CN108540451A - A method of classification and Detection being carried out to attack with machine learning techniques - Google Patents

A method of classification and Detection being carried out to attack with machine learning techniques Download PDF

Info

Publication number
CN108540451A
CN108540451A CN201810202552.9A CN201810202552A CN108540451A CN 108540451 A CN108540451 A CN 108540451A CN 201810202552 A CN201810202552 A CN 201810202552A CN 108540451 A CN108540451 A CN 108540451A
Authority
CN
China
Prior art keywords
data
value
training
training data
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810202552.9A
Other languages
Chinese (zh)
Inventor
吕坤
郑宇坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810202552.9A priority Critical patent/CN108540451A/en
Publication of CN108540451A publication Critical patent/CN108540451A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of methods carrying out classification and Detection to attack with machine learning techniques, belong to field of information security technology.Specially:1. acquisition network data is simultaneously pre-processed, training data is obtained.2. building and training multistage classifier.3. carrying out classification and Detection to test data with trained multistage classifier.Method compare with the existing technology proposed by the present invention, advantage are:1. data scale can be reduced by the preprocess method to gathered data, while part extraneous data is removed, improves whole efficiency.2. using the thought of multistage classifier and integrated study, solves the problems, such as that single grader fitting precision is not high, substantially increase the accuracy of detection of system.3. the design of the deblocking method based on improved random forests algorithm, which can detect different types of attack, is embodied as parallel algorithm, the overall detection speed of system is improved.

Description

A method of classification and Detection being carried out to attack with machine learning techniques
Technical field
The present invention relates to a kind of methods carrying out classification and Detection to attack with machine learning techniques, belong to information Security technology area.
Background technology
As network and computer technology are while developing into people’s lives and offer convenience, the safety of network system is asked Topic also brings new puzzlement.Since the value volume and range of product of network attack is exponentially increased, network and information system are faced with sternly The security threat of weight.There is important theory and practice to be worth for the guard method of the safety of research network in this context.
In order to protect the safety of network system, differentiate and prevent the row of going beyond one's commission of the attack and user inside and outside system For researchers propose the active monitoring technology of network system.The technology by monitor node active to monitored network It generates and loads and the data of acquisition are analyzed, and then obtain the status information of monitored network and generate corresponding decision. The main research of active monitoring system is that the suitable model logarithm monitoring information of structure carries out classification and Detection, classification and Detection skill The evaluation index of art includes detection time, Detection accuracy, detection rate of false alarm etc..When the number for monitoring node increases to network Active monitoring it is also more complicated, it is therefore desirable to propose the analysis method more outstanding in accuracy of detection and detection time.
Machine learning be by the learning activities of the computer simulation mankind, by building learning machine from existing experience, Further unknown data are predicted by learning machine, and the learning machine constructed by constantly improve in this course.It will Machine learning techniques introduce the levels of precision that active monitoring technology improves data analysis, are commonly used in active monitoring system at present Machine learning model include SVM, neural network, logistic regression, Bayesian network etc..
Invention content
The purpose of the present invention is for accurate to the detection of attack existing for active monitoring technology in large scale network The problem that rate is low, the response time is long, detection rate of false alarm is high, proposition is a kind of to divide attack with machine learning techniques The method of class detection.The method of the present invention improves quality and the reduction data rule of data to be tested by complete Feature Engineering method Mould builds the ensemble machine learning model based on random forests algorithm and algorithm of support vector machine, after processing on this basis Data carry out classification and Detection to predict the attack in network system.
The purpose of the present invention is what is be achieved through the following technical solutions.
A kind of method carrying out classification and Detection to attack with machine learning techniques proposed by the present invention is specific Operation is:
Step 1: acquiring network data and being pre-processed, training data is obtained.The training data is divided into normal data With attack data;The attack data are divided into according to different attack types as plurality of classes, the quantity symbol N tables of attack type Show;N is positive integer.The quantity of each type training data is no less than 3000.
It is described to obtain training data concrete operations and be:
Step 1.1:Network data is acquired from network system.The network data includes Web content correlated characteristic, net Network flow correlated characteristic connects correlated characteristic with network.
Step 1.2:The network data is pre-processed, specially:
Step 1.2.1:Data cleansing is carried out to the network data, removal characteristic item has the data and characteristic item of missing Value is the data in improper value range.
Step 1.2.2:Network data after cleaning is standardized.Specially:To the data of character types into Line number value maps or carries out the numerical transformation of dualization.Network data representation after standardization is feature vector Form.
Step 1.2.3:The network data after standardization is normalized by formula (1), makes network number According to each characteristic item value in [0,1] range.
Wherein, new_v indicates the value after any one characteristic item (being indicated with symbol V) standardization in network data, new_v∈[0,1];V indicates the original value of characteristic item V in network data;Max indicates the original of characteristic item V in overall network data Maximum value in initial value;Min indicates the minimum value in the original value of characteristic item V in overall network data.
Step 1.3:It is peeled off the factor (LOF) algorithm using the part based on distance, calculates and pass through pretreated network data Part peel off the factor.Specially:
Step 1.3.1:Every network data representation is a m dimensional feature vector, and described eigenvector is indicated with symbol s, s ={ x1,x2,x3…xm, m is the characteristic item quantity that a network data includes, and m is positive integer;x1,x2,x3…xmM is indicated respectively A characteristic item.Then described eigenvector is mapped in the feature space of m dimensions, then in each feature vector character pair space A point.
Step 1.3.2:Any one m dimensional feature vector corresponding point in feature space is indicated with symbol p, passes through formula (2) part for calculating point p peels off the factor.
Wherein, LOFk(p) indicate that the part of the kth neighborhood of point p peels off the factor;The value of k is by artificially specifying, k>10;Nk(p)Table Show the kth neighborhood point set of point p, the kth neighborhood point of the point p is all the points of the kth of point p within;lrdk(o) it indicates The local reachability density of point o, o ∈ Nk(p);lrdk(p) local reachability density of point p, lrd are indicatedk(p) it is calculated by formula (3) It obtains.
Wherein, | Nk(p) | indicate the kth distance of point p;Dist (p, o) indicates the distance of point p to point o.
Step 1.4:According to the local factor LOF that peels offk(p), judge whether point p is outlier.
Indicate the threshold value of outlier with symbol ε, ε is artificial setting value, the value range of ε be (1,2].Work as LOFk(p)>ε When, point p is labeled as abnormal point, then the corresponding feature vectors of point p are abnormal data, and delete processing is done to abnormal data.Otherwise, Execute the operation of step 1.5.
By the operation of step 1.4, using the data obtained after rejecting abnormalities data as basic training data.
Step 1.5:If the dimension m of propaedeutics data>Q thens follow the steps 1.6 operation.Otherwise, by grounding Data are as training data.Wherein, Q is artificial setting value, Q >=40.
Step 1.6:Dimension-reduction treatment is carried out to grounding data using feature selecting algorithm, makes the dimension of grounding data Degree drops to T, and T is artificial setting value, T≤40.
The feature selecting algorithm includes:Variance selection method, Information Gain Method, mutual information method, Chi-square Test side Method and feature selection approach based on tree-model.
Step 1.7:Analyze and confirm the validity of feature selecting algorithm.Concrete operations are:
Step 1.7.1:A selected machine learning model for classifying to grounding data.The engineering Habit model is two-value disaggregated model, for grounding data to be divided into normal data or attack data.
The machine learning model includes Logic Regression Models, decision tree and perceptron.
Step 1.7.2:The grounding data that step 1.4 is obtained are obtained as the input of the machine learning model Then classification results calculate the classification accuracy of grounding data, with symbol L1It indicates.
Step 1.7.3:Using the grounding data after the dimensionality reduction that step 1.6 obtains as the defeated of the machine learning model Enter, obtain classification results, and calculate the classification accuracy of the grounding data after dimensionality reduction, with symbol L2It indicates.
Step 1.7.3:The threshold value of assessment result, δ ∈ (0.9,1) are indicated with symbol δ.IfT values are then tuned up, Then operation of the step 1.6 to step 1.7 is repeated.Otherwise, the operation for terminating this step, by the grounding number after dimension-reduction treatment According to as training data.
Step 2: building and training multistage classifier, concrete operation step is:
Step 2.1:The voting principle of random forests algorithm is improved, structure improves random forest grader. Specially:
Step 2.1.1:N decision tree is constructed, n is artificial setting value, n>500.
Step 2.1.2:With symbol theta presentation class threshold value, θ is artificial setting value, θ ∈ (0,1).
Step 2.1.3:The a certain type of network data is indicated with symbol y, y ∈ Y, Y are the type set of network data.
Step 2.1.4:The classification results of a training data are obtained by formula (4).
Wherein, x indicates that a training data, f (x) indicate that training data x belongs to the classification function value of type y;xiTable Show the input of i-th decision tree, xiFor the sampling of training data x;fi(xi) indicate i-th decision built by random sampling The value of the classification function of tree.
If f (x) values are 1, the starting type of training data x is labeled as classification y.
Step 2.1.5:The training data that step 1 is obtained is as the input for improving random forest grader, by classification Operation is completed to mark the starting type of training data.Since attack type quantity is N, in addition normal data type, one is shared (N+1) type.
Step 2.2:Use support vector machines (SVM) of the alternative manner structure based on adaptive enhancing (AdaBoost) thought Integrated classifier.The weighing computation method of each training data and integrated classifier is specially in iterative process:
Step 2.2.1:For m training data, with symbol w1,w2,w3…wm1 to m-th training data is indicated respectively Weight, and w is set1,w2,w3…wmInitial value beCurrent iteration number is indicated with symbol t, and current iteration number is set Initial value t=0.
Step 2.2.2:T-th is built using Gaussian function as the base grader of the support vector machines of kernel function, uses symbol gtIt indicates, shown in the classification function such as formula (5) of the grader.Using training dataset as base grader gtInput, instructed Practice result.
Wherein, g (x) indicates base grader gtClassification function value;Sgn () indicates sign function;A is a training Data;For supporting vector;B is bias, and the initial value of b is artificial setting value, and value range is [0,0.5];K(xr, xs) it is gaussian kernel function.
Step 2.2.3:Select quadratic loss function as base grader gtLoss function.Calculate base grader gtDamage Lose the value of function, setting base grader gtThe threshold value of loss function be For artificial setting value,
Step 2.2.4:If base grader gtThe value of loss function is less than threshold valueThen obtain the multistage classifier, institute State the classification function such as formula (6) of multistage classifier;Then, base grader g is calculated by formula (7)tModel Weight, with symbol Number dtIt indicates;End operation.If base grader gtThe value of loss function is not less than threshold valueThen follow the steps the behaviour of 2.2.5 Make.
Wherein, g indicates the value of the classification function of multistage classifier;T is total iterations.
Wherein, a indicates a articles training data;wt(a) indicate that weight of a articles training data in the t times iteration takes Value;gt(a) base grader g is indicatedtTo the training result of training data a;yt(a) legitimate reading of training data a is indicated.
Step 2.2.5:Use the weight w of formula (8) adjusting training data1,w2,w3…wm, and to the power of each training data Be standardized again, ensure all training datas weight and be 1.Then make iterations from increasing 1, return to step 2.2.2, the operation of step 2.2.2 is executed.
Wherein, wt+1(a) the weight value in training data a (t+1) secondary iteration is indicated;β is regulation coefficient, works as training When the training result of data a is identical with legitimate reading, β=1, when the training result of training data a and legitimate reading difference, β =-1;
Step 2.3:Support vector machines (SVM) of the training based on adaptive enhancing (AdaBoost) thought.
The each type of training data that step 2.1 obtains is separately input to one based on adaptive enhancing (AdaBoost) input of the support vector machines (SVM) of thought.By training, it is a trained based on adaptive to obtain (N+1) Enhance the support vector machines (SVM) of (AdaBoost) thought.
By the operation of step 2, a trained multistage classifier is obtained.
Step 3: carrying out classification and Detection to test data with trained multistage classifier.
On the basis of step 2 operates, test data is input to multistage classifier, obtains final classification result.
Advantageous effect
The method and prior art proposed by the present invention for carrying out classification and Detection to attack with machine learning techniques It compares, has the following advantages:
1. data scale can be reduced by the preprocess method to gathered data, while part extraneous data is removed, carried High whole efficiency
2. using the thought of multistage classifier and integrated study, solve the problems, such as that single grader fitting precision is not high, Substantially increase the accuracy of detection of system.
3. the design of the deblocking method based on improved random forests algorithm can be by different types of attack Detection is embodied as parallel algorithm, improves the overall detection speed of system.
Description of the drawings
Fig. 1 is the side for carrying out classification and Detection in the specific embodiment of the invention to attack with machine learning techniques The operational flowchart of method.
Specific implementation mode
Technical solution of the present invention is described further in the following with reference to the drawings and specific embodiments.
Using a kind of method pair carrying out classification and Detection to attack with machine learning techniques proposed by the present invention Network data is classified, operating process as shown in Figure 1, the specific steps are:
Step 1: acquiring network data and being pre-processed, training data is obtained.
KDD CUP99 data are obtained, and 10% sampling, the network as the present embodiment are carried out to KDD CUP99 data sets Data.
The training data is divided into normal (Normal) data and attack data;Attack data according to different attack types again It is divided into 4 kinds of classifications.The quantity of each type training data is no less than 3000.Wherein attack type is respectively:Refusal service (DOS), monitoring or probe (Probing), long-range unauthorized access (R2L) and user right illegally promote (U2R).
It is described to obtain training data concrete operations and be:
Step 1.1:KDD CUP99 data are obtained, and 10% sampling is carried out to KDD CUP99 data sets, as this implementation The network data of example, including 41 attributes, as shown in table 1.
The characteristic item of 1 KDD CUP00 data of table
Step 1.2:The network data is pre-processed, KDD CUP99 data are concentrated with the category of 38 value types The attribute of property and 3 character types, is Protocol_type, Flag and Service respectively.It specifically handles and is:
Step 1.2.1:Data cleansing is carried out to the network data, removal characteristic item has the data and characteristic item of missing Value is the data in improper value range.
Step 1.2.2:Network data after cleaning is standardized.Specially:By the data of character types into The mapping of line number value or the numerical transformation that dualization is carried out to character string.Wherein, character attibute Protocol_type looks like It is protocol type, value range is TCP, ICMP and UDP, and TCP, ICMP and UDP numeralization are handled, correspond to 0,1,2 respectively.It is special Levying the value of attribute Flag, correspondence is as shown in table 2 with quantizing that treated.Characteristic attribute Service have tens not Same value corresponds to 1,2,3 ... according to the transformation that data dictionary sequence directly quantizes.
The numeralization of 2 Flag attributes of table maps
Network data representation after standardization is characterized vector form.
Step 1.2.3:The network data after standardization is normalized by formula (1), makes network number According to each characteristic item value in [0,1] range.
Wherein, new_v indicates the value after any one characteristic item (being indicated with symbol V) standardization in network data, new_v∈[0,1];V indicates the original value of characteristic item V in network data;Max indicates the original of characteristic item V in overall network data Maximum value in initial value;Min indicates the minimum value in the original value of characteristic item V in overall network data.
Step 1.3:It is peeled off the factor (LOF) algorithm using the part based on distance, calculates and pass through pretreated network data Part peel off the factor.Specially:
Step 1.3.1:Every network data representation is a m dimensional feature vector, m=41.Described eigenvector symbol s It indicates, s={ x1,x2,x3…xm};x1,x2,x3…xmM characteristic item is indicated respectively.Then described eigenvector is mapped to m dimensions Feature space in, then a point in each feature vector character pair space.
Step 1.3.2:Any one m dimensional feature vector corresponding point in feature space is indicated with symbol p, passes through formula (2) part for calculating point p peels off the factor.
Wherein, LOFk(p) indicate that the part of the kth neighborhood of point p peels off the factor;The value of k is by artificially specifying, k=50;Nk(p) Indicate the kth neighborhood point set of point p, the kth neighborhood point of the point p is all the points of the kth of point p within;lrdk(o) table Show the local reachability density of point o, o ∈ Nk(p);lrdk(p) local reachability density of point p, lrd are indicatedk(p) it is counted by formula (3) It obtains.
Wherein, | Nk(p) | indicate the kth distance of point p;Dist (p, o) indicates the distance of point p to point o.
Step 1.4:According to the local factor LOF that peels offk(p), judge whether point p is outlier.
Indicate that the threshold value of outlier, ε are artificial setting value, ε=1.2 with symbol ε.Work as LOFk(p)>When ε, point p is marked For abnormal point, then the corresponding feature vectors of point p are abnormal data, and delete processing is done to abnormal data.Otherwise, step 1.5 is executed Operation.
By the operation of step 1.4, using the data obtained after rejecting abnormalities data as basic training data.
Step 1.5:If the dimension m of propaedeutics data>Q thens follow the steps 1.6 operation.Otherwise, by grounding Data are as training data.Wherein, Q=40.
Step 1.6:Dimension-reduction treatment is carried out to grounding data using Information Gain Method, makes the dimension of grounding data Degree drops to T, and T is artificial setting value, T=20.
Step 1.7:Analyze and confirm the validity of feature selecting algorithm.Concrete operations are:
Step 1.7.1:The choosing machine learning model that then Logic Regression Models classify to grounding data.Logic is returned It is two-value disaggregated model to return model, for grounding data to be divided into normal data or attack data.
Step 1.7.2:The grounding data that step 1.4 is obtained are classified as the input of Logic Regression Models As a result, the classification accuracy of grounding data is then calculated, with symbol L1It indicates.
Step 1.7.3:Using the grounding data after the dimensionality reduction that step 1.6 obtains as the input of Logic Regression Models, Classification results are obtained, and calculate the classification accuracy of the grounding data after dimensionality reduction, with symbol L2It indicates.
Step 1.7.3:The threshold value of assessment result, δ=0.95 are indicated with symbol δ.IfT values are then tuned up, then Repeat operation of the step 1.6 to step 1.7.Otherwise, the grounding data after dimension-reduction treatment are made in the operation for terminating this step For training data.
By the operation of step 1, the training data of T=24 dimensions is obtained, the number that the feature selected corresponds in table 1 is: 5,3,6,23,26,37,2,12,30,2,3,9,10,13,15,18,1,2,7,1,14,22,4,7。
Step 2: building and training multistage classifier.
Step 2.1:The voting principle of random forests algorithm is improved, and is calculated using improved random forest Method carries out starting type mark to training data.Specially:
Step 2.1.1:Construct n decision tree, n=1000.
Step 2.1.2:With symbol theta presentation class threshold value, θ is artificial setting value, θ=0.35
Step 2.1.3:The a certain type of network data is indicated with symbol y, y ∈ Y, Y are the type set of network data.
{ normal, refusal service, probe, long-range unauthorized access, illegal permission carry the type set Y=of the network data It rises }.
Step 2.1.4:The classification results of a training data are obtained by formula (4).
Wherein, x indicates that a training data, f (x) indicate that training data x belongs to the classification function value of type y;xiTable Show the input of i-th decision tree, xiFor the sampling of training data x;fi(xi) indicate i-th decision built by random sampling The value of the classification function of tree.
If f (x) values are 1, the starting type of training data x is labeled as classification y.
Step 2.1.5:The training data that step 1 is obtained is as the input for improving random forest grader, by classification Operation is completed to mark the starting type of training data.Since attack type quantity is N, in addition normal data type, one is shared (N+1) type.
Step 2.2:Use support vector machines (SVM) of the alternative manner structure based on adaptive enhancing (AdaBoost) thought Integrated classifier, the weighing computation method of each training data and integrated classifier is specially in iterative process:
Step 2.2.1:For including the training dataset of m training data, with symbol w1,w2,w3…wm1 is indicated respectively To the weight of m-th of training data, and w is set1,w2,w3…wmInitial value beWherein, m indicates the number of training data Amount.Current iteration number is indicated with symbol t, and the initial value t=0 of current iteration number is set.
Step 2.2.2:T-th is built using Gaussian function as the base grader of the support vector machines of kernel function, uses symbol gtIt indicates, shown in the classification function such as formula (5) of the grader.Using training dataset as base grader gtInput, instructed Practice result.
Wherein, g (x) indicates base grader gtClassification function value;Sgn () indicates sign function;A is a training Data;For supporting vector;B is bias, and the initial value of b is artificial setting value, and value range is [0,0.5];K(xr, xs) it is gaussian kernel function.
For the base grader of normal type, b=0.0
For the base grader of refusal service (DOS) type, b=0.13
For the base grader of monitoring or probe (Probing), b=0.18
For the base grader of long-range unauthorized access (R2L), b=0.16
The base grader of (U2R), b=0.21 are illegally promoted for user right
Step 2.2.3:Select quadratic loss function as base grader gtLoss function.Calculate base grader gtDamage Lose the value of function, setting base grader gtThe threshold value of loss function be For artificial setting value,
Step 2.2.4:If base grader gtThe value of loss function is less than threshold valueThen obtain the multistage classifier, institute State the classification function such as formula (6) of multistage classifier;Then, base grader g is calculated by formula (7)tModel Weight, with symbol Number dtIt indicates;End operation.If base grader gtThe value of loss function is not less than threshold valueThen follow the steps the behaviour of 2.2.5 Make.
Wherein, g indicates the value of the classification function of multistage classifier;T is total iterations.
Wherein, a indicates a articles training data;wt(a) indicate that weight of a articles training data in the t times iteration takes Value;gt(a) base grader g is indicatedtTo the training result of training data a;yt(a) legitimate reading of training data a is indicated.
Step 2.2.5:Use the weight w of formula (8) adjusting training data1,w2,w3…wm, and to the power of each training data Be standardized again, ensure all training datas weight and be 1.Then make iterations from increasing 1, return to step 2.2.2, the operation of step 2.2.2 is executed.
Wherein, wt+1(a) the weight value in training data a (t+1) secondary iteration is indicated;β is regulation coefficient, works as training When the training result of data a is identical with legitimate reading, β=1, when the training result of training data a and legitimate reading difference, β =-1;
Step 2.3:Support vector machines (SVM) of the training based on adaptive enhancing (AdaBoost) thought.
The each type of training data that step 2.1 obtains is separately input to one based on adaptive enhancing (AdaBoost) input of the support vector machines (SVM) of thought.By training, obtain 5 it is trained based on adaptive enhancing (AdaBoost) support vector machines (SVM) of thought.
By the operation of step 2, a trained multistage classifier is obtained.
Step 3: carrying out classification and Detection to test data with trained multistage classifier.
On the basis of step 2 operates, test data is separately input to each type of multistage classifier, is obtained most Whole classification results.
In the present embodiment, each corresponding attack type, in the training data used and test data quantity and test set New attack type sample size is as shown in table 3.
3 experimental data of table and sample size statistics
In order to assess the validity of classification and Detection method, using the accuracy rate of classification, rate of false alarm, rate of failing to report as classification side The evaluation index of method.The Testing index of various types of other data is as shown in table 4 in the present embodiment.
The Testing index of 4 various categorical datas of table

Claims (3)

1. a kind of method carrying out classification and Detection to attack with machine learning techniques, it is characterised in that:It is specifically grasped As:
Step 1: acquiring network data and being pre-processed, training data is obtained;The training data is divided into normal data and attacks Hit data;The attack data are divided into according to different attack types as plurality of classes, and the quantity of attack type is indicated with symbol N;N For positive integer;The quantity of each type training data is no less than 3000;
It is described to obtain training data concrete operations and be:
Step 1.1:Network data is acquired from network system;The network data includes Web content correlated characteristic, network flow Amount correlated characteristic connects correlated characteristic with network;
Step 1.2:The network data is pre-processed, specially:
Step 1.2.1:Data cleansing is carried out to the network data, removal characteristic item has the data and characteristic item value of missing For the data in improper value range;
Step 1.2.2:Network data after cleaning is standardized;Specially:To the data of character types into line number Value maps or carries out the numerical transformation of dualization;Network data representation after standardization is feature vector shape Formula;
Step 1.2.3:The network data after standardization is normalized by formula (1), makes network data The value of each characteristic item is in [0,1] range;
Wherein, new_v indicates the value after any one characteristic item V standardization, new_v ∈ [0,1] in network data;V is indicated The original value of characteristic item V in network data;Max indicates the maximum value in the original value of characteristic item V in overall network data;min Indicate the minimum value in the original value of characteristic item V in overall network data;
Step 1.3:It is peeled off factor LOF algorithms using the part based on distance, calculates the part Jing Guo pretreated network data Peel off the factor;Specially:
Step 1.4:According to the local factor LOF that peels offk(p), judge whether point p is outlier;
Indicate the threshold value of outlier with symbol ε, ε is artificial setting value, the value range of ε be (1,2];Work as LOFk(p)>It, will when ε Point p is labeled as abnormal point, then the corresponding feature vectors of point p are abnormal data, and delete processing is done to abnormal data;Otherwise, it executes The operation of step 1.5;
By the operation of step 1.4, using the data obtained after rejecting abnormalities data as basic training data;
Step 1.5:If the dimension m of propaedeutics data>Q thens follow the steps 1.6 operation;Otherwise, by grounding data As training data;Wherein, Q is artificial setting value, Q >=40;
Step 1.6:Dimension-reduction treatment is carried out to grounding data using feature selecting algorithm, the dimension of grounding data is made to drop It is artificial setting value, T≤40 to T, T;
Step 1.7:Analyze and confirm the validity of feature selecting algorithm;Concrete operations are:
Step 1.7.1:A selected machine learning model for classifying to grounding data;The machine learning mould Type is two-value disaggregated model, for grounding data to be divided into normal data or attack data;
The machine learning model includes Logic Regression Models, decision tree and perceptron;
Step 1.7.2:The grounding data that step 1.4 is obtained are classified as the input of the machine learning model As a result, the classification accuracy of grounding data is then calculated, with symbol L1It indicates;
Step 1.7.3:Using the grounding data after the dimensionality reduction that step 1.6 obtains as the input of the machine learning model, Classification results are obtained, and calculate the classification accuracy of the grounding data after dimensionality reduction, with symbol L2It indicates;
Step 1.7.3:The threshold value of assessment result, δ ∈ (0.9,1) are indicated with symbol δ;IfT values are then tuned up, are then weighed Answer operation of the step 1.6 to step 1.7;Otherwise, the operation for terminating this step, using the grounding data after dimension-reduction treatment as Training data;
Step 2: building and training multistage classifier, concrete operation step is:
Step 2.1:The voting principle of random forests algorithm is improved, structure improves random forest grader;Specifically For:
Step 2.1.1:N decision tree is constructed, n is artificial setting value, n>500;
Step 2.1.2:With symbol theta presentation class threshold value, θ is artificial setting value, θ ∈ (0,1);
Step 2.1.3:The a certain type of network data is indicated with symbol y, y ∈ Y, Y are the type set of network data;
Step 2.1.4:The classification results of a training data are obtained by formula (4);
Wherein, x indicates that a training data, f (x) indicate that training data x belongs to the classification function value of type y;xiIndicate i-th The input of decision tree, xiFor the sampling of training data x;fi(xi) indicate dividing for i-th decision tree built by random sampling The value of class function;
If f (x) values are 1, the starting type of training data x is labeled as classification y;
Step 2.1.5:The training data that step 1 is obtained is grasped as the input for improving random forest grader by classification Make, completes to mark the starting type of training data;Since attack type quantity is N, in addition normal data type, a shared (N + 1) type;
Step 2.2:Use the collection ingredient of support vector machines of the alternative manner structure based on adaptive enhancing AdaBoost thoughts Class device;The weighing computation method of each training data and integrated classifier is specially in iterative process:
Step 2.2.1:For m training data, with symbol w1,w2,w3…wmThe weight of 1 to m-th training data is indicated respectively, And w is set1,w2,w3…wmInitial value beCurrent iteration number is indicated with symbol t, and the initial of current iteration number is set Value t=0;
Step 2.2.2:T-th is built using Gaussian function as the base grader of the support vector machines of kernel function, with symbol gtTable Show, shown in the classification function such as formula (5) of the grader;Using training dataset as base grader gtInput, trained As a result;
Wherein, g (x) indicates base grader gtClassification function value;Sgn () indicates sign function;A is a training data;For supporting vector;B is bias, and the initial value of b is artificial setting value, and value range is [0,0.5];K(xr,xs) it is height This kernel function;
Step 2.2.3:Select quadratic loss function as base grader gtLoss function;Calculate base grader gtLoss letter Several values, setting base grader gtThe threshold value of loss function beFor artificial setting value,
Step 2.2.4:If base grader gtThe value of loss function is less than threshold valueThe multistage classifier is then obtained, it is described more The classification function such as formula (6) of grade grader;Then, base grader g is calculated by formula (7)tModel Weight, with symbol dt It indicates;End operation;If base grader gtThe value of loss function is not less than threshold valueThen follow the steps the operation of 2.2.5;
Wherein, g indicates the value of the classification function of multistage classifier;T is total iterations;
Wherein, a indicates a articles training data;wt(a) weight value of a articles training data in the t times iteration is indicated;gt (a) base grader g is indicatedtTo the training result of training data a;yt(a) legitimate reading of training data a is indicated;
Step 2.2.5:Use the weight w of formula (8) adjusting training data1,w2,w3…wm, and to the weight of each training data into Row standardization, ensure all training datas weight and be 1;Then make iterations from increasing 1, return to step 2.2.2, Execute the operation of step 2.2.2;
Wherein, wt+1(a) the weight value in training data a (t+1) secondary iteration is indicated;β is regulation coefficient, as training data a Training result it is identical with legitimate reading when, β=1, when the training result of training data a and legitimate reading difference, β=- 1;
Step 2.3:Support vector machines of the training based on adaptive enhancing AdaBoost thoughts;
The each type of training data that step 2.1 obtains is separately input to one based on adaptive enhancing AdaBoost thoughts Support vector machines input;By training, it is a trained based on adaptive enhancing AdaBoost thoughts to obtain (N+1) Support vector machines;
By the operation of step 2, a trained multistage classifier is obtained;
Step 3: carrying out classification and Detection to test data with trained multistage classifier;
On the basis of step 2 operates, test data is input to multistage classifier, obtains final classification result.
2. a kind of method that classification and Detection being carried out to attack with machine learning techniques as described in claim 1, It is characterized in that:It is peeled off factor LOF algorithms using the part based on distance described in its step 1.3, calculates and pass through pretreated net The peel off concrete operations of the factor of the part of network data are:
Step 1.3.1:Every network data representation is a m dimensional feature vector, and described eigenvector is indicated with symbol s, s= {x1,x2,x3…xm, m is the characteristic item quantity that a network data includes, and m is positive integer;x1,x2,x3…xmM are indicated respectively Characteristic item;Then described eigenvector is mapped in the feature space of m dimensions, then in each feature vector character pair space One point;
Step 1.3.2:Any one m dimensional feature vector corresponding point in feature space is indicated with symbol p, is passed through formula (2) The part for calculating point p peels off the factor;
Wherein, LOFk(p) indicate that the part of the kth neighborhood of point p peels off the factor;The value of k is by artificially specifying, k>10;Nk(p)Indicate point The kth neighborhood point of the kth neighborhood point set of p, the point p is all the points of the kth of point p within;lrdk(o) point o is indicated Local reachability density, o ∈ Nk(p);lrdk(p) local reachability density of point p, lrd are indicatedk(p) it is calculated by formula (3) It arrives;
Wherein, | Nk(p) | indicate the kth distance of point p;Dist (p, o) indicates the distance of point p to point o.
3. a kind of method that classification and Detection being carried out to attack with machine learning techniques as claimed in claim 1 or 2, It is characterized in that:Feature selecting algorithm described in its step 1.6 includes:Variance selection method, Information Gain Method, mutual information side Method, Chi-square Test method and the feature selection approach based on tree-model.
CN201810202552.9A 2018-03-13 2018-03-13 A method of classification and Detection being carried out to attack with machine learning techniques Pending CN108540451A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810202552.9A CN108540451A (en) 2018-03-13 2018-03-13 A method of classification and Detection being carried out to attack with machine learning techniques

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810202552.9A CN108540451A (en) 2018-03-13 2018-03-13 A method of classification and Detection being carried out to attack with machine learning techniques

Publications (1)

Publication Number Publication Date
CN108540451A true CN108540451A (en) 2018-09-14

Family

ID=63484291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810202552.9A Pending CN108540451A (en) 2018-03-13 2018-03-13 A method of classification and Detection being carried out to attack with machine learning techniques

Country Status (1)

Country Link
CN (1) CN108540451A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109379379A (en) * 2018-12-06 2019-02-22 中国民航大学 Based on the network inbreak detection method for improving convolutional neural networks
CN109450880A (en) * 2018-10-26 2019-03-08 平安科技(深圳)有限公司 Detection method for phishing site, device and computer equipment based on decision tree
CN109450876A (en) * 2018-10-23 2019-03-08 中国科学院信息工程研究所 A kind of DDos recognition methods and system based on various dimensions state-transition matrix feature
CN109660522A (en) * 2018-11-29 2019-04-19 华东师范大学 The mixed intrusion detection method based on deep layer self-encoding encoder towards Integrated Electronic System
CN109741175A (en) * 2018-12-28 2019-05-10 上海点融信息科技有限责任公司 Based on artificial intelligence to the appraisal procedure of credit again and equipment for purchasing automobile-used family by stages
CN110008976A (en) * 2018-12-05 2019-07-12 阿里巴巴集团控股有限公司 A kind of network behavior classification method and device
CN110133146A (en) * 2019-05-28 2019-08-16 国网上海市电力公司 A kind of Diagnosis Method of Transformer Faults and system considering unbalanced data sample
CN110478911A (en) * 2019-08-13 2019-11-22 苏州钛智智能科技有限公司 The unmanned method of intelligent game vehicle and intelligent vehicle, equipment based on machine learning
CN110581840A (en) * 2019-07-24 2019-12-17 中国科学院信息工程研究所 Intrusion detection method based on double-layer heterogeneous integrated learner
CN110719279A (en) * 2019-10-09 2020-01-21 东北大学 Network anomaly detection system and method based on neural network
CN111107077A (en) * 2019-12-16 2020-05-05 中国电子科技网络信息安全有限公司 SVM-based attack flow classification method
CN112269907A (en) * 2020-11-02 2021-01-26 山东万里红信息技术有限公司 Processing method of health big data of Internet of things
CN112333706A (en) * 2019-07-16 2021-02-05 ***通信集团浙江有限公司 Internet of things equipment anomaly detection method and device, computing equipment and storage medium
CN112367303A (en) * 2020-10-21 2021-02-12 中国电子科技集团公司第二十八研究所 Distributed self-learning abnormal flow cooperative detection method and system
CN112398779A (en) * 2019-08-12 2021-02-23 中国科学院国家空间科学中心 Network traffic data analysis method and system
CN112559591A (en) * 2020-12-08 2021-03-26 晋中学院 Outlier detection system and detection method for cold roll manufacturing process
CN112583844A (en) * 2020-12-24 2021-03-30 北京航空航天大学 Big data platform defense method for advanced sustainable threat attack
CN113114618A (en) * 2021-03-02 2021-07-13 西安电子科技大学 Internet of things equipment intrusion detection method based on traffic classification recognition
CN113239025A (en) * 2021-04-23 2021-08-10 四川大学 Ship track classification method based on feature selection and hyper-parameter optimization
CN113347181A (en) * 2021-06-01 2021-09-03 上海明略人工智能(集团)有限公司 Abnormal advertisement flow detection method, system, computer equipment and storage medium
CN114039837A (en) * 2021-11-05 2022-02-11 奇安信科技集团股份有限公司 Alarm data processing method, device, system, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193693A1 (en) * 2014-01-06 2015-07-09 Cisco Technology, Inc. Learning model selection in a distributed network
CN104794192A (en) * 2015-04-17 2015-07-22 南京大学 Multi-level anomaly detection method based on exponential smoothing and integrated learning model
CN107276999A (en) * 2017-06-08 2017-10-20 西安电子科技大学 A kind of event detecting method in wireless sensor network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193693A1 (en) * 2014-01-06 2015-07-09 Cisco Technology, Inc. Learning model selection in a distributed network
CN104794192A (en) * 2015-04-17 2015-07-22 南京大学 Multi-level anomaly detection method based on exponential smoothing and integrated learning model
CN107276999A (en) * 2017-06-08 2017-10-20 西安电子科技大学 A kind of event detecting method in wireless sensor network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾斌: "基于组合分类器的DDos攻击流量分布式检测模型", 《华中科技大学学报(自然科学版)》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450876B (en) * 2018-10-23 2020-12-22 中国科学院信息工程研究所 DDos identification method and system based on multi-dimensional state transition matrix characteristics
CN109450876A (en) * 2018-10-23 2019-03-08 中国科学院信息工程研究所 A kind of DDos recognition methods and system based on various dimensions state-transition matrix feature
CN109450880A (en) * 2018-10-26 2019-03-08 平安科技(深圳)有限公司 Detection method for phishing site, device and computer equipment based on decision tree
CN109660522A (en) * 2018-11-29 2019-04-19 华东师范大学 The mixed intrusion detection method based on deep layer self-encoding encoder towards Integrated Electronic System
CN109660522B (en) * 2018-11-29 2021-05-25 华东师范大学 Deep self-encoder-based hybrid intrusion detection method for integrated electronic system
CN110008976A (en) * 2018-12-05 2019-07-12 阿里巴巴集团控股有限公司 A kind of network behavior classification method and device
CN109379379B (en) * 2018-12-06 2021-03-02 中国民航大学 Network intrusion detection method based on improved convolutional neural network
CN109379379A (en) * 2018-12-06 2019-02-22 中国民航大学 Based on the network inbreak detection method for improving convolutional neural networks
CN109741175A (en) * 2018-12-28 2019-05-10 上海点融信息科技有限责任公司 Based on artificial intelligence to the appraisal procedure of credit again and equipment for purchasing automobile-used family by stages
CN110133146A (en) * 2019-05-28 2019-08-16 国网上海市电力公司 A kind of Diagnosis Method of Transformer Faults and system considering unbalanced data sample
CN112333706B (en) * 2019-07-16 2022-08-23 ***通信集团浙江有限公司 Internet of things equipment anomaly detection method and device, computing equipment and storage medium
CN112333706A (en) * 2019-07-16 2021-02-05 ***通信集团浙江有限公司 Internet of things equipment anomaly detection method and device, computing equipment and storage medium
CN110581840B (en) * 2019-07-24 2020-10-16 中国科学院信息工程研究所 Intrusion detection method based on double-layer heterogeneous integrated learner
CN110581840A (en) * 2019-07-24 2019-12-17 中国科学院信息工程研究所 Intrusion detection method based on double-layer heterogeneous integrated learner
CN112398779B (en) * 2019-08-12 2022-11-01 中国科学院国家空间科学中心 Network traffic data analysis method and system
CN112398779A (en) * 2019-08-12 2021-02-23 中国科学院国家空间科学中心 Network traffic data analysis method and system
CN110478911A (en) * 2019-08-13 2019-11-22 苏州钛智智能科技有限公司 The unmanned method of intelligent game vehicle and intelligent vehicle, equipment based on machine learning
CN110719279A (en) * 2019-10-09 2020-01-21 东北大学 Network anomaly detection system and method based on neural network
CN111107077B (en) * 2019-12-16 2021-12-21 中国电子科技网络信息安全有限公司 SVM-based attack flow classification method
CN111107077A (en) * 2019-12-16 2020-05-05 中国电子科技网络信息安全有限公司 SVM-based attack flow classification method
CN112367303A (en) * 2020-10-21 2021-02-12 中国电子科技集团公司第二十八研究所 Distributed self-learning abnormal flow cooperative detection method and system
CN112269907A (en) * 2020-11-02 2021-01-26 山东万里红信息技术有限公司 Processing method of health big data of Internet of things
CN112559591A (en) * 2020-12-08 2021-03-26 晋中学院 Outlier detection system and detection method for cold roll manufacturing process
CN112559591B (en) * 2020-12-08 2023-06-13 晋中学院 Outlier detection system and detection method for cold roll manufacturing process
CN112583844A (en) * 2020-12-24 2021-03-30 北京航空航天大学 Big data platform defense method for advanced sustainable threat attack
CN112583844B (en) * 2020-12-24 2021-09-03 北京航空航天大学 Big data platform defense method for advanced sustainable threat attack
CN113114618A (en) * 2021-03-02 2021-07-13 西安电子科技大学 Internet of things equipment intrusion detection method based on traffic classification recognition
CN113239025A (en) * 2021-04-23 2021-08-10 四川大学 Ship track classification method based on feature selection and hyper-parameter optimization
CN113347181A (en) * 2021-06-01 2021-09-03 上海明略人工智能(集团)有限公司 Abnormal advertisement flow detection method, system, computer equipment and storage medium
CN114039837A (en) * 2021-11-05 2022-02-11 奇安信科技集团股份有限公司 Alarm data processing method, device, system, equipment and storage medium
CN114039837B (en) * 2021-11-05 2023-10-31 奇安信科技集团股份有限公司 Alarm data processing method, device, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108540451A (en) A method of classification and Detection being carried out to attack with machine learning techniques
Khan et al. An improved convolutional neural network model for intrusion detection in networks
CN107395590B (en) A kind of intrusion detection method classified based on PCA and random forest
Özgür et al. A review of KDD99 dataset usage in intrusion detection and machine learning between 2010 and 2015
Ibrahimi et al. Management of intrusion detection systems based-KDD99: Analysis with LDA and PCA
CN106973038B (en) Network intrusion detection method based on genetic algorithm oversampling support vector machine
CN107862347A (en) A kind of discovery method of the electricity stealing based on random forest
Raj et al. Applications of pattern recognition algorithms in agriculture: a review
US20080086493A1 (en) Apparatus and method for organization, segmentation, characterization, and discrimination of complex data sets from multi-heterogeneous sources
CN102291392A (en) Hybrid intrusion detection method based on bagging algorithm
CN114124482B (en) Access flow anomaly detection method and equipment based on LOF and isolated forest
CN105354198A (en) Data processing method and apparatus
Chandolikar et al. Efficient algorithm for intrusion attack classification by analyzing KDD Cup 99
Chen et al. Pattern recognition using clustering algorithm for scenario definition in traffic simulation-based decision support systems
CN107483451A (en) Based on serial parallel structural network secure data processing method and system, social networks
CN113542241A (en) Intrusion detection method and device based on CNN-BiGRU mixed model
Mir et al. An experimental evaluation of bayesian classifiers applied to intrusion detection
CN116865994A (en) Network data security prediction method based on big data
Gong et al. Intrusion detection system combining misuse detection and anomaly detection using genetic network programming
Machoke et al. Performance Comparison of Ensemble Learning and Supervised Algorithms in Classifying Multi-label Network Traffic Flow
Thanh et al. An approach to reduce data dimension in building effective network intrusion detection systems
CN115392351A (en) Risk user identification method and device, electronic equipment and storage medium
Liu et al. Network intrusion detection based on chaotic multi-verse optimizer
CN113837481A (en) Financial big data management system based on block chain
Chareka et al. A study of fitness functions for data classification using grammatical evolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180914

WD01 Invention patent application deemed withdrawn after publication