CN109286622A

CN109286622A - A kind of network inbreak detection method based on learning rules collection

Info

Publication number: CN109286622A
Application number: CN201811122445.1A
Authority: CN
Inventors: 王劲松; 杨传印; 黄玮; 莫敬涛
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2019-01-29
Anticipated expiration: 2038-09-26
Also published as: CN109286622B

Abstract

A kind of network inbreak detection method based on learning rules collection, first to the number of network connections Data preprocess in international standard data set KDDCup99, then the data item of redundancy is removed using improved FOIL algorithm and extracts classifying rules, the classification to network connection test data is finally realized according to classifying rules, judges whether the network connection is attack connection and specific attack type.Method in the present invention chooses the network connection data in KDDCup99 and carries out experimental verification, and for data set the characteristics of makes improvement to original FOIL algorithm, it is made to be more suitable for standard data set.It is extracted and the efficiency of network connection test data classification the experimental results showed that improved algorithm effectively increases classifying rules, the accuracy of testing result also has a certain upgrade, and it is low to effectively prevent traditional intruding detection system classification effectiveness, the high defect of rate of false alarm.

Description

A kind of network inbreak detection method based on learning rules collection

Technical field

This method is related to Network Intrusion Detection System field more particularly to a kind of network intrusions inspection based on learning rules collection Survey method.

Background technique

Important supplement of the intruding detection system as firewall can be collected in the case where not influencing network system performance With several key point informations in analytical calculation machine network or computer system, finds whether to have in network or system and be invaded Sign, to complete protection to network system, it plays a very important role network system security.

Intrusion Detection Technique based on data mining technology has become the hot spot of research, has generated many achievements both at home and abroad, But still generally existing some problems: based on the intrusion detection method of data mining in Detection accuracy, false alarm rate and real-time side Face needs to be further increased.The close-fitting model of especially needed data digging technology and intrusion detection improves invasion The accuracy and timeliness of detection.

Summary of the invention

The present invention is directed to the defect that traditional intruding detection system classification effectiveness is low, rate of false alarm is high, proposes a kind of based on The network inbreak detection method for practising rule set handles network connection data by using improved FOIL algorithm, improves invasion The timeliness and accuracy rate of detection.By testing on KDDCup99 experimental data set, original FOIL algorithm is compared, after improvement Algorithm be applied to intrusion detection in have certain feasibility.

Learning rules set algorithm is applied in intrusion detection method, it is mainly a kind of centered on data mining and processing Viewpoint, for network connection data acquisition process process not within the scope of consideration of the invention.Zhong Yi of the present invention is international For standard network connects data set KDDCup99, invasion network connection is divided using the thought of data mining as theoretical foundation Class.

Technical solution of the present invention:

A kind of network inbreak detection method based on learning rules collection, method includes the following steps:

Network connection data selected from international standard data set KDDCup99 is divided into training set and test set number by step 1 According to then being pre-processed to each data item in training set and test set, with each of specialization network connection data Data item.

Step 2, using improved FOIL algorithm, that is, learning rules set algorithm, remove the attribute of the redundancy in training set data Data item is trained remaining each attribute data item, extracts classifying rules, obtained classifying rules is stored in classification gauge Then in library.

Network connection data in step 3, test set matches classifying rules in classifying rules library one by one, according to it is matched not The case where covering sample in training set with classifying rules calculates separately the bat of every classifying rules, according to classification gauge Different connection types add up the accuracy of classifying rules respectively in then.

Step 4, the accuracy for saving different connection types in step 3, finding out the maximum connection type of accuracy is this Item is connected to the network the classification results of test data, as final detection result；Data in test set after classification obtains result, The test set data correctly classified are added in training set data together with testing result, the instruction as subsequent extracted classifying rules Practice collection data source, so that classifying rules can dynamically update, to adapt to the variation of heterogeneous networks connection.

Wherein, data set described in step 1 pretreatment the following steps are included:

1.1st step is made in KDDCup99 data set 60% network connection data using the method for cross validation For training set, remaining 40% network connection data is as test set.

1.2nd step adds sequential parameter, specialization network connection data for each data item in every network connection data In each data item, enhance the discrimination of data；KDDCup99 data are concentrated with many identical data item, such as a net Network, which connects in data, has multiple " 0 ", and every column data has specific meaning.Original FOIL algorithm connects in one network of processing When connecing data, same data item can be considered as to same data, therefore can shadow using original FOIL algorithm process data set Ring the accuracy of the speed and classification results of extracting classifying rules.To make up this defect, need be in data preprocessing phase The data item of each column adds sequential parameter.The identical data in every network connection can be distinguished in this way, also can guarantee data Concrete meaning.

Being concentrated redundant data item using improved FOIL algorithm removal training data described in step 2 and extracted classifying rules needs To pass through following steps:

Pretreated training set data is divided into positive example and negative example two according to the difference of network connection type by the 2.1st step Major class simultaneously counts the attribute data item in positive example set.The all-network connection type in training set is counted, class will be wherein connected A kind of identical network connection data of type is classified as positive example, and the data of every other connection type are classified as negative example, counts positive example collection Different attribute data item in positive example set is added to positive example attribute data item set Vset by the different attribute data item in conjunction In, the former piece of classifying rules r is set to sky.

2.2nd step, the gain for calculating each attribute data item v in positive example attribute data item set Vset, removal are not met The data item of the redundancy of restrictive condition, before the maximum attribute data item of the gain for meeting restrictive condition is added to classifying rules r New classifying rules r' is obtained in part.The gain calculation formula of attribute data item v is as follows:

When the former piece of classifying rules r is empty, P and N respectively represents sample in positive example set and negative example set in formula Quantity, P^*And N^*New classifying rules r' after the former piece that attribute data item v is added to classifying rules r is respectively represented in positive example collection The quantity of the sample covered in conjunction and negative example set.At this point, the gain of all properties data item is calculated and compares, by gain maximum Attribute data item be added in the former piece of classifying rules r.

When the former piece non-empty of classifying rules r, P and N respectively represents classifying rules r in positive example set and negative example collection in formula The quantity of the sample covered in conjunction, P^*And N^*It then respectively represents new after the former piece that attribute data item v is added to classifying rules r The quantity for the sample that classifying rules r' is covered in positive example set and negative example set.At this point, being intended to the maximum attribute data of gain Item, which is added in classifying rules r former piece, needs to meet following restrictive condition: the maximum attribute data item of gain is added to classification gauge New classifying rules r' is then obtained after r former piece will cover less sample, i.e. N in negative example set^*< N；If gain is maximum Attribute data item obtains new classifying rules r' and covers the sample of negative example set not becoming after being added to classifying rules r former piece Change, and the maximum attribute data item of gain is identical as a certain item attribute value in classifying rules r former piece, thinks the data item It is redundancy, which can be deleted from positive example attribute data item set Vset, then in remaining attribute data item The middle maximum attribute data item of lookup gain is added in classifying rules r former piece according to above-mentioned requirements with specialization classifying rules.

2.3rd step saves classifying rules r' in the 2.2nd step, deletes in negative example set institute either with or without being classified regular r' The sample of covering.All samples in negative example set are traversed, by the samples not comprising classifying rules r' former piece all in negative example set Example is deleted.If all samples are deleted in negative example set, classifying rules r' can be used as a classifying rules；If negative It is not deleted in example set there are also sample, then should count other attribute datas in the positive example comprising classifying rules r' former piece ?.Then it returns in the 2.2nd step, the maximum attribute data item of satisfactory gain is added in classifying rules r' former piece and is obtained To new classifying rules R, be further continued for deleting all samples not covered by new classifying rules R in negative example set, repeat more than Process is deleted until all samples in negative example set.

2.4th step saves classifying rules R (or r') obtained in the 2.3rd step, deletes in positive example set and all is classified rule The then sample of R (or r') covering.All samples in negative example set are traversed, whether compare in sample one by one includes classifying rules R (or r') former piece deletes all samples comprising classifying rules R (or r') former piece.If all samples in positive example set All classifying rules extraction for being deleted then the type finishes；If there is sample remaining in positive example set, remaining sample is counted All properties data item in example returns to all steps before the 2.2nd step repeats, until sample all in positive example set is equal It is deleted, the classifying rules extraction of the type finishes.The classifying rules of every deletion positive example can be used as the type sample point The classifying rules of class, the consequent of these classifying rules are the network connection type of the type sample, and classifying rules storage, which is arrived, to divide In rule-like library.

2.5th step returns to the 2.1st step, the extraction of the second class sample classifying rules is carried out, until all types sample Classifying rules is all found, and is terminated by the process that training set extracts classifying rules.

The matched classifying rules bat of calculating described in step 3 need to pass through following steps:

3.1st step, read test collection data, by test set every network connection data with it is every in classifying rules library Classifying rules compares, the classifying rules that record matching arrives.Every network connection data has many in KDDCup99 data set Data item, extracting in step 2 may include several attribute data items in the former piece of numerous classifying rules, in test set When the network connection data of every UNKNOWN TYPE is classified according to the classifying rules extracted, a plurality of classification gauge may be matched to Then, all matched classifying rules are recorded.

3.2nd step, for matched m articles of classifying rules, if the consequent of these classifying rules is all the same, the unknown class The network connection data of type is the connection type in these classifying rules consequents；If matched classifying rules consequent not phase Together, then the bat of these classifying rules, classifying rules R are calculated separately_iBat calculate as follows:

Wherein, k is the quantity of heterogeneous networks connection type of data connection in training set, and n is all comprising dividing in training set Rule-like R_iThe quantity of the sample of former piece, e are that connection type is classifying rules R in training set_iContain classification in the sample of consequent Regular R_iThe sample quantity of former piece.After obtaining every matched classifying rules bat, these bats are pressed It adds up respectively according to connection type, obtains the corresponding connection type t of this network connection test data_iAccuracy:

It indicates that the s classifying rules consequent connection type in m item matching classifying rules is t_iAccuracy Accuracy(t_i)。

Classification results are obtained from the accuracy of the different connection types of matching rule described in step 4 and are added into training set Add sorted test data that need to pass through following steps:

4.1st step, the accuracy for saving the heterogeneous networks connection type being calculated in step 3, compare to obtain accuracy Maximum connection type is the final classification result of the network connection test data.

4.2nd step, the dynamic update to guarantee this method self-learning property and classifying rules, it is contemplated that real network situation The characteristic of dynamic change, primary training gained classifying rules possibly can not adapt to the network data constantly changed, in the method Sorted test data is added to training set together with corresponding classification results to train again, generates new classifying rules simultaneously Update classifying rules library.

The invention has the following advantages that

The present invention is divided into training by taking KDDCup99 international standard data set as an example, first, in accordance with the method for crosscheck Collection and test set add sequential parameter to 41 attribute data items in training set and test set.Then pass through improved FOIL Algorithm removes the data item of redundancy in training set and classifying rules is extracted in training.Finally by Laplce's accuracy estimation formulas The bat that matching test concentrates the network connection data classifying rules of UNKNOWN TYPE is calculated, most by the comparison of accuracy Eventually obtain classification results, while by test set data and corresponding classification results be added in training set and instructed with real-time update Practice collection data, generate new classifying rules, makes this method that there is good adaptivity and self-learning property.The invention, which uses, to be changed Into FOIL algorithm, a large amount of when efficiently avoiding original FOIL algorithm process KDDCup99 data set repeat traversal and meter It calculates, reduces the time complexity of algorithm, greatly accelerate the efficiency for extracting classifying rules and classification, improve number of network connections According to the accuracy of classification and Detection result, the characteristic of adaptivity and self study is but also this method has stronger stability.

Detailed description of the invention

Fig. 1 is that the present invention is based on the flow charts of the network inbreak detection method of learning rules collection.

Specific embodiment

Specific embodiments of the present invention will be described in further detail with reference to the accompanying drawing.

Learning rules set algorithm is applied in intrusion detection method, mainly a kind of data-centered viewpoint is right In network connection data acquisition process process not within the scope of consideration of the invention.The present invention is connected to the network with international standard For data set KDDCup99, classify using data mining thought as theoretical foundation to network connection data.

Fig. 1 has carried out detailed step explanation to a kind of network inbreak detection method based on learning rules collection.The present invention The method of offer the following steps are included:

Network connection data selected from international standard data set KDDCup99 is divided into training set and test set number by step 1 According to then being pre-processed each data item in training set and test set with every number in specialization network connection data According to item.

1.1st step is made in KDDCup99 data set 60% network connection data using the method for cross validation For training set, remaining 40% network connection data is as test set.10 nets will be randomly selected in KDDCup99 data set Network connection data are classified as one group, then arbitrarily chosen from every group wherein 6 be added to training set, remaining 4 data is added To test set.

1.2nd step adds sequential parameter, specialization network connection data for each data item in every network connection data In each data item, enhance the discrimination of data；KDDCup99 data are concentrated with many identical data item, such as a net Network, which connects, has multiple " 0 " or " 1 " in data, every column data has specific meaning, and original FOIL algorithm is in processing one It is regarded as same data when same data item in network connection data, therefore uses original FOIL algorithm process number It will affect the speed and the accuracy of classification results for extracting classifying rules according to collection.To make up this defect, need to locate in advance in data The reason stage is that the data item of each column adds sequential parameter place, i.e. column information where data item, each data in data set Item possible constructions body DataItem { int place, string data } expression, such as such net randomly selected Network connects data:

The network connection data that table 1 randomly selects

0

tcp

http

SF

279

1129

......

0

normal

Data preprocessing phase is that data item addition sequential parameter is as follows: (1,0), (2, tcp), and (3, http), (4, SF), (5,279) ..., (40,0), (41,0).The location information for not only saving data is handled in this way, thereby ensures that data represent Meaning will not obscure, and ergodic data is only needed during subsequent extracted classifying rules to concentrate the number in respective column According to, greatly reduce ergodic data amount, significantly improve extract classifying rules efficiency.

Step 2, the data item that redundancy in training set is removed using improved FOIL algorithm, to remaining network connection data It is trained, extracts classifying rules, and classifying rules is stored in classifying rules library.

Following explanation is carried out first:

Network connection: each row of data in international standard data set KDDCup99 is a network connection, every connection There are 42 data, wherein first 41 are the attributes being connected to the network, the last one is the connection type of the network connection, connects class The network connection that type is normal is normally to be connected to the network, remaining connection type is attack type.

Sample: the training set selected from international standard data set is the set of the network connection data of a variety of connection types, Every network connection data is a sample.

Attribute data item: data of each data item after adding sequential parameter in sample are known as attribute data item, It include sequential parameter and data item in this position in attribute data item.

Classifying rules: classifying rules is made of classifying rules former piece and classifying rules consequent, if classifying rules former piece by Dry item includes the data item composition of attribute data item, and classifying rules consequent is connection type.

Covering: including that data item all in classifying rules former piece then claims the sample energy in numerous data item of sample It is covered by this classifying rules.

Gain: FOIL algorithm needs to select attribute data item with specialization classifying rules during extracting classifying rules Former piece, gain are the differences of attribute data item information coding before and after being added to classifying rules former piece.The bigger attribute number of gain Bigger according to contribution of the item to information coding reduction, alternatively attribute data item is added to classification gauge to the gain of FOIL algorithms selection The then major criterion of former piece.

Pretreated training set data is divided into two major classes according to the difference of network connection type and counted by the 2.1st step Attribute data item in positive example set；The all-network connection type in training set is counted, it will wherein connection type identical one Kind network connection data is classified as positive example, and the data of every other connection type are classified as negative example, counts the difference in positive example set Different attribute data item in positive example set is added in positive example attribute data item set Vset, will classify by attribute data item Regular r former piece is set to sky.

2.2nd step, the gain for calculating each attribute data item v in positive example attribute data item set Vset, removal are not met The data item of the redundancy of restrictive condition, before the maximum attribute data item of the gain for meeting restrictive condition is added to classifying rules r New classifying rules r' is obtained in part；The gain calculation formula of attribute data item v is as follows:

When the former piece of classifying rules r is empty, P and N respectively represents sample in positive example set and negative example set in formula Quantity, P^*And N^*New classifying rules r' after the former piece that attribute data item v is added to classifying rules r is respectively represented in positive example collection The quantity of the sample covered in conjunction and negative example set.At this point, being increased in the gain for calculating and comparing all properties data item The maximum attribute data item of benefit can be directly appended in classifying rules former piece.

When the former piece non-empty of classifying rules r, P and N respectively represents classifying rules r in positive example set and negative example collection in formula The quantity of the sample covered in conjunction, P^*And N^*It then respectively represents new after the former piece that attribute data item v is added to classifying rules r The quantity for the sample that classifying rules r' is covered in positive example set and negative example set.At this point, being intended to the maximum attribute data of gain Item, which is added in classifying rules r former piece, needs to meet following restrictive condition: the maximum attribute data item of gain is added to classification gauge New classifying rules r' is then obtained after r former piece will cover less sample, i.e. N in negative example set^*< N；If gain is maximum Attribute data item obtains new classifying rules r' and covers the sample of negative example set not becoming after being added to classifying rules r former piece Change, and the maximum attribute data item of gain is identical as the attribute value of a certain item in classifying rules r former piece, thinks the data Item is redundancy, which can be deleted from positive example attribute data item set Vset, then in remaining attribute data The maximum attribute data item of gain is searched in be added in classifying rules r former piece according to above-mentioned requirements with specialization classifying rules.

The attribute data item of redundancy is removed in this step, and mainly consider may be by the category of redundancy when training set is smaller Property data item be added in classifying rules former piece, the attribute data item of redundancy also will affect classification to no any contribution of classifying Accuracy.Example below such as:

Table 2 illustrates to remove 8 datas that the citing of redundant attributes data item is chosen

	F1	F2	F3	F4		Type
							...	0	0	0	0	...	normal
...	0	0	1	0	...	land
							...	0	1	0	0	...	ipsweep
...	0	1	1	0	...	normal
							...	1	0	0	1	...	teardrop
...	1	0	1	1	...	normal
							...	1	1	0	1	...	normal
...	1	1	1	1	...	back

For this 8 samples, connection type is that normal (positive example) is identical with the sample quantity of non-normal (negative example), often The value and quantity of Column Properties value are also identical, and wherein all values of attribute F4 are identical as F1, therefore F4 is the attribute of redundancy.? When extracting the classifying rules for the network connection that connection type is normal, since connection type is each category in the sample of normal Property data item gain it is all the same, if for the first time attribute data item (F1,0) is added in classifying rules former piece, Jin Jinyi Second attribute data item to be selected is distinguished by the gain of attribute data item to draw redundant attributes data item (F4,0) Into in classifying rules former piece, this cannot have any positive contribution to subsequent classification.By adding restrictive condition N^*< N can be protected Demonstrate,prove the attribute data item that the maximum attribute data item of gain chosen every time is not centainly redundancy.

2.3rd step saves classifying rules r ' in the 2.2nd step, deletes in negative example set institute either with or without being classified regular r ' The sample of covering；All samples in negative example set are traversed, the samples not comprising classifying rules r ' all in negative example set are deleted It removes；If all samples are deleted in negative example set, classifying rules r ' can be used as a classifying rules；If negative example collection It is not deleted in conjunction there are also sample, then should count other attribute data items in the positive example comprising classifying rules r ' former piece, so It returns in the 2.2nd step afterwards, the maximum attribute data item of satisfactory gain is added in classifying rules r ' former piece and is obtained newly Classifying rules R, then proceed to delete in negative example set it is all not by new classifying rules R cover samples, repeat above procedure Until all samples in negative example set are deleted；

2.4th step saves classifying rules R (or r ') obtained in the 2.3rd step, deletes in positive example set and all is classified rule The then sample of R (or r ') covering；All samples in positive example set are traversed, whether compare in sample one by one includes classifying rules R (or r ') former piece deletes all samples comprising classifying rules R (or r ') former piece；If all samples in positive example set All classifying rules extraction for being deleted then the type finishes；If there is sample remaining in positive example set, remaining sample is counted All properties data item in example returns to all steps before the 2.2nd step repeats, until sample all in positive example set is equal It is deleted, the classifying rules extraction of the type finishes；The classifying rules of every deletion positive example can be used as the type sample point The classifying rules of class, the consequent of these classifying rules are the network connection type of the type sample, and classifying rules storage, which is arrived, to divide In rule-like library；

It gives one example below and the above process is illustrated.Following data are randomly selected from KDDCup99:

5 datas randomly selected in 3 KDDCup99 data set of table

0	tcp	http	SF	54540	8314	……	0.04	0.04	back
										14	tcp	http	RSTR	33580	7300	……	1	1	back
0	icmp	eco_i	SF	18	0	……	0	0	ipsweep
										0	tcp	http	SF	321	480	……	1	1	normal
0	tcp	http	SF	277	3410	……	0	0	normal

Using this 5 samples as training set sample, after adding sequential parameter, the sample that connection type is back is classified as just Example, other all types of samples are negative example, train the classifying rules of positive example first, different attribute data item group in positive example At attribute data item set Vset (back): (1,0), (2, tcp), (3, http), (4, SF), (5,54540), (6, 8314), (40,0.04), (41,0.04), (1,14), (4, RSTR), (5,33580), (6,7300), (40,1), (41,1) }, Classifying rules r former piece is set to sky, the gain of each attribute data item is calculated, obtains (5,54540), (6,8314), (40, 0.04) yield value of, (41,0.04), (1,14), (4, RSTR), (5,33580), (6,7300), (40,1), (41,1) is maximum, At this time if (40,1) are added in the former piece of classifying rules r, delete and be not classified regular r:{ (40,1) in negative example → The sample of back former piece covering, the sample in negative example set can not be erased entirely.

The increasing of all properties data item that connection type is back and is classified in the sample of regular r covering is calculated again Benefit should be by (41,1) this attribute data item as redundant data entry deletion, because it cannot if the gain of (41,1) is maximum Satisfaction obtains this limitation of less sample in the new negative example set of classifying rules r ' covering after being added to classifying rules r former piece Condition, and the attribute value of the attribute data item is identical with the attribute value of attribute data item in classifying rules r.Increasing is calculated The maximum attribute data item of benefit is: (1,14), (4, RSTR), (5,33580), (6,7300) therefrom select an attribute data Item is added in classifying rules former piece, obtains the classifying rules that connection type is back are as follows: r:{ (40,1), (1,14) } → Back deletes the samples for not being classified regular r former piece covering all in negative example, and negative examples all at this time is deleted, and is found One classifying rules of positive example.The sample that regular r former piece covering is classified in positive example is deleted, positive example is not erased entirely, is needed Other classifying rules of positive example: { (40,0.04) } → back are generated according still further to above step.So far, all positive examples classification Rule is found and is finished, then using the sample of other connection types as positive example, generates corresponding classifying rules.Finally obtain following point Rule-like: { (40,1), (1,14) } → back；{ (40,0.04) } → back；{ (2, icmp) } → ipsweep；{ (5,321) } →normal；{ (6,3410) } → normal；These classifying rules are stored in classifying rules library.

Network connection data in step 3, test set matches classifying rules in classifying rules library one by one, according to it is matched not The case where covering sample in training set with classifying rules calculates separately the bat of every classifying rules, according to classification gauge Different connection types in then add up the accuracy of classifying rules respectively.

3.1st step, read test collection data, by every network connection data in test set with every in classifying rules library Classifying rules compares, the classifying rules that record matching arrives.Every network connection data has many numbers in KDDCup99 data set According to item, being extracted in the former piece of numerous classifying rules in step 2 may include several attribute data items, every in test set When the network connection data of UNKNOWN TYPE is classified according to the classifying rules extracted, it may be matched to a plurality of classifying rules, Record all matched classifying rules.

Step 4, the accuracy for saving different connection types in step 3 compare and find out accuracy maximum connection type and be It is connected to the network the classification results of test；Simultaneously to make this method have good self-learning property, the data of test set are in basis After classifying rules classification obtains corresponding result, test set data are added in training set data together with corresponding classification results, are The extraction of subsequent classification rule provides new training set data source, guarantees that the dynamic of classifying rules updates.

4.2nd step, the dynamic update to guarantee this method self-learning property and classifying rules, it is contemplated that real network situation Dynamic characteristic, primary training gained classifying rules possibly can not adapt to the network data constantly changed, in the method will point Test data after class is added to training set together with corresponding classification results and trains again, generates new classifying rules and updates Classifying rules library.

In order to show the process of step 3 and step 4, from KDDCup99 data set connection type be back, ipsweep, Selection one is as follows in the network connection data of normal:

Connection type is the data chosen in the data of three of the above in 4 KDDCup99 data set of table

14

tcp

http

SF

321

3410

......

1

0

normal

This data is matched with the classifying rules in classifying rules library, matches this network connection test data Classifying rules has 3: { (5,321) } → normal；{ (6,3410) } → normal；{ (40,1), (1,14) } → back.Due to The consequent for 3 classifying rules being matched to is not identical, needs to calculate separately the corresponding two kinds of connection types of three classifying rules Accuracy.The biggish company of accuracy Connecing type is normal, i.e., it is normal that this network connection test data, which obtains classification results, as normal network connection.

It is applied to the property of Network Intrusion Detection System in order to verify improved FOIL algorithm compared to original FOIL algorithm Can, we carry out following confirmatory experiment.Experimental situation a: PC machine.CPU model Inter Core i7-4770 3.4GHz, Memory 8G, 1T hard disk, has the software environment of Visual Studio 2013.Experimental data: according in KDDCup99 data set The different proportion of network connection type, therefrom randomly selects, and guarantees that the taken data volume of every kind of connection type is no more than 50 Item chooses 2150 altogether, then using the method for crosscheck, chooses therein 60% and is used as training set data, and in addition 40% As test set data, 5 experiments are carried out to the FOIL algorithm for improving front and back, experimental result is as shown in table 5.By consulting profession Paper information obtains the bat of current network intrusion detection related algorithm, and comparing result is as shown in table 6.

Table 5 adopts international standards data set KDDCup99 to FOIL proof of algorithm Comparative result before and after improving

6 current network intrusion detection related algorithm of table is to the classification bat comparison of KDDCup99 network connection data

The results showed that there is very aspect between intrusion detection method of the invention compares original FOIL algorithm when being executed It is big to improve, there is preferable performance in terms of the bat of classification results compared with other algorithms.

Claims

1. a kind of network inbreak detection method based on learning rules collection, it is characterised in that method includes the following steps:

Network connection data selected from international standard data set KDDCup99 is divided into training set and test set data by step 1, so Each data item in training set and test set is pre-processed afterwards, with each data in specialization network connection data ?；

Step 2, using improved FOIL algorithm, that is, learning rules set algorithm, remove the attribute data of redundancy in training set data , remaining each attribute data item is trained, classifying rules is extracted, by obtained classifying rules storage to classifying rules library In；

Network connection data in step 3, test set matches classifying rules in classifying rules library one by one, according to matched difference Classifying rules covers the bat for the case where sample calculating separately every classifying rules in training set, according to classifying rules Middle difference connection type adds up the bat of classifying rules respectively；

Step 4, the accuracy for saving different connection types in step 3, finding out the maximum connection type of accuracy is this net The classification results of network connecting test data, as final detection result；Data in test set, will just after classification obtains result The test set data really classified are added in training set data together with testing result, the training set as subsequent extracted classifying rules Data source, so that classifying rules can dynamically update, to adapt to the variation of heterogeneous networks connection.

2. the network inbreak detection method according to claim 1 based on learning rules collection, it is characterised in that: step 1 institute State is to the pretreated method of data set:

1.1st step, using cross validation method using in KDDCup99 data set 60% network connection data as instruction Practice collection, remaining 40% network connection data is as test set；

1.2nd step adds sequential parameter for each data item in every network connection data, in specialization network connection data Each data item enhances the discrimination of data, guarantees the concrete meaning of each data item.

3. the network inbreak detection method according to claim 1 based on learning rules collection, it is characterised in that: step 2 institute It states using the data item of redundancy in improved FOIL algorithm removal training set and the method for extracting classifying rules is:

Pretreated training set data is divided into positive example and negative example two major classes according to the difference of network connection type by the 2.1st step And count the attribute data item in positive example set；The all-network connection type in training set is counted, it will wherein connection type phase A kind of same network connection data is classified as positive example, and the data of every other connection type are classified as negative example, will be in positive example set Different attribute data item is added in positive example attribute data item set Vset, and classifying rules r former piece is set to sky；

2.2nd step, the gain for calculating each attribute data item v in positive example attribute data item set Vset, removal do not meet limitation The attribute data item of the redundancy of condition, before the maximum attribute data item of the gain for meeting restrictive condition is added to classifying rules r New classifying rules r' is obtained in part；The gain calculation formula of attribute data item v is as follows:

When the former piece of classifying rules r is empty, P and N respectively represents the quantity of sample in positive example set and negative example set in formula, P^*And N^*New classifying rules r' after the former piece that attribute data item v is added to classifying rules r is respectively represented in positive example set and negative The quantity of the sample covered in example set；At this point, the gain of all properties data item is calculated and compares, by the maximum attribute of gain Data item is added in the former piece of classifying rules r；

When the former piece non-empty of classifying rules r, P and N respectively represents classifying rules r in positive example set and negative example set in formula The quantity of the sample of covering, P^*And N^*Then respectively represent new classification after the former piece that attribute data item v is added to classifying rules r The quantity for the sample that regular r' is covered in positive example set and negative example set；At this point, being intended to the maximum attribute data Xiang Tian of gain Be added in the former piece of classifying rules r and need to meet following restrictive condition: the maximum attribute data item of gain is added to classifying rules r New classifying rules r' is obtained after former piece will cover less sample, i.e. N in negative example set^*< N；If the maximum attribute of gain Data item obtains new classifying rules r' and covers the sample of negative example set not changing after being added to classifying rules r former piece, and And the maximum attribute data item of gain is identical as a certain item attribute value in classifying rules r former piece, thinks that the data item is redundancy , which can be deleted from positive example attribute data item set Vset, then be searched in remaining attribute data item The maximum attribute data item of gain is added in the former piece of classifying rules r according to above-mentioned requirements with specialization classifying rules；

Classifying rules r' in 2.3rd step, the 2.2nd step of preservation deletes institute in negative example set and covers either with or without regular r' is classified Sample；All samples in negative example set are traversed, the samples not comprising classifying rules r' former piece all in negative example set are deleted It removes；If all samples are deleted in negative example set, classifying rules r' can be used as a classifying rules, classifying rules r' Consequent be network connection type in positive example；If be not deleted in negative example set there are also sample, it should count and include Other attribute data items, then return in the 2.2nd step in the positive example of classifying rules r' former piece, and satisfactory gain is maximum Attribute data item be added in classifying rules r' former piece and obtain new classifying rules R, be further continued for deleting in negative example set and own The sample not covered by the former piece of new classifying rules R repeats above procedure until all samples in negative example set are deleted；

2.4th step, save the 2.3rd step obtained in classifying rules R or r', delete positive example set in it is all be classified regular R or The sample of r' former piece covering；Traverse all samples in positive example set, compare in sample one by one whether include classifying rules R or R' former piece deletes all samples comprising classifying rules R or r' former piece；If all samples are deleted in positive example set Then all classifying rules extraction of the type finishes；If there is sample remaining in positive example set, the institute in remaining sample is counted There is an attribute data item, returns to all steps before the 2.2nd step repeats, until sample all in positive example set is deleted, The classifying rules extraction of the type finishes；The classifying rules of every deletion positive example can be used as point of the network connection of the type Rule-like, the consequent of these classifying rules are the network connection type of the type sample, classifying rules storage to classifying rules In library；

2.5th step returns to the 2.1st step, the extraction of the second class sample classifying rules is carried out, until the classification of all types sample Rule is all found, and is terminated by the process that training set extracts classifying rules.

4. the network inbreak detection method according to claim 1 based on learning rules collection, it is characterised in that: step 3 institute The method of the bat for the matched different classifications rule of calculating stated is:

3.1st step, read test collection data, by every network connection data in test set and every point in classifying rules library Rule-like compares, the classifying rules that record matching arrives；Every network connection data has many data in KDDCup99 data set , being extracted in the former piece of numerous classifying rules in step 2 may include several attribute data items, and every in test set When the network connection data of UNKNOWN TYPE is classified according to the classifying rules extracted, it may be matched to a plurality of classifying rules, remembered Record all matched classifying rules:

3.2nd step, for matched m articles of classifying rules, if the consequent of these classifying rules is all the same, the UNKNOWN TYPE Network connection data is the connection type in these classifying rules consequents；If matched classifying rules consequent is not identical, Calculate separately the bat of these classifying rules, classifying rules R_iBat calculate as follows:

Wherein, k is the quantity of heterogeneous networks connection type of data connection in training set, and n is all comprising classification gauge in training set Then R_iThe quantity of the sample of former piece, e are that connection type is classifying rules R in training set_iContain classifying rules in the sample of consequent R_iThe sample quantity of former piece；After obtaining every matched classifying rules bat, by these bats according to even It connects type to add up respectively, obtains the corresponding connection type t of this network connection test data_iAccuracy:

It indicates that the s classifying rules consequent connection type in m item matching classifying rules is t_iAccuracy Accuracy (t_i)。

5. the network inbreak detection method according to claim 1 based on learning rules collection, it is characterised in that: step 4 institute It states, obtained classification results from the accuracy of the different connection types of matching classifying rules and is added into training set sorted Test data method is:

4.1st step, the accuracy for saving the heterogeneous networks connection type being calculated in step 3 compare to obtain accuracy maximum Connection type be the network connection test data final classification result；

4.2nd step, the dynamic update to guarantee this method self-learning property and classifying rules, it is contemplated that real network situation dynamic The characteristic of variation, once training gained classifying rules possibly can not adapt to the network data constantly changed, will divide in the method Test data after class is added to training set together with corresponding classification results and trains again, generates new classifying rules and updates Classifying rules library.