CN113934719B

CN113934719B - Industrial Internet intrusion detection data set processing method based on D-N

Info

Publication number: CN113934719B
Application number: CN202111202373.3A
Authority: CN
Inventors: 刘明山; 石伟诚; 周原; 韦晓宇
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2024-04-19
Anticipated expiration: 2041-10-15
Also published as: CN113934719A

Abstract

The invention discloses a D-N-based industrial Internet intrusion detection data set processing method, which improves the problems that when the existing integrated learning algorithm is used for solving the industrial Internet intrusion detection problem, redundant data items in a data set cause poor generalization performance of a trained integrated learning model, certain types of data tags in the data set cannot be identified by an integrated learning individual learner, and certain types of data tags in the data set are erroneously identified by the integrated learning individual learner, so that the trained integrated learning model has low detection precision, and provides a new method for processing training data sets and verification data sets when the integrated learning algorithm is used for solving the industrial Internet intrusion detection problem.

Description

Industrial Internet intrusion detection data set processing method based on D-N

Technical Field

The invention relates to the fields of data set processing, data cleaning, discrete-normalized mathematical methods (D-N algorithm), classification of integrated learning algorithms and application thereof, in particular to KDD99 data set processing based on CART-AMV algorithm in integrated learning algorithms for realizing industrial Internet intrusion detection.

Background

The integrated learning algorithm improves a complex single algorithm flow in machine learning, and can effectively reduce the algorithm complexity and cost of the machine learning by constructing a large number of individual learners with simple algorithms and various types, which is an advantage of the integrated learning algorithm. The disadvantage is that in training of individual learners, the training data set used is strongly dependent. The quality of the training data set data structure directly influences the quality of the generalization performance of the individual learner after training. Under the application background of the integrated learning algorithm for intrusion detection of the industrial Internet to solve the classification problem, large data sets such as KDD99, KDD-NSD, UNSW-NB15 and the like are available, and the large data sets have huge data volume, real data and comprehensive intrusion attack type coverage. But has the defects of large redundancy of data, non-uniform data types and partial data labels which cannot be identified by an individual learner.

Disclosure of Invention

The invention provides a D-N-based industrial Internet intrusion detection data set processing algorithm, which aims to solve the problems of high requirements of an integrated learning algorithm on a data set data structure and the deficiency of an industrial Internet intrusion detection data set, and realizes the data analysis and arrangement of the data set through three steps of data cleaning, data discretization and data normalization. The present invention can be applied to various types and sizes of datasets.

The specific technical scheme for realizing the aim of the invention is as follows:

Firstly, a data cleaning pool is constructed, and data items with specific values of certain data labels are processed, such as rewritten or rejected, so that the redundancy of the data is reduced, and the generalization performance of the integrated learning algorithm model after training is improved; secondly, converting the non-numerical data label into a discrete numerical label through three steps of code conversion, arithmetic average calculation and average absolute deviation calculation of the non-numerical data label so as to improve the available quantity of the data label in the data set; finally, the non-numerical data labels and the continuous numerical data labels after discretization are subjected to normalization processing, so that the order-of-magnitude difference of the center values of different data labels is further reduced, and the classification precision of the integrated learning algorithm model after training is improved.

Drawings

Other objects and results of the present patent will become more apparent and readily appreciated by reference to the following description and claims in conjunction with the accompanying drawings and as the patent of the invention is more fully understood. In the drawings:

FIG. 1 is an algorithm flow chart of an industrial Internet intrusion detection dataset processing algorithm based on D-N;

FIG. 2 is a statistical graph of the self-importance coefficients of data tags after training using the CART-AMV algorithm in ensemble learning after processing kddcup.data_10_percentage.gz dataset in KDD99 series dataset using a D-N based industrial Internet intrusion detection dataset processing algorithm;

FIG. 3 is a graph showing the distribution of data tags processed by the D-N based industrial Internet intrusion detection data set processing algorithm among the data tags with the data tags in FIG. 2 having a self-importance coefficient greater than 0.6.

Detailed Description

(1) The data set D to be processed is input, and all data tags l ₁、l₂、…l_n of the data set D are traversed.

(2) And according to the traversing result of the data labels of the data set D, establishing an empty table E, namely a data cleaning pool, of which the sequence and the names are completely consistent with those of the data labels of the data set D.

(3) The value v₁₁、v₁₂、…、v_1m;v₂₁、v₂₂、…、v_2m;…;v_n1、v_n2、…、v_nm, of the data label to be processed and the processing mode M of each item are respectively input under each data label l ₁、l₂、…l_n of the data cleaning pool E, and the data cleaning pool is updated to be E _f.

(4) Traversing the data set D in a sequence of line by line and column by column, comparing the data cleaning pool E _f, and processing the data label to be processed in a processing mode M to obtain a traversed data set D _f.

(5) Traversing the data set D ₁ in a sequence of firstly row by row and then line by line, and skipping the step if the data type of the data tag is numerical; if the data type of the data tag is non-numerical, counting the number m of the value types of the data tag, and carrying out simple coding on the m value types of the data tag: 1.2, … and m to obtain a numeric data tag value x ₁、x₂、…、x_n.

(6) Calculating an arithmetic average value AVG of the digitized values of each of the data tags subjected to the digitizing process according to the digitized data tag values x ₁、x₂、…、x_n obtained in the step (5), wherein

(7) Calculating the average absolute deviation STAD of the numerical values of each data tag subjected to the numerical processing according to the numerical data tag value x ₁、x₂、…、x_n obtained in the step (5) and the arithmetic average value AVG obtained in the step (6), wherein

(8) Calculating the value x' _n of the final data label after the discretization of the numerical value processed in the steps (5) - (7) according to the x _n, the AVG and the STAD obtained in the steps (5), (6) and (7), whereinNote that if avg=0 or stad=0, discretized x' _n =0 results in the traversal processed data set D _d.

(9) Traversing the value of each data tag to obtain a discretized data tag value maximum value x _max and a discretized data tag value minimum value x _min.

(10) Calculating the value x' _n of the data label after numerical normalization according to the value x _max、x_min obtained in the step (9), wherein

(11) After the processing of steps (5) - (10) is completed for all columns of data set D _f, all processed data is stored in the new data set D _n in the data format of data set D.

(12) The industrial Internet intrusion detection data set processing algorithm described in the steps (1) - (11) is applied to the industrial Internet intrusion detection data set processing, and the usability of the method is further verified through experiments on the integrated learning type industrial Internet intrusion detection algorithm.

Claims

1. The method for processing the industrial Internet intrusion detection data set based on the D-N is characterized by comprising the following steps of: the D-N-based industrial Internet intrusion detection data set processing algorithm can effectively analyze the data types of the industrial Internet intrusion detection data set, and perform data cleaning, discretization and normalization processing on the data in the industrial Internet intrusion detection data set, and comprises the following steps:

(1) Inputting a data set D to be processed, traversing all data labels of the data set D ；

(2) According to the traversing result of the data labels of the data set D, an empty table E, namely a data cleaning pool, is established, wherein the sequence and the name of the empty table E are completely consistent with those of the data labels of the data set D;

(3) Each data tag in data cleansing pool E Respectively inputting the values of the data labels to be processed/>, and；/>；/>And the processing mode M of each item, updating the data cleaning pool as/>；

(4) Traversing the data set D in a sequence of line by line and column by column, comparing the data cleaning poolsProcessing the data label to be processed in a processing mode M to obtain a data set/>, which is subjected to traversal processing；

(5) Traversing a dataset in a column-by-column followed by row orderIf the data type of the data tag is numerical, skipping the step; if the data type of the data tag is non-numerical, counting the number m of the value types of the data tag, and carrying out simple coding on the m value types of the data tag: 1.2, …, m to obtain the numeric data tag value/>；

(6) Taking the value according to the numeric data label obtained in the step (5)Calculating an arithmetic average value AVG of each numeric value of the numeric data tag, wherein/>；

(7) Taking the value according to the numeric data label obtained in the step (5)And (6) calculating the average absolute deviation STAD of the digitized values of each of the digitized data tags, wherein；

(8) According to steps (5), (6) and (7)AVG and STAD, and calculating the final value of the data label after discretization of the numerical value processed in the steps (5) - (7)/>Wherein/>If/>Or/>=0, DiscretizedObtaining the data set/>, after traversing；

(9) Traversing the value of each data label to obtain the maximum value of the discretized data label valueAnd data tag value minimum/>；

(10) According to step (9),Calculating the value/>, of the data label after numerical normalizationWherein；

(11) At the data setAfter all columns of data have been processed in steps (5) - (10), storing all processed data in the new data set/>, according to the data format of data set DIs a kind of medium.

2. The D-N based industrial internet intrusion detection dataset processing method of claim 1, wherein: firstly, a data cleaning pool is constructed, data items with specific values of certain data labels are processed, rewritten or rejected, the redundancy of the data is reduced, and the generalization performance of the integrated learning algorithm model after training is improved; secondly, converting the non-numerical data label into a discrete numerical label through three steps of code conversion, arithmetic average calculation and average absolute deviation calculation of the non-numerical data label so as to improve the available quantity of the data label in the data set; finally, the non-numerical data labels and the continuous numerical data labels after discretization are subjected to normalization processing, so that the order-of-magnitude difference of the center values of different data labels is further reduced, and the classification precision of the integrated learning algorithm model after training is improved.