CN113934719B - Industrial Internet intrusion detection data set processing method based on D-N - Google Patents

Industrial Internet intrusion detection data set processing method based on D-N Download PDF

Info

Publication number
CN113934719B
CN113934719B CN202111202373.3A CN202111202373A CN113934719B CN 113934719 B CN113934719 B CN 113934719B CN 202111202373 A CN202111202373 A CN 202111202373A CN 113934719 B CN113934719 B CN 113934719B
Authority
CN
China
Prior art keywords
data
data set
value
label
numerical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111202373.3A
Other languages
Chinese (zh)
Other versions
CN113934719A (en
Inventor
刘明山
石伟诚
周原
韦晓宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202111202373.3A priority Critical patent/CN113934719B/en
Publication of CN113934719A publication Critical patent/CN113934719A/en
Application granted granted Critical
Publication of CN113934719B publication Critical patent/CN113934719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a D-N-based industrial Internet intrusion detection data set processing method, which improves the problems that when the existing integrated learning algorithm is used for solving the industrial Internet intrusion detection problem, redundant data items in a data set cause poor generalization performance of a trained integrated learning model, certain types of data tags in the data set cannot be identified by an integrated learning individual learner, and certain types of data tags in the data set are erroneously identified by the integrated learning individual learner, so that the trained integrated learning model has low detection precision, and provides a new method for processing training data sets and verification data sets when the integrated learning algorithm is used for solving the industrial Internet intrusion detection problem.

Description

Industrial Internet intrusion detection data set processing method based on D-N
Technical Field
The invention relates to the fields of data set processing, data cleaning, discrete-normalized mathematical methods (D-N algorithm), classification of integrated learning algorithms and application thereof, in particular to KDD99 data set processing based on CART-AMV algorithm in integrated learning algorithms for realizing industrial Internet intrusion detection.
Background
The integrated learning algorithm improves a complex single algorithm flow in machine learning, and can effectively reduce the algorithm complexity and cost of the machine learning by constructing a large number of individual learners with simple algorithms and various types, which is an advantage of the integrated learning algorithm. The disadvantage is that in training of individual learners, the training data set used is strongly dependent. The quality of the training data set data structure directly influences the quality of the generalization performance of the individual learner after training. Under the application background of the integrated learning algorithm for intrusion detection of the industrial Internet to solve the classification problem, large data sets such as KDD99, KDD-NSD, UNSW-NB15 and the like are available, and the large data sets have huge data volume, real data and comprehensive intrusion attack type coverage. But has the defects of large redundancy of data, non-uniform data types and partial data labels which cannot be identified by an individual learner.
Disclosure of Invention
The invention provides a D-N-based industrial Internet intrusion detection data set processing algorithm, which aims to solve the problems of high requirements of an integrated learning algorithm on a data set data structure and the deficiency of an industrial Internet intrusion detection data set, and realizes the data analysis and arrangement of the data set through three steps of data cleaning, data discretization and data normalization. The present invention can be applied to various types and sizes of datasets.
The specific technical scheme for realizing the aim of the invention is as follows:
Firstly, a data cleaning pool is constructed, and data items with specific values of certain data labels are processed, such as rewritten or rejected, so that the redundancy of the data is reduced, and the generalization performance of the integrated learning algorithm model after training is improved; secondly, converting the non-numerical data label into a discrete numerical label through three steps of code conversion, arithmetic average calculation and average absolute deviation calculation of the non-numerical data label so as to improve the available quantity of the data label in the data set; finally, the non-numerical data labels and the continuous numerical data labels after discretization are subjected to normalization processing, so that the order-of-magnitude difference of the center values of different data labels is further reduced, and the classification precision of the integrated learning algorithm model after training is improved.
Drawings
Other objects and results of the present patent will become more apparent and readily appreciated by reference to the following description and claims in conjunction with the accompanying drawings and as the patent of the invention is more fully understood. In the drawings:
FIG. 1 is an algorithm flow chart of an industrial Internet intrusion detection dataset processing algorithm based on D-N;
FIG. 2 is a statistical graph of the self-importance coefficients of data tags after training using the CART-AMV algorithm in ensemble learning after processing kddcup.data_10_percentage.gz dataset in KDD99 series dataset using a D-N based industrial Internet intrusion detection dataset processing algorithm;
FIG. 3 is a graph showing the distribution of data tags processed by the D-N based industrial Internet intrusion detection data set processing algorithm among the data tags with the data tags in FIG. 2 having a self-importance coefficient greater than 0.6.
Detailed Description
(1) The data set D to be processed is input, and all data tags l 1、l2、…ln of the data set D are traversed.
(2) And according to the traversing result of the data labels of the data set D, establishing an empty table E, namely a data cleaning pool, of which the sequence and the names are completely consistent with those of the data labels of the data set D.
(3) The value v11、v12、…、v1m;v21、v22、…、v2m;…;vn1、vn2、…、vnm, of the data label to be processed and the processing mode M of each item are respectively input under each data label l 1、l2、…ln of the data cleaning pool E, and the data cleaning pool is updated to be E f.
(4) Traversing the data set D in a sequence of line by line and column by column, comparing the data cleaning pool E f, and processing the data label to be processed in a processing mode M to obtain a traversed data set D f.
(5) Traversing the data set D 1 in a sequence of firstly row by row and then line by line, and skipping the step if the data type of the data tag is numerical; if the data type of the data tag is non-numerical, counting the number m of the value types of the data tag, and carrying out simple coding on the m value types of the data tag: 1.2, … and m to obtain a numeric data tag value x 1、x2、…、xn.
(6) Calculating an arithmetic average value AVG of the digitized values of each of the data tags subjected to the digitizing process according to the digitized data tag values x 1、x2、…、xn obtained in the step (5), wherein
(7) Calculating the average absolute deviation STAD of the numerical values of each data tag subjected to the numerical processing according to the numerical data tag value x 1、x2、…、xn obtained in the step (5) and the arithmetic average value AVG obtained in the step (6), wherein
(8) Calculating the value x' n of the final data label after the discretization of the numerical value processed in the steps (5) - (7) according to the x n, the AVG and the STAD obtained in the steps (5), (6) and (7), whereinNote that if avg=0 or stad=0, discretized x' n =0 results in the traversal processed data set D d.
(9) Traversing the value of each data tag to obtain a discretized data tag value maximum value x max and a discretized data tag value minimum value x min.
(10) Calculating the value x' n of the data label after numerical normalization according to the value x max、xmin obtained in the step (9), wherein
(11) After the processing of steps (5) - (10) is completed for all columns of data set D f, all processed data is stored in the new data set D n in the data format of data set D.
(12) The industrial Internet intrusion detection data set processing algorithm described in the steps (1) - (11) is applied to the industrial Internet intrusion detection data set processing, and the usability of the method is further verified through experiments on the integrated learning type industrial Internet intrusion detection algorithm.

Claims (2)

1. The method for processing the industrial Internet intrusion detection data set based on the D-N is characterized by comprising the following steps of: the D-N-based industrial Internet intrusion detection data set processing algorithm can effectively analyze the data types of the industrial Internet intrusion detection data set, and perform data cleaning, discretization and normalization processing on the data in the industrial Internet intrusion detection data set, and comprises the following steps:
(1) Inputting a data set D to be processed, traversing all data labels of the data set D
(2) According to the traversing result of the data labels of the data set D, an empty table E, namely a data cleaning pool, is established, wherein the sequence and the name of the empty table E are completely consistent with those of the data labels of the data set D;
(3) Each data tag in data cleansing pool E Respectively inputting the values of the data labels to be processed/>, and;/>;/>And the processing mode M of each item, updating the data cleaning pool as/>
(4) Traversing the data set D in a sequence of line by line and column by column, comparing the data cleaning poolsProcessing the data label to be processed in a processing mode M to obtain a data set/>, which is subjected to traversal processing
(5) Traversing a dataset in a column-by-column followed by row orderIf the data type of the data tag is numerical, skipping the step; if the data type of the data tag is non-numerical, counting the number m of the value types of the data tag, and carrying out simple coding on the m value types of the data tag: 1.2, …, m to obtain the numeric data tag value/>
(6) Taking the value according to the numeric data label obtained in the step (5)Calculating an arithmetic average value AVG of each numeric value of the numeric data tag, wherein/>
(7) Taking the value according to the numeric data label obtained in the step (5)And (6) calculating the average absolute deviation STAD of the digitized values of each of the digitized data tags, wherein
(8) According to steps (5), (6) and (7)AVG and STAD, and calculating the final value of the data label after discretization of the numerical value processed in the steps (5) - (7)/>Wherein/>If/>Or/>=0, DiscretizedObtaining the data set/>, after traversing
(9) Traversing the value of each data label to obtain the maximum value of the discretized data label valueAnd data tag value minimum/>
(10) According to step (9),Calculating the value/>, of the data label after numerical normalizationWherein
(11) At the data setAfter all columns of data have been processed in steps (5) - (10), storing all processed data in the new data set/>, according to the data format of data set DIs a kind of medium.
2. The D-N based industrial internet intrusion detection dataset processing method of claim 1, wherein: firstly, a data cleaning pool is constructed, data items with specific values of certain data labels are processed, rewritten or rejected, the redundancy of the data is reduced, and the generalization performance of the integrated learning algorithm model after training is improved; secondly, converting the non-numerical data label into a discrete numerical label through three steps of code conversion, arithmetic average calculation and average absolute deviation calculation of the non-numerical data label so as to improve the available quantity of the data label in the data set; finally, the non-numerical data labels and the continuous numerical data labels after discretization are subjected to normalization processing, so that the order-of-magnitude difference of the center values of different data labels is further reduced, and the classification precision of the integrated learning algorithm model after training is improved.
CN202111202373.3A 2021-10-15 2021-10-15 Industrial Internet intrusion detection data set processing method based on D-N Active CN113934719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111202373.3A CN113934719B (en) 2021-10-15 2021-10-15 Industrial Internet intrusion detection data set processing method based on D-N

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111202373.3A CN113934719B (en) 2021-10-15 2021-10-15 Industrial Internet intrusion detection data set processing method based on D-N

Publications (2)

Publication Number Publication Date
CN113934719A CN113934719A (en) 2022-01-14
CN113934719B true CN113934719B (en) 2024-04-19

Family

ID=79279742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111202373.3A Active CN113934719B (en) 2021-10-15 2021-10-15 Industrial Internet intrusion detection data set processing method based on D-N

Country Status (1)

Country Link
CN (1) CN113934719B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107846392A (en) * 2017-08-25 2018-03-27 西北大学 A kind of intrusion detection algorithm based on improvement coorinated training ADBN
CN109347872A (en) * 2018-11-29 2019-02-15 电子科技大学 A kind of network inbreak detection method based on fuzziness and integrated study
CN111181939A (en) * 2019-12-20 2020-05-19 广东工业大学 Network intrusion detection method and device based on ensemble learning
CN112070131A (en) * 2020-08-25 2020-12-11 天津大学 Intrusion detection method based on partial deep learning theory
KR20210088146A (en) * 2020-01-06 2021-07-14 계명대학교 산학협력단 Network intrusion detection system and method based on ae-cgan model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107846392A (en) * 2017-08-25 2018-03-27 西北大学 A kind of intrusion detection algorithm based on improvement coorinated training ADBN
CN109347872A (en) * 2018-11-29 2019-02-15 电子科技大学 A kind of network inbreak detection method based on fuzziness and integrated study
CN111181939A (en) * 2019-12-20 2020-05-19 广东工业大学 Network intrusion detection method and device based on ensemble learning
KR20210088146A (en) * 2020-01-06 2021-07-14 계명대학교 산학협력단 Network intrusion detection system and method based on ae-cgan model
CN112070131A (en) * 2020-08-25 2020-12-11 天津大学 Intrusion detection method based on partial deep learning theory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于时间序列演变分析的有效相似性定义和聚类;周原冰;左新强;顾杰;赵春晖;;计算机工程与应用;20080401(10);全文 *

Also Published As

Publication number Publication date
CN113934719A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
CN110910243B (en) Property right transaction method based on reconfigurable big data knowledge map technology
Rupapara et al. Auto-encoders for content-based image retrieval with its implementation using handwritten dataset
CN110046356B (en) Label-embedded microblog text emotion multi-label classification method
CN110276382B (en) Crowd classification method, device and medium based on spectral clustering
CN114117153A (en) Online cross-modal retrieval method and system based on similarity relearning
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
CN112100394A (en) Knowledge graph construction method for recommending medical experts
CN117333037A (en) Industrial brain construction method and device for publishing big data
CN115221387A (en) Enterprise information integration method based on deep neural network
CN113934719B (en) Industrial Internet intrusion detection data set processing method based on D-N
CN111340362A (en) Enterprise investment risk assessment method based on deep learning neural network
De Araujo et al. Automatic cluster labeling based on phylogram analysis
Noorani et al. Evaluation of convolutional neural networks for waste identification
CN111259106A (en) Relation extraction method combining neural network and feature calculation
CN116805010A (en) Multi-data chain integration and fusion knowledge graph construction method oriented to equipment manufacturing
CN111159328A (en) Information knowledge fusion system and method
CN116244442A (en) Text classification method and device, storage medium and electronic equipment
CN115659951A (en) Statement emotion analysis method, device and equipment based on label embedding
CN115618875A (en) Public opinion scoring method, system and storage medium based on named entity recognition
CN110705268B (en) Article subject matter extraction method and device based on artificial intelligence and computer readable storage medium
CN112487231B (en) Automatic image labeling method based on double-image regularization constraint and dictionary learning
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
CN113361787A (en) Commodity classification system, commodity classification method, storage medium and terminal
CN114462516B (en) Enterprise credit scoring sample labeling method and device
CN116468037A (en) NLP-based data processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant