CN115695027A

CN115695027A - Original network flow threat detection method and device

Info

Publication number: CN115695027A
Application number: CN202211379520.9A
Authority: CN
Inventors: 任传伦; 俞赛赛; 何明枢; 王小娟; 刘晓影; 张先国; 贾佳; 乌吉斯古愣; 刘文瀚; 孟祥頔
Original assignee: Beijing University of Posts and Telecommunications; CETC 15 Research Institute
Current assignee: Beijing University of Posts and Telecommunications; CETC 15 Research Institute
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-02-03

Abstract

The invention discloses a method and a device for detecting an original network flow threat, wherein the method comprises the following steps: acquiring original network flow data, and processing the original network flow data by using an original network flow data representation model to obtain characteristic information of the original network flow data; dividing the characteristic information of the original network flow data to obtain training characteristic information of the original network flow data and test characteristic information of the original network flow data; processing training characteristic information of the original network flow data by using an automatic machine learning model to obtain an optimized original network flow threat detection model; and processing the test characteristic information of the original network flow data by using the optimized original network flow threat detection model to obtain an original network flow threat detection result. Therefore, the method solves the difficult problems of poor method applicability, low automation degree and the like caused by complex and changeable scenes, and effectively improves the automatic construction level of the threat detection analysis model in the network space battle scene.

Description

Original network flow threat detection method and device

Technical Field

The invention relates to the field of network security, in particular to a method and a device for detecting an original network flow threat.

Background

In the existing method for data conversion and data processing in network space, all semantic fields of a network stream are collected in one representation based on a semantic coding mode and the representation has integrity and constant size, but the method has representation uncertainty; the encoding mode based on the original binary system reserves the characteristic sequence, reduces the dependence on the artificial design characteristics, but ignores many complex details in the network flow, and leads the method to possibly introduce a large amount of interference.

In the existing network space scene construction method, a method based on a machine learning technology and a deep neural network technology can realize construction of a single model and model optimization, but the method has the problems of low adaptability, poor model effect and low automation degree for construction of network spaces with multiple scenes and different applications.

In the actual problem of constructing the network space battle scene, because different network spaces have great difference and have different requirements for a specific network space during construction, the conventional method introduces a plurality of problems, including:

(1) The semantic-based coding mode does not reserve the sequence of message option fields in each network stream, and the field coding mode needs to be manually determined, so that the problem of non-uniform feature ordering among the network streams processed by the method exists, and the coding representation is not suitable for tasks such as equipment identification.

(2) The original binary-based coding mode ignores many complex details in the network flow, including the variable length of the network flow and the difference between different protocols, which will result in the inconsistent coding length of each network flow and the coding representation result having no interpretability.

(3) The method based on the machine learning technology and the deep neural network technology can only construct and optimize a manually specified single model, so that the training model is only suitable for a single specific network space scene and cannot realize cross-network space migration, the automation level of the model during construction is low, and the scene adaptability in practical application is poor.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and a device for detecting an original network stream threat, which perform data transformation and data processing on a network stream in a network space based on an original network stream data characterization model, so that a coding representation has integrity, consistency and expandability, and the representation is standardized and has a constant size. In addition, the automatic machine learning model can be used for automatically selecting the optimal model and the hyper-parameters of the specific network space, the automation degree of model construction is improved, and the automatic machine learning model can realize multi-model fusion construction of the specific network space, so that the final model has high environmental adaptability.

In order to solve the above technical problem, a first aspect of an embodiment of the present invention discloses a method for detecting an original network flow threat, where the method includes:

s1, acquiring original network flow data, wherein the original network flow data comprises a plurality of network flow data;

s2, processing the original network flow data by using a preset original network flow data representation model to obtain the characteristic information of the original network flow data;

s3, dividing the characteristic information of the original network flow data to obtain training characteristic information of the original network flow data and test characteristic information of the original network flow data;

s4, processing training characteristic information of the original network flow data by using a preset automatic machine learning model to obtain an optimized original network flow threat detection model;

and S5, processing the test characteristic information of the original network flow data by using the optimized original network flow threat detection model to obtain an original network flow threat detection result.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the original network stream data includes, but is not limited to, normal stream data, DDos attack network stream data, webattach attack network stream data;

the network flow data is a data packet in a binary form.

As an optional implementation manner, in a first aspect of an embodiment of the present invention, the processing, by using a preset original network flow data characterization model, the original network flow data to obtain feature information of the original network flow data includes:

the original network flow data characterization model is used for carrying out data conversion and data representation on the original network flow data to obtain characteristic information of the original network flow data;

the characteristic information of the original network flow data is a two-dimensional matrix; each row of the two-dimensional matrix represents characteristic information of a piece of network flow data.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the processing, by using a preset automatic machine learning model, training feature information of the original network flow data to obtain an optimized original network flow threat detection model includes:

and the automatic machine learning model is used for carrying out data preprocessing, feature training, model selection and hyper-parameter optimization on training feature information of the original network flow data to obtain an optimized original network flow threat detection model.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the performing data conversion and data representation on the original network flow data to obtain characteristic information of the original network flow data includes:

performing data conversion on the original network flow data, including data alignment and data internal padding on the original network flow data;

and performing data representation on the original network flow data, wherein the data representation comprises the step of changing the original network flow data into a two-dimensional matrix to obtain the characteristic information of the original network flow data.

As an optional implementation manner, in a first aspect of the embodiment of the present invention, the data alignment and data internal padding for the original network flow data includes:

carrying out data alignment on the binary data packet to obtain an original network flow data packet with a consistent length;

filling the interior of the data of the original network streaming data packets with the consistent length to obtain the original network streaming data packets with the consistent format;

the original network flow data packet with the consistent format comprises a plurality of packets, and each packet has the same number of characteristics and the same size.

As an optional implementation manner, in a first aspect of an embodiment of the present invention, the performing, by using a preset automatic machine learning model, data preprocessing, feature training, model selection, and hyper-parameter optimization on training feature information of the original network flow data to obtain an optimized original network flow threat detection model includes:

the preset automatic machine learning model processes the training characteristic information of the original network flow data, and the method comprises the following steps:

s401, preprocessing data; the data preprocessing comprises the steps of identifying the type of training characteristic information of the original network flow data, deleting irrelevant characteristics of the training characteristic information and filtering low-frequency characteristics of the training characteristic information to obtain preprocessed training characteristic information;

s402, feature training; the feature training comprises training the preprocessing training feature information;

the training comprises the following steps: filling missing values in the preprocessed training characteristic information to obtain filling training characteristic information;

coding the filling training characteristic information to obtain coded training characteristic information;

carrying out data standardization processing on the coding training characteristic information to obtain standardized training characteristic information with a mean value of 0 and a standard deviation of 1;

s403, selecting a model; the model selection comprises the step of training the models in the model base by utilizing the standardized training characteristic information to obtain an optimized original network flow threat detection model;

and presetting a model library, wherein the model library comprises N basic models.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the method further includes dividing, by using a random division method, the characteristic information of the original network stream data;

70% of the characteristic information of the original network flow data is training characteristic information of the original network flow data, and 30% of the characteristic information of the original network flow data is testing characteristic information of the original network flow data.

The second aspect of the present invention discloses an original network flow threat detection apparatus, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original network flow data which comprises a plurality of network flow data;

the first processing module is used for processing the original network flow data by utilizing a preset original network flow data representation model to obtain the characteristic information of the original network flow data;

the second processing module is used for dividing the characteristic information of the original network flow data to obtain training characteristic information of the original network flow data and test characteristic information of the original network flow data;

the third processing module is used for processing the training characteristic information of the original network flow data by using a preset automatic machine learning model to obtain an optimized original network flow threat detection model;

and the fourth processing module is used for processing the test characteristic information of the original network flow data by using the optimized original network flow threat detection model to obtain an original network flow threat detection result.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the original webflow data includes, but is not limited to, normal flow data, DDos attack webflow data, and webattach attack webflow data;

the network flow data is a data packet in a binary form.

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the processing, by using a preset original network flow data characterization model, the original network flow data to obtain feature information of the original network flow data includes:

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the processing, by using a preset automatic machine learning model, training feature information of the original network flow data to obtain an optimized original network flow threat detection model includes:

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the performing data conversion and data representation on the original network flow data to obtain characteristic information of the original network flow data includes:

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the data alignment and data internal padding on the original network flow data includes:

filling the data inside the original network flow data packets with the consistent lengths to obtain the original network flow data packets with the consistent formats;

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the performing, by using a preset automatic machine learning model, data preprocessing, feature training, model selection, and hyper-parameter optimization on training feature information of the original network flow data to obtain an optimized original network flow threat detection model includes:

the preset automatic machine learning model processes the training characteristic information of the original network flow data, and the steps comprise:

carrying out data standardization processing on the coding training characteristic information to obtain standardized training characteristic information with an average value of 0 and a standard deviation of 1;

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the method further includes dividing, by using a random division method, the characteristic information of the original network flow data;

The third aspect of the present invention discloses another original network flow threat detection apparatus, the apparatus comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute part or all of the steps of the original network flow threat detection method disclosed in the first aspect of the embodiment of the invention.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

(1) The method of the invention does not need to carry out semantic extraction on the original network flow, thereby being capable of retaining all information of the original network flow, simultaneously ensuring that the characteristic sequence in each data representation is the same, and being capable of ensuring the completeness and consistency of the representation.

(2) The invention focuses on the length difference between different network flows and the coding difference between different protocols, so that the data representation of different network flows has comparability and the representation has consistency and interpretability.

(3) The method has the advantage of high network flow data processing speed, and can convert 150 ten thousand data packets per minute on average under a single thread; meanwhile, the method has the advantages of collecting and coding the dynamic network flow on line and coding the network flow off line. Therefore, the method is suitable for constructing the network flow threat detection analysis model under various network space battle scenes.

(4) The method has the advantages that the characteristics for model training are screened without depending on manual work, and the multi-model fusion model can be trained aiming at the problem of single network space construction, so that the automatic and rapid parameter optimization is realized, and the optimal fusion model is output.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of an original network flow threat detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of an original network flow data characterization model of an original network flow threat detection method according to an embodiment of the present invention;

FIG. 3 is a schematic representation of original network flow data of a method for threat detection of original network flow according to an embodiment of the present invention;

FIG. 4 is a neural network architecture of an automatic machine learning model of a method for threat detection of primitive network flows as disclosed in an embodiment of the present invention;

FIG. 5 is a multi-layer stacking strategy of an automatic machine learning model of a method for threat detection of primitive network flows according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an original network flow threat detection apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of another original network flow threat detection apparatus according to the embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to those listed but may alternatively include other steps or elements not listed or inherent to such process, method, product, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The invention discloses a method and a device for detecting threat of original network flow, which can directly encode the semantic structure of the original network flow and eliminate the influence of inconsistent bits among the network flows by using an internal filling method; the process of multi-model selection, training and fusion is automatically carried out without manual intervention; the network flow processing and the network space model construction are effectively combined to form a complete construction process from the original network flow to the network space model. The following are detailed descriptions.

Example one

Referring to fig. 1, fig. 1 is a schematic flowchart of an original network flow threat detection method according to an embodiment of the present invention. The flow of the original network flow threat detection method described in fig. 1 may be used in the field of network security, such as intrusion detection and defense, and information theft prevention. As shown in fig. 1, the original network flow threat detection method may include the following operations:

s2, processing the original network flow data by using a preset original network flow data representation model to obtain characteristic information of the original network flow data;

optionally, the dividing may be performed according to a ratio of 7.

Fig. 2 is a flowchart of an original network flow data characterization model of an original network flow threat detection method according to an embodiment of the present invention. As can be seen from fig. 2, the flow of the raw network flow data characterization model includes raw network flow input, data conversion and data representation.

Optionally, the original network flow data is obtained, and the user can collect the original data set of the network flow and input the original network flow into the original network flow data representation model for processing. The original network flow data is labeled, such as marked as normal flow, DDos attack, webattach attack, etc.

Optionally, the original network flow data representation model represents the packet in original binary form and aligns the binary data in a manner that identifies the particular semantic structure that the packet itself has. The original network flow data characterization model reduces the inconsistency of data representation by using an internal padding mode, and retains the sequence of option features under the influence of reducing inconsistent bit representation.

The original network flow data characterization model can ensure the data representation integrity; the original network flow data characterization model uses internal padding, each packet has the positions of all header types, and the representation of each packet is ensured to have the same number of characteristics, and each characteristic has the same meaning. The original network flow data representation model is interpretable at a bit level, and can be mapped back to a semantic field to better understand characteristics; the representation of the original network flow data characterization model is standardized, and may contain only-1,0,1 per bit. The non-existing header information is represented by-1, -1 can distinguish the bit itself set to 0 from the bit of the non-existing value; the data represented by the original network flow data characterization model is of the same size, with each packet having the same number of features. The payload is used to make the number of bytes selectable. Each piece of network flow data forms a feature vector, and the feature vectors of all the network flow data are arranged to form a two-dimensional matrix. The conversion process from the original network flow data to the two-dimensional matrix data is realized by a data processing method.

Fig. 3 illustrates raw network flow data characterization formats for TCP and UDP network flows including header placeholders for IPV4 protocol, TCP protocol, UDP protocol, and ICMP protocol.

And processing the training characteristic information of the original network flow data by using a preset automatic machine learning model to obtain an optimized original network flow threat detection model.

Optionally, the automatic machine learning model may perform data preprocessing and multi-model fusion training to find a fusion model of the optimal parameters. In the data processing stage, the automatic machine learning model automatically identifies the feature types, deletes irrelevant features and filters low-frequency features. The automatic machine learning model trains a representation vector for class-type features in the data respectively, and determines the vector length of each feature according to the number of representations of each feature in all input data.

The automatic machine learning model processes missing values of continuous features in data, fills the missing values with median, encodes specific features, and then performs data normalization processing to make the data mean 0 and standard deviation 1. Finally, splicing the category type characteristic vector and the continuous type data, and putting the spliced category type characteristic vector and the continuous type data into three layers of dense blocks on the left side in the graph 4 for training; in the model training stage, the automatic machine learning model can test more than 50 models, the models can be divided into 6 types of basic model categories, the basic model categories are evolved by a tree method, a deep neural network and a neighbor algorithm, and the invention is not limited. The automatic machine learning model trains a plurality of basic models, such as model 1, model 2 through model n in fig. 5, using all features and a plurality of different types of algorithms that are well defined. Then, performing multi-model interaction at the same layer, and updating the hyper-parameters of the models, wherein the hyper-parameters of the models of the same type (for example, if the types of the model 1 and the model 2 are both tree models) are kept consistent. The goal of model hyper-parameter updating is to gradually increase the accuracy of each round of training. And finally, stacking the output layers in a weighting mode, weighting the high-performance models with higher weight, and then carrying out multi-model weighted summation to generate a multi-model fusion. The automatic machine learning model can preset a model training speed parameter and a model size parameter, and can perform multiple k-fold cross validation training in order to prevent overfitting of the model.

The k-fold cross validation randomly divides a data set into k mutually exclusive subsets with the same size, randomly selects k-1 parts as a training set each time, and uses the remaining 1 part as a test set. When this round is completed, k are randomly chosen again to train the data. After several rounds (less than k), a loss function is selected to evaluate the optimal model and parameters.

The neural network architecture of tabular data composed of numerical values and classification features of an automated machine learning model is shown in fig. 4. The multi-level stacking strategy for an automated machine learning model is illustrated in FIG. 5, which shows two stacked levels and different types of base learners.

Example two

Referring to fig. 6, fig. 6 is a schematic structural diagram of an original network flow threat detection apparatus according to an embodiment of the present invention. The process of the original network flow threat detection apparatus described in fig. 6 may be used in the network security field, such as intrusion detection and defense, and information theft prevention. As shown in fig. 6, the apparatus includes:

s301, an obtaining module, configured to obtain original network stream data, where the original network stream data includes a plurality of network stream data;

s302, a first processing module is used for processing the original network flow data by using a preset original network flow data representation model to obtain the characteristic information of the original network flow data;

s303, a second processing module is used for dividing the characteristic information of the original network flow data to obtain training characteristic information of the original network flow data and test characteristic information of the original network flow data;

s304, a third processing module, configured to process training feature information of the original network flow data by using a preset automatic machine learning model, to obtain an optimized original network flow threat detection model;

s305, a fourth processing module, configured to process the test feature information of the original network flow data by using the optimized original network flow threat detection model, to obtain an original network flow threat detection result.

EXAMPLE III

Referring to fig. 7, fig. 7 is a schematic structural diagram of another original network flow threat detection apparatus according to an embodiment of the present invention. The process of the original network flow threat detection apparatus described in fig. 7 may be used in the network security field, such as intrusion detection and defense, and information theft prevention. As shown in fig. 7, the apparatus may include:

a memory S401 storing executable program codes;

a processor S402 coupled with the memory S401;

the processor S402 calls the executable program code stored in the memory S401 for executing the steps in the original network flow threat detection method described in the first embodiment or the second embodiment.

The above-described embodiments of the apparatus are only illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above technical solutions may essentially or in part contribute to the prior art, be embodied in the form of a software product, which may be stored in a computer-readable storage medium, including a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable Programmable Read-Only Memory (EEPROM), an optical Disc-Read (CD-ROM) or other storage medium capable of storing data, a magnetic tape, or any other computer-readable medium capable of storing data.

Finally, it should be noted that: the method and apparatus for detecting threat of original network flow disclosed in the embodiments of the present invention are only the preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for original network flow threat detection, the method comprising:

2. The original network flow threat detection method of claim 1, wherein said original network flow data includes, but is not limited to, normal flow data, DDos attack network flow data, webattach attack network flow data;

the network flow data is a data packet in a binary form.

3. The method for detecting threat of original network flow according to claim 1, wherein the processing the original network flow data by using a preset original network flow data characterization model to obtain the characteristic information of the original network flow data comprises:

performing data conversion and data representation on the original network flow data by using the original network flow data representation model to obtain characteristic information of the original network flow data;

4. The method for detecting the original network flow threat according to claim 1, wherein the step of processing training feature information of the original network flow data by using a preset automatic machine learning model to obtain an optimized original network flow threat detection model comprises:

5. The method for threat detection of an original network flow according to claim 3, wherein the performing data conversion and data representation on the original network flow data to obtain the characteristic information of the original network flow data includes:

6. The original network flow threat detection method of claim 5, wherein data alignment and data internal padding of the original network flow data comprises:

carrying out data alignment on the binary data packets to obtain original network flow data packets with consistent lengths;

filling the interior of the data of the original network streaming data packets with the consistent length to obtain the original network streaming data packets with the consistent format; the original network flow data packet with the consistent format comprises a plurality of packets, and each packet has the same number of characteristics and the same size.

7. The method for detecting the threat of the original network flow according to claim 4, wherein the step of performing data preprocessing, feature training, model selection and hyper-parameter optimization on training feature information of the original network flow data by using a preset automatic machine learning model to obtain an optimized original network flow threat detection model comprises the following steps:

8. The original network flow threat detection method according to claim 1, further comprising dividing characteristic information of the original network flow data by a random division method;

9. An original network flow threat detection apparatus, the apparatus comprising:

10. An original network flow threat detection apparatus, the apparatus comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to perform the raw network flow threat detection method of any of claims 1-8.