CN117527446B

CN117527446B - Network abnormal flow refined detection method

Info

Publication number: CN117527446B
Application number: CN202410003786.6A
Authority: CN
Inventors: 杨贻宏
Original assignee: Shanghai Artificial Intelligence Network System Engineering Technology Research Center Co ltd
Current assignee: Shanghai Artificial Intelligence Network System Engineering Technology Research Center Co ltd
Priority date: 2024-01-03
Filing date: 2024-01-03
Publication date: 2024-03-12
Anticipated expiration: 2044-01-03
Also published as: CN117527446A

Abstract

The application relates to a network abnormal flow refinement detection method, which comprises the following steps: acquiring a labeled dimension reduction feature set and a non-labeled dimension reduction feature set; inputting the labeled dimension reduction feature set into a clustering model to obtain a plurality of clustering feature clusters; respectively adding feature marks to a plurality of cluster feature clusters, and dividing an original detection model into a plurality of sub-detectors; respectively inputting cluster feature clusters with different feature marks into the sub-detectors to obtain a plurality of sub-detection models; carrying out feature conformity assessment on the unlabeled dimension reduction feature set and a plurality of cluster feature clusters one by one to obtain a cluster feature cluster set which is most matched with the unlabeled dimension reduction feature set; and inputting the label-free dimension reduction feature set into a sub-detection model which is most matched with the label-free dimension reduction feature set, and finally obtaining an abnormal flow detection result. The method provided by the application can solve the problem that the existing abnormal flow detection technology is difficult to accurately capture the flow characteristic deviation, so that the abnormal flow detection precision is improved.

Description

Network abnormal flow refined detection method

Technical Field

The application relates to the technical field of network security, in particular to a network abnormal flow refined detection method.

Background

With the rapid development of 6G technology, network traffic tends to be complicated and diversified, wherein hidden malicious abnormal traffic behaviors are layered, and network security problems are more and more prominent. The traditional abnormal flow detection method needs a switch to collect a large amount of flow data passing through a backbone network and input the flow data into a single detection model for training, but the characteristic distribution of the flow data passing through different gateways often has certain differences due to different factors such as data sources, transmission purposes, flowing places and time, and the traditional single detection model does not have the capability of distinguishing the characteristic distribution differences, so that the detection effect on abnormal flow is not ideal, the expandability is poor, and the refined abnormal flow detection can not be developed for complex flow passing through each gateway. In addition, in order to avoid the detection of the abnormal flow detection system, some attackers often hide malicious flow in normal flow through encryption means, the encrypted abnormal flow often does not have a label, the traditional single abnormal flow detection means is difficult to judge the attack category of the attacker, so that the matching degree of the characteristics extracted from the flow data and a trained single detection model is low, an effective detection feedback result cannot be provided for related detection personnel, and the process not only greatly reduces the detection precision of the single detection model, but also causes low detection efficiency.

Disclosure of Invention

The application relates to a network abnormal flow refinement detection method, which can solve the problem that the accuracy of an abnormal flow detection result is low due to the fact that characteristic deviation among unlabeled flows is difficult to capture by the existing abnormal flow detection technology. The application relates to a network abnormal flow refinement detection method, which comprises the following steps:

s1, acquiring a labeled dimension reduction feature set and a non-labeled dimension reduction feature set according to an original flow data sample library;

s2, inputting the labeled dimension reduction feature set into a clustering model for training to obtain a plurality of clustering feature clusters;

s3, adding feature marks to the clustering feature clusters respectively, and dividing an original abnormal flow detection model into a plurality of sub-detectors by taking the feature marks and the cluster sizes as dividing basis;

s4, respectively inputting the cluster feature clusters with different feature marks into the sub-detectors to be trained to obtain a plurality of sub-detection models;

s5, evaluating feature coincidence degree of the unlabeled dimension reduction feature set and the plurality of cluster feature clusters one by one to obtain a cluster feature cluster set which is most matched with the unlabeled dimension reduction feature set, wherein the method specifically comprises the following steps:

s51, a certain unlabeled flow data feature in the unlabeled dimension-reducing feature set is subjected toClustering feature cluster with feature tag character number 1Performing feature coincidence degree matching to obtain the clustered feature clusterFeature compliance score of (a)；

S52, judging the cluster characteristic clusterFeature compliance score of (a)If the clustering result is lower than the preset threshold, clustering the feature clustersClustering feature cluster with feature tag character number 2Integrating to obtain a coincidence degree evaluation feature clusterThe method comprises the steps of carrying out a first treatment on the surface of the If the threshold value is higher than the preset threshold value, clustering the characteristic clusterAs said unlabeled traffic data featureMost matched cluster feature clusters；

S53, the label-free flow data is characterizedFeature cluster for evaluating coincidence degreeMatching the feature coincidence degree to obtain a coincidence degree evaluation feature clusterFeature compliance score of (a)；

S54, the feature conformity score is calculatedCompliance score with featuresWeighted average is carried out to obtain the cluster feature clusterFeature compliance score of (a)；

S55, comparing the feature conformity scoresCompliance score with featuresSelecting cluster feature clusters with larger feature conformity values as the features of the unlabeled flow dataMost matched cluster feature clusters；

S6, respectively inputting the label-free dimension reduction feature sets into a sub-detection model to which the cluster feature clusters most matched with the label-free dimension reduction feature sets belong to detect abnormal flow, and finally obtaining an abnormal flow detection result.

Further, the original traffic data sample library in step S1 includes a labeled unencrypted traffic data sample, a labeled encrypted traffic data sample, a non-labeled unencrypted traffic data sample, and a non-labeled encrypted traffic data sample.

Further, the clustering model in the step S2 comprises a K-Means clustering model generated by a guide cluster of a modified sine and cosine optimization algorithm.

Further, the number of cluster feature clusters in the step S2 is determined by a preset cluster threshold.

Further, the step S2 specifically includes the following steps:

s21, defining initialization parameters of a cluster model, and establishing an initialization sample cluster;

s22, inputting the initialization parameters and the initialization sample clusters in the step S21 into a clustering model for training to obtain a plurality of original clustering feature clusters with the number smaller than or equal to the preset cluster threshold;

s23, calculating the adaptability of the initialized sample cluster in the step S22, and expanding and optimizing in each original cluster feature cluster according to the solving result to obtain the optimal solution in the cluster of the original cluster feature cluster;

s24, according to the intra-cluster maximum value and the intra-cluster minimum value of the original cluster feature clusters in the step S23, spreading the chaos variation of the cauchy operators on other sample clusters except the global optimal cluster in the original cluster feature cluster set, and updating the global optimal cluster set;

s25, comparing the fitness of the global optimal cluster set in the original cluster feature clusters in the step S24 with the fitness of the initial value of the global optimal cluster set one by one, and finally obtaining the cluster feature cluster set with the optimal fitness.

Further, the feature label in the step S3 is a character number label of a typical feature of a cluster feature cluster different from other cluster feature clusters, so that feature conformity assessment can be conveniently performed on the labeled feature set and the unlabeled feature set in the step S1.

Further, the cluster size in the step S3 specifically represents the sample size of the labeled traffic data and the sample size of the unlabeled traffic data contained in the cluster feature cluster.

Further, the division dimension of the sub-detector in the step S3 is equal to the number of the cluster feature clusters, and the subordinate number of the sub-detector is equal to the character number of the cluster feature cluster to which the feature mark belongs.

Further, the sub detection model in step S4 is configured to finely detect abnormal traffic in the labeled feature set and the unlabeled feature set in step S1, where the model performance of the plurality of sub detection models is consistent with the model performance of the original abnormal traffic detection model in step S3.

Further, the step S5 obtains a conformity assessment feature clusterThe specific implementation process of (1) is as follows: removing clustered feature cluster with feature tag character number of 2Cluster of and cluster features contained inThe features of the three are identical, and the feature marking characters are numbered as cluster feature clusters with the number of 2Cluster of and cluster features contained inThe feature with difference among the features is supplemented to the cluster feature clusterObtaining the conformity evaluation feature cluster。

Further, the sub-detection model to which the cluster feature cluster belongs in the step S6 specifically represents a sub-detection model with the same character number as the feature tag of the cluster feature cluster.

Compared with the prior art, the technical method provided by the embodiment of the application has at least the following advantages: according to the network abnormal flow fine detection method, fine detection of abnormal flow is achieved by generating the cluster feature cluster to divide the plurality of sub-detection models, feature deviation among unlabeled flows is accurately captured by the feature coincidence degree assessment method, influence of the unlabeled flow feature deviation on an abnormal flow detection result is reduced, and therefore detection accuracy of an existing abnormal flow detection technology is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flow diagram of a network abnormal traffic refinement detection method provided in an embodiment of the present application;

fig. 2 is a flow chart of another network abnormal traffic refinement detection method provided in an embodiment of the present application;

fig. 3 is a flow chart of another network abnormal traffic refinement detection method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a network abnormal traffic refinement detection model provided in an embodiment of the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of methods that are consistent with some aspects of the present application as detailed in the accompanying claims.

The traditional abnormal flow detection method comprises an abnormal flow detection algorithm based on label statistics, an abnormal flow detection algorithm based on parameter statistics, an abnormal flow detection algorithm based on information entropy and the like. The method generally has the problems of complicated flow detection process, long detection time, low detection accuracy, poor model expandability and the like. The problems can be improved by adopting a machine learning method, and the detection can be developed for unknown abnormal flow in real time by training an initialized abnormal flow detection model by utilizing a large number of flow data samples, so that the detection time of the traditional abnormal flow detection method is shortened, and the accuracy of flow detection is improved. However, machine learning algorithms such as decision trees, support vector machines, bayes, neural networks and the like adopted in current abnormal traffic detection often have difficulty in accurately detecting abnormal traffic behaviors in unlabeled encrypted traffic data due to lack of capability of capturing characteristic deviations among unlabeled traffic data, so that the detection effect of an abnormal traffic detection model is not ideal.

In view of this, the application provides a network abnormal flow refinement detection method, divide the original abnormal flow detection model into a plurality of sub-detectors by taking the cluster feature marks and cluster sizes as division basis, input the plurality of cluster feature clusters into the sub-detectors to which the cluster feature clusters belong for training to obtain a plurality of sub-detection models, evaluate feature conformity of a label-free dimension-reduction feature set and the plurality of cluster feature clusters one by one to obtain a cluster feature cluster set which is most matched with the label-free dimension-reduction feature set, and input the label-free dimension-reduction feature set into the sub-detection model which is most matched with the label-free dimension-reduction feature set for abnormal flow detection, so as to finally obtain an abnormal flow detection result. According to the method, the clustering feature clusters are generated to divide the plurality of sub-detection models to realize the fine detection of the abnormal flow, and the feature deviation among the unlabeled flows is accurately captured by providing the feature coincidence degree assessment method, so that the influence of the unlabeled flow feature deviation on the abnormal flow detection result is reduced, and the detection accuracy of the existing abnormal flow detection model is improved.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems through specific embodiments with reference to the accompanying drawings.

Fig. 1 is a flow chart of a network abnormal traffic refinement detection method provided in an embodiment of the present application. As shown in fig. 1, the method mainly comprises the following steps:

s101, acquiring a labeled dimension reduction feature set and a non-labeled dimension reduction feature set according to an original flow data sample library.

The flow data samples contained in the original flow data sample library should belong to the following categories: a tagged non-encrypted traffic data sample, a tagged encrypted traffic data sample, an untagged non-encrypted traffic data sample, and an untagged encrypted traffic data sample.

The source and characteristics of the data samples belonging to the above traffic classes in the original traffic data sample library are not limited in this application. For example, the traffic data samples constituting the original traffic data sample library may be traffic data generated during e-mail transmission or traffic data generated during network audio transmission; the flow data may be in an image format or in a text format.

According to different application scene requirements, before the labeled flow data samples and the unlabeled flow data samples are cut, the flow data samples in the original flow data sample library can be cleaned, and the flow data samples containing invalid characteristic information such as missing values, 0 values, illegal values, repeated values and the like in the original flow data sample library are removed, so that a cleaned flow data sample set is obtained, and the influence of invalid flow data on the detection result of an abnormal flow detection model is avoided. The implementation mode of the flow data sample feature extraction can be any existing implementation mode of data feature extraction, and the adopted flow data sample feature extraction method can be any existing data feature extraction method. For example, one possible implementation may be to first directly extract the explicit features of the flow data according to the original flow data sample format, and then further extract the implicit features of the flow data through a Word2 vec-based feature extraction algorithm; in another possible implementation manner, the tagged traffic data samples and the untagged traffic data samples may be distinguished, for the tagged traffic data samples, the explicit features of the tagged traffic data are directly extracted according to the original tagged traffic data sample format, and for the untagged traffic data samples, the implicit features of the untagged traffic data are extracted by a Word2 vec-based feature extraction algorithm. The feature extraction algorithm based on Word2vec is applicable to flow data in text format, and the flow data can also adopt feature extraction methods such as weight statistics, TF-IDF and the like; for the flow data in the format of an image, for example, SIFT algorithm, HOG algorithm, or the like may be employed.

Alternatively, for example, the text-formatted unlabeled traffic data sample feature extracted using the Word2vec feature extraction algorithm may be a sample acquisition time, a sample source IP address, a destination IP address, a sample byte length, a sample arrival interval, a traffic transmission bit rate, and so forth.

The implementation mode of the flow data sample feature dimension reduction can be any existing implementation mode of the data feature dimension reduction, and the adopted flow data sample feature dimension reduction method can also be any existing data feature dimension reduction method. For example, the feature dimension reduction method for the traffic data sample may be multidimensional scaling (MultiDimensional Scaling, MDS), equidistant feature mapping (Isometric Mapping, ISOMAP), principal component analysis (Principle component analysis, PCA), sequence back selection (Sequential Backward Selection, SBS), etc.

Taking the example of realizing the dimension reduction of the sample characteristics of the labeled flow data by using the SBS algorithm, the dimension reduction of the sample characteristics of the labeled flow data can comprise the following steps:

calculating weights of different characteristics of the labeled traffic data samples according to the size of the sample volume containing a certain characteristic, wherein the weights are selected to be larger (namely, the influence on the traffic detection result is the greatest)Individual features, forming a set of important features. Definition of the definitionCalculating a standard according to the difference value between the retention of a certain characteristic and the removal of the performance of a certain characteristic flow detection model as a standard performance evaluation functionQuasi-performance evaluation functionAnd removing the feature with the largest value in the next round of iterative training：。

Repeating the steps until the performance loss of the original flow detection model reaches a peak value or reaches the feature dimension of a preset feature set. To one or more features generated in an iterative processAnd removing the labeled flow characteristic set from the original labeled flow characteristic set to obtain a labeled dimension reduction characteristic set.

S102, inputting the labeled dimension reduction feature set into a clustering model for training to obtain a plurality of clustering feature clusters;

the labeled feature set for dimension reduction specifically refers to a labeled flow data sample training set after feature dimension reduction processing. The cluster model includes a K-Means cluster model generated via a modified sine and cosine optimization algorithm. Optionally, the optimization algorithm may be other optimization algorithms or variations of sine and cosine optimization algorithms that can improve the convergence rate, the clustering precision, the cluster size and other aspects of the existing cluster model. The initial cluster model may also be other cluster algorithms or related variants of the K-Means cluster algorithm that are theoretically similar to the K-Means cluster algorithm. The application does not impose strict limitations on the optimization algorithm and the clustering algorithm employed. Before the cluster model generated by the guide cluster through the optimization algorithm is utilized to obtain the cluster feature cluster, a cluster threshold parameter needs to be preset, and the cluster threshold parameter is a positive integer which is not zero. The number of cluster feature clusters is determined by the preset cluster threshold, and in reasonable cases, the number of cluster feature clusters should not be greater than the upper limit of the preset cluster threshold.

S103, adding feature marks to the clustering feature clusters respectively, and dividing an original abnormal flow detection model into a plurality of sub-detectors by taking the feature marks as division basis;

the cluster feature clusters specifically refer to a sample set consisting of labeled flow data with typical features or unique features, the features contained between different cluster feature clusters are not identical, and each cluster feature cluster should contain at least one typical feature or unique feature which can be distinguished from other cluster feature clusters. For example, for a group consisting ofClustering feature cluster composed of clustering feature clustersClustering only feature clustersIncluding the feature "packet time interval", the remaindern-1. The clustering feature clusters do not contain the feature 'data packet time interval', and the feature 'data packet time interval' is the clustering feature clusterDistinguished from the typical or unique features of other clusters of clustered features. The feature "packet time interval" is given to cluster feature clustersThe main basis for adding the feature tag. Optionally, if a cluster feature cluster contains two or more typical features that can be distinguished from other cluster feature clusters, only one of the typical features is selected to add a feature label to the cluster feature cluster; optionally, considering an extreme case, if a cluster feature cluster does not contain a typical feature or a unique feature different from other cluster feature clusters due to failure of the clustering algorithm, the cluster feature cluster is integrated with other cluster feature clusters.

The feature mark is a character number mark which is characterized in that a certain cluster feature cluster is different from the typical feature or the unique feature of other cluster feature clusters, and is used for conveniently detecting the unlabeled flow data sampleAnd evaluating the feature conformity between the label flow data sample and the label flow data sample. For example, assume that a tagged traffic data sample contained in a cluster of clusters consists essentially of seven-tuple packets, i.e., the tagged traffic data sample is in the format of= { source port number, destination port number, source IP address, destination IP address, transport protocol, packet length, timestamp }, then in giving the tagged traffic data sampleAfter the feature labels are added to the cluster feature clusters, the labeled flow data sampleThe format change is an eight-tuple packet:the = { character number, source port number, destination port number, source IP address, destination IP address, transmission protocol, packet length, timestamp }, where the character number is a feature tag of the cluster feature cluster to which the labeled traffic data sample belongs, and is in a binary coding form and is located at the header position.

The original abnormal flow detection model can be any existing machine learning or deep learning model, and the implementation principle of the original abnormal flow detection model is not strictly limited. For example, the original abnormal flow detection model may be a machine learning model such as a support vector machine, a decision tree, XGBoost, random forest, or a deep learning model such as a recurrent neural network RNN, a long and short term memory network LSTM, YOLO, etc. Wherein the dividing dimension of the sub-detector is equal to the number of the generated cluster feature clusters, and the subordinate number of the sub-detector is equal to the character number of the cluster feature cluster to which the feature mark belongs.

S104, respectively inputting cluster feature clusters with different feature marks into the sub-detectors to be trained to obtain a plurality of sub-detection models;

the sub-detection models are used for finely detecting abnormal flow in a labeled flow data sample and an unlabeled flow data sample in an original flow data sample library, the subordinate number of each sub-detection model is identical to the character number of a clustering feature cluster to which a feature label belongs, the model performance of each sub-detection model is identical to that of the original abnormal flow detection model, and each sub-detection model can independently train all the labeled flow data samples and unlabeled flow data samples in the original flow data sample library. After a plurality of sub-detection models are obtained, the detection effects of all the sub-detection models are tested by using a labeled flow data sample test set after feature dimension reduction processing, namely, labeled flow data sample test sets with different feature marks are respectively input into the sub-detection models corresponding to the feature marks contained in the labeled flow data sample test sets, the respective test results of the sub-detection models are output, and the detection effects of the sub-detection models are determined according to the test results, including detection precision, detection duration and the like.

S105, evaluating feature coincidence degree of the unlabeled dimension reduction feature set and a plurality of cluster feature clusters one by one to obtain a cluster feature cluster set which is most matched with the unlabeled dimension reduction feature set;

and carrying out feature conformity assessment on the unlabeled flow data samples contained in the unlabeled dimension-reduction feature set and all the cluster feature clusters, obtaining the cluster feature cluster which is most matched with one unlabeled flow data sample in the unlabeled dimension-reduction feature set according to the feature conformity score, and obtaining the cluster feature cluster set which is most matched with all the unlabeled flow data sample in the unlabeled dimension-reduction feature set. And according to the obtained cluster feature cluster set which is most matched with the label-free dimension reduction feature set, distributing a sub-detection model with the best detection effect for the label-free flow data sample, and using the sub-detection model for the fine detection of the label-free abnormal flow.

S106, respectively inputting the unlabeled dimension reduction feature sets into the sub detection models of the cluster feature clusters which are most matched with the unlabeled dimension reduction feature sets to detect abnormal flow, and finally obtaining an abnormal flow detection result.

The sub-detection model to which the cluster feature cluster belongs specifically represents a sub-detection model with the same character number as the cluster feature mark. And respectively inputting the unlabeled flow data samples into sub-detection models corresponding to the closest matched cluster feature clusters according to the closest matched cluster feature markers of each unlabeled flow data sample, wherein the sub-detection models are sub-detection models with the best detection effect for the unlabeled flow data samples. After model distribution of all unlabeled flow data samples in the unlabeled dimension reduction feature set is completed, each sub-detection model performs fine detection of abnormal flow aiming at the input unlabeled flow data samples, and an abnormal flow fine detection result is output. The abnormal flow refined detection result should include the belonging label of the abnormal unlabeled flow data sample, namely the abnormal behavior category of the abnormal flow.

According to the network abnormal flow refinement detection method, an original abnormal flow detection model is divided into a plurality of sub-detectors by taking cluster feature marks and cluster sizes as division basis, the plurality of cluster feature clusters are input into the sub-detectors to which the cluster feature clusters belong respectively for training to obtain a plurality of sub-detection models, feature conformity evaluation is carried out on a label-free dimension reduction feature set and the plurality of cluster feature clusters one by one to obtain a cluster feature cluster set which is matched with the label-free dimension reduction feature set, abnormal flow detection is carried out by inputting the label-free dimension reduction feature set into the sub-detection model which is matched with the label-free dimension reduction feature set, and finally an abnormal flow detection result is obtained. According to the method, the clustering feature clusters are generated to divide the plurality of sub-detection models to realize the fine detection of the abnormal flow, and the feature deviation among the unlabeled flows is accurately captured by providing the feature coincidence degree assessment method, so that the influence of the unlabeled flow feature deviation on the abnormal flow detection result is reduced, and the detection accuracy of the existing abnormal flow detection model is improved.

Next, a detailed description will be given of how to obtain the cluster feature cluster set that is most matched with the label-free dimension-reduction feature set by using the feature conformity assessment method provided in the present application in the foregoing step S105.

Fig. 2 is a flow chart of another network abnormal traffic refinement detection method provided in the embodiment of the present application. As shown in fig. 2, the aforementioned step S105 mainly includes the following steps;

s201, a certain unlabeled flow data in the unlabeled dimension-reducing feature set is characterizedClustering feature cluster with feature tag character number 1Performing feature coincidence degree matching to obtain a cluster feature cluster with the feature mark character number of 1Feature compliance score of (a)；

S202, judging the cluster characteristic clusterFeature compliance score of (a)Whether the clustering characteristic is lower than a preset threshold value, if so, clustering the clustering characteristicClustering feature cluster with feature tag character number 2Integrating to obtain a coincidence degree evaluation feature clusterThe method comprises the steps of carrying out a first treatment on the surface of the If the clustering characteristic is higher than the preset threshold value, clustering the clustering characteristicAs a non-tagged traffic data featureMost matched cluster feature clusters；

S203, the unlabeled flow data is characterizedFeature cluster for evaluating coincidence degreeMatching the feature coincidence degree to obtain the coincidence degree evaluation feature clusterFeature compliance score of (a)；

S204, grading the feature conformity degreeCompliance score with the featureWeighted average is carried out to obtain a cluster feature clusterFeature compliance score of (a)；

S205, comparing feature conformity scoresCompliance score with featuresSelecting cluster feature clusters with larger feature conformity values as the features of the unlabeled flow dataMost matched cluster feature clusters。

S206, repeating the steps S201-S205 until feature conformity assessment is completed between the unlabeled dimension reduction feature set and all the cluster feature clusters with the feature labels, and finally obtaining the cluster feature cluster set which is most matched with the unlabeled dimension reduction feature set.

Taking the Jaccard similarity method as an example, the specific implementation process of obtaining the feature coincidence score of the cluster feature cluster in the step S201 may include: calculating certain unlabeled traffic data characteristicsClustering feature cluster with feature tag character number 1The similarity ratio of (2):the Jaccard distance between the two was calculated:normalizing the distance to a value between 0 and 1, wherein the value is the characteristic of the unlabeled flow dataClustering feature cluster with feature tag character number 1Is a feature compliance score of (1).

The coincidence evaluation feature cluster is obtained in the step S202The specific implementation process of (1) is as follows: removing the features which are contained in the cluster feature cluster with the feature mark character number of 2 and are identical to the features in the cluster feature cluster with the feature mark character number of 1, and supplementing the features which are contained in the cluster feature cluster with the feature mark character number of 2 and are different from the features in the cluster feature cluster with the feature mark character number of 1 to the cluster with the feature mark character number of 1Obtaining a coincidence degree evaluation feature cluster from the feature clusters。

According to the method provided by the embodiment of the application, the feature coincidence degree evaluation is carried out on the unlabeled dimension reduction feature set and the plurality of cluster feature clusters one by one to obtain the cluster feature cluster set which is matched with the unlabeled dimension reduction feature set, the unlabeled dimension reduction feature set is input into the sub-detection model which is matched with the unlabeled dimension reduction feature set to detect abnormal flow, and finally an abnormal flow detection result is obtained. According to the feature coincidence degree evaluation method, feature deviation among unlabeled flows is accurately captured, influence of the unlabeled flow feature deviation on an abnormal flow detection result is reduced, and therefore detection accuracy of an existing abnormal flow detection model is improved.

Taking a K-Means clustering algorithm generated by a guide cluster of the modified sine and cosine optimization algorithm as an example, how to input the labeled feature set with reduced dimension into the clustering model for training in the step S102, and a plurality of cluster feature clusters are obtained for detailed description.

Fig. 3 is a flowchart of another network abnormal traffic refinement detection method according to an embodiment of the present application. As shown in fig. 3, the foregoing step S102 mainly includes the following steps:

s301, defining initialization parameters of an improved sine and cosine optimization clustering model, and establishing an initialization sample cluster;

s302, inputting the initialization parameters and the initialization sample clusters into a K-Means clustering model for training to obtain a plurality of original clustering feature clusters with the number smaller than or equal to the clustering cluster threshold value parameters;

s303, calculating the adaptability of the initialized sample cluster, and according to the solving result, developing and optimizing in each original cluster feature cluster to obtain the optimal solution in the cluster of the original cluster feature cluster.

S304, according to the intra-cluster maximum value and the intra-cluster minimum value of the original cluster feature cluster, spreading the chaos variation of the cauchy operator on other sample clusters except the global optimal cluster in the original cluster feature cluster set, and updating the global optimal cluster set.

S305, comparing the fitness of the global optimal cluster set in the original cluster feature clusters with the fitness of the initial value of the global optimal cluster set one by one, and finally obtaining the cluster feature cluster set with the optimal fitness.

The initialization parameters in the step S301 mainly include a sample size, a cluster size, a number of clustering iterations, a cluster threshold parameter, a global optimal cluster initial value, and the like. After defining the initialization parameters, a Monte Carlo simulation (Monte Carlo) method is used for establishing an initialization cluster which is uniformly distributed and has no sample overlapping and is used for generating a global optimal cluster characteristic cluster set capable of dividing a plurality of sub-detectors.

The fitness of the initialization sample cluster in the step S303 is mainly used for evaluating the performance of the initialization sample cluster, and the specific implementation process of the calculation is as follows:。

wherein,to be adaptive toIs used for the control of the maximum value range of (a),in order to adapt the coefficient of the degree of adaptation,for the total number of training iterations,for the current number of iterations,is an fitness index. The intra-cluster optimal solution of the original cluster feature clusters comprises an intra-cluster maximum value and an intra-cluster minimum value of each original cluster feature cluster, and a global optimal cluster set in the original cluster feature cluster set.

The objective of performing the chaos mutation of the cauchy operator in the step S304 is to optimize the convergence performance of the global search, so as to obtain a more ideal global optimal cluster set. The updating of the global optimal cluster set specifically refers to updating coordinates of the global optimal clusters in the global optimal cluster set.

The specific implementation process of the foregoing step S305 is: and selecting a global optimal cluster with the suitability of the global optimal cluster being greater than that of the initial suitability of the global optimal cluster set, and updating the original cluster feature cluster set according to the global optimal cluster until all the global optimal clusters in the global optimal cluster set are traversed. Wherein the cluster feature cluster set with optimal fitness will be used to divide sub-detectors with different feature labels.

According to the method provided by the embodiment of the application, the labeled dimension reduction feature set is input into the clustering model generated by the global optimization algorithm guide cluster for training, so that a plurality of clustering feature clusters capable of adding feature labels are obtained, and a model division basis is provided for realizing the fine detection of abnormal flow. Compared with other abnormal flow detection model training methods, the method is characterized in that the clustering model generated by the cluster is guided by an optimization algorithm to generate a clustering feature cluster, so that feature deviation among different flow data is effectively identified.

Next, the structure and function of the network abnormal traffic refinement detection model built by the network abnormal traffic refinement detection method described in fig. 1 to 3 will be described.

Fig. 4 is a schematic diagram of a network abnormal traffic refinement detection model provided in an embodiment of the present application. As shown in fig. 4, the network abnormal traffic refinement detection model mainly includes: the K-Means cluster model generated by the cluster, the cluster feature cluster with the feature label and the plurality of sub-detection models with the cluster feature cluster label (taking XGBoost model as an example) are guided by a modified sine and cosine optimization algorithm. The labeled dimension reduction feature set is used for being input into a K-Means cluster model generated by a modified sine and cosine optimization algorithm guide cluster to generate a plurality of cluster feature clusters; acquiring feature marks by using the cluster feature clusters to obtain a plurality of sub-detection XGBoost models with the cluster feature cluster marks; the plurality of sub-detection XGBoost models are used for training labeled dimension reduction flow data samples; and then, respectively testing the label-free dimension reduction flow data sample which is most matched with the XGBoost model after the feature conformity evaluation is completed by the plurality of sub-detection XGBoost models, and finally obtaining the refined detection result of the abnormal flow.

According to the method provided by the embodiment of the application, the original abnormal flow detection model is divided into a plurality of sub-detectors by taking the cluster feature marks and the cluster sizes as division basis, and the plurality of cluster feature clusters are input into the sub-detectors to which the cluster feature marks and the cluster sizes belong for training, so that the plurality of sub-detection models are obtained. Compared with other abnormal flow detection model training methods, the method can realize the fine detection of the abnormal flow by generating the cluster feature clusters to divide a plurality of sub-detection models, and effectively improve the accuracy of the abnormal flow detection models.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A network abnormal flow refinement detection method is characterized by comprising the following steps:

S52, judging the cluster characteristic clusterFeature compliance score of (a)If the clustering result is lower than the preset threshold, clustering the feature clustersClustering feature cluster with feature tag character number 2Integrating to obtain a coincidence degree evaluation feature clusterThe method comprises the steps of carrying out a first treatment on the surface of the If the threshold value is higher than the preset threshold value, the polymer is polymerizedClass feature clustersAs said unlabeled traffic data featureMost matched cluster feature clusters；

2. The method according to claim 1, wherein the original traffic data sample library in step S1 includes a labeled unencrypted traffic data sample, a labeled encrypted traffic data sample, a non-labeled unencrypted traffic data sample, and a non-labeled encrypted traffic data sample.

3. The method according to claim 1, wherein the clustering model in step S2 includes a K-Means clustering model generated by a guide cluster via a modified sine and cosine optimization algorithm.

4. The method for detecting network abnormal traffic refinement according to claim 1, wherein the number of cluster feature clusters in the step S2 is determined by a preset cluster threshold.

5. The method for detecting network abnormal traffic refinement according to claim 1, wherein the step S2 specifically includes the steps of:

s22, inputting the initialization parameters and the initialization sample clusters in the step S21 into the clustering model for training to obtain a plurality of original clustering feature clusters with the number smaller than or equal to a preset cluster threshold value;

s23, calculating the fitness of the initialized sample cluster in the step S22, and according to the solving result, developing optimization in each original cluster feature cluster to obtain an intra-cluster optimal solution of the original cluster feature cluster;

s25, comparing the fitness of the global optimal cluster set in the original cluster feature clusters in the step S24 with the fitness of the initial value of the global optimal cluster set one by one, and finally obtaining the cluster feature cluster set with optimal fitness.

6. The method for detecting network abnormal traffic refinement according to claim 1, wherein the feature label in the step S3 is a character number label of a typical feature of a cluster feature cluster different from other cluster feature clusters, so as to facilitate feature conformity assessment on the labeled feature set and the unlabeled feature set in the step S1.

7. The method for detecting network abnormal traffic refinement according to claim 1, wherein the cluster size in the step S3 specifically represents a labeled traffic data sample size and an unlabeled traffic data sample size contained in the cluster feature cluster.

8. The method for detecting network abnormal traffic refinement according to claim 1, wherein in the step S3, the dividing dimension of the sub-detector is equal to the number of the cluster feature clusters, and the subordinate number of the sub-detector is equal to the character number of the cluster feature cluster to which the feature mark belongs.

9. The method according to claim 1, wherein the sub-detection model in step S4 is used for finely detecting the abnormal traffic in the tagged and untagged dimension-reduction feature sets in step S1, and the model performance of the plurality of sub-detection models is consistent with the model performance of the original abnormal traffic detection model in step S3.

10. The method for refined detection of network abnormal traffic according to claim 1, wherein the step S5 is performed to obtain a feature cluster for evaluating the degree of coincidenceThe specific implementation process of (1) is as follows: removing clustered feature cluster with feature tag character number of 2Cluster of and cluster features contained inThe features of the three are identical, and the feature marking characters are numbered as cluster feature clusters with the number of 2Cluster of and cluster features contained inThe feature with difference among the features is supplemented to the cluster feature clusterObtaining the conformity evaluation feature cluster。

11. The method for detecting network abnormal traffic refinement according to claim 1, wherein the sub-detection model to which the cluster feature cluster belongs in the step S6 specifically represents a sub-detection model equal to the character number of the cluster feature tag.