CN110598774A - Encrypted flow detection method and device, computer readable storage medium and electronic equipment - Google Patents

Encrypted flow detection method and device, computer readable storage medium and electronic equipment Download PDF

Info

Publication number
CN110598774A
CN110598774A CN201910827194.5A CN201910827194A CN110598774A CN 110598774 A CN110598774 A CN 110598774A CN 201910827194 A CN201910827194 A CN 201910827194A CN 110598774 A CN110598774 A CN 110598774A
Authority
CN
China
Prior art keywords
data
encrypted
algorithm
training sample
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910827194.5A
Other languages
Chinese (zh)
Other versions
CN110598774B (en
Inventor
罗赟骞
邬江
戴方岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Power Great Wall Internetworking Safety Technology Research Institute (beijing) Co Ltd
Original Assignee
China Power Great Wall Internetworking Safety Technology Research Institute (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Power Great Wall Internetworking Safety Technology Research Institute (beijing) Co Ltd filed Critical China Power Great Wall Internetworking Safety Technology Research Institute (beijing) Co Ltd
Priority to CN201910827194.5A priority Critical patent/CN110598774B/en
Publication of CN110598774A publication Critical patent/CN110598774A/en
Application granted granted Critical
Publication of CN110598774B publication Critical patent/CN110598774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides an encrypted flow detection method and device, a computer readable storage medium and electronic equipment. The method comprises the following steps: extracting features of network sessions from a target file to serve as training samples, and constructing a training sample set, wherein data in the training samples comprise data of at least two data types; setting the data type of a preset training sample as a data type which can be identified by a preset algorithm, and obtaining a training sample set after preprocessing, wherein the preset training sample comprises the characteristics of a network session which is extracted from a target file and has the data type which can be identified by the preset algorithm, and the preset algorithm can identify the characteristics of at least two data types; constructing an encrypted flow detection model by adopting the predetermined algorithm; and detecting the object to be detected by using the constructed encrypted flow detection model. The device is used for executing the encrypted flow detection method. The invention constructs more comprehensive detection characteristics, saves computing resources and improves detection accuracy.

Description

Encrypted flow detection method and device, computer readable storage medium and electronic equipment
Technical Field
The present invention relates to the field of network security, and in particular, to an encrypted traffic detection method, an encrypted traffic detection apparatus for performing the encrypted traffic detection method, a computer-readable storage medium, and an electronic device.
Background
With the rapid development of the internet of things, big data, cloud computing and high-speed mobile communication networks, the information confidentiality problem becomes more and more important, various security protocols for ensuring the network communication security are widely applied, and more internet traffic is encrypted. The encryption technology ensures the communication security of internet users, ensures that information cannot be intercepted and read by a third party, and simultaneously causes the traditional security detection mechanism to face failure.
The wide application of the artificial intelligence technology provides an important means for discovering the threat of malicious flow attack. At present, malicious encrypted traffic detection research is mainly divided into session-based, session-statistics-based and certificate-based detection research. Detection based on conversation mainly aims at extracting characteristics of network flow and adopts methods such as random forest and the like; detection based on session statistics mainly aims at extracting statistical characteristics of statistical data of network flows, and methods such as eXtreme Gradient Boosting (Xgboost), light Gradient Boosting machine (LightGBM) and the like are adopted; based on the detection of the certificate, aiming at the extraction characteristics of the certificate, a detection model is constructed by methods such as a Support Vector Machine (SVM) and the like.
However, the existing detection model has incomplete features, occupies a large memory space, and needs to be further improved in detection accuracy.
Disclosure of Invention
To solve at least one aspect of the above problems of the prior art, it is an object of the present invention to provide an encrypted traffic detection method, an encrypted traffic detection apparatus that performs the encrypted traffic detection method, a computer-readable storage medium, and an electronic device. The method aims to reduce the memory space occupied by the encryption flow detection model and further improve the accuracy of encryption flow detection.
To achieve the above object, as a first aspect of the present invention, there is provided an encrypted traffic detection method including:
extracting features of network sessions from a target file to serve as training samples, and constructing a training sample set, wherein data in the training samples comprise data of at least two data types;
preprocessing training samples in the training sample set to set the data types of preset training samples as the data types which can be identified by a preset algorithm, and obtaining the preprocessed training sample set, wherein the preset training samples comprise the features of network sessions, which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the features of at least two data types;
constructing an encrypted flow detection model by using the pre-processed training sample set and adopting the predetermined algorithm;
and detecting the object to be detected by using the constructed encrypted flow detection model.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
Optionally, the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
Optionally, constructing the encrypted traffic detection model includes:
searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and training by using the pre-processed training sample set and the preset algorithm by using the optimal hyper-parameter to obtain the encrypted flow detection model.
Optionally, the detecting the object to be detected by using the constructed encrypted traffic detection model includes:
extracting the characteristics of an object to be detected;
preprocessing the extracted features of the object to be detected, and setting the data type of the extracted features of the object to be detected, of which the data type before extraction is the data type which can be identified by a preset algorithm, as the data type which can be identified by the preset algorithm;
inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification.
As a second aspect of the present invention, there is provided an encrypted traffic detection apparatus comprising:
the system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of a network session from a target file to serve as training samples and constructing a training sample set, and data in the training samples comprise data of at least two data types;
the characteristic data processing module is used for preprocessing the training samples in the training sample set so as to set the data types of the preset training samples as the data types which can be identified by a preset algorithm and obtain the preprocessed training sample set, wherein the preset training samples comprise the characteristics of network sessions which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the characteristics of at least two data types;
the model construction module is used for constructing an encryption flow detection model by using the pre-processed training sample set and adopting the preset algorithm;
and the encrypted flow detection module is used for detecting the object to be detected by using the constructed encrypted flow detection model.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
Optionally, the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
Optionally, the model building module comprises:
the optimal hyper-parameter selection module is used for searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and the model training module is used for training by using the preprocessed training sample set by using the optimal hyper-parameter and the preset algorithm to obtain the encrypted flow detection model.
Optionally, the feature extraction module is further configured to extract features of the object to be detected.
The characteristic data processing module is further used for preprocessing the extracted characteristic of the object to be detected, and setting the data type of the extracted characteristic of the object to be detected, of which the data type before extraction is the data type which can be identified by the preset algorithm, as the data type which can be identified by the preset algorithm.
And the encrypted flow detection module is also used for inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification.
As a third aspect of the present invention, there is provided a computer-readable storage medium for storing an executable program capable of executing the above-described encrypted traffic detection method of the present invention.
As a fourth aspect of the present invention, there is provided an electronic apparatus comprising:
one or more processors;
a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the encrypted traffic detection method of the present invention described above.
According to the characteristics of malicious encrypted traffic, the encrypted traffic detection model is constructed by using an algorithm capable of directly identifying and processing numerical data and non-numerical data, and the non-numerical data is not required to be converted into the numerical data, so that the occupied storage space of the model is reduced, and the detection accuracy is improved; meanwhile, non-numerical characteristic data is extracted, perfect detection characteristics are constructed, and malicious encrypted flow can be described more comprehensively, so that the detection accuracy is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a method of detecting encrypted traffic;
FIG. 2 is a flow chart of the construction of an encrypted traffic detection model using the predetermined algorithm;
FIG. 3 is a flow chart of detecting an object to be detected by using the constructed encrypted traffic detection model;
fig. 4 is a block diagram of the encrypted flow rate detection apparatus.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As a first aspect of the present invention, there is provided an encrypted traffic detection method. Fig. 1 is a flow chart of a method of detecting encrypted traffic. As shown in fig. 1, the encrypted traffic detection method according to the present embodiment includes:
in step S110, features of the web session are extracted from the target file as training samples, and a training sample set is constructed, where data in the training samples includes data of at least two data types.
In step S120, preprocessing is performed on the training samples in the training sample set to set the data types of predetermined training samples as data types that can be recognized by a predetermined algorithm, and obtain a preprocessed training sample set, where the predetermined training samples include features of network sessions, which are extracted from a target file and whose data types are data types that can be recognized by the predetermined algorithm before, and the predetermined algorithm can recognize features of at least two data types.
In step S130, the preprocessed training sample set is used to construct an encrypted traffic detection model by using the predetermined algorithm.
In step S140, the constructed encrypted traffic detection model is used to detect the object to be detected.
The inventor of the invention researches and discovers that the existing models can only identify and process numerical data, so that when the characteristic data of the encrypted flow is extracted or only the characteristics of the numerical type are extracted, the malicious encrypted flow cannot be completely described, and the detection is not accurate enough; or after the non-numerical characteristic data is extracted, the non-numerical characteristic data needs to be converted into numerical data, a large amount of memory space is occupied, the detection timeliness is low, and the detection accuracy is further limited.
In view of the above, in order to overcome the problem that the existing models can only recognize and process numerical data, and in order to process non-numerical data, the invention adopts an algorithm capable of directly recognizing and processing data of at least two data types, so that malicious encrypted traffic can be described more comprehensively, and waste of memory resources caused by converting non-numerical characteristic data into numerical data can be avoided, thereby effectively improving the accuracy of encrypted traffic detection.
In addition, research finds that the session connection characteristics represent the characteristic expression of malicious encrypted traffic on the connection traffic; the Security Transport Layer protocol (TLS)/Security Sockets Layer (SSL) session feature and the X509 certificate feature represent the feature expression of malicious traffic on the encryption attribute; the Domain Name System (DNS) feature represents whether there is a problem with a Domain Name used in a session, such as possibly a Domain Name generation Algorithm (DGA) Domain Name. The characteristics comprise non-numerical characteristics, and the non-numerical characteristics describe unique performances on specific attributes of the malicious encrypted traffic and have an important role in comprehensively describing the malicious encrypted traffic. When the characteristics of the network session are extracted, the characteristic data of at least two data types are simultaneously extracted as training samples, and relatively perfect detection characteristics are constructed, so that the accuracy of encrypted flow detection can be further improved.
It should be noted that, in the present invention, at least two types of feature data are extracted, and depending on the difference in data processing by the system, the data type of the extracted feature data, which is not a numerical data, may be changed by the system, and in order to enable the adopted predetermined algorithm to recognize the feature data, the data type of the feature data is set again in step S120 and set as the data type before extraction.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
As described above, among the features of the network session, non-numerical features play an important role in fully describing malicious encrypted traffic, and the non-numerical feature data is mainly classified data.
The existing encrypted flow detection model cannot directly identify and process the classification characteristics, and only one-hot coding (one-hot) needs to be carried out on the classification characteristic data to process the classification characteristics, so that the classification data is thinned. However, if the categories are too many, data becomes too sparse after one-hot processing, which greatly increases the size of the training set and wastes computing resources. In order to avoid the waste of the computing resources, the invention adopts an algorithm which can directly identify and process the classification characteristics and the numerical characteristics. Meanwhile, the algorithm capable of directly identifying and processing the classification characteristic and the numerical characteristic is adopted, so that the numerical characteristic and the classification characteristic can be simultaneously selected as training samples, malicious encrypted flow can be comprehensively described, and the detection accuracy is improved.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
The LightGBM algorithm and the Catboost algorithm can directly identify and process the classification features, so that the encrypted traffic detection model can be constructed by using the algorithms.
The LightGBM algorithm is a novel Gradient Boost Decision Tree (GBDT) algorithm, and is currently widely applied to relevant fields such as classification, regression, training, and the like. The method mainly has the following advantages: 1. the method comprises unilateral sampling based on gradient and mutually exclusive feature binding, and meets the requirements of efficiency and expandability under the conditions of high dimension and mass data; 2. the algorithm based on the histogram is used for accelerating the training process and reducing the memory consumption; 3. the tree generation strategy growing according to the leaf nodes is adopted, so that the generalization performance of the algorithm is improved; 4. the classification characteristics can be directly processed, and the problems that data becomes too sparse after one-hot processing and computing resources are wasted are avoided.
The Catboost algorithm is a Boosting ensemble learning algorithm, mainly solves the learning of classification features, and can directly process and learn character type classification features. The method mainly has the following advantages: 1. a Graphics Processing Unit (GPU) is supported, and the calculation is more efficient; 2. providing a training process visualization function; 3. and supporting modeling of various languages such as Python, R and the like.
The experiments of the inventor of the invention show that the encrypted flow detection model constructed by the Catboost algorithm has a difference of about 0.05% in the indexes of accuracy, F1 value (F-measure), recall rate and Area Under the Curve (AUC) compared with the encrypted flow detection model constructed by the LightGBM algorithm.
Based on the above gap, the LightGBM algorithm is selected to construct the encrypted traffic detection model in the present embodiment. Since the LightGBM algorithm can directly identify and process the classification feature of the "category" type, and depending on the data processing of the system, the data type of the extracted feature of the network session, which is originally the "category" type, may become a character type or an "object" type, and in order to enable the LightGBM algorithm to identify the above feature data, the data type of the extracted feature of the network session, which is originally the "category" type, needs to be set as "category".
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
The inventor of the invention finds that the existing encrypted traffic detection model based on session statistics cannot detect malicious encrypted traffic in real time. In the invention, the characteristic data of the network session can be extracted from the PCAP packet, the real-time network interface or other network flow files, thereby realizing the real-time detection of the encrypted flow.
In the present embodiment, the feature data of the network session is extracted from the static PCAP packet and/or the real-time network traffic, and further, the feature data of the network session required by the present invention may be extracted using the open source software Zeek.
Optionally, the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
As mentioned above, the session connection characteristics represent the characteristics of malicious encrypted traffic on the connection traffic; TLS/SSL session characteristics and X509 certificate characteristics represent the characteristic representation of malicious traffic on encryption attributes; the DNS feature represents whether there is a problem with the domain name used in the session, such as possibly a DGA domain name. To fully describe the malicious encrypted traffic, the characteristics related to the construction of the encrypted traffic detection model can be selected according to the characteristic expression of the malicious traffic in different attributes. The http feature may also be used, but the inventors believe it will die in the future and will therefore not be embodied in this embodiment.
As an embodiment of the present invention, the feature of the network session may be selected as follows to construct the encrypted traffic detection feature:
and extracting 62 session connection features, TLS/SSL session features, X509 certificate features and DNS features related to building a malicious encrypted traffic detection model from the network session. The extracted features include a numerical type feature and a "category" type feature. The method specifically comprises the following steps:
session connection characteristics refer to communication session characteristics associated with encrypted traffic communications. In the present embodiment, 5 features such as "session duration" are selected, as shown in table 1.
TABLE 1
TLS/SSL session characteristics refer to TLS/SSL handshake characteristic data generated in the process of carrying out encryption communication by using TLS/SSL protocol. The present embodiment has 11 of the features, as shown in table 2.
TABLE 2
And the X509 certificate feature refers to certificate data transmitted by a server side in the process of carrying out encrypted communication by using the TLS/SSL protocol. The present embodiment has 33 of these features, as shown in table 3.
TABLE 3
The DNS feature refers to the feature contained in the DNS requested before the session starts, and the DNS feature is selected mainly in consideration of the fact that the DNS domain name used by some malicious encrypted traffic is greatly different from a common normal domain name. 13 of these features were selected in this embodiment as shown in table 4.
TABLE 4
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
The present embodiment is directed to a network session, because when a TLS/SSL session is first established and the session is already established, the session information includes important features such as TLS/SSL handshake and certificate, while a TLS/SSL session restored using previous session information does not include the above-mentioned information, in order to extract an effective detection feature from the session, the network session must satisfy that the TLS/SSL session includes important features such as TLS/SSL handshake and certificate, that is, the TLS/SSL session is first established and the session is already established.
Optionally, in order to use the extracted features of the network session for model training to obtain the encrypted traffic detection model, constructing a training sample set further includes: classifying the training samples as 'malicious' or 'normal' according to the nature of the network session, and constructing a training sample setxiRepresenting characteristic data, yiIn the present embodiment, the corresponding tag data is represented by 1 for malicious purpose, 0 for normal purpose, or in a customized manner.
Optionally, as an error-proofing process, in this embodiment, the preprocessing the training samples in the training sample set may include: the feature number of the training sample is checked, and if the training sample does not meet the specified feature number (in the present embodiment, the specified feature number is 62, wherein, the session connection feature is 5, the TLS/SSL session feature is 11, the X509 certificate feature is 33, and the DNS feature is 13), the training sample is discarded as a problem sample.
Optionally, fig. 2 is a flowchart for constructing an encrypted traffic detection model by using the predetermined algorithm. As shown in fig. 2, the constructing of the encrypted traffic detection model by using the predetermined algorithm includes:
in step S131, the training sample set after the preprocessing is used to find the optimal hyper-parameter of the predetermined algorithm.
In general, the hyper-parameters have an important influence on the prediction accuracy. The hyper-parameters in the LightGBM algorithm determine the accuracy of the model, the speed of building the model and whether the model is over-fitted, so the number and the variation range of the hyper-parameters need to be determined, and the optimal hyper-parameters of the model are further obtained to build the optimal encrypted traffic detection model. In this embodiment, the parameters that the LightGBM algorithm needs to optimize are shown in table 5.
TABLE 5
Parameter name Interpretation of parameters
num_leaves The number of leaves of each tree determines the accuracy of the model
learning_rate Controlling the speed of iteration and determining model accuracy
max_depth Maximum depth of tree, determining whether model is over-fitted
min_data_in_leaf The leaves may containThe minimum record number determines whether the model is over-fitted
feature_fraction The proportion of randomly selected features in each iteration of the building tree determines the model building speed
bagging_fraction The proportion of data used per iteration is typically used to speed up training and avoid overfitting
max_bin The maximum bin number of the inserted characteristic value determines the model construction speed
bagging_freq Frequency of bagging, determining whether the model is overfitting
n_estimators The number of iterations is improved, and the accuracy of the model is determined
Optionally, in this embodiment, all training samples in the training sample set that is preprocessed in step S120 are used to find the optimal hyper-parameter of the encrypted traffic detection model.
Optionally, in this embodiment, any one of a grid search method, a random search method, or a heuristic method is used to find the optimal hyper-parameter of the model; and when the optimal hyper-parameter is searched, an N-fold cross validation method is adopted.
The grid search method is an exhaustive search method for specifying parameter values, namely, possible values of each parameter are arranged and combined, all possible combination results are listed to generate a grid, and the parameters of an estimation function are optimized by a cross validation method to obtain the optimal hyper-parameters.
The random search method does not exhaust all parameter values, but extracts a fixed number of parameter values according to a specified distribution to find the optimal hyper-parameter.
The heuristic method usually uses optimization algorithms such as particle swarm optimization and difference algorithm to find the optimal hyper-parameter.
The inventor researches and discovers that theoretically, the grid search algorithm has the lowest efficiency, the random search algorithm has the next lowest efficiency, and the heuristic method has the highest efficiency; in the aspect of implementation, the grid search algorithm and the random search algorithm are simpler, and the heuristic method is more complex.
The basic idea of cross validation is to group the original data in a certain sense, one part is used as a training set, the other part is used as a validation set, firstly, the training set is used for training the classifier, and then the validation set is used for testing the model obtained by training, so that the model is used as the performance index for evaluating the classifier. The purpose of cross-validation is to obtain a reliable and stable model.
In step S132, the optimal hyper-parameter is adopted, the preprocessed training sample set is used to perform training by using the predetermined algorithm, and the encrypted traffic detection model is obtained.
In this embodiment, the optimal hyper-parameter obtained in step S131 and all the training samples in the training sample set preprocessed in step S120 are used to perform training using the LightGBM algorithm, and a detection model is obtained.
Optionally, fig. 3 is a flowchart for detecting an object to be detected by using the constructed encrypted traffic detection model. As shown in fig. 3, the detecting the object to be detected by using the constructed encrypted traffic detection model includes:
in step S141, the feature of the object to be measured is extracted.
Alternatively, the object to be tested may be a static PCAP data packet file or a dynamic real-time network traffic file.
Optionally, in this embodiment, the extracted features of the object to be tested include 62 session connection features (as shown in table 1), TLS/SSL session features (as shown in table 2), X509 certificate features (as shown in table 3), and DNS features (as shown in table 4) of the network session to be tested.
In step S142, the extracted feature of the object to be measured is preprocessed, so that the data type of the feature of the object to be measured, in which the data type before extraction is the data type that can be recognized by the predetermined algorithm, is set as the data type that can be recognized by the predetermined algorithm.
Optionally, in this embodiment, because the LightGBM algorithm is capable of directly identifying and processing the feature of the "category" type, depending on the data processing system, the data type of the feature of the extracted object to be tested, which is originally the "category" type, may become a character type or an "object" type, and in order to enable the LightGBM algorithm to identify the above feature data, the data type of the feature of the extracted object to be tested, which is originally the "category" type, needs to be set as the "category".
In step S143, the obtained characteristics of the object to be detected are input into the encrypted traffic detection model for identification.
Optionally, in this embodiment, the inputting the obtained feature of the object to be detected into the encrypted traffic detection model for identification further includes: and obtaining the abnormal probability value p of the object to be detected by the encryption detection model, comparing the abnormal probability value p with a set threshold value epsilon, if p is larger than epsilon, judging that the object to be detected is malicious flow, and otherwise, judging that the object to be detected is normal flow.
Because the false alarm rate of the algorithm can generate a plurality of false positives, safety analysis personnel can not obtain effective alarm, and the result of the algorithm loses significance. Therefore, a method for dynamically setting the threshold epsilon can be adopted, and a proper threshold is set by combining the false alarm rate generated by the algorithm, so that the false alarm rate of the algorithm is reduced, and the accuracy of encrypted flow detection is improved.
Optionally, in this embodiment, a threshold value for making the false positive rate obtained by the N-fold cross validation one in ten thousandth is selected during training.
As a second aspect of the present invention, an encrypted traffic detection apparatus is provided, and fig. 4 is a block diagram of the encrypted traffic detection apparatus. As shown in fig. 4, the system includes a feature extraction module 110, a feature data processing module 120, an encrypted traffic detection model building module 130, and an encrypted traffic detection module 140.
A feature extraction module 110, configured to perform step S110, specifically, the training sample component module 110 is configured to extract features of the network session from the target file as training samples, and construct a training sample set, where data in the training samples includes data of at least two data types.
The feature data processing module 120 is configured to perform step S120, specifically, the feature data processing module 120 is configured to perform preprocessing on a training sample in the training sample set, so as to set a data type of a predetermined training sample as a data type that can be recognized by a predetermined algorithm, and obtain a preprocessed training sample set, where the predetermined training sample includes features of a network session, where the previous data type is the data type that can be recognized by the predetermined algorithm, extracted from a target file, and the predetermined algorithm can recognize features of at least two data types.
The encrypted flow detection model building module 130 is configured to execute step S130, and specifically, the model building module 130 is configured to build the encrypted flow detection model by using the pre-processed training sample set and using the predetermined algorithm.
The encrypted flow detection module 140 is configured to execute step S140, and specifically, the encrypted flow detection module 140 is configured to detect the object to be detected by using the constructed encrypted flow detection model.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
Optionally, the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
Optionally, the encrypted traffic detection model 130 includes an optimal hyper-parameter selection module 150 and a model training module 160.
An optimal hyper-parameter selection module 150, configured to execute step S131, specifically, the optimal hyper-parameter selection module 150 is configured to find an optimal hyper-parameter of the predetermined algorithm by using the preprocessed training sample set.
The model training module 160 is configured to execute step S132, specifically, the model training module 160 is configured to perform training by using the pre-processed training sample set and using the predetermined algorithm by using the optimal hyper-parameter, so as to obtain the encrypted flow detection model.
Optionally, the feature extraction module 110 is further configured to execute step S141, that is, extract features of the object to be tested according to the features of the network session determined during model building.
Correspondingly, the feature data processing module 120 is further configured to execute step S142, that is, perform preprocessing on the extracted feature of the object to be tested, and set the data type of the extracted feature of the object to be tested, where the data type before extraction is the data type that can be recognized by the predetermined algorithm, as the data type that can be recognized by the predetermined algorithm.
Correspondingly, the encrypted flow detection module 140 is further configured to perform step S143, that is, input the preprocessed extracted feature of the object to be detected into the encrypted flow detection model for identification.
The working principle and the beneficial effect of the encryption traffic detection method have been described in detail above, and are not described again here.
As a third aspect of the present invention, there is provided a computer-readable storage medium for storing an executable program capable of executing the above-described encrypted traffic detection method of the present invention.
Computer-readable storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage media, or any other medium which can be used to store the desired information and which can be accessed by a computer.
As a fourth aspect of the present invention, there is provided an electronic apparatus comprising:
one or more processors;
a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the encrypted traffic detection method of the present invention described above.
It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (18)

1. An encrypted traffic detection method, characterized in that the encrypted traffic detection method comprises:
extracting features of network sessions from a target file to serve as training samples, and constructing a training sample set, wherein data in the training samples comprise data of at least two data types;
preprocessing training samples in the training sample set to set the data types of preset training samples as the data types which can be identified by a preset algorithm, and obtaining the preprocessed training sample set, wherein the preset training samples comprise the features of network sessions, which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the features of at least two data types;
constructing an encrypted flow detection model by using the pre-processed training sample set and adopting the predetermined algorithm;
and detecting the object to be detected by using the constructed encrypted flow detection model.
2. The encrypted flow detection method according to claim 1, wherein the data in the training samples includes numerical data and classification data, and the predetermined algorithm is capable of recognizing and processing the numerical data and the classification data.
3. The encrypted traffic detection method according to claim 2, wherein the predetermined algorithm includes a LightGBM algorithm or a Catboost algorithm.
4. The encrypted traffic detection method according to claim 1, wherein the target file comprises a static packet file and/or a real-time network traffic file.
5. The encrypted traffic detection method of claim 1, wherein the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
6. The encrypted traffic detection method of claim 1, wherein a TLS/SSL session of the web session contains TLS/SSL handshake and certificate information.
7. The encrypted traffic detection method according to any one of claims 1 to 6, wherein constructing the encrypted traffic detection model includes:
searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and training by using the pre-processed training sample set and the preset algorithm by using the optimal hyper-parameter to obtain the encrypted flow detection model.
8. The encrypted traffic detection method according to any one of claims 1 to 6, wherein the detecting the object to be detected by using the constructed encrypted traffic detection model includes:
extracting the characteristics of an object to be detected;
preprocessing the extracted features of the object to be detected, and setting the data type of the extracted features of the object to be detected, of which the data type before extraction is the data type which can be identified by a preset algorithm, as the data type which can be identified by the preset algorithm;
inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification.
9. An encrypted flow rate detection device, characterized by comprising:
the system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of a network session from a target file to serve as training samples and constructing a training sample set, and data in the training samples comprise data of at least two data types;
the characteristic data processing module is used for preprocessing the training samples in the training sample set so as to set the data types of the preset training samples as the data types which can be identified by a preset algorithm and obtain the preprocessed training sample set, wherein the preset training samples comprise the characteristics of network sessions which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the characteristics of at least two data types;
the model construction module is used for constructing an encryption flow detection model by using the pre-processed training sample set and adopting the preset algorithm;
and the encrypted flow detection module is used for detecting the object to be detected by using the constructed encrypted flow detection model.
10. The encrypted flow rate detection device of claim 9, wherein the data in the training samples includes numerical data and classification data, and the predetermined algorithm is capable of identifying and processing the numerical data and the classification data.
11. The encrypted traffic detection device of claim 10, wherein the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
12. The encrypted traffic detection device of claim 9, wherein the destination file comprises a static packet file and/or a real-time network traffic file.
13. The encrypted traffic detection apparatus of claim 9, wherein the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
14. The encrypted traffic detection apparatus of claim 9, wherein a TLS/SSL session of the web session contains TLS/SSL handshake and certificate information.
15. The encrypted flow rate detection device according to any one of claims 9 to 14, wherein the model construction module includes:
the optimal hyper-parameter selection module is used for searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and the model training module is used for training by using the preprocessed training sample set by using the optimal hyper-parameter and the preset algorithm to obtain the encrypted flow detection model.
16. The encrypted traffic detection apparatus according to any one of claims 9 to 14,
the characteristic extraction module is also used for extracting the characteristics of the object to be detected;
the characteristic data processing module is further used for preprocessing the extracted characteristic of the object to be detected, and setting the data type of the extracted characteristic of the object to be detected, of which the data type before extraction is the data type which can be identified by a preset algorithm, as the data type which can be identified by the preset algorithm;
and the encrypted flow detection module is also used for inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification.
17. A computer-readable storage medium for storing an executable program capable of executing the encrypted traffic detection method according to any one of claims 1 to 8.
18. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the encrypted traffic detection method of any one of claims 1 to 8.
CN201910827194.5A 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment Active CN110598774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910827194.5A CN110598774B (en) 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910827194.5A CN110598774B (en) 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110598774A true CN110598774A (en) 2019-12-20
CN110598774B CN110598774B (en) 2023-04-07

Family

ID=68857386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910827194.5A Active CN110598774B (en) 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110598774B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111277578A (en) * 2020-01-14 2020-06-12 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
CN112101485A (en) * 2020-11-12 2020-12-18 北京云真信科技有限公司 Target device identification method, electronic device, and medium
CN112165487A (en) * 2020-09-27 2021-01-01 上海万向区块链股份公司 Zeek-based distributed network security and performance detection method and system
CN112714079A (en) * 2020-12-14 2021-04-27 成都安思科技有限公司 Target service identification method under VPN environment
CN113364792A (en) * 2021-06-11 2021-09-07 奇安信科技集团股份有限公司 Training method of flow detection model, flow detection method, device and equipment
CN113595967A (en) * 2020-04-30 2021-11-02 深信服科技股份有限公司 Data identification method, equipment, storage medium and device
CN113676348A (en) * 2021-08-04 2021-11-19 南京赋乐科技有限公司 Network channel cracking method, device, server and storage medium
CN113765911A (en) * 2021-09-02 2021-12-07 恒安嘉新(北京)科技股份公司 Method, device, equipment and storage medium for detecting webshell encrypted flow
CN116346452A (en) * 2023-03-17 2023-06-27 中国电子产业工程有限公司 Multi-feature fusion malicious encryption traffic identification method and device based on stacking

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106685962A (en) * 2016-12-29 2017-05-17 广东睿江云计算股份有限公司 System and method for defense of reflective DDOS attack flow
CN106790019A (en) * 2016-12-14 2017-05-31 北京天融信网络安全技术有限公司 The encryption method for recognizing flux and device of feature based self study
CN107294993A (en) * 2017-07-05 2017-10-24 重庆邮电大学 A kind of WEB abnormal flow monitoring methods based on integrated study
CN109347872A (en) * 2018-11-29 2019-02-15 电子科技大学 A kind of network inbreak detection method based on fuzziness and integrated study
US20190164060A1 (en) * 2017-11-24 2019-05-30 Yandex Europe Ag Method of and server for converting a categorical feature value into a numeric representation thereof
CN110113349A (en) * 2019-05-15 2019-08-09 北京工业大学 A kind of malice encryption traffic characteristics analysis method
CN110177123A (en) * 2019-06-20 2019-08-27 电子科技大学 Botnet detection method based on DNS mapping association figure

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790019A (en) * 2016-12-14 2017-05-31 北京天融信网络安全技术有限公司 The encryption method for recognizing flux and device of feature based self study
CN106685962A (en) * 2016-12-29 2017-05-17 广东睿江云计算股份有限公司 System and method for defense of reflective DDOS attack flow
CN107294993A (en) * 2017-07-05 2017-10-24 重庆邮电大学 A kind of WEB abnormal flow monitoring methods based on integrated study
US20190164060A1 (en) * 2017-11-24 2019-05-30 Yandex Europe Ag Method of and server for converting a categorical feature value into a numeric representation thereof
CN109347872A (en) * 2018-11-29 2019-02-15 电子科技大学 A kind of network inbreak detection method based on fuzziness and integrated study
CN110113349A (en) * 2019-05-15 2019-08-09 北京工业大学 A kind of malice encryption traffic characteristics analysis method
CN110177123A (en) * 2019-06-20 2019-08-27 电子科技大学 Botnet detection method based on DNS mapping association figure

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GUOLIN KE.ET AL: ""LightGBM:A Highly Efficient Gradient Boosting Decision Tree"", 《ACM》 *
WEIXIN_42001089: ""LightGBM源码阅读+理论分析(处理特征类别,缺省值的实现细节)"", 《CSDN》 *
王华勇等: ""基于LightGBM改进的GBDT短期负荷预测研究"", 《自动化仪表》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111277578B (en) * 2020-01-14 2022-02-22 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
CN111277578A (en) * 2020-01-14 2020-06-12 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
CN113595967A (en) * 2020-04-30 2021-11-02 深信服科技股份有限公司 Data identification method, equipment, storage medium and device
CN112165487A (en) * 2020-09-27 2021-01-01 上海万向区块链股份公司 Zeek-based distributed network security and performance detection method and system
CN112165487B (en) * 2020-09-27 2022-07-15 上海万向区块链股份公司 Zeek-based distributed network security and performance detection method and system
CN112101485A (en) * 2020-11-12 2020-12-18 北京云真信科技有限公司 Target device identification method, electronic device, and medium
CN112714079B (en) * 2020-12-14 2022-07-12 成都安思科技有限公司 Target service identification method under VPN environment
CN112714079A (en) * 2020-12-14 2021-04-27 成都安思科技有限公司 Target service identification method under VPN environment
CN113364792A (en) * 2021-06-11 2021-09-07 奇安信科技集团股份有限公司 Training method of flow detection model, flow detection method, device and equipment
CN113364792B (en) * 2021-06-11 2022-07-12 奇安信科技集团股份有限公司 Training method of flow detection model, flow detection method, device and equipment
CN113676348A (en) * 2021-08-04 2021-11-19 南京赋乐科技有限公司 Network channel cracking method, device, server and storage medium
CN113676348B (en) * 2021-08-04 2023-12-29 南京赋乐科技有限公司 Network channel cracking method, device, server and storage medium
CN113765911A (en) * 2021-09-02 2021-12-07 恒安嘉新(北京)科技股份公司 Method, device, equipment and storage medium for detecting webshell encrypted flow
CN116346452A (en) * 2023-03-17 2023-06-27 中国电子产业工程有限公司 Multi-feature fusion malicious encryption traffic identification method and device based on stacking
CN116346452B (en) * 2023-03-17 2023-12-01 中国电子产业工程有限公司 Multi-feature fusion malicious encryption traffic identification method and device based on stacking

Also Published As

Publication number Publication date
CN110598774B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN110598774B (en) Encrypted flow detection method and device, computer readable storage medium and electronic equipment
CN111565205B (en) Network attack identification method and device, computer equipment and storage medium
Jha et al. Intrusion detection system using support vector machine
WO2019128529A1 (en) Url attack detection method and apparatus, and electronic device
Jacobs et al. AI/ML for network security: The emperor has no clothes
CN111277578A (en) Encrypted flow analysis feature extraction method, system, storage medium and security device
CN110557382A (en) Malicious domain name detection method and system by utilizing domain name co-occurrence relation
Kohout et al. Learning communication patterns for malware discovery in HTTPs data
CN110493262B (en) Classification-improved network attack detection method and system
CN113469366A (en) Encrypted flow identification method, device and equipment
CN106446124B (en) A kind of Website classification method based on cyberrelationship figure
Luxemburk et al. Fine-grained TLS services classification with reject option
Barut et al. R1dit: Privacy-preserving malware traffic classification with attention-based neural networks
Boukhalfa et al. Parallel processing using big data and machine learning techniques for intrusion detection
CN116915450A (en) Topology pruning optimization method based on multi-step network attack recognition and scene reconstruction
Wang et al. Threat Intelligence Relationship Extraction Based on Distant Supervision and Reinforcement Learning.
CN110598794A (en) Classified countermeasure network attack detection method and system
CN111444364B (en) Image detection method and device
Liu et al. An approach based on the improved SVM algorithm for identifying malware in network traffic
CN117807245A (en) Node characteristic extraction method and similar node searching method in network asset map
Bui et al. A clustering-based shrink autoencoder for detecting anomalies in intrusion detection systems
CN111291078A (en) Domain name matching detection method and device
Long et al. Deep encrypted traffic detection: An anomaly detection framework for encryption traffic based on parallel automatic feature extraction
Ghimeş et al. Neural network models in big data analytics and cyber security
Zhang et al. An uncertainty-based traffic training approach to efficiently identifying encrypted proxies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant