CN114978593B - Graph matching-based encrypted traffic classification method and system for different network environments - Google Patents

Graph matching-based encrypted traffic classification method and system for different network environments Download PDF

Info

Publication number
CN114978593B
CN114978593B CN202210397693.7A CN202210397693A CN114978593B CN 114978593 B CN114978593 B CN 114978593B CN 202210397693 A CN202210397693 A CN 202210397693A CN 114978593 B CN114978593 B CN 114978593B
Authority
CN
China
Prior art keywords
matching
root
root cluster
network
cluster set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210397693.7A
Other languages
Chinese (zh)
Other versions
CN114978593A (en
Inventor
张晓宇
李文灏
刘峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202210397693.7A priority Critical patent/CN114978593B/en
Publication of CN114978593A publication Critical patent/CN114978593A/en
Application granted granted Critical
Publication of CN114978593B publication Critical patent/CN114978593B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a graph matching-based encrypted flow classification method and system for different network environments, belongs to the field of network flow management, and is characterized in that the same type of encrypted flows of different networks are respectively aggregated by utilizing a designed encrypted flow clustering algorithm and a graph matching-based encrypted flow matching classification method, the same type of encrypted flows of different networks are matched, known labels are mapped to the matched encrypted flow clusters, and therefore encrypted flows to be detected are classified.

Description

Encrypted traffic classification method and system based on graph matching for different network environments
Technical Field
The invention belongs to the field of network traffic management, relates to encrypted network traffic identification and classification technology, and particularly relates to a graph matching-based encrypted traffic classification method and system for different network environments.
Background
Encryption traffic identification and classification techniques are one of the major branches of network traffic management techniques. The technique identifies and classifies network applications to which traffic belongs by analyzing the collected encrypted network traffic. Encrypted traffic identification and classification techniques are widely used in the fields of network security and network supervision, and also applied to defense devices such as intelligent Intrusion Detection Systems (IDS) for detecting and filtering malicious traffic. In recent years, with the increasing popularity of encryption technology, network traffic has also changed from plaintext transmission to ciphertext transmission. Therefore, the traditional deep packet based network traffic detection and classification method is suitable for encrypted traffic. Thus, the new encrypted traffic classification technique abandons the method of matching patterns in plaintext information, and instead learns the distribution of side channel features using the side channel information of encrypted traffic as training features to identify and classify encrypted traffic.
At present, the following challenges which are difficult to solve exist in the novel intelligent encryption traffic identification and classification technology: due to complexity and uncertainty of a network topology architecture, the current encryption traffic identification and classification technology cannot ensure stable universality. Under different network environments, due to the existence of unpredictable network fluctuation, network delay, network bandwidth and topological structure, the feature distribution of encrypted traffic from the same network application under the same group of feature vectors is easily interfered, and the unstable feature distribution makes the current encrypted traffic identification and classification model initialized under a single network unable to achieve stable identification and classification effects. The reasons for this challenge are: the training materials of the current encrypted traffic identification and classification technology are side channel information of encrypted traffic, and the side channel information is unstable under different network environments, so that single distribution learned by a model cannot adapt to side channel characteristic distribution receiving disturbance; the existing training and testing method for the encrypted traffic identification and classification model is to initialize the model under a known single network environment, and the initialized model is subjected to deployment testing under different network environments, so that single distribution learned by the model cannot adapt to side channel characteristic distribution which receives disturbance.
Disclosure of Invention
The invention aims to provide a graph matching-based encrypted flow classification method and system for different network environments.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method for classifying encrypted traffic of different network environments based on graph matching comprises the following steps:
collecting encrypted flow data under different network environments, and dividing the encrypted flow data under the same network environment by taking a network session as a unit;
extracting the multidimensional static characteristics of each divided network session;
clustering the network session for multiple times according to the multidimensional static characteristics of the network session to obtain different root cluster sets corresponding to different network environments;
selecting one obtained root cluster set each time, wherein the root cluster set is a to-be-detected root cluster set of an unknown label and is matched with a root cluster set of a known label; for the two root cluster sets which are matched, calculating the similarity between all root clusters in each root cluster set to obtain a similarity matrix of each root cluster set;
traversing the two root cluster sets to obtain a candidate matching pair set, traversing the candidate matching pairs in the matching pair set, and calculating a coexistence value between the matching pairs to obtain a matching matrix of the candidate matching pair set;
according to the matching matrix of the candidate matching pair set, calculating the correctness of each candidate matching pair in the matching pair set, and screening to obtain a matching pair which is mapped one to one in the two root cluster sets;
and mapping the label information in the root cluster set of the known label to the root cluster set to be detected of the unknown label in a one-to-one manner, predicting the encrypted flow in the root cluster set to be detected of the unknown label to be the known label, and realizing classification.
Further, the network traffic sniffer is used for collecting the encrypted traffic data of the corresponding applications under different network environments respectively.
Further, a preset five-tuple { destination IP, destination port, source IP, source port, transport layer protocol } is used as a key value to perform network session segmentation.
Further, the multidimensional static features include a certificate feature, an address feature, a domain name feature, and a time feature of the session.
Further, the step of clustering comprises:
according to the certificate characteristics of the encryption handshake of the network sessions, the network sessions with the same certificate information characteristics are aggregated together to form an original root cluster set;
according to the address characteristics of network sessions, on the basis of the original root cluster set, the sessions with the same destination network address are aggregated together, and the original root cluster set is supplemented to obtain a supplemented root cluster set;
according to the domain name characteristics of the network session, on the basis of the expanded root cluster set, the sessions with similar domain name characteristics are aggregated together, and the root cluster set is further expanded;
according to the time characteristics, the rest network sessions which are not aggregated are aggregated to the root cluster with the most similar time characteristics.
Further, if N network applications generate encrypted traffic data from M network environments, M root cluster sets are obtained through aggregation, each root cluster set includes N root clusters, and N × M root clusters are counted.
Further, the correctness of each candidate matching pair in the matching pair set is calculated through a spectrum matching algorithm, and the processing steps of the spectrum matching algorithm are as follows: and inputting the matching matrix of the matching pair set, and calculating to obtain a main eigenvector of the matching matrix, wherein subscripts corresponding to each value of the two corresponding main eigenvectors correspond to the matching pair sequence in the matching pair set.
Further, screening by an acceptance-rejection algorithm to obtain a matching pair mapped one-to-one in the two root cluster sets, wherein the acceptance-rejection algorithm comprises the following processing steps: the subscripts in the main characteristic vector are sorted according to the corresponding values, the matching pairs are accepted from the matching pairs corresponding to the subscripts corresponding to larger values, unique mapping from a certain root cluster in one root cluster set to a certain root cluster in another root cluster set is formed, and all the matching pairs related to the two root clusters are rejected at the same time until all the root clusters are uniquely matched.
The encrypted traffic classification system based on the graph matching and different network environments comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor realizes the steps of the method when executing the program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
The method can use the collected flow data under a single network as an initial sample to stably identify and classify the network application flow data collected under different network environments, wherein compared with a traditional clustering algorithm, the clustering method for the encrypted flow has a more efficient aggregation effect on the encrypted flow. The encryption traffic identification and classification method does not need a large amount of computing resources, and realizes efficient and stable encryption traffic identification and classification effects by a non-learning framework. The method can effectively solve the problem that the method cannot be suitable for the same network application flow identification and classification task under different networks after a single network initialization model.
Drawings
Fig. 1 is a schematic diagram of classification of encrypted traffic of different network environments based on graph matching according to an embodiment of the present invention.
Fig. 2 is a flowchart of an encryption traffic clustering algorithm in the same network environment according to an embodiment of the present invention.
Fig. 3 is a flowchart of an encryption traffic matching classification algorithm in different network environments according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The embodiment provides a graph matching-based method for classifying encrypted traffic in different network environments, which is suitable for accurately and stably identifying and classifying encrypted traffic generated by different network applications, even if the encrypted traffic is generated in different network environments, as shown in fig. 1. The method is divided into an encryption flow clustering algorithm under the same network and an encryption flow matching classification algorithm under different networks, wherein FIG. 2 is a flow chart of the encryption flow clustering algorithm under the same network, and FIG. 3 is a flow chart of the encryption flow matching classification algorithm under different networks.
The embodiment firstly aggregates the encrypted traffic of the same application to the encrypted traffic under different network environments to form an encrypted traffic root cluster set under different network environments; and then, matching an encrypted flow root cluster set under a network with a known label and an encrypted flow root cluster set of an unknown label under a network environment to be detected by using a designed graph matching encrypted flow classification method, wherein encrypted flow root clusters applied by the same network form one-to-one matching, and the encrypted flow cluster of the known label maps the label to the matched encrypted flow cluster to be detected, so that the identification and classification of the encrypted flow to be detected are realized.
The method specifically comprises the following steps:
collecting encrypted traffic data under different network environments: respectively collecting encrypted traffic data of corresponding applications under different network environments by using a network traffic sniffer;
1) The encryption flow clustering algorithm under the same network environment:
first, for encrypted traffic in the same network environment, the whole traffic data is partitioned in units of network sessions using a predefined quintuple as a key value.
Then, for each network session, extracting multidimensional static features of the network session as a characterization vector of the network session, wherein the static features comprise certificate features, address features, domain name features, time features and the like of the session; the certificate characteristic refers to extracting plaintext encryption certificate information of a standard encryption handshake stage in the encrypted flow session as a characteristic; the address characteristics refer to address characteristics of the encrypted traffic session including a destination IP address; the domain name characteristics refer to domain name information embedded in the semi-encrypted traffic session and plaintext domain name information embedded in an encryption handshake stage of the encrypted traffic session; the temporal characteristics refer to the process time between the arrival times of the first traffic packet of the encrypted traffic session.
And finally, clustering the encrypted flow under the same network environment, wherein the clustering algorithm comprises four steps:
(1) According to the certificate characteristics of the encryption handshake of the network sessions, the network sessions with the same certificate information characteristics are aggregated together to form a root cluster set; the following is a formalized representation of certificate feature aggregation:
Figure BDA0003598183410000041
Figure BDA0003598183410000042
therein, ζ n Is a single independent encrypted traffic session, f cert Is a function of the certificate feature extraction,
Figure BDA0003598183410000043
the root cluster set is obtained by taking certificate characteristics as an aggregation key value. i, j denote two network sessions undergoing aggregation, and n denotes the number of network sessions. Traversing encrypted traffic sessions ζ under the same network environment n Using certificate feature extraction function f cert Extracting certificate features of the certificate; for encrypted sessions with the same certificate characteristics, aggregating the encrypted sessions into the same root cluster to obtain a root cluster set aggregated by taking the certificate characteristics as key values
Figure BDA0003598183410000044
Then, the root cluster aggregated by taking the certificate characteristics as key values is aggregated
Figure BDA0003598183410000045
And (4) inducing to a root cluster set RCs under the network environment.
(2) According to the address characteristics of the network session, on the basis of the original root cluster set, the sessions with the same target network address are aggregated together, and the original root cluster set is expanded; the following is a formalized representation of address feature aggregation:
Figure BDA0003598183410000051
Figure BDA0003598183410000052
RCs=RC cert ∪RC ip
wherein f is ip Is an address feature extraction function. Traversing the residual unaggregated encrypted traffic sessions, and aggregating the sessions with the same address characteristics as the root cluster to the corresponding certificate characteristic root cluster
Figure BDA0003598183410000053
The preparation method comprises the following steps of (1) performing; for the encrypted traffic sessions which do not have the same address characteristics with the root cluster, the encrypted traffic sessions with the same address characteristics are aggregated according to the address characteristic similarity to form the encrypted traffic sessionsRoot cluster using address characteristics as key values
Figure BDA0003598183410000054
Then, the root cluster with address characteristics as key values
Figure BDA0003598183410000055
And the cluster is classified into the existing root cluster RCs to form a supplemented root cluster.
(3) According to the domain name characteristics of the network session, on the basis of the expanded root cluster set, the sessions with similar domain name characteristics are aggregated together to expand the root cluster set; the following is a formalized representation of the similarity of domain name features between two root clusters:
Figure BDA0003598183410000056
where α is a domain name feature (domain name list) of one root cluster in the root cluster set, β is a domain name feature (domain name list) of another root cluster in the root cluster set, and Simhash is a function that calculates the similarity between two domain names. Traversing and calculating the similarity between root cluster pairs in the same root cluster set; and circularly aggregating the root clusters with high similarity until the number of the root clusters in the root cluster set is equal to the number of the collected network applications.
(4) According to the time characteristics, the rest network sessions which are not aggregated are aggregated into the root cluster with the most similar time characteristics.
Through the encrypted traffic clustering algorithm, encrypted traffic data from M network environments generated by N network applications form M root cluster sets, and each root cluster set comprises N root clusters.
2) The encryption flow matching classification algorithm under different network environments comprises the following steps:
firstly, for N root clusters from M network environments generated by N network applications obtained by an encryption traffic clustering algorithm, N root clusters of every two root cluster sets are matched one by one. For two sets of root clusters to be matched, one is a set of root clusters in a training set of known labels, and the other is a set of root clusters in a testing set of unknown labels. For each root cluster set, calculating the similarity between the root clusters to obtain a similarity matrix of each root cluster set; the following is a formalized expression for calculating the similarity between two root clusters:
Figure BDA0003598183410000061
for the root cluster set under the same network environment, traversing the root clusters, and calculating the domain name feature similarity between the root cluster pairs to obtain the similarity between the two root clusters.
Then, traversing the two root cluster sets to be matched to obtain a candidate matching pair set; traversing the candidate matching pairs in the candidate matching pair set, and calculating the coexistence value between the matching pairs to obtain the matching matrix of the candidate matching pair set; the following is a formalized representation of the matching matrix computation method for a set of matching pairs:
Figure BDA0003598183410000062
Figure BDA0003598183410000063
wherein the content of the first and second substances,
Figure BDA0003598183410000064
the point value of the similarity matrix in the root cluster set a is the similarity of two root clusters corresponding to the matching pair alpha;
Figure BDA0003598183410000065
the point value of the similarity matrix in the root cluster set b is the similarity of two root clusters corresponding to the matching pair beta; theta a,b Is the noise tolerance value of the similarity matrix corresponding to the root cluster set a, b. And traversing the candidate matching pair set, and calculating a coexistence value between the matching pairs to obtain a matching matrix of the matching pair set.
Then, calculating the correctness of each candidate matching pair in the matching pair set through a spectrum matching algorithm for the obtained matching matrix of the matching pair set; through the "accept-reject" algorithm, a one-to-one match in the two root cluster sets is formed. Wherein the process of the spectrum matching algorithm is as follows: and inputting the matching matrix of the matching pair set, and calculating to obtain a main eigenvector of the matching matrix, wherein subscripts corresponding to each value of the two corresponding main eigenvectors correspond to the matching pair sequence in the matching pair set. The "accept-reject" algorithm: the subscripts in the main characteristic vector are sorted according to the corresponding values, the matching pairs are accepted from the matching pairs corresponding to the subscripts corresponding to larger values, unique mapping from a certain root cluster in the root cluster set a to a certain root cluster in the root cluster set b is formed, meanwhile, all the matching pairs related to the two root clusters are rejected until all the root clusters have unique matching, and the algorithm is stopped.
And finally, mapping the label information from the root cluster set of the known label to the root cluster set to be tested of the unknown label one by one, so as to achieve the purpose of matching and classifying the encrypted flow under the network to be tested. Specifically, one of the two obtained root cluster sets with one-to-one mapping relationship is from a known network, a network application label corresponding to a root cluster in the set is known, the other root cluster set is from a network to be tested, and the network application label corresponding to the root cluster in the set is unknown; and equivalently mapping the label information of the root cluster in the known label to the root cluster of the unknown label through a one-to-one unique mapping relation, wherein the encrypted traffic session aggregated in the root cluster of the unknown label is predicted as the known label.
An example is listed below:
1. method for aggregating encrypted traffic under same network by using encrypted traffic clustering method algorithm
The data set CrossNet2021 is collected independently, the CrossNet2021 data set contains encrypted traffic data collected in two network environments, and for each network environment, encrypted traffic data generated by the same network application is collected, and these traffic data are generated by 10 common network applications:
1) Firstly, for 10 kinds of network application encrypted traffic data collected in one of the network environments, network session segmentation is performed by using a five-tuple of { destination IP, destination port, source IP, source port, transport layer protocol } as a key. Dividing the collected original data packet set into single network sessions;
2) Respectively extracting multidimensional static characteristics of the network session obtained in the step 1), wherein the multidimensional static characteristics comprise certificate characteristics, address characteristics, domain name characteristics and time characteristics;
3) Firstly, network sessions with the same certificate characteristics are aggregated, and encryption sessions with the same certificate characteristics are aggregated into the same root cluster to obtain a root cluster set aggregated by taking the certificate characteristics as key values. Then, the root cluster set aggregated by taking the certificate characteristics as key values is summarized to the root cluster set under the network environment;
4) Further aggregating network sessions with the same address characteristics, expanding the root cluster set obtained in the step 3), traversing the residual unaggregated encrypted traffic sessions, and aggregating the sessions with the same address characteristics as the root cluster into a corresponding certificate characteristic root cluster; for encrypted traffic sessions which do not have the same address characteristics as the root cluster, aggregating the encrypted traffic sessions with the same address characteristics according to the address characteristic similarity to form the root cluster taking the address characteristics as key values; then, the root cluster taking the address characteristics as key values is put into the existing root cluster set to form a supplemented root cluster set;
5) Further aggregating network sessions with high similarity domain name characteristics, expanding the root cluster set obtained in the step 4), aggregating sessions with high similarity domain name characteristics for the root clusters with high similarity domain name characteristics according to the domain name characteristic similarity, and fusing and compressing the existing root cluster set. Traversing and calculating the similarity between root cluster pairs in the same root cluster set; and circularly aggregating the root clusters with high similarity until the number of the root clusters in the root cluster set is equal to the number of the collected network applications. Because CrossNet2021 includes network traffic of 10 applications, the root cluster set obtained by aggregation now includes 10 root clusters;
6) And finally, respectively obtaining a root cluster set comprising 10 root clusters according to the network traffic data collected under each network environment.
The results of comparing the encryption flow clustering algorithm provided by the invention with other methods are shown in table 1.
TABLE 1 comparison of Cluster purity under two different network environments in the CrossNet2021 dataset
Data set The method of the invention BIRCH DBSCAN K-Means Mean-Shift
CrossNet2021_A 0.998 0.456 0.407 0.462 0.789
CrossNet2021_B 0.984 0.546 0.356 0.517 0.889
Note: the metric used in table 1 is the in-class purity score (IPS).
2. Identifying and classifying encrypted traffic in different network environments using graph matching based encrypted traffic matching classification algorithm
Using two subdata sets of CrossNet2021 as example samples, with the data in CrossNet2021_ a as the training set, whose labels are known; the data in CrossNet2021_ B is used as a test set, whose labels require class prediction.
1) Respectively aggregating flow data in crossNet2021_ A and crossNet2021_ B by using a proposed encryption flow clustering algorithm to obtain two root cluster sets RCa and RCb, wherein each of the RCa and RCb comprises 10 root clusters;
2) For the RCa and the RCb obtained in the step 1), respectively calculating the similarity between root clusters to obtain two similar matrixes of Ga and Gb with the size of 10 multiplied by 10;
3) Traversing the root clusters in the RCa and the RCb to obtain a candidate matching pair set with the size of 100;
4) And traversing the candidate matching pairs in the candidate matching pair set, and calculating a coexistence value between the two matching pairs to obtain a matching matrix M with the size of 100 multiplied by 100.
5) Calculating the correctness of each candidate matching pair in the matching pair set by a spectrum matching algorithm for the matching matrix M of the matching pair set obtained in the step 4); through the "accept-reject" algorithm, a one-to-one match in the two root cluster sets is formed. Firstly, inputting a matching matrix M of a matching pair set, calculating to obtain a main eigenvector x of the matching matrix, wherein subscript i corresponding to each value of the corresponding main eigenvector x corresponds to a matching pair sequence in the matching pair set. Then, the subscript i in the main feature vector x is sorted according to the size of the corresponding value, the matched pair is received from the matched pair corresponding to the subscript corresponding to the larger value, the unique mapping from a certain root cluster in the root cluster RCa to a certain root cluster in the root cluster RCb is formed, meanwhile, all the matched pairs related to the two root clusters are rejected until all the root clusters have unique matching, and the algorithm is stopped.
6) The two root cluster sets which have one-to-one mapping relationship are obtained in the step 5), RCa is from a known network, the corresponding network application label of the root cluster in the set is known, RCb is from a network to be tested, and the corresponding network application label of the root cluster in the set is unknown; and equivalently mapping the label information of the root cluster in the known label to the root cluster of the unknown label through a one-to-one unique mapping relation, wherein the encrypted traffic session aggregated in the root cluster of the unknown label is predicted as the known label.
Comparing the results of the method of the present invention with other methods, all methods were initialized and trained in CrossNet2021_ a, and cross-network environment testing was performed in CrossNet2021_ B, table 2 is the accuracy of encrypted traffic cross-network environment classification.
TABLE 2 classification accuracy of encrypted traffic across network environments
Data set The method of the invention Flowprint XGboost RBRN FC-Net
CrossNet2021 95.87 36.01 49.51 61.69 72.74
According to the results in tables 1 and 2, the advantages of the graph matching-based encryption flow clustering and matching classification method provided by the method are respectively embodied.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A method for classifying encrypted traffic of different network environments based on graph matching is characterized by comprising the following steps:
collecting encrypted flow data under different network environments, and dividing the encrypted flow data under the same network environment by taking a network session as a unit;
extracting the multidimensional static characteristics of each divided network session;
clustering the network session for multiple times according to the multidimensional static characteristics of the network session to obtain different root cluster sets corresponding to different network environments;
selecting one obtained root cluster set each time, wherein the root cluster set is a to-be-detected root cluster set of an unknown label and is matched with a root cluster set of a known label; for the two root cluster sets for matching, calculating the similarity between all root clusters in each root cluster set to obtain a similarity matrix of each root cluster set;
traversing the two root cluster sets to obtain a candidate matching pair set, traversing the candidate matching pairs in the matching pair set, and calculating a coexistence value between the matching pairs to obtain a matching matrix of the candidate matching pair set;
according to the matching matrix of the candidate matching pair set, calculating the correctness of each candidate matching pair in the matching pair set, and screening to obtain matching pairs which are mapped one to one in the two root cluster sets;
and mapping the label information in the root cluster set of the known label to the root cluster set to be detected of the unknown label in a one-to-one manner, predicting the encrypted flow in the root cluster set to be detected of the unknown label to be the known label, and realizing classification.
2. The method of claim 1, wherein the encrypted traffic data of the corresponding application is collected separately under different network environments using a network traffic sniffer.
3. The method of claim 1, wherein the network session segmentation is performed using a preset five-tuple { destination IP, destination port, source IP, source port, transport layer protocol } as a key.
4. The method of claim 1, wherein the multidimensional static features comprise a certificate feature, an address feature, a domain name feature, and a time feature of the session.
5. The method of claim 4, wherein the step of clustering comprises:
according to the certificate characteristics of the encryption handshake of the network sessions, the network sessions with the same certificate information characteristics are aggregated together to form an original root cluster set;
according to the address characteristics of network sessions, on the basis of the original root cluster set, the sessions with the same destination network address are aggregated together, and the original root cluster set is supplemented to obtain a supplemented root cluster set;
according to the domain name characteristics of the network session, on the basis of the supplemented root cluster set, the sessions with similar domain name characteristics are aggregated together, and the root cluster set is further expanded;
according to the time characteristics, the rest network sessions which are not aggregated are aggregated into the root cluster with the most similar time characteristics.
6. The method of claim 1, wherein if there are N network applications that generate encrypted traffic data from M network environments, aggregating to obtain M root cluster sets, each root cluster set comprising N root clusters for a total of nxm root clusters.
7. The method of claim 1, wherein the correctness of each candidate matching pair in the set of matching pairs is computed by a spectral matching algorithm that processes by: and inputting the matching matrix of the matching pair set, and calculating to obtain a main characteristic vector of the matching matrix, wherein subscripts corresponding to each value of the two corresponding main characteristic vectors correspond to the matching pair sequence in the matching pair set.
8. The method of claim 7, wherein the matched pairs of one-to-one mapping in the two root cluster sets are obtained by an accept-reject algorithm, the accept-reject algorithm comprising the steps of: the subscripts in the main feature vector are sorted according to the corresponding values, the matching pairs are received from the matching pairs corresponding to the subscripts corresponding to larger values, unique mapping from a certain root cluster in one root cluster set to a certain root cluster in another root cluster set is formed, and all the matching pairs related to the two root clusters are rejected at the same time until all the root clusters have unique matching.
9. A system for classifying encrypted traffic in different network environments based on graph matching, comprising a memory on which a computer program is stored and a processor which, when executing the program, implements the steps of the method according to any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202210397693.7A 2022-04-15 2022-04-15 Graph matching-based encrypted traffic classification method and system for different network environments Active CN114978593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210397693.7A CN114978593B (en) 2022-04-15 2022-04-15 Graph matching-based encrypted traffic classification method and system for different network environments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210397693.7A CN114978593B (en) 2022-04-15 2022-04-15 Graph matching-based encrypted traffic classification method and system for different network environments

Publications (2)

Publication Number Publication Date
CN114978593A CN114978593A (en) 2022-08-30
CN114978593B true CN114978593B (en) 2023-03-10

Family

ID=82977946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210397693.7A Active CN114978593B (en) 2022-04-15 2022-04-15 Graph matching-based encrypted traffic classification method and system for different network environments

Country Status (1)

Country Link
CN (1) CN114978593B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240657B (en) * 2023-09-07 2024-03-12 中国电子产业工程有限公司 VPN application identification method based on graph matching network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101714952A (en) * 2009-12-22 2010-05-26 北京邮电大学 Method and device for identifying traffic of access network
CN109525508A (en) * 2018-12-15 2019-03-26 深圳先进技术研究院 Encryption stream recognition method, device and the storage medium compared based on flow similitude
CN111211994A (en) * 2019-11-28 2020-05-29 南京邮电大学 Network traffic classification method based on SOM and K-means fusion algorithm
CN113762377A (en) * 2021-09-02 2021-12-07 北京恒安嘉新安全技术有限公司 Network traffic identification method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL260986B (en) * 2018-08-05 2021-09-30 Verint Systems Ltd System and method for using a user-action log to learn to classify encrypted traffic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101714952A (en) * 2009-12-22 2010-05-26 北京邮电大学 Method and device for identifying traffic of access network
CN109525508A (en) * 2018-12-15 2019-03-26 深圳先进技术研究院 Encryption stream recognition method, device and the storage medium compared based on flow similitude
CN111211994A (en) * 2019-11-28 2020-05-29 南京邮电大学 Network traffic classification method based on SOM and K-means fusion algorithm
CN113762377A (en) * 2021-09-02 2021-12-07 北京恒安嘉新安全技术有限公司 Network traffic identification method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种加密流量行为分析***的设计研究;程永新等;《通信技术》;20200410(第04期);全文 *

Also Published As

Publication number Publication date
CN114978593A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN109726744B (en) Network traffic classification method
Singh Performance analysis of unsupervised machine learning techniques for network traffic classification
WO2022037130A1 (en) Network traffic anomaly detection method and apparatus, and electronic apparatus and storage medium
CN110796196A (en) Network traffic classification system and method based on depth discrimination characteristics
CN111144459A (en) Class-unbalanced network traffic classification method and device and computer equipment
CN111786951B (en) Traffic data feature extraction method, malicious traffic identification method and network system
Kong et al. Identification of abnormal network traffic using support vector machine
CN114978593B (en) Graph matching-based encrypted traffic classification method and system for different network environments
CN113821793A (en) Multi-stage attack scene construction method and system based on graph convolution neural network
Dong et al. Flow cluster algorithm based on improved K-means method
Xiao et al. A traffic classification method with spectral clustering in SDN
Dixit et al. Internet traffic detection using naïve bayes and K-Nearest neighbors (KNN) algorithm
CN112383488B (en) Content identification method suitable for encrypted and non-encrypted data streams
Shrivastav et al. Network traffic classification using semi-supervised approach
CN112633353B (en) Internet of things equipment identification method based on packet length probability distribution and k nearest neighbor algorithm
Chen et al. A novel semi-supervised learning method for Internet application identification
CN114866301B (en) Encryption traffic identification and classification method and system based on direct push graph
CN114666273B (en) Flow classification method for application layer unknown network protocol
WO2016177146A1 (en) Network traffic data classification method and device
Atli et al. Network intrusion detection using flow statistics
Abdalla et al. Impact of packet inter-arrival time features for online peer-to-peer (P2P) classification
CN110417786B (en) P2P flow fine-grained identification method based on depth features
CN114124437A (en) Encrypted flow identification method based on prototype convolutional network
Jiao et al. Multi-level IoT device identification
CN110689074A (en) Feature selection method based on fuzzy set feature entropy value calculation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant