CN115766204B - Dynamic IP equipment identification system and method for encrypted traffic - Google Patents

Dynamic IP equipment identification system and method for encrypted traffic Download PDF

Info

Publication number
CN115766204B
CN115766204B CN202211420599.5A CN202211420599A CN115766204B CN 115766204 B CN115766204 B CN 115766204B CN 202211420599 A CN202211420599 A CN 202211420599A CN 115766204 B CN115766204 B CN 115766204B
Authority
CN
China
Prior art keywords
data
tls
data packet
fingerprint
traffic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211420599.5A
Other languages
Chinese (zh)
Other versions
CN115766204A (en
Inventor
朱宇坤
牛伟纳
周玉祥
张小松
赵毅卓
陈瑞东
王楷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202211420599.5A priority Critical patent/CN115766204B/en
Publication of CN115766204A publication Critical patent/CN115766204A/en
Application granted granted Critical
Publication of CN115766204B publication Critical patent/CN115766204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a dynamic IP equipment identification system and a method for encrypted traffic, which belong to the technical field of network monitoring and comprise a traffic collection module, a user fingerprint extraction module, a flow characteristic extraction module and a cluster analysis module. The technology is deployed on a gateway, port mirror image copying is carried out on inbound encrypted traffic, then the traffic of the port is captured and analyzed, whether the traffic is a Client Hello data packet in a TLS/SSL encrypted traffic handshake stage is judged according to an analysis result, and if the traffic is the Client Hello data packet, user fingerprints are extracted from the unencrypted Client Hello data packet. Each packet information is stored in a respective data stream list, and when a data stream normally ends or a timeout ends, feature information is extracted from the stream and memory space is released, and then features are preprocessed and vectorized. The cluster analysis module performs cluster analysis on the feature vector of each stream, and identifies the access equipment according to the clustering result, so as to achieve the purpose of distinguishing different access equipment from the same access equipment.

Description

Dynamic IP equipment identification system and method for encrypted traffic
Technical Field
A dynamic IP equipment identification system and method aiming at encrypted traffic are used for dynamic IP equipment identification, belong to the technical field of network traffic monitoring, and the tested object is router equipment with NAT dynamic IP address conversion, so that test cases can be automatically generated.
Background
Due to the rapid development of the global internet, the number of IP of 32-bit size obviously cannot meet the demands of all network devices, and thus the technology of dynamic IP address conversion is widely used. Dynamic IP address conversion is a technology for recycling public network IP, and a plurality of different intranet hosts can access external resources by using the same IP. This technology is commonly adopted by large operators (e.g., union, mobile, etc.) in China.
Although the dynamic address conversion technology effectively solves the problem of insufficient IP addresses, the problem brings great challenges to the tracing work based on IP, the previous IP-based technology is difficult to effectively track the dynamic IP which is continuously changed by the same user, in addition, with the continuous importance of the privacy security of the user and the wide application of the TLS/SSL encryption technology, the encrypted traffic in the network is increased explosively, and the traffic exceeding 9 in the current internet is https encrypted traffic.
Thus, the user equipment identification problem is faced mainly with two key challenges of dynamic IP and traffic encryption. The current problem of identifying encrypted traffic mainly utilizes data packet payload, deep packet parsing, user behavior patterns and machine learning methods.
Many encryption protocols negotiate keys prior to encrypted transmissions, and the process of the key protocol is often unencrypted, and useful information can be extracted from this portion of plaintext data. The payload-based identification method detects a small amount of information from the unencrypted portion and then combines statistical methods to identify the application or service. In the document Markov CHAIN FINGERPRINTING to CLASSIFY ENCRYPTED TRAFFIC Korczynski et al, a method of identifying SSL/TLS is proposed. The method uses the header of the data packet to establish a fingerprint when the SSL/TLS protocol creates a session, where the fingerprint is based on a first order homogeneous markov chain. Markov chain states model SSL/TLS message sequences for servers and clients.
With the development of networks, port-based traffic identification classification has failed to meet the needs, and an identification classification method based on deep packet inspection is now and then occurring. Moore et al devised a classification method that relied on the payload of the complete data packet. The method can be regarded as an iterative process with the aim of obtaining features very accurately, and then the corresponding application to a fixed flow rate for grouping data packets into data flows can more efficiently process the collected information and obtain the necessary context for the network application to be correctly identified, so that the DPI operates on the flow rather than on the packets. In literature "A comparison of supervised machine learning algorithms for classification of communications network traffic", moore et al, the first step taken is to aggregate five-tuple packets based on the packets into one stream. When it is a TCP network data stream (transmission control protocol network data stream), additional semantics may also be used to determine the start and end times of the process. The second step is to repeatedly test the characteristics of the stream according to different criteria until a very deterministic application identification is obtained. This process includes 9 different identification methods. The DPI technology is a Layer capturing tool that captures a plurality of data packets and performs certain pattern matching to find an application program that meets its characteristic value.
The machine learning based identification method uses statistical features of traffic, which is a DFI method. Encryption techniques generally encrypt only the payload, so this approach is less affected by encryption. In the field of encrypted traffic identification, there are many machine learning based methods. In literature "Machine Learning for Encrypted Ma lware Traffic Classification:Accounting for Noisy Labels and Non-Stationarity.Knowledge Discovery and Data Mining", anderson et al state that the main reasons that machine learning methods generally exhibit in encrypting malware traffic classification are inaccurate ground truth and non-stationarity of network data. Machine learning based recognition methods can also be used to refine the classification. In document Analyzing Android Encrypted Network Traffic to ldentify User Actions, conti et al propose a method that can be used to identify user behavior, which considers three time sequences: (i) a time sequence is obtained from only outbound packets; (ii) Another time sequence is obtained considering only the bytes transmitted by the outbound data packet; (iii) The third time series is obtained by byte combination (time ordered) of inbound and outbound packet transmissions. The "shape" of the cumulative map obtained from different user behavior time series is different. The proposed classification approach is studied in an attempt to learn the "shape" of the network traffic associated with a particular user behavior and aims to identify the user behavior by classifying the "shape".
In summary, the prior art has the following technical problems:
1. In the prior art, a machine learning or deep learning method is used for analyzing and judging the flow behavior of the network flow characteristics, but effective user equipment identification cannot be carried out on the encrypted flow in the dynamic IP;
2. the prior art analysis of encrypted traffic depends on the payload in the data packet, and the method is deployed in the network, has large running time overhead and low detection performance, can even influence the transmission of normal traffic, and causes interference to normal traffic.
3. Most of the prior art uses a single technology or method, and the influence of different user equipment on network flow is not comprehensively considered, so that the problems of low accuracy and high false alarm rate are caused.
Disclosure of Invention
Aiming at the problems of the research, the invention aims to provide a dynamic IP equipment identification system and a method for encrypting traffic, which solve the problems that the prior art uses a machine learning or deep learning method to analyze network flow characteristics to judge traffic, but can not effectively identify user equipment for encrypting traffic in dynamic IP.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a dynamic IP device identification system for encrypted traffic, comprising the following modules:
The flow acquisition module is used for: capturing traffic by using a port mirroring method, analyzing protocol fields of each layer of each captured data packet, and judging whether the data packet is TLS/SSL encrypted traffic or not based on the port method;
And a user fingerprint extraction module: aiming at the data packet of TLS/SSL encrypted flow, analyzing the Client hello data packet in the handshake stage to extract fingerprint features, sequentially splicing all the extracted fingerprint features into character strings, calculating hash values of the character strings as user fingerprints, and storing the user fingerprints in a database, wherein the fingerprint features comprise TLS/SSL protocol versions, encryption algorithms supported by users, support expansion lists, elliptic curves and elliptic curve formats, and sequentially splicing the encryption algorithms supported by users, the support expansion lists, the elliptic curves and the elliptic curve formats into the character strings according to the TLS/SSL protocol versions;
And a flow characteristic extraction module: dividing the data packets acquired by the flow acquisition module into different flows according to five-tuple, then taking the flows as units and combining session ids for storage, counting flow characteristics of each data flow after storage, and preprocessing the flow characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data flow, wherein the flow characteristics comprise average size, transmission rate and number of the data packets;
And a cluster analysis module: judging whether each data stream belongs to the user connection recorded before, if so, optimizing a clustering algorithm; otherwise, the user fingerprint of the data stream is retrieved from the database, the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint are subjected to cluster analysis based on the optimized clustering algorithm, and whether the data stream is connected with a new user is judged according to the result of the cluster analysis.
Further, the flow acquisition module comprises the following steps:
port mirroring and traffic capture: copying all inbound traffic backups to a particular port, and then using tcpdump or wireshark for traffic capture;
And (3) data packet analysis: analyzing the data packet in the captured flow layer by layer according to the protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
Judging the Client Hello packet: firstly judging whether each data packet uses a TLS/SSL protocol to encrypt or not based on a protocol Type, namely analyzing a payload carried by each data packet according to a TLS/SSL protocol format to obtain the Content of a main field of the TLS/SSL protocol, judging whether the data packet uses the TLS/SSL protocol or not by checking whether an analysis result accords with a TLS/SSL protocol specification, and judging whether the data packet is a TLS handshake packet or not by checking a Content Type field and a HANDSHAKE TYPE field, wherein the Content Type field marks the TLS/SSL protocol Type, when the Content Type field is 22, the handshake packet is represented, and when the Content Type field is 01, the handshake packet is represented, and the Content of the main field comprises a Content Type field and a HANDSHAKE TYPE field.
Further, the user fingerprint extraction module comprises the following steps:
Extracting a characteristic field: aiming at the data packet of the TLS/SSL encrypted flow, analyzing the Client hello data packet of the TLS/SSL encrypted flow in a handshake stage according to a TLS/SSL protocol format to obtain fingerprint characteristics required by generating a user fingerprint, wherein fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
feature field splicing: the fingerprint feature is converted into decimal numbers, and then the decimal numbers are spliced into character strings in sequence;
Generating a user fingerprint: and calculating the character strings obtained by splicing by using a hash algorithm meter, and taking the hash value obtained by calculation as a user fingerprint.
Further, the flow feature extraction module comprises the steps of:
Stream data storage: dividing a data packet of TLS/SSL encrypted traffic into different data streams according to quintuple based on a data structure of a hash array, splicing each quintuple field into character strings for hash calculation, taking a hash value as an index of the data stream, and corresponding to the position of the data structure content of the data stream in a memory space, wherein the quintuple is a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
Data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size and the transmission rate of the data streams, and simultaneously recording session ids corresponding to each data stream into a database, wherein the session ids are session ids in data packet Cookies and are used for judging whether the connection is the same user;
Pretreatment and vectorization: and preprocessing and vectorizing the result obtained by the data statistics to obtain a feature vector, wherein the preprocessing and vectorizing comprises filling the missing value of the data sample after the data statistics, converting the character string or character format feature into a vector form which can be processed by a clustering algorithm, and converting the data with the same feature and different specifications into the same specification.
Further, the specific implementation steps of the cluster analysis module are as follows:
The method comprises the steps of obtaining session ids of data streams from results obtained by preprocessing and vectorizing in an Nginx server, judging whether the data streams belong to previous connection or not based on whether the session ids are recorded by a database, if yes, not performing clustering processing, directly marking a label and evaluating a clustering algorithm, wherein the evaluating method is to calculate contour coefficients to represent the quality of clustering effects; for one of the sample points i:
calculate a (i) =average (distance of i vector to other points in all clusters to which it belongs)
Calculating b (i) =min (average distance of i vector to all points in a cluster that does not contain it)
The contour coefficients of sample point i are then:
The value of the visible contour coefficient is between [ -1,1], the more towards 1 the relatively better the cohesion degree and the separation degree are, the contour coefficient of all points is averaged, and the total contour coefficient of the clustering result is obtained; the parameters of the clustering algorithm are adjusted so that the contour coefficient of the clustering algorithm is more approximate to 1, the iteration times are adjusted, and the result is recorded in a database;
Otherwise, searching the user fingerprint corresponding to the data stream from the database, then carrying out similarity calculation on the feature vector and each feature vector obtained by the same user fingerprint recorded in the database, judging whether all similarity values exceed a given threshold value, if so, judging that the data stream is subjected to an access request for a new user device, otherwise, combining two most similar clusters, simultaneously, storing a similarity table in each feature vector similarity calculation result until all clusters cannot be further combined with other clusters, wherein each category is one cluster, namely one user device, 0 clusters are not used, when the first user device is accessed to form 1 cluster, and when the second user device is accessed to form whether the two flows are similar, and so on.
A dynamic IP device identification method for encrypted traffic, comprising the steps of:
Step 1: capturing traffic by using a port mirroring method, analyzing protocol fields of each layer of each captured data packet, and judging whether the data packet is TLS/SSL encrypted traffic or not based on the port method;
Step 2: aiming at the data packet of TLS/SSL encrypted flow, analyzing the Client hello data packet in the handshake stage to extract fingerprint features, sequentially splicing all the extracted fingerprint features into character strings, calculating hash values of the character strings as user fingerprints, and storing the user fingerprints in a database, wherein the fingerprint features comprise TLS/SSL protocol versions, encryption algorithms supported by users, support expansion lists, elliptic curves and elliptic curve formats, and sequentially splicing the encryption algorithms supported by users, the support expansion lists, the elliptic curves and the elliptic curve formats into the character strings according to the TLS/SSL protocol versions;
Step 3: dividing the data packet acquired in the step 1 into different streams according to five-tuple, then storing the data packet by taking the stream as a unit and combining with a session id, counting the stream characteristics of each data stream after storing, and preprocessing the stream characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data stream, wherein the stream characteristics comprise the average size, the transmission rate and the number of the data packets:
Step 4: judging whether each data stream belongs to the user connection recorded before or not through the session id, and if so, optimizing a clustering algorithm; otherwise, the user fingerprint of the data stream is retrieved from the database, the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint are subjected to cluster analysis based on the optimized clustering algorithm, and whether the data stream is connected with a new user is judged according to the result of the cluster analysis.
Further, the step 1 includes the steps of:
port mirroring and traffic capture: copying all inbound traffic backups to a particular port, and then using tcpdump or wireshark for traffic capture;
And (3) data packet analysis: analyzing the data packet in the captured flow layer by layer according to the protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
Judging the Client Hello packet: firstly judging whether each data packet uses a TLS/SSL protocol to encrypt or not based on a protocol Type, namely analyzing a payload carried by each data packet according to a TLS/SSL protocol format to obtain the Content of a main field of the TLS/SSL protocol, judging whether the data packet uses the TLS/SSL protocol or not by checking whether an analysis result accords with a TLS/SSL protocol specification, and judging whether the data packet is a TLS handshake packet or not by checking a Content Type field and a HANDSHAKE TYPE field, wherein the Content Type field marks the TLS/SSL protocol Type, when the Content Type field is 22, the handshake packet is represented, and when the Content Type field is 01, the handshake packet is represented, and the Content of the main field comprises a Content Type field and a HANDSHAKE TYPE field.
Further, the step 2 includes the steps of:
Extracting a characteristic field: aiming at the data packet of the TLS/SSL encrypted flow, analyzing the Client hello data packet of the TLS/SSL encrypted flow in a handshake stage according to a TLS/SSL protocol format to obtain fingerprint characteristics required by generating a user fingerprint, wherein fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
feature field splicing: the fingerprint feature is converted into decimal numbers, and then the decimal numbers are spliced into character strings in sequence;
Generating a user fingerprint: and calculating the character strings obtained by splicing by using a hash algorithm meter, and taking the hash value obtained by calculation as a user fingerprint.
Further, the step3 includes the steps of:
Stream data storage: dividing a data packet of TLS/SSL encrypted traffic into different data streams according to quintuple based on a data structure of a hash array, splicing each quintuple field into character strings for hash calculation, taking a hash value as an index of the data stream, and corresponding to the position of the data structure content of the data stream in a memory space, wherein the quintuple is a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
Data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size and the transmission rate of the data streams, and simultaneously recording session ids corresponding to each data stream into a database, wherein the session ids are session ids in data packet Cookies and are used for judging whether the connection is the same user;
Pretreatment and vectorization: and preprocessing and vectorizing the result obtained by the data statistics to obtain a feature vector, wherein the preprocessing and vectorizing comprises filling the missing value of the data sample after the data statistics, converting the character string or character format feature into a vector form which can be processed by a clustering algorithm, and converting the data with the same feature and different specifications into the same specification.
Further, the step4 includes the steps of:
Obtaining the session id of each data stream from the results obtained by preprocessing and vectorizing in Ngi nx server, judging whether the data stream belongs to the previous connection based on whether the session id is recorded by the database, if yes, not performing clustering processing, directly marking a label and evaluating a clustering algorithm, wherein the evaluating method is to calculate the contour coefficient to express the quality of the clustering effect; for one of the sample points i:
calculate a (i) =average (distance of i vector to other points in all clusters to which it belongs)
Calculating b (i) =min (average distance of i vector to all points in a cluster that does not contain it)
The contour coefficients of sample point i are then:
The value of the visible contour coefficient is between [ -1,1], the more towards 1 the relatively better the cohesion degree and the separation degree are, the contour coefficient of all points is averaged, and the total contour coefficient of the clustering result is obtained; the parameters of the clustering algorithm are adjusted so that the contour coefficient of the clustering algorithm is more approximate to 1, the iteration times are adjusted, and the result is recorded in a database;
Otherwise, searching the user fingerprint corresponding to the data stream from the database, then carrying out similarity calculation on the feature vector and each feature vector obtained by the same user fingerprint recorded in the database, judging whether all similarity values exceed a given threshold value, if so, judging that the data stream is subjected to an access request for a new user device, otherwise, combining two most similar clusters, simultaneously, storing a similarity table in each feature vector similarity calculation result until all clusters cannot be further combined with other clusters, wherein each category is one cluster, namely one user device, 0 clusters are not used, when the first user device is accessed to form 1 cluster, and when the second user device is accessed to form whether the two flows are similar, and so on.
Compared with the prior art, the invention has the beneficial effects that:
1. According to the technology, each data packet is not analyzed, fingerprint feature extraction is firstly carried out on Client Hello handshake packets in the encryption flow handshake protocol process, and then flow features are counted for each data flow, so that time and performance cost are reduced;
2. The fingerprint method aiming at the encrypted flow can effectively utilize the difference of different user equipment, can generate the same fingerprint only under the condition that the equipment is very similar, and has lower probability of being simultaneously distributed to the same IP of the operator in the same period.
3. The invention can utilize the session information recorded by the Web server, define all the flows sent by the same user in a period of time through the session id, label the flows, evaluate the existing clustering model, continuously adjust parameters, optimize the model and further improve the accuracy of the model.
Drawings
FIG. 1 is a general architecture diagram of the present invention;
FIG. 2 is a schematic diagram of a database storage structure according to the present invention;
FIG. 3 is a diagram of a scene of an embodiment of the present invention;
fig. 4 is a schematic diagram of a character string according to the present invention, in which fingerprint features are converted into decimal numbers and then sequentially spliced.
Detailed Description
The invention will be further described with reference to the drawings and detailed description.
For a real network scenario in which 3 different user devices under an operator access a server of an enterprise through the same operator gateway, resources are acquired. For enterprise gateways, the source IP of all three devices is the carrier gateway IP, 223.71.41.15. At the same time these user devices occupy different operator gateway ports, but these ports are varied and it is not efficient to distinguish these devices by ports alone.
The invention combines the method of encrypting the flow user fingerprint, the session id and the clustering to identify the accessed user equipment. The main process will be described in terms of different modules:
1. Flow acquisition module
The traffic collection module is deployed on the enterprise gateway, port mirroring is needed, normal inbound traffic is copied to a specific port, normal network traffic is prevented from being affected, then traffic is captured through the port, and tools such as tcpdump, wireshark and the like can be used for traffic capture. Then, each captured data packet is subjected to simple protocol analysis, and common information such as network five-tuple, protocol hierarchy and the like is required to be obtained. The data packet protocol format is fixed, the protocol analysis work can be realized by using network libraries such as scapy of python, and the like, whether encryption is needed is mainly judged, if an encryption mode is used, the data packet application layer is in a TLS/SSL protocol format, wherein relevant fields such as TLS/SSL protocol, TLS/SSL protocol type and the like can assist in judgment.
Once the data packet is clear to be the encrypted flow, further judging whether the data packet is a TLS handshake packet in the key negotiation process, if so, firstly extracting the fingerprint (namely the user fingerprint) of the user equipment by the fingerprint extraction module, and entering the flow characteristic extraction module after the user fingerprint extraction is finished; if not, the direct access stream feature extraction module stores stream data (i.e. not a client hello packet (TLS handshake packet), and directly accesses the stream feature extraction module), extracts stream features and performs data preprocessing.
2. Fingerprint extraction module
The fingerprint extraction module aims at a Client Hello handshake packet in a key negotiation process of encrypted traffic, because a data packet in the key negotiation process carries plaintext information, the data after the key negotiation is completed can be encrypted by using a negotiated session key, a third party cannot acquire any information in encrypted data, and the Client Hello packet carries a lot of information related to user equipment, including a TLS/SSL encryption protocol version used by a user, an encryption algorithm list and an expansion list supported by an encryption suite, and the like, and two different devices have obvious differences. The fingerprint extraction module mainly comprises the following steps:
1. And analyzing the application layer content of the data packet aiming at the data packet of the TLS/SSL encrypted flow. The application layer encapsulates the TLS/SSL protocol content, and the required field value can be obtained by analyzing the TLS/SSL protocol, and the analysis method is existing.
2. Protocol field extraction (feature field extraction, i.e., fingerprint feature field). Protocol field extraction is to obtain the field values needed by us from the data packet, including TLS/SSL protocol version, encryption algorithm list supported by user, extension type list supported (extension list supported), elliptic curve algorithm list supported (elliptic curve) and elliptic curve format.
3. Splicing protocol fields. The method comprises the steps of converting fingerprint features into decimal numbers, sequentially splicing the decimal numbers into character strings in sequence, taking fig. 4 as an example, splicing protocol fields into 771,4866-4867-4865-49196-49200-159-52393-52392-52394-49195-49199-158-49188-49192-107-49187-49191-103-49162-49172-57-49161-49171-51-255,11-10-22-23-13-43-45-51,29-23-30-25-24,0-1-2., wherein commas separate each field, and hyphens separate list types in each field;
4. and calculating a hash value. And finally, the character strings after the splicing are subjected to hash algorithm calculation to obtain the user fingerprint of c81fc162549590fOe836b538fe5bfdd7.
3. Flow characteristic extraction module
The flow characteristic extraction module copies and stores each data packet in a flow unit, counts flow characteristics after detecting that the flow is ended, acquires a preliminary characteristic list, and then needs to preprocess the characteristics and convert the characteristics into a vector format acceptable by the cluster analysis module.
1. And storing stream data. The stream data storage can be completed by using a data structure of a hash array, each five-tuple corresponds to a unique stream, a structure array FlowBuff [ Size ] is defined, wherein Size is the maximum number of streams, the number of stored data streams is lower than the value, the array stores the structure of stream characteristics, then hash calculation is performed on each five-tuple character string SrclP-SrcPort-DstlP-DstPort-protocol to obtain a number, and the statistical characteristics of the stream are stored at the position corresponding to the number.
2. And (5) data statistics. And counting the characteristic data of one stream by taking the data stream as a unit, wherein the characteristic data comprises the number of data packets, the transmission size of the data stream, the data transmission rate and the like, and simultaneously, the session id corresponding to the data stream is stored in a database.
3. Pretreatment and vectorization. Based on the data statistics, the preprocessing may include many steps according to practical situations, for example, when a sample lacks a certain field value, the missing value needs to be complemented, and a common method is to use the field average value of all samples or directly complement with 0. Data type conversion is then required for the character string type feature, typically converting the time-formatted character string to an integer. Finally, feature scaling is performed to convert the data of different specifications into the same specification by the same feature, for example, scaling the data packet size to the interval range of [ -1,1 ].
4. Cluster analysis module
The cluster analysis module firstly obtains the session id of the data stream, and the session id can be obtained from an Nginx server, and the Nginx server is provided with a session key, so that the encrypted traffic can be decrypted to obtain the session id, and the data is generally stored in a log, so that the gateway is not required to decrypt once again. The session id can determine whether the data stream belongs to a previous connection.
1. If the data stream belongs to the previous connection, cluster analysis is not needed, the clustering model is directly marked and evaluated, and the parameters of the clustering algorithm are adjusted and the result is recorded in the database.
2. If the data stream does not belong to the previous connection, the hash value (i.e. the user fingerprint) corresponding to the data stream is retrieved from the database, then the similarity calculation is performed between the feature vector of the data stream and other feature vectors of the same user fingerprint recorded in the database, and it is determined whether all the similarity values exceed a given threshold (e.g. 90%) (it is determined whether the feature vector is in a known category), and if so, it can be determined that a new user device has made an access request. Otherwise, the two most similar clusters are converged, and the process is iterated and the similarity table is updated until the requirement is met.
In summary, the technology of the present invention is deployed on a gateway, port mirror image copying is performed on the inbound encrypted traffic, then the traffic of the port is captured and parsed, and whether the traffic is a Client Hello data packet in the TLS/SSL encrypted traffic handshake stage is determined according to the parsing result, if so, the user fingerprint is extracted from the unencrypted Client Hello data packet. Each packet information is stored in a respective data stream list, and when a data stream normally ends or a timeout ends, feature information is extracted from the stream and memory space is released, and then features are preprocessed and vectorized. The cluster analysis module performs cluster analysis on the feature vector of each stream, and identifies the access equipment according to the clustering result, so as to achieve the purpose of distinguishing different access equipment from the same access equipment.
The processing method proposed in the present embodiment is effective and simple, and can cope with a real complex situation. For example, most of the time, traffic is transmitted encrypted, and the method is also capable of handling encrypted traffic. The above is merely representative examples of numerous specific applications of the present invention and should not be construed as limiting the scope of the invention in any way. All technical schemes formed by adopting transformation or equivalent substitution fall within the protection scope of the invention.

Claims (6)

1. A dynamic IP device identification system for encrypted traffic, comprising the following modules:
The flow acquisition module is used for: capturing traffic by using a port mirroring method, analyzing protocol fields of each layer of each captured data packet, and judging whether the data packet is TLS/SSL encrypted traffic or not based on the port method;
And a user fingerprint extraction module: aiming at the data packet of TLS/SSL encrypted flow, analyzing the Client hello data packet in the handshake stage to extract fingerprint features, sequentially splicing all the extracted fingerprint features into character strings, calculating hash values of the character strings as user fingerprints, and storing the user fingerprints in a database, wherein the fingerprint features comprise TLS/SSL protocol versions, encryption algorithms supported by users, support expansion lists, elliptic curves and elliptic curve formats, and sequentially splicing the encryption algorithms supported by users, the support expansion lists, the elliptic curves and the elliptic curve formats into the character strings according to the TLS/SSL protocol versions;
And a flow characteristic extraction module: dividing the data packets acquired by the flow acquisition module into different flows according to five-tuple, then taking the flows as units and combining session ids for storage, counting flow characteristics of each data flow after storage, and preprocessing the flow characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data flow, wherein the flow characteristics comprise average size, transmission rate and number of the data packets;
And a cluster analysis module: judging whether each data stream belongs to the user connection recorded before, if so, optimizing a clustering algorithm; otherwise, retrieving the user fingerprint of the data stream from the database, performing cluster analysis on the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint based on the optimized clustering algorithm, and judging whether the data stream is connected by a new user according to the cluster analysis result;
The flow characteristic extraction module comprises the following steps:
Stream data storage: dividing a data packet of TLS/SSL encrypted traffic into different data streams according to quintuple based on a data structure of a hash array, splicing each quintuple field into character strings for hash calculation, taking a hash value as an index of the data stream, and corresponding to the position of the data structure content of the data stream in a memory space, wherein the quintuple is a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
Data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size and the transmission rate of the data streams, and simultaneously recording session ids corresponding to each data stream into a database, wherein the session ids are session ids in data packet Cookies and are used for judging whether the connection is the same user;
Pretreatment and vectorization: preprocessing and vectorizing a result obtained by data statistics to obtain a feature vector, wherein the preprocessing and vectorizing comprises filling missing values of data samples after the data statistics, converting the features of character strings or character formats into vector forms which can be processed by a clustering algorithm, and converting data with the same features and different specifications into the same specification;
the specific implementation steps of the cluster analysis module are as follows:
The method comprises the steps of obtaining session ids of data streams from results obtained by preprocessing and vectorizing in an Nginx server, judging whether the data streams belong to previous connection or not based on whether the session ids are recorded by a database, if yes, not performing clustering processing, directly marking a label and evaluating a clustering algorithm, wherein the evaluating method is to calculate contour coefficients to represent the quality of clustering effects; for one of the sample points i:
calculate a (i) =average (distance of i vector to other points in all clusters to which it belongs)
Calculating b (i) =min (average distance of i vector to all points in a cluster that does not contain it)
The contour coefficients of sample point i are then:
The value of the visible contour coefficient is between [ -1,1], the more towards 1 the relatively better the cohesion degree and the separation degree are, the contour coefficient of all points is averaged, and the total contour coefficient of the clustering result is obtained; the parameters of the clustering algorithm are adjusted so that the contour coefficient of the clustering algorithm is more approximate to 1, the iteration times are adjusted, and the result is recorded in a database;
Otherwise, searching the user fingerprint corresponding to the data stream from the database, then carrying out similarity calculation on the feature vector and each feature vector obtained by the same user fingerprint recorded in the database, judging whether all similarity values exceed a given threshold value, if so, judging that the data stream is subjected to an access request for a new user device, otherwise, combining two most similar clusters, simultaneously, storing a similarity table in each feature vector similarity calculation result until all clusters cannot be further combined with other clusters, wherein each category is one cluster, namely one user device, 0 clusters are not used, when the first user device is accessed to form 1 cluster, and when the second user device is accessed to form whether the two flows are similar, and so on.
2. A dynamic IP device identification system for encrypted traffic according to claim 1, wherein: the flow acquisition module comprises the following steps:
port mirroring and traffic capture: copying all inbound traffic backups to a particular port, and then using tcpdump or wireshark for traffic capture;
And (3) data packet analysis: analyzing the data packet in the captured flow layer by layer according to the protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
Judging the Client Hello packet: firstly judging whether each data packet uses a TLS/SSL protocol to encrypt or not based on a protocol Type, namely analyzing a payload carried by each data packet according to a TLS/SSL protocol format to obtain the Content of a main field of the TLS/SSL protocol, judging whether the data packet uses the TLS/SSL protocol or not by checking whether an analysis result accords with a TLS/SSL protocol specification, and judging whether the data packet is a TLS handshake packet or not by checking a Content Type field and a HANDSHAKE TYPE field, wherein the Content Type field marks the TLS/SSL protocol Type, when the Content Type field is 22, the handshake packet is represented, and when the Content Type field is 01, the handshake packet is represented, and the Content of the main field comprises a Content Type field and a HANDSHAKE TYPE field.
3. A dynamic IP device identification system for encrypted traffic as recited in claim 2, wherein: the user fingerprint extraction module comprises the following steps:
Extracting a characteristic field: aiming at the data packet of the TLS/SSL encrypted flow, analyzing the Client hello data packet of the TLS/SSL encrypted flow in a handshake stage according to a TLS/SSL protocol format to obtain fingerprint characteristics required by generating a user fingerprint, wherein fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
feature field splicing: the fingerprint feature is converted into decimal numbers, and then the decimal numbers are spliced into character strings in sequence;
Generating a user fingerprint: and calculating the character strings obtained by splicing by using a hash algorithm meter, and taking the hash value obtained by calculation as a user fingerprint.
4. A method for dynamic IP device identification for encrypted traffic, comprising the steps of:
Step 1: capturing traffic by using a port mirroring method, analyzing protocol fields of each layer of each captured data packet, and judging whether the data packet is TLS/SSL encrypted traffic or not based on the port method;
Step 2: aiming at the data packet of TLS/SSL encrypted flow, analyzing the Client hello data packet in the handshake stage to extract fingerprint features, sequentially splicing all the extracted fingerprint features into character strings, calculating hash values of the character strings as user fingerprints, and storing the user fingerprints in a database, wherein the fingerprint features comprise TLS/SSL protocol versions, encryption algorithms supported by users, support expansion lists, elliptic curves and elliptic curve formats, and sequentially splicing the encryption algorithms supported by users, the support expansion lists, the elliptic curves and the elliptic curve formats into the character strings according to the TLS/SSL protocol versions;
Step 3: dividing the data packet acquired in the step 1 into different streams according to five-tuple, then taking the streams as units and combining session ids for storage, counting stream characteristics of each data stream after storage, and preprocessing the stream characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data stream, wherein the stream characteristics comprise average size, transmission rate and number of the data packets;
Step 4: judging whether each data stream belongs to the user connection recorded before or not through the session id, and if so, optimizing a clustering algorithm; otherwise, retrieving the user fingerprint of the data stream from the database, performing cluster analysis on the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint based on the optimized clustering algorithm, and judging whether the data stream is connected by a new user according to the cluster analysis result;
the step3 comprises the following steps:
Stream data storage: dividing a data packet of TLS/SSL encrypted traffic into different data streams according to quintuple based on a data structure of a hash array, splicing each quintuple field into character strings for hash calculation, taking a hash value as an index of the data stream, and corresponding to the position of the data structure content of the data stream in a memory space, wherein the quintuple is a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
Data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size and the transmission rate of the data streams, and simultaneously recording session ids corresponding to each data stream into a database, wherein the session ids are session ids in data packet Cookies and are used for judging whether the connection is the same user;
Pretreatment and vectorization: preprocessing and vectorizing a result obtained by data statistics to obtain a feature vector, wherein the preprocessing and vectorizing comprises filling missing values of data samples after the data statistics, converting the features of character strings or character formats into vector forms which can be processed by a clustering algorithm, and converting data with the same features and different specifications into the same specification;
the step 4 comprises the following steps:
The method comprises the steps of obtaining session ids of data streams from results obtained by preprocessing and vectorizing in an Nginx server, judging whether the data streams belong to previous connection or not based on whether the session ids are recorded by a database, if yes, not performing clustering processing, directly marking a label and evaluating a clustering algorithm, wherein the evaluating method is to calculate contour coefficients to represent the quality of clustering effects; for one of the sample points i:
calculate a (i) =average (distance of i vector to other points in all clusters to which it belongs)
Calculating b (i) =min (average distance of i vector to all points in a cluster that does not contain it)
The contour coefficients of sample point i are then:
The value of the visible contour coefficient is between [ -1,1], the more towards 1 the relatively better the cohesion degree and the separation degree are, the contour coefficient of all points is averaged, and the total contour coefficient of the clustering result is obtained; the parameters of the clustering algorithm are adjusted so that the contour coefficient of the clustering algorithm is more approximate to 1, the iteration times are adjusted, and the result is recorded in a database;
Otherwise, searching the user fingerprint corresponding to the data stream from the database, then carrying out similarity calculation on the feature vector and each feature vector obtained by the same user fingerprint recorded in the database, judging whether all similarity values exceed a given threshold value, if so, judging that the data stream is subjected to an access request for a new user device, otherwise, combining two most similar clusters, simultaneously, storing a similarity table in each feature vector similarity calculation result until all clusters cannot be further combined with other clusters, wherein each category is one cluster, namely one user device, 0 clusters are not used, when the first user device is accessed to form 1 cluster, and when the second user device is accessed to form whether the two flows are similar, and so on.
5. The method for dynamic IP device identification for encrypted traffic of claim 4, wherein: the step 1 comprises the following steps:
port mirroring and traffic capture: copying all inbound traffic backups to a particular port, and then using tcpdump or wireshark for traffic capture;
And (3) data packet analysis: analyzing the data packet in the captured flow layer by layer according to the protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
Judging the Client Hello packet: firstly judging whether each data packet uses a TLS/SSL protocol to encrypt or not based on a protocol Type, namely analyzing a payload carried by each data packet according to a TLS/SSL protocol format to obtain the Content of a main field of the TLS/SSL protocol, judging whether the data packet uses the TLS/SSL protocol or not by checking whether an analysis result accords with a TLS/SSL protocol specification, and judging whether the data packet is a TLS handshake packet or not by checking a Content Type field and a HANDSHAKE TYPE field, wherein the Content Type field marks the TLS/SSL protocol Type, when the Content Type field is 22, the handshake packet is represented, and when the Content Type field is 01, the handshake packet is represented, and the Content of the main field comprises a Content Type field and a HANDSHAKE TYPE field.
6. The method for dynamic IP device identification for encrypted traffic of claim 5, wherein: the step 2 comprises the following steps:
Extracting a characteristic field: aiming at the data packet of the TLS/SSL encrypted flow, analyzing the Client hello data packet of the TLS/SSL encrypted flow in a handshake stage according to a TLS/SSL protocol format to obtain fingerprint characteristics required by generating a user fingerprint, wherein fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
feature field splicing: the fingerprint feature is converted into decimal numbers, and then the decimal numbers are spliced into character strings in sequence;
Generating a user fingerprint: and calculating the character strings obtained by splicing by using a hash algorithm meter, and taking the hash value obtained by calculation as a user fingerprint.
CN202211420599.5A 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic Active CN115766204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211420599.5A CN115766204B (en) 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211420599.5A CN115766204B (en) 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic

Publications (2)

Publication Number Publication Date
CN115766204A CN115766204A (en) 2023-03-07
CN115766204B true CN115766204B (en) 2024-04-26

Family

ID=85370337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211420599.5A Active CN115766204B (en) 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic

Country Status (1)

Country Link
CN (1) CN115766204B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938562A (en) * 2016-04-13 2016-09-14 中国科学院信息工程研究所 Automatic network application fingerprint extraction method and system
CN108600414A (en) * 2018-05-09 2018-09-28 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of device-fingerprint
CN109068272A (en) * 2018-08-30 2018-12-21 北京三快在线科技有限公司 Similar users recognition methods, device, equipment and readable storage medium storing program for executing
CN109672650A (en) * 2017-10-17 2019-04-23 阿里巴巴集团控股有限公司 Websites collection system, method and data processing method
CN111277587A (en) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 Malicious encrypted traffic detection method and system based on behavior analysis
CN111277578A (en) * 2020-01-14 2020-06-12 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
CN112019574A (en) * 2020-10-22 2020-12-01 腾讯科技(深圳)有限公司 Abnormal network data detection method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109936512B (en) * 2017-12-15 2021-10-01 华为技术有限公司 Flow analysis method, public service flow attribution method and corresponding computer system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938562A (en) * 2016-04-13 2016-09-14 中国科学院信息工程研究所 Automatic network application fingerprint extraction method and system
CN109672650A (en) * 2017-10-17 2019-04-23 阿里巴巴集团控股有限公司 Websites collection system, method and data processing method
CN108600414A (en) * 2018-05-09 2018-09-28 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of device-fingerprint
CN109068272A (en) * 2018-08-30 2018-12-21 北京三快在线科技有限公司 Similar users recognition methods, device, equipment and readable storage medium storing program for executing
CN111277578A (en) * 2020-01-14 2020-06-12 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
CN111277587A (en) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 Malicious encrypted traffic detection method and system based on behavior analysis
CN112019574A (en) * 2020-10-22 2020-12-01 腾讯科技(深圳)有限公司 Abnormal network data detection method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
武思齐 ; 王俊峰 ; .基于数据流多维特征的移动流量识别方法研究.四川大学学报(自然科学版).(02),全文. *

Also Published As

Publication number Publication date
CN115766204A (en) 2023-03-07

Similar Documents

Publication Publication Date Title
US8065722B2 (en) Semantically-aware network intrusion signature generator
CN111277570A (en) Data security monitoring method and device, electronic equipment and readable medium
KR101295708B1 (en) Apparatus for capturing traffic and apparatus, system and method for analyzing traffic
CN110611640A (en) DNS protocol hidden channel detection method based on random forest
CN111064678A (en) Network traffic classification method based on lightweight convolutional neural network
CN110417729B (en) Service and application classification method and system for encrypted traffic
CN112804253B (en) Network flow classification detection method, system and storage medium
CN109450733B (en) Network terminal equipment identification method and system based on machine learning
CN112217763A (en) Hidden TLS communication flow detection method based on machine learning
CN115134250B (en) Network attack tracing evidence obtaining method
CN114157502A (en) Terminal identification method and device, electronic equipment and storage medium
CN111147394A (en) Multi-stage classification detection method for remote desktop protocol traffic behavior
CN112800424A (en) Botnet malicious traffic monitoring method based on random forest
CN111182002A (en) Zombie network detection device based on HTTP (hyper text transport protocol) first question-answer packet clustering analysis
CN115865534B (en) Malicious encryption-based traffic detection method, system, device and medium
CN115766204B (en) Dynamic IP equipment identification system and method for encrypted traffic
CN115051874B (en) Multi-feature CS malicious encrypted traffic detection method and system
EP3576365B1 (en) Data processing device and method
CN114338070B (en) Shadowsocks (R) identification method based on protocol attribute
CN113382003B (en) RTSP mixed intrusion detection method based on two-stage filter
CN111274235B (en) Unknown protocol data cleaning and protocol field feature extraction method
CN114465786A (en) Monitoring method for encrypted network flow
Hong et al. A sensitive information detection method based on network traffic restore
CN116668085B (en) Flow multi-process intrusion detection method and system based on lightGBM
CN103051501A (en) Detection method for identifying network data according to network data recovery manner

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant