CN115766204A - Dynamic IP equipment identification system and method for encrypted flow - Google Patents

Dynamic IP equipment identification system and method for encrypted flow Download PDF

Info

Publication number
CN115766204A
CN115766204A CN202211420599.5A CN202211420599A CN115766204A CN 115766204 A CN115766204 A CN 115766204A CN 202211420599 A CN202211420599 A CN 202211420599A CN 115766204 A CN115766204 A CN 115766204A
Authority
CN
China
Prior art keywords
data
tls
user
fingerprint
traffic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211420599.5A
Other languages
Chinese (zh)
Other versions
CN115766204B (en
Inventor
朱宇坤
牛伟纳
周玉祥
张小松
赵毅卓
陈瑞东
王楷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202211420599.5A priority Critical patent/CN115766204B/en
Publication of CN115766204A publication Critical patent/CN115766204A/en
Application granted granted Critical
Publication of CN115766204B publication Critical patent/CN115766204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a dynamic IP equipment identification system and method for encrypted flow, belonging to the technical field of network monitoring. The technology is deployed on a gateway, port mirror image copying is carried out on inbound encrypted traffic, then the traffic of the port is captured and analyzed, whether the traffic is a Client Hello data packet in a TLS/SSL encrypted traffic handshake stage is judged according to an analysis result, and if the traffic is the Client Hello data packet, a user fingerprint is extracted from the unencrypted Client Hello data packet. Each packet information is stored in a respective data stream list, and when a data stream normally ends or overtime ends, the characteristic information is extracted from the stream, the storage space is released, and then the characteristic is preprocessed and vectorized. The cluster analysis module carries out cluster analysis on the characteristic vector of each stream and identifies the access equipment according to a cluster result, so that the aim of distinguishing different access equipment from the same access equipment is fulfilled.

Description

Dynamic IP equipment identification system and method for encrypted flow
Technical Field
A dynamic IP equipment identification system and method aiming at encrypted flow is used for dynamic IP equipment identification, belongs to the technical field of network flow monitoring, and is characterized in that a tested object is router equipment with NAT dynamic IP address conversion, and a test case can be automatically generated.
Background
Due to the rapid development of the global internet, the number of IP with the size of 32 bits obviously cannot meet the requirements of all network devices, so the technology of dynamic IP address translation is widely applied. Dynamic IP address translation is a technology that reuses public network IP, and multiple different intranet hosts can access external resources using the same IP. The technology is generally adopted by various large operators (such as Unicom, mobile and the like) in China.
Although the dynamic address conversion technology effectively solves the problem of insufficient IP addresses, the dynamic address conversion technology brings huge challenges to the source tracing work based on the IP, the traditional IP-based technology is difficult to effectively track the dynamic IP which is changed by the same user, in addition, with the continuous emphasis on the privacy security of the user and the wide application of the TLS/SSL encryption technology, the encryption traffic in the network is increased in an explosive manner, and the traffic which exceeds 9 in the current Internet is https encryption traffic.
Therefore, the user equipment identification problem mainly faces two key challenges of dynamic IP and traffic encryption. At present, the identification problem of encrypted flow mainly utilizes a data packet effective load, deep packet analysis, a user behavior mode and a machine learning method.
Many encryption protocols negotiate keys before encrypting transmissions, and the process of key protocols is often unencrypted, allowing useful information to be extracted from this portion of plaintext data. Payload-based identification methods detect small amounts of information from unencrypted portions and then incorporate statistical methods to identify the application or service. Korczynski et al propose a method for identifying SSL/TLS in Markov Chain fingerprint to classic Encrypted Traffic. The method uses the header of the packet to create a fingerprint when the SSL/TLS protocol creates a session, and the fingerprint is based on a first order homogeneous markov chain. The Markov chain states model the SSL/TLS message sequence for the server and client.
With the development of networks, port-based traffic identification classification has not been able to meet the demand, and an identification classification method based on deep packet inspection is emerging from time to time. Moore et al designed a classification method that relies on the complete packet payload. The method can be viewed as an iterative process with the goal of obtaining features very accurately, then grouping packets into data flows for a corresponding application at a fixed flow rate can process the collected information more efficiently and obtain the necessary context so that the network application is correctly identified so that the DPI is running on the flow instead of on the packet. In the document "A compliance of superior machine learning algorithms for classification of communications network traffic", moore et al takes the first step of aggregating quintuple based packets into a stream. When it is a TCP network stream (transmission control protocol network stream), additional semantics can also be used to determine the start and end times of a process. The second step is to repeatedly test the characteristics of the stream according to different criteria until a well-defined application identification is obtained. This process includes 9 different identification methods. The DPI technology is a Layer capturing tool, captures a plurality of data packets, performs certain pattern matching, and finds an application program which accords with the characteristic value of the data packets.
The recognition method based on machine learning uses statistical features of the flow, which is a DFI method. Encryption techniques generally encrypt only the payload, so this approach is less affected by encryption. In the field of encrypted traffic identification, there are many methods based on machine learning. In the document "Machine Learning for Encrypted Ma lware Traffic Classification: anderson et al, in Accounting for noise Labels and Non-statistical, knowledge Discovery and Data Mining, indicate that the main reasons why machine learning methods typically appear in encrypted malware traffic classification are inaccurate ground truth and Non-stationarity of network Data. Machine learning based recognition methods can also be used to refine the classification. In the document "Analyzing Android Encrypted Network Traffic to storage User Actions", conti et al propose a method for identifying User behavior, which takes into account three time sequences: (i) a time series is obtained only from outbound data packets; (ii) Another time sequence is obtained considering only bytes transmitted by the outbound data packet; (iii) The third time series is obtained from byte combining (chronological) of inbound and outbound packet transmissions. The "shape" of the cumulative graph resulting from different time series of user behavior is different. The proposed classification method is studied in an attempt to learn the "shape" of network traffic associated with a particular user behavior and is intended to identify user behavior by classifying the "shape".
In summary, the prior art has the following technical problems:
1. in the prior art, a machine learning or deep learning method is used for analyzing network flow characteristics to judge the flow behavior, but effective user equipment identification cannot be carried out on encrypted flow in a dynamic IP;
2. in the prior art, the analysis of the encrypted traffic depends on the effective load in the data packet, and the method is deployed in the network, has high running time overhead and low detection performance, even influences the transmission of normal traffic and causes interference to normal services.
3. In the prior art, a single technology or method is mostly used, and the influence of different user equipment on network flow is not comprehensively considered, so that the problems of low accuracy and high false alarm rate are caused.
Disclosure of Invention
In view of the above research problems, an object of the present invention is to provide a system and a method for identifying a dynamic IP device for encrypted traffic, which solve the problem that in the prior art, a machine learning or deep learning method is used to analyze network flow characteristics to determine traffic behavior, but cannot effectively identify a user device for encrypted traffic in a dynamic IP.
In order to achieve the purpose, the invention adopts the following technical scheme:
a dynamic IP device identification system for encrypted traffic, comprising the following modules:
a flow acquisition module: capturing flow by using a port mirror image method, analyzing each layer of protocol field of each captured data packet, and then judging whether the data packet is TLS/SSL encrypted flow or not by using a port-based method;
the user fingerprint extraction module: analyzing a Client hello data packet in a handshake phase aiming at a TLS/SSL encrypted flow data packet to extract fingerprint characteristics, splicing all the extracted fingerprint characteristics into a character string in sequence, calculating a hash value of the character string as a user fingerprint and storing the user fingerprint into a database, wherein the fingerprint characteristics comprise a TLS/SSL protocol version, an encryption algorithm supported by a user, a support expansion list, an elliptic curve and an elliptic curve format, and splicing the character string into the encryption algorithm supported by the user, the support expansion list, the elliptic curve and the elliptic curve format according to the TLS/SSL protocol version, the encryption algorithm supported by the user, the support expansion list, the elliptic curve and the elliptic curve format in sequence;
a flow feature extraction module: dividing the data packets acquired by the flow acquisition module into different flows according to the quintuple, then storing the flows by taking the flows as a unit and combining with the session id, counting the flow characteristics of each data flow after storing, and preprocessing the flow characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data flow, wherein the flow characteristics comprise the average size of the data packets, the transmission rate and the number of the data packets;
a cluster analysis module: judging whether each data stream belongs to the user connection recorded before, and if so, optimizing a clustering algorithm; otherwise, the user fingerprint of the data stream is retrieved from the database, and the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint are subjected to clustering analysis based on the optimized clustering algorithm to judge whether the data stream is connected with a new user or not according to the clustering analysis result.
Further, the flow collection module comprises the following steps:
port mirroring and traffic capture: copying all inbound traffic backups to a specific port, and then performing traffic capture by using tcpdump or wireshark;
data packet analysis: analyzing data packets in the captured flow layer by layer according to a protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
and (3) judging a Client Hello packet: firstly, whether each data packet is encrypted by using a TLS/SSL protocol is judged based on a protocol Type, namely, a payload carried by each data packet is analyzed according to a TLS/SSL protocol format to obtain the Content of a main field of the TLS/SSL protocol, whether the data packet uses the TLS/SSL protocol is judged by checking whether an analysis result meets the TLS/SSL protocol specification, whether the data packet is a TLS Handshake packet is judged by checking a Content Type field and a Handshae Type field, wherein the Content Type field marks the TLS/SSL protocol Type, when the value is 22, the Handshake packet is shown, and the Handshae Type field marks the Handshake packet Type, when the value is 01, the data packet is shown as a Client Hello packet sent by a Client, and the Content of the main field comprises the Content Type field and the Handshae Type field.
Further, the user fingerprint extraction module comprises the following steps:
extracting a characteristic field: analyzing a Client hello data packet of a TLS/SSL encrypted flow according to a TLS/SSL protocol format to acquire fingerprint characteristics required by generating a user fingerprint, wherein the fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
splicing the characteristic fields: the system is used for converting fingerprint features into decimal numbers and then sequentially splicing the decimal numbers into character strings;
user fingerprint generation: and calculating the spliced character string by using a Hash algorithm meter, and taking the Hash value obtained by calculation as the user fingerprint.
Further, the flow feature extraction module comprises the steps of:
and (3) stream data storage: dividing TLS/SSL encrypted flow data packets into different data flows according to five tuples based on a data structure of the hash array, splicing fields of the five tuples into character strings for hash calculation, using a hash value as an index of the data flow, and taking the position of the data structure content of the corresponding data flow in a memory space, wherein the five tuples are a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
and (3) data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size of the data stream and the data transmission rate, and simultaneously recording session id corresponding to each data stream into a database, wherein the session id is the session id in the Cookie of the data packet and is used for judging whether the data packets are connected by the same user;
preprocessing and vectorization: and preprocessing and vectorizing the result obtained by data statistics to obtain a characteristic vector, wherein the preprocessing and vectorizing processing comprise filling missing values of data samples after the data statistics, converting the characteristic of a character string or a character format into a vector form which can be processed by a clustering algorithm, and converting data with the same characteristic in different specifications into the same specification.
Further, the cluster analysis module comprises the following specific implementation steps:
obtaining session id of each data stream from results obtained by preprocessing and vectorization processing in the Nginx server, judging whether the data stream belongs to previous connection or not based on whether the session id is recorded by a database or not, if so, not performing clustering processing, directly marking a label and evaluating a clustering algorithm, wherein the evaluation method is to calculate a contour coefficient to express the quality of a clustering effect; for one of the sample points i:
calculate a (i) = average (distance of i vector to other points in the cluster to which it belongs)
Calculate b (i) = min (average distance of i vector to all points in a cluster that does not contain it)
Then the contour coefficients for sample point i are:
Figure BDA0003941404400000041
it can be seen that the value of the contour coefficient is between [ -1,1], the closer to 1, the higher the cohesion and separation, the average is taken of the contour coefficients of all points to obtain the total contour coefficient of the clustering result; adjusting parameters of the clustering algorithm to enable the outline coefficient of the clustering algorithm to be closer to 1, achieving the purpose of completing the adjustment of iteration times, and recording results into a database;
otherwise, searching the user fingerprint corresponding to the data stream from the database, then performing similarity calculation on the feature vector of the user fingerprint and each feature vector obtained by the same user fingerprint recorded in the database, and judging whether all the similarity values exceed a given threshold value, if so, judging that the data stream requests access to a new user device, otherwise, combining two most similar clusters, and simultaneously storing a similarity table in each feature vector similarity calculation result until all the clusters cannot be further combined with other clusters, wherein each category is a cluster, namely a user device, and 0 cluster without data, and when the first user device traffic access becomes 1 cluster, the second user device traffic access judges whether the two traffic are similar, and the like.
A dynamic IP device identification method for encrypted traffic comprises the following steps:
step 1: capturing flow by using a port mirror image method, analyzing each layer of protocol field of each captured data packet, and judging whether the data packet is TLS/SSL encrypted flow or not by using a port-based method;
step 2: analyzing a Client hello data packet in a handshake phase aiming at a TLS/SSL encrypted flow data packet to extract fingerprint characteristics, splicing all the extracted fingerprint characteristics into a character string in sequence, calculating a hash value of the character string as a user fingerprint and storing the user fingerprint into a database, wherein the fingerprint characteristics comprise a TLS/SSL protocol version, an encryption algorithm supported by a user, a support expansion list, an elliptic curve and an elliptic curve format, and splicing the character string into the encryption algorithm supported by the user, the support expansion list, the elliptic curve and the elliptic curve format according to the TLS/SSL protocol version, the encryption algorithm supported by the user, the support expansion list, the elliptic curve and the elliptic curve format in sequence;
and step 3: dividing the data packets collected in the step 1 into different streams according to a quintuple, storing the streams by taking the streams as a unit and combining with a session id, counting stream characteristics of each data stream after storing, and preprocessing the stream characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data stream, wherein the stream characteristics comprise the average size of the data packets, the transmission rate and the number of the data packets:
and 4, step 4: judging whether each data stream belongs to the user connection recorded before or not through the session id, and if so, using the data stream to optimize a clustering algorithm; otherwise, the user fingerprint of the data stream is retrieved from the database, and the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint are subjected to clustering analysis based on the optimized clustering algorithm to judge whether the data stream is connected with a new user or not according to the clustering analysis result.
Further, the step 1 comprises the following steps:
port mirroring and traffic capture: copying all inbound traffic backups to a specific port, and then performing traffic capture by using tcpdump or wireshark;
analyzing the data packet: analyzing data packets in the captured flow layer by layer according to a protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
and (3) judging a Client Hello packet: firstly, whether each data packet is encrypted by using a TLS/SSL protocol is judged based on a protocol Type, namely, a payload carried by each data packet is analyzed according to a TLS/SSL protocol format to obtain the Content of a main field of the TLS/SSL protocol, whether the data packet uses the TLS/SSL protocol is judged by checking whether an analysis result meets the TLS/SSL protocol specification, whether the data packet is a TLS Handshake packet is judged by checking a Content Type field and a Handshae Type field, wherein the Content Type field marks the TLS/SSL protocol Type, when the value is 22, the Handshake packet is shown, and the Handshae Type field marks the Handshake packet Type, when the value is 01, the data packet is shown as a Client Hello packet sent by a Client, and the Content of the main field comprises the Content Type field and the Handshae Type field.
Further, the step 2 comprises the following steps:
extracting the characteristic field: analyzing a Client hello data packet of a TLS/SSL encrypted flow according to a TLS/SSL protocol format to acquire fingerprint characteristics required by generating a user fingerprint, wherein the fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
splicing the characteristic fields: the system is used for converting fingerprint features into decimal numbers and then sequentially splicing the decimal numbers into character strings;
and (3) user fingerprint generation: and calculating the spliced character string by using a Hash algorithm meter, and taking the Hash value obtained by calculation as the user fingerprint.
Further, the step 3 comprises the following steps:
and (3) stream data storage: dividing TLS/SSL encrypted flow data packets into different data flows according to five tuples based on a data structure of the hash array, splicing fields of the five tuples into character strings for hash calculation, using a hash value as an index of the data flow, and taking the position of the data structure content of the corresponding data flow in a memory space, wherein the five tuples are a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
and (3) data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size of the data stream and the data transmission rate, and simultaneously recording session id corresponding to each data stream into a database, wherein the session id is the session id in the Cookie data packet and is used for judging whether the data packets are connected by the same user;
preprocessing and vectorization: and preprocessing and vectorizing the result obtained by data statistics to obtain a characteristic vector, wherein the preprocessing and vectorizing processing comprise filling missing values of data samples after the data statistics, converting the characteristic of a character string or a character format into a vector form which can be processed by a clustering algorithm, and converting data with the same characteristic in different specifications into the same specification.
Further, the step 4 comprises the following steps:
obtaining session id of each data stream from results obtained by preprocessing and vectorization processing in the Ngi nx server, judging whether the data stream belongs to previous connection or not based on whether the session id is recorded by a database or not, if so, not performing clustering processing, directly labeling and evaluating a clustering algorithm, wherein the evaluation method is to represent whether the clustering effect is good or not by calculating a contour coefficient; for one of the sample points i:
calculate a (i) = average (distance of i vector to other points in the cluster to which it belongs)
Calculate b (i) = min (average distance of i vector to all points in a cluster that does not contain it)
Then the contour coefficients for sample point i are:
Figure BDA0003941404400000071
it can be seen that the value of the contour coefficient is between [ -1,1], the closer to 1, the higher the cohesion and separation, the average is taken of the contour coefficients of all points to obtain the total contour coefficient of the clustering result; adjusting parameters of the clustering algorithm to enable the outline coefficient of the clustering algorithm to be closer to 1, achieving the purpose of completing the adjustment of the iteration times, and recording results into a database;
otherwise, searching the user fingerprint corresponding to the data stream from the database, then performing similarity calculation on the feature vectors of the user fingerprint and feature vectors obtained by the same user fingerprint recorded in the database, and judging whether all similarity values exceed a given threshold value, if so, judging that the data stream is an access request to a new user device, otherwise, combining two most similar clusters, and simultaneously, storing a similarity calculation result of each feature vector in a similarity table until all clusters cannot be further combined with other clusters, wherein each category is a cluster, namely a user device, 0 cluster without data, when the first user device traffic is accessed, the second user device traffic is accessed, namely judging whether the two traffic are similar, and so on.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the technology, instead of analyzing each data packet, fingerprint feature extraction is firstly carried out on the Client Hello handshake packet in the process of the encrypted traffic handshake protocol, and then stream features are counted for each data stream, so that time and performance overhead are reduced;
2. the fingerprint method aiming at the encrypted flow can effectively utilize the difference of different user equipment, the same fingerprint can be generated only under the condition that the equipment is extremely similar, and the probability of the condition that the equipment is simultaneously distributed to the IP of the same operator in the same time period is lower, so that the false alarm rate can be effectively reduced by the method.
3. The invention can use the session information recorded by the Web server to determine all the flow sent by the same user in a time period through the session id and mark the flow, then evaluates the existing clustering model, continuously adjusts parameters, optimizes the model and further improves the accuracy of the model.
Drawings
FIG. 1 is an overall architecture diagram of the present invention;
FIG. 2 is a diagram illustrating a database storage structure according to the present invention;
FIG. 3 is a diagram illustrating an exemplary implementation of the present invention;
FIG. 4 is a schematic diagram of the present invention for converting fingerprint features into decimal numbers and then sequentially splicing the decimal numbers into character strings.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific embodiments.
Aiming at the following real network scene, wherein, an operator has 3 different user devices to access the server of the enterprise through the same operator gateway to obtain the resource. For the enterprise gateway, the source IP of all three devices is the operator gateway IP, namely 223.71.41.15. These user devices occupy different operator gateway ports at the same time, but these ports are varied and the devices cannot be effectively distinguished by port alone.
Therefore, the invention combines the methods of encrypting the flow user fingerprint, session id and clustering to identify the accessed user equipment. The main process will be described according to different modules:
1. flow acquisition module
The traffic collection module is deployed on an enterprise gateway, port mirroring is needed to be carried out, normal inbound traffic is copied to a specific port, normal network service is prevented from being influenced, then the traffic is captured through the port, and the traffic capture can use tools such as tcpdump and wireshark. Then, each captured data packet is simply subjected to protocol analysis, and common information such as network quintuple, protocol hierarchy and the like needs to be obtained. The data packet protocol format is fixed, the protocol analysis work can be realized by using a network library such as a scapy of python, whether encryption is required to be judged, if the encryption mode is used, the data packet application layer is in a TLS/SSL protocol format, and related fields such as the TLS/SSL protocol, the TLS/SSL protocol type and the like can be judged in an auxiliary mode.
Once the data packet is determined to be the encrypted flow, whether the data packet is a TLS handshake packet in the key agreement process is further judged, if so, the data packet firstly enters a fingerprint extraction module to extract the fingerprint of the user equipment (namely the user fingerprint), and then enters a stream feature extraction module after the user fingerprint extraction is finished; if not, the stream feature extraction module is directly entered to store the stream data (namely, the stream feature extraction module is not a client hello packet (TLS handshake packet) and is directly entered to extract the stream features and data preprocessing.
2. Fingerprint extraction module
The fingerprint extraction module aims at a Client Hello handshake packet in a key negotiation process of encrypted flow, because plaintext information is carried in a data packet in the key negotiation process, data after the key negotiation is completed can be encrypted by using a negotiated session key, a third party cannot acquire any information in the encrypted data, and the Client Hello packet carries a lot of information related to user equipment, including a TLS/SSL encryption protocol version used by a user, an encryption algorithm list and an expansion list supported by an encryption suite and the like, so that two different devices have obvious difference. The fingerprint extraction module mainly comprises the following steps:
1. and analyzing the application layer content of the data packet aiming at the data packet of TLS/SSL encrypted flow. The application layer encapsulates TLS/SSL protocol content, required field values can be obtained through TLS/SSL protocol analysis, and the analysis method is the existing method.
2. Protocol field extraction (feature field extraction, i.e. fingerprint feature field). Protocol field extraction is to obtain the required field values from the data packet, including TLS/SSL protocol version, user supported encryption algorithm list, supported extension type list (supported extension list), supported elliptic curve algorithm list (elliptic curve), elliptic curve format, etc.
3. A splicing protocol field. For converting fingerprint features into decimal numbers, and then sequentially splicing the decimal numbers into character strings in sequence, taking fig. 4 as an example, after the protocol fields are spliced, 771, 4866-4867-4865-49196-49200-159-52393-52392-52394-49195-49199-158-49188-49192-107-49187-49191-103-49162-49172-49161-49171-51-255, 11-10-22-23-13-43-45-51, and 29-23-30-25-24,0-1-2 are obtained. Wherein commas separate each field and hyphens separate list categories within each field;
4. and (4) calculating a hash value. And finally, calculating the spliced character string by a Hash algorithm to obtain the user fingerprint c81fc162549590fOe836b538fe5bfdd7.
3. Flow feature extraction module
The stream feature extraction module copies and stores each data packet by taking a stream as a unit, statistics is carried out on stream features after the stream is detected to be finished, a preliminary feature list is obtained, and then the features need to be preprocessed and converted into a vector format which can be accepted by the cluster analysis module.
1. And storing the stream data. The flow data storage can be completed by using a data structure of a hash array, each quintuple corresponds to a unique flow, a structure array FlowBuff [ Size ] is defined, wherein Size is the maximum flow number, the number of the stored data flows is lower than the value, the array stores the structure of the flow characteristics, then, hash calculation is carried out on each quintuple character string SrclP-SrcPort-DstlP-DstPort-protocol to obtain a number, and the statistical characteristics of the flow are stored to the position corresponding to the number.
2. And (6) counting data. And counting the characteristic data of one stream by taking the data stream as a unit, wherein the characteristic data comprises the number of data packets, the transmission size of the data stream, the data transmission rate and the like, and simultaneously, storing the session id corresponding to the data stream in a database.
3. Preprocessing and vectorization. Based on the data statistics, the preprocessing may include many steps according to practical situations, for example, a missing value needs to be filled when a sample lacks a certain field value, and the common method is to use the average value of the field of all samples or directly fill with 0. Then, data type conversion needs to be performed on the character string type features, and the character string in a time format is converted into an integer, which is common. Finally, scaling the characteristics, converting the data of different specifications into the same specification by using the same characteristic, for example, scaling the packet size to the range of [ -1,1 ].
4. Cluster analysis module
The cluster analysis module firstly obtains the session id of the data stream, and can obtain the session id from the Nginx server, and the Nginx server has the session key, so that the encrypted flow can be decrypted to obtain the session id therein, and the data is generally stored in a log, so that the gateway is not required to decrypt once again. The session id can determine whether the data flow belongs to a previous connection.
1. If the data stream belongs to the previous connection, clustering analysis is not needed, labels are directly marked and a clustering model is evaluated, clustering algorithm parameters are adjusted, and results are recorded in a database.
2. If the data stream does not belong to the previous connection, the hash value (i.e. the user fingerprint) corresponding to the data stream is retrieved from the database, similarity calculation is performed on the feature vector of the data stream and other feature vectors of the same user fingerprint recorded in the database, whether all similarity values exceed a given threshold (e.g. 90%) (whether the feature vector is in a known category is determined), and if so, it can be determined that a new user equipment has made an access request. Otherwise, the two most similar clusters are merged, and the process is iterated and the similarity table is updated until the requirements are met.
In summary, the technology of the present invention is deployed on a gateway, port mirror copy is performed on inbound encrypted traffic, then traffic of this port is captured and analyzed, whether the packet is a Client Hello packet in a TLS/SSL encrypted traffic handshake phase is determined according to an analysis result, and if the packet is the Client Hello packet, a user fingerprint is extracted from an unencrypted Client Hello packet. Each packet information is stored in a respective data stream list, and when a data stream normally ends or overtime ends, the characteristic information is extracted from the stream, the storage space is released, and then the characteristic is preprocessed and vectorized. The cluster analysis module carries out cluster analysis on the characteristic vector of each stream and identifies the access equipment according to a cluster result, so that the aim of distinguishing different access equipment from the same access equipment is fulfilled.
The processing method provided by the embodiment is effective and simple, and can process real complex situations. For example, most of the time traffic is transmitted encrypted, the method can also process encrypted traffic. The above are merely representative examples of the many specific applications of the present invention, and do not limit the scope of the invention in any way. All the technical solutions formed by using the conversion or the equivalent substitution fall within the protection scope of the present invention.

Claims (10)

1. A dynamic IP device identification system for encrypted traffic, comprising the following modules:
a flow acquisition module: capturing flow by using a port mirror image method, analyzing each layer of protocol field of each captured data packet, and judging whether the data packet is TLS/SSL encrypted flow or not by using a port-based method;
the user fingerprint extraction module: analyzing a Client hello data packet in a handshake phase aiming at a TLS/SSL encrypted flow data packet to extract fingerprint characteristics, splicing all the extracted fingerprint characteristics into a character string in sequence, calculating a hash value of the character string as a user fingerprint and storing the user fingerprint into a database, wherein the fingerprint characteristics comprise a TLS/SSL protocol version, an encryption algorithm supported by a user, a support expansion list, an elliptic curve and an elliptic curve format, and splicing the character string into the encryption algorithm supported by the user, the support expansion list, the elliptic curve and the elliptic curve format according to the TLS/SSL protocol version, the encryption algorithm supported by the user, the support expansion list, the elliptic curve and the elliptic curve format in sequence;
a flow feature extraction module: dividing the data packets acquired by the flow acquisition module into different flows according to the quintuple, then storing the flows by taking the flows as a unit and combining with the session id, counting the flow characteristics of each data flow after storing, and preprocessing the flow characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data flow, wherein the flow characteristics comprise the average size of the data packets, the transmission rate and the number of the data packets;
a cluster analysis module: judging whether each data stream belongs to the user connection recorded before, and if so, optimizing a clustering algorithm; otherwise, the user fingerprint of the data stream is retrieved from the database, and the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint are subjected to clustering analysis based on the optimized clustering algorithm to judge whether the data stream is connected with a new user or not according to the clustering analysis result.
2. A dynamic IP device identification system for encrypted traffic according to claim 1, characterized by: the flow acquisition module comprises the following steps:
port mirroring and traffic capture: copying all inbound traffic backups to a specific port, and then performing traffic capture by using tcpdump or wireshark;
data packet analysis: analyzing data packets in the captured flow layer by layer according to a protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
and (3) judging a Client Hello packet: firstly, whether each data packet is encrypted by using a TLS/SSL protocol is judged based on a protocol Type, namely, a payload carried by each data packet is analyzed according to a TLS/SSL protocol format to obtain the Content of a main field of the TLS/SSL protocol, whether the data packet uses the TLS/SSL protocol is judged by checking whether an analysis result meets the TLS/SSL protocol specification, whether the data packet is a TLS Handshake packet is judged by checking a Content Type field and a Handshae Type field, wherein the Content Type field marks the TLS/SSL protocol Type, when the value is 22, the Handshake packet is shown, and the Handshae Type field marks the Handshake packet Type, when the value is 01, the data packet is shown as a Client Hello packet sent by a Client, and the Content of the main field comprises the Content Type field and the Handshae Type field.
3. A dynamic IP device identification system for encrypted traffic according to claim 2, characterized by: the user fingerprint extraction module comprises the following steps:
extracting a characteristic field: analyzing a Client hello data packet of a TLS/SSL encrypted flow according to a TLS/SSL protocol format to acquire fingerprint characteristics required by generating a user fingerprint, wherein the fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
splicing the characteristic fields: the system is used for converting fingerprint features into decimal numbers and then sequentially splicing the decimal numbers into character strings;
and (3) user fingerprint generation: and calculating the spliced character string by using a Hash algorithm meter, and taking the Hash value obtained by calculation as the user fingerprint.
4. A dynamic IP device identification system for encrypted traffic according to claim 3, characterized by: the flow feature extraction module comprises the following steps:
and (3) stream data storage: dividing TLS/SSL encrypted flow data packets into different data flows according to five tuples based on a data structure of the hash array, splicing fields of the five tuples into character strings for hash calculation, using a hash value as an index of the data flow, and taking the position of the data structure content of the corresponding data flow in a memory space, wherein the five tuples are a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
and (3) data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size of the data stream and the data transmission rate, and simultaneously recording session id corresponding to each data stream into a database, wherein the session id is the session id in the Cookie data packet and is used for judging whether the data packets are connected by the same user;
preprocessing and vectorization: and preprocessing and vectorizing the result obtained by data statistics to obtain a characteristic vector, wherein the preprocessing and vectorizing processing comprise filling missing values of data samples after the data statistics, converting the characteristic of a character string or a character format into a vector form which can be processed by a clustering algorithm, and converting data with the same characteristic in different specifications into the same specification.
5. A dynamic IP device identification system for encrypted traffic according to claim 4, characterized by: the cluster analysis module comprises the following specific implementation steps:
obtaining session id of each data stream from results obtained by preprocessing and vectorization processing in the Nginx server, judging whether the data stream belongs to previous connection or not based on whether the session id is recorded by a database or not, if so, not performing clustering processing, directly marking a label and evaluating a clustering algorithm, wherein the evaluation method is to calculate a contour coefficient to express the quality of a clustering effect; for one of the sample points i:
calculate a (i) = average (distance of i vector to other points in the cluster to which it belongs)
Calculate b (i) = min (average distance of i vector to all points in a cluster that does not contain it)
Then the contour coefficients for sample point i are:
Figure FDA0003941404390000031
it can be seen that the value of the contour coefficient is between [ -1,1], the closer to 1, the higher the cohesion and separation, the average is taken of the contour coefficients of all points to obtain the total contour coefficient of the clustering result; adjusting parameters of the clustering algorithm to enable the outline coefficient of the clustering algorithm to be closer to 1, achieving the purpose of completing the adjustment of the iteration times, and recording results into a database;
otherwise, searching the user fingerprint corresponding to the data stream from the database, then performing similarity calculation on the feature vectors of the user fingerprint and feature vectors obtained by the same user fingerprint recorded in the database, and judging whether all similarity values exceed a given threshold value, if so, judging that the data stream is an access request to a new user device, otherwise, combining two most similar clusters, and simultaneously, storing a similarity calculation result of each feature vector in a similarity table until all clusters cannot be further combined with other clusters, wherein each category is a cluster, namely a user device, 0 cluster without data, when the first user device traffic is accessed, the second user device traffic is accessed, namely judging whether the two traffic are similar, and so on.
6. A dynamic IP device identification method for encrypted traffic is characterized by comprising the following steps:
step 1: capturing flow by using a port mirror image method, analyzing each layer of protocol field of each captured data packet, and judging whether the data packet is TLS/SSL encrypted flow or not by using a port-based method;
step 2: aiming at a TLS/SSL encrypted flow data packet, analyzing the Client hello data packet in a handshake phase to extract fingerprint characteristics, splicing all the extracted fingerprint characteristics into character strings in sequence, calculating a hash value of the character strings as a user fingerprint and storing the user fingerprint into a database, wherein the fingerprint characteristics comprise a TLS/SSL protocol version, an encryption algorithm supported by a user, a support expansion list, an elliptic curve format and an elliptic curve format, and the encryption algorithm, the support expansion list, the elliptic curve format and the elliptic curve format supported by the user are spliced into the character strings in sequence, namely according to the TLS/SSL protocol version, the encryption algorithm supported by the user, the support expansion list, the elliptic curve format and the elliptic curve format;
and step 3: dividing the data packets collected in the step 1 into different streams according to a quintuple, storing the streams by taking the streams as a unit and combining with a session id, counting stream characteristics of each data stream after storing, and preprocessing the stream characteristics obtained by counting to obtain characteristic vectors which have consistent data formats and correspond to each data stream, wherein the stream characteristics comprise the average size of the data packets, the transmission rate and the number of the data packets;
and 4, step 4: judging whether each data stream belongs to the user connection recorded before or not through the session id, and if so, using the data stream to optimize a clustering algorithm; otherwise, the user fingerprint of the data stream is retrieved from the database, and the feature vector of the data stream and the feature vector of the recorded data stream under the same user fingerprint are subjected to clustering analysis based on the optimized clustering algorithm to judge whether the data stream is connected with a new user or not according to the clustering analysis result.
7. The method of claim 6, wherein the method comprises: the step 1 comprises the following steps:
port mirroring and traffic capture: copying all inbound traffic backups to a specific port, and then performing traffic capture by using tcpdump or wireshark;
data packet analysis: analyzing data packets in the captured flow layer by layer according to a protocol format to obtain basic data, wherein the basic data comprises a source IP address, a source port number, a destination IP address, a destination port number, a protocol type, a data packet size and a protocol type;
and (3) judging a Client Hello packet: firstly, judging whether each data packet is encrypted by using a TLS/SSL protocol or not based on a protocol Type, namely analyzing a payload carried by each data packet according to a TLS/SSL protocol format to obtain main field Content of the TLS/SSL protocol, judging whether the data packet uses the TLS/SSL protocol or not by checking whether an analysis result meets a TLS/SSL protocol specification or not, judging whether the data packet is a TLS Handshake packet or not by checking a Content Type field and a Handshake Type field, wherein the Content Type field marks the TLS/SSL protocol Type, when the value is 22, the Handshake packet is shown, when the value is 01, a Client Hello packet sent by a Client is shown, and the main field Content comprises the Content Type field and the Handshake Type field.
8. The system of claim 7, wherein the dynamic IP device identification for encrypted traffic comprises: the step 2 comprises the following steps:
extracting the characteristic field: analyzing a Client hello data packet of a TLS/SSL encrypted flow according to a TLS/SSL protocol format to acquire fingerprint characteristics required by generating a user fingerprint, wherein the fingerprint characteristic words comprise a TLS/SSL protocol version, a supported encryption algorithm list, a supported extension type list, a supported elliptic curve algorithm list and a supported elliptic curve format;
splicing the characteristic fields: the system is used for converting the fingerprint characteristics into decimal numbers and then sequentially splicing the decimal numbers into character strings in sequence;
and (3) user fingerprint generation: and calculating the spliced character string by using a Hash algorithm meter, and taking the Hash value obtained by calculation as the user fingerprint.
9. The method of claim 8 for dynamic IP device identification for encrypted traffic, characterized by: the step 3 comprises the following steps:
and (3) stream data storage: dividing TLS/SSL encrypted flow data packets into different data flows according to five tuples based on a data structure of the hash array, splicing fields of the five tuples into character strings for hash calculation, using a hash value as an index of the data flow, and taking the position of the data structure content of the corresponding data flow in a memory space, wherein the five tuples are a source IP address, a source port number, a destination IP address, a destination port number and a protocol type;
and (3) data statistics: counting characteristic data of each data stream, including the number of data packets, the transmission size of the data stream and the data transmission rate, and simultaneously recording session id corresponding to each data stream into a database, wherein the session id is the session id in the Cookie data packet and is used for judging whether the data packets are connected by the same user;
preprocessing and vectorization: and preprocessing and vectorizing the result obtained by data statistics to obtain a characteristic vector, wherein the preprocessing and vectorizing processing comprise filling missing values of data samples after the data statistics, converting the characteristic of a character string or a character format into a vector form which can be processed by a clustering algorithm, and converting data with the same characteristic in different specifications into the same specification.
10. The method of claim 9 for dynamic IP device identification for encrypted traffic, wherein: the step 4 comprises the following steps:
obtaining session id of each data stream from results obtained by preprocessing and vectorization processing in the Nginx server, judging whether the data stream belongs to previous connection or not based on whether the session id is recorded by a database or not, if so, not performing clustering processing, directly marking a label and evaluating a clustering algorithm, wherein the evaluation method is to calculate a contour coefficient to express the quality of a clustering effect; for one of the sample points i:
calculate a (i) = average (distance of i vector to other points in the cluster to which it belongs)
Calculate b (i) = min (average distance of i vector to all points in a cluster that does not contain it)
Then the contour coefficients for sample point i are:
Figure FDA0003941404390000051
it can be seen that the value of the contour coefficient is between [ -1,1], the closer to 1, the higher the cohesion and separation, the average is taken of the contour coefficients of all points to obtain the total contour coefficient of the clustering result; adjusting parameters of the clustering algorithm to enable the outline coefficient of the clustering algorithm to be closer to 1, achieving the purpose of completing the adjustment of the iteration times, and recording results into a database;
otherwise, searching the user fingerprint corresponding to the data stream from the database, then performing similarity calculation on the feature vectors of the user fingerprint and feature vectors obtained by the same user fingerprint recorded in the database, and judging whether all similarity values exceed a given threshold value, if so, judging that the data stream is an access request to a new user device, otherwise, combining two most similar clusters, and simultaneously, storing a similarity calculation result of each feature vector in a similarity table until all clusters cannot be further combined with other clusters, wherein each category is a cluster, namely a user device, 0 cluster without data, when the first user device traffic is accessed, the second user device traffic is accessed, namely judging whether the two traffic are similar, and so on.
CN202211420599.5A 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic Active CN115766204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211420599.5A CN115766204B (en) 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211420599.5A CN115766204B (en) 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic

Publications (2)

Publication Number Publication Date
CN115766204A true CN115766204A (en) 2023-03-07
CN115766204B CN115766204B (en) 2024-04-26

Family

ID=85370337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211420599.5A Active CN115766204B (en) 2022-11-14 2022-11-14 Dynamic IP equipment identification system and method for encrypted traffic

Country Status (1)

Country Link
CN (1) CN115766204B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938562A (en) * 2016-04-13 2016-09-14 中国科学院信息工程研究所 Automatic network application fingerprint extraction method and system
CN108600414A (en) * 2018-05-09 2018-09-28 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of device-fingerprint
CN109068272A (en) * 2018-08-30 2018-12-21 北京三快在线科技有限公司 Similar users recognition methods, device, equipment and readable storage medium storing program for executing
CN109672650A (en) * 2017-10-17 2019-04-23 阿里巴巴集团控股有限公司 Websites collection system, method and data processing method
CN111277587A (en) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 Malicious encrypted traffic detection method and system based on behavior analysis
CN111277578A (en) * 2020-01-14 2020-06-12 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
US20200274812A1 (en) * 2017-12-15 2020-08-27 Huawei Technologies Co., Ltd. Traffic analysis method, common service traffic attribution method, and corresponding computer system
CN112019574A (en) * 2020-10-22 2020-12-01 腾讯科技(深圳)有限公司 Abnormal network data detection method and device, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938562A (en) * 2016-04-13 2016-09-14 中国科学院信息工程研究所 Automatic network application fingerprint extraction method and system
CN109672650A (en) * 2017-10-17 2019-04-23 阿里巴巴集团控股有限公司 Websites collection system, method and data processing method
US20200274812A1 (en) * 2017-12-15 2020-08-27 Huawei Technologies Co., Ltd. Traffic analysis method, common service traffic attribution method, and corresponding computer system
CN108600414A (en) * 2018-05-09 2018-09-28 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of device-fingerprint
CN109068272A (en) * 2018-08-30 2018-12-21 北京三快在线科技有限公司 Similar users recognition methods, device, equipment and readable storage medium storing program for executing
CN111277578A (en) * 2020-01-14 2020-06-12 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
CN111277587A (en) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 Malicious encrypted traffic detection method and system based on behavior analysis
CN112019574A (en) * 2020-10-22 2020-12-01 腾讯科技(深圳)有限公司 Abnormal network data detection method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN115766204B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN111865815B (en) Flow classification method and system based on federal learning
CN111277570A (en) Data security monitoring method and device, electronic equipment and readable medium
CN112738039B (en) Malicious encrypted flow detection method, system and equipment based on flow behavior
CN112104570B (en) Traffic classification method, traffic classification device, computer equipment and storage medium
KR101295708B1 (en) Apparatus for capturing traffic and apparatus, system and method for analyzing traffic
CN110611640A (en) DNS protocol hidden channel detection method based on random forest
WO2011050545A1 (en) Automatic analysis method for unknown application layer protocols
CN110868409A (en) Passive operating system identification method and system based on TCP/IP protocol stack fingerprint
CN113743542B (en) Network asset identification method and system based on encrypted flow
CN112217763A (en) Hidden TLS communication flow detection method based on machine learning
CN106330584A (en) Identification method and identification device of business flow
CN109525508A (en) Encryption stream recognition method, device and the storage medium compared based on flow similitude
CN111147394A (en) Multi-stage classification detection method for remote desktop protocol traffic behavior
CN109275045B (en) DFI-based mobile terminal encrypted video advertisement traffic identification method
CN112800424A (en) Botnet malicious traffic monitoring method based on random forest
CN112804253A (en) Network flow classification detection method, system and storage medium
CN113923026A (en) Encrypted malicious flow detection model based on TextCNN and construction method thereof
CN115865534B (en) Malicious encryption-based traffic detection method, system, device and medium
CN115051874B (en) Multi-feature CS malicious encrypted traffic detection method and system
CN116828087A (en) Information security system based on block chain connection
CN115766204B (en) Dynamic IP equipment identification system and method for encrypted traffic
CN114338070B (en) Shadowsocks (R) identification method based on protocol attribute
EP3576365B1 (en) Data processing device and method
CN111274235B (en) Unknown protocol data cleaning and protocol field feature extraction method
CN114745175A (en) Attention mechanism-based network malicious traffic identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant