CN113518042A

CN113518042A - Data processing method, device, equipment and storage medium

Info

Publication number: CN113518042A
Application number: CN202011490495.2A
Authority: CN
Inventors: 彭婧; 甘祥; 郑兴; 郭晶; 范宇河; 唐文韬; 申军利; 刘羽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-10-19
Anticipated expiration: 2040-12-16
Also published as: CN113518042B

Abstract

The embodiment of the application provides a data processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring data traffic to be identified, wherein the data traffic to be identified is generated based on a fast user datagram protocol (QUIC) network connection, and the data traffic to be identified comprises QUIC attribute information and domain name system information; combining QUIC attribute information and domain name system information in the data traffic to be identified into traffic characteristics of the data traffic to be identified; and calling a flow type identification model to identify the flow characteristics of the data flow to be identified, obtaining the flow type of the data flow to be identified, and outputting the flow type of the data flow to be identified. By the method and the device, the flow type of the data flow generated based on the QUIC protocol can be accurately and effectively identified.

Description

Data processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and storage medium.

Background

A Quick User Datagram Protocol (Quick UDP) Internet Connection (QUIC) based Protocol is a low-latency Internet transport layer Protocol based on a User Datagram Protocol (UPD) established by ***. The Internet draft of QUIC specifications was submitted for standardization by The Internet Engineering Task Force (IETF) at 6 months 2015.

In an environment where Google products are heavily used, 10% to 20% of network traffic is communicated via the QUIC protocol. There is currently no way to identify data traffic generated by the QUIC protocol to identify whether the traffic is malicious or normal data traffic. Therefore, how to identify the data traffic generated by the QUIC protocol is a problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, data processing equipment and a data processing storage medium, which can accurately and effectively identify the flow type of data flow generated based on a QUIC protocol.

In one aspect, an embodiment of the present invention provides a data processing method, where the method includes:

acquiring data traffic to be identified, wherein the data traffic to be identified is generated based on a fast user datagram protocol (QUIC) network connection, and the data traffic to be identified comprises QUIC attribute information and domain name system information;

combining QUIC attribute information and domain name system information in the data traffic to be identified into traffic characteristics of the data traffic to be identified;

and calling a flow type identification model to identify the flow characteristics of the data flow to be identified, obtaining the flow type of the data flow to be identified, and outputting the flow type of the data flow to be identified.

In one aspect, an embodiment of the present invention provides a data processing method and apparatus, where the apparatus includes:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data traffic to be identified, the data traffic to be identified is generated based on a fast user datagram protocol (QUIC) network connection (QUIC), and the data traffic to be identified comprises QUIC attribute information and domain name system information;

the processing module is used for combining QUIC attribute information and domain name system information in the data traffic to be identified into traffic characteristics of the data traffic to be identified;

and the processing module is also used for calling a flow type identification model to identify the flow characteristics of the data flow to be identified, obtaining the flow type of the data flow to be identified and outputting the flow type of the data flow to be identified.

In one aspect, an embodiment of the present invention provides a computer device, which includes a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to perform operations involved in the above-mentioned data processing method.

In one aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements a program related to the above-mentioned data processing method.

In one aspect, an embodiment of the present invention further provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the data processing method.

The embodiment of the invention trains an original flow type recognition model by utilizing QUIC attribute information and domain name system information in a data flow sample, so that the obtained flow type recognition model can recognize the flow type of data flow generated based on a QUIC protocol; meanwhile, by extracting QUIC attribute information and domain name system information in the data flow to be identified generated based on the QUIC protocol and combining the information as the flow characteristics of the data flow to be identified, the flow type identification model can accurately and effectively identify the flow type of the data flow to be identified generated based on the QUIC protocol.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of traffic type identification model establishment according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a data processing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a CHLO packet provided in accordance with an embodiment of the present invention;

fig. 6 is a schematic flowchart of a data processing method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, cloud storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Specifically, in the application, a traffic type identification model for identifying traffic types is trained through a massive data traffic training sample set and a machine learning algorithm, and the traffic type identification model can be used for identifying whether data traffic to be identified is normal data traffic or malicious data traffic.

Cloud computing (cloud computing) refers to a delivery and use mode of an IT infrastructure, and refers to obtaining required resources in an on-demand and easily-extensible manner through a network; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.

Cloud Security (Cloud Security) refers to a generic term for Security software, hardware, users, organizations, secure Cloud platforms for Cloud-based business model applications. The cloud security integrates emerging technologies and concepts such as parallel processing, grid computing and unknown virus behavior judgment, abnormal monitoring of software behaviors in the network is achieved through a large number of meshed clients, the latest information of trojans and malicious programs in the internet is obtained and sent to the server for automatic analysis and processing, and then the virus and trojan solution is distributed to each client.

The data processing method provided by the application can be packaged into a security service of the cloud platform, and when data traffic needing to be identified exists, the security service is called on the cloud platform to obtain an identification result. Subsequently, normal data traffic can be allowed to pass through and malicious data traffic can be intercepted according to the obtained identification result, so that the equipment is prevented from being attacked, and the network security of the equipment is improved.

It should be understood that the data processing method provided by the embodiment of the present application can be applied to various communication systems based on the QUIC protocol, such as: computer networks, global system for mobile communications (GSM) systems, Code Division Multiple Access (CDMA) systems, Wideband Code Division Multiple Access (WCDMA) systems, General Packet Radio Service (GPRS), Long Term Evolution (LTE) systems, LTE Frequency Division Duplex (FDD) systems, LTE Time Division Duplex (TDD), Universal Mobile Telecommunications System (UMTS), Worldwide Interoperability for Microwave Access (WiMAX) communication systems, and 5G communication systems, among others.

As shown in fig. 1, the embodiment of the present application provides a data processing system, which includes at least one terminal device 101 and at least one server 102. Terminal equipment 101 is also referred to as a Terminal (Terminal), User Equipment (UE), access Terminal, subscriber unit, mobile device, user Terminal, wireless communication device, user agent, or user equipment. The terminal device 101 may be a Personal Digital Assistant (PDA) device, a smart tv, a handheld device with wireless communication function (e.g., a smart phone or a tablet), a computing device (e.g., a Personal Computer (PC), a vehicle-mounted device, a wearable device, and the like, but is not limited thereto.

The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The terminal device 101 sends the data traffic to be identified to the server 102, the server 102 extracts the QUIC attribute information and the domain name system information in the data traffic to be identified generated based on the QUIC protocol, combines the information as the traffic characteristics of the data traffic to be identified, and then identifies the traffic characteristics of the data traffic to be identified by calling a pre-trained traffic type identification model, so that the traffic type of the data traffic to be identified is obtained, and the traffic type of the data traffic generated based on the QUIC protocol can be accurately and effectively identified.

The specific application scenarios of the present application may be: a bypass device is installed at a machine room flow outlet (a switch may be at the machine room flow outlet) where a large number of servers are installed, wherein the bypass device can be installed in the bypass device by encapsulating the application into an application software. By the adoption of the method and the system, the security of data flow which comes in and goes out of a machine room and is generated by a QUIC protocol can be monitored, and once malicious data flow is detected, the server is informed to stop sending or receiving so as to ensure the security of the server and the client.

It should be understood that the architecture diagram of the system described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

In one embodiment, as shown in fig. 2, a data processing method provided by the data processing system of fig. 1 according to an embodiment of the present invention is provided. Take the example where the server is the server 102 mentioned in fig. 1. The method of the embodiment of the present invention is described below with reference to fig. 2.

S201, obtaining data flow to be identified, wherein the data flow to be identified is generated based on a fast user datagram protocol (QUIC) network connection (QUIC) protocol, and the data flow to be identified comprises QUIC attribute information and domain name system information.

The QUIC protocol is used as an internet transmission layer protocol and comprises a transmission encryption layer, so that the confidentiality of transmission data is ensured, and the data traffic to be identified generated based on the QUIC protocol is encrypted data traffic, and is prevented from being attacked by middle persons such as eavesdropping, tampering and the like.

In one embodiment, the data traffic to be identified contains QUIC data traffic, only a small portion of which is typically unencrypted for security reasons, which is mainly generated when the connection is established, so that the unencrypted special field included in the QUIC data traffic is extracted as QUIC attribute information, and the domain name system information associated with the destination IP address is selected.

S202, combining QUIC attribute information and domain name system information in the data traffic to be identified as traffic characteristics of the data traffic to be identified.

In one embodiment, the next step of traffic type analysis is performed by combining QUIC attribute information and domain name system information as traffic characteristics of the data traffic to be identified.

S203, calling a flow type recognition model to recognize the flow characteristics of the data flow to be recognized, obtaining the flow type of the data flow to be recognized, and outputting the flow type of the data flow to be recognized.

In one embodiment, when the traffic type identification model is called to identify the traffic type of the data traffic to be identified, the traffic type identification model may be regarded as a classifier, and the traffic type of the data traffic to be identified is identified through traffic features in the data traffic to be identified.

In the embodiment of the application, the QUIC attribute information and the domain name system information in the data flow to be identified generated based on the QUIC protocol are extracted and combined to be used as the flow characteristic of the data flow to be identified, and then the flow characteristic of the data flow to be identified is identified by calling a flow type identification model, so that the flow type of the data flow to be identified is obtained, and the flow type of the data flow generated based on the QUIC protocol can be accurately and effectively identified.

In one embodiment, the flow characteristics of the data flow to be identified are standardized to obtain standardized flow characteristics; the calling the traffic type identification model to identify the traffic characteristics to obtain the traffic type of the data traffic to be identified, including: and calling a flow type identification model to identify the standardized flow characteristics to obtain the flow type of the data flow to be identified.

In one embodiment, QUIC attribute information and domain name system information in flow characteristics of data streams to be identified are standardized by one or more of data cleaning, data standardization and data regularization to obtain standardized flow characteristics, and a flow type identification model is called to identify the standardized flow characteristics to obtain flow types of the data streams to be identified. The process of cleaning the flow characteristics of the data flow to be identified mainly aims at cleaning missing values, abnormal values, repeated values, noise data and the like in the extracted flow characteristics. For example, missing values may be processed by methods such as fixed value padding, mean padding, median padding, upper and lower data padding, interpolation padding, or random number padding. The data can be extracted more quickly through standardization processing, meanwhile, the influence of different dimensions caused by digitization can be eliminated through data standardization and data regularization, and the operation efficiency of a flow type identification model is improved, for example, the z-score standardization.

In one embodiment, invoking a traffic type recognition model to recognize the standardized traffic characteristics to obtain a traffic type of the data traffic to be recognized includes: calling a plurality of classification regression tree sets in a flow type identification model to identify the standardized flow characteristics to obtain a prediction result of the standardized flow characteristics in each classification regression tree set, wherein each prediction result comprises a matching probability set between the standardized flow characteristics and a plurality of flow types; and determining the traffic type of the data traffic to be identified based on a plurality of prediction results.

In one embodiment, the traffic type identification model is generated based on a Gradient Boosting Decision Tree (GBDT), when a plurality of classification regression Tree sets in the traffic type identification model are used to identify the standardized traffic characteristics, each classification regression Tree set obtains a prediction result of the standardized traffic characteristics, each prediction result includes a matching probability set between the standardized traffic characteristics and a plurality of traffic types, and the traffic type of the data traffic to be identified is determined according to the plurality of prediction results.

Specifically, the gradient boosting decision tree algorithm trains one classification regression tree set for each possible traffic type of the data traffic sample, and when the traffic types of the data traffic sample are assumed to be normal data traffic and malicious data traffic, two classification regression trees are simultaneously trained in each training round, for example, when the data traffic sample x is normal data traffic, in the first training round, the input for the first classification regression tree is (x,1), the input for the second classification regression tree is (x,0), 1 indicates that the classification tree belongs to the class, and 0 indicates that the classification tree does not belong to the class. In the training process of the classification regression tree set of each traffic type, each tree node needs to give a splitting rule starting from the root node in one tree, the splitting rule includes a splitting characteristic and a splitting characteristic value, for example, MSE, MAE and the like can be used as the splitting rule, data samples are split into subtree nodes of the next layer according to the splitting rule, finally, each data traffic sample x is distributed to one leaf node, and the leaf node predicts the data traffic sample thereon as w. The leaf nodes of the gradient boosting decision tree give continuous values, w is added to the predicted value of each data traffic sample after one tree is trained, and then the next tree is trained with the new predicted value. The gradient boosting decision tree accumulates the predicted values of all classification regression trees as final predicted values, as shown in the following formula (1).

Where N is the number of trees, f_i(x_i) Is the predicted value for the ith tree, and η is the hyperparameter.

Therefore, the final classification regression tree set for each traffic type includes a plurality of classification regression trees. When the two classification regression tree sets identify the flow type of the data flow z to be identified, the obtained prediction result is F₁(z) and F₂When the data traffic z to be identified belongs to the first classification regression tree set, namely the matching probability of the data traffic z belonging to the traffic type being normal data traffic is the following formula (2).

Then, the data traffic z to be identified belongs to the second classification regression tree set, that is, the matching probability of the data traffic belonging to the traffic type malicious data traffic is the following formula (3).

Then, the traffic type corresponding to the large matching probability value is the traffic type of the data traffic z to be identified.

In an embodiment, since the method provided in the embodiment of the present application has strong applicability and model replaceability, different algorithms may be used to train the model to obtain the traffic type recognition model, such as a Vector Machine (SVM), an l1 regularized logistic Regression algorithm (l1-logistic Regression), a Random Forest (RF), a Gradient Boosting decision tree algorithm, and an Extreme Gradient Boosting (boost) algorithm, which are not limited in the present application.

In an embodiment, referring to fig. 3, before invoking the traffic type identification model to identify the traffic type of the data traffic to be identified, the method further includes a step of establishing the traffic type identification model, where the step specifically includes the following steps:

s301, a data flow training sample set is obtained, wherein the data flow training sample set comprises a plurality of data flow samples and the flow type of each data flow sample, any data flow sample in the data flow training sample set is data flow generated based on a QUIC protocol, and any data flow sample in the data flow training sample set comprises QUIC attribute information and domain name system information.

In one embodiment, the data traffic training sample set includes a plurality of data traffic samples, each data traffic sample being generated based on the QUIC protocol, so that the plurality of data traffic samples included in the data traffic training sample set are encrypted data traffic, wherein each data traffic sample in the plurality of data traffic samples has a traffic type of normal data traffic and malicious data traffic.

In one embodiment, before training the original traffic type recognition model, as shown in fig. 4, the method further includes the following steps: firstly, deploying light splitting equipment at a switch of a machine room to capture all light splitting data traffic of the machine room; screening to obtain data flow generated based on the QUIC protocol, filtering the flow to extract the QUIC data flow in the data flow generated based on the QUIC protocol, decrypting the QUIC data flow and sending the decrypted QUIC data flow to a non-encrypted flow detection system; classifying the QUIC data flow according to the result identified by the non-encryption flow detection system, wherein the classification is mainly to divide the QUIC data flow into malicious data flow and normal data flow; then, associating the QUIC attribute information in the classified QUIC data flow with the domain name system information related to the destination IP address; finally, data preprocessing is performed on the correlated data to obtain the flow characteristics of the data flow, and the data preprocessing is the standardized processing described in the above embodiment. By the method, a mass of data traffic samples can be obtained, and the samples can be used for training an original traffic type recognition model, so that the traffic type of the data traffic to be recognized can be accurately and effectively recognized by the traffic type recognition model obtained after training.

In one embodiment, each data traffic sample in the set of data traffic training samples has QUIC data traffic, and for security reasons, only a small portion of the data is typically unencrypted, which is mainly generated when a connection is established, so that the unencrypted special field included in the QUIC data traffic is extracted as QUIC attribute information, and at the same time, by selecting domain name system information associated with the destination IP address.

In one embodiment, the handshake packet generated by the QUIC data traffic when establishing a connection is unencrypted, and taking GQUIC version Q035 as an example, a terminal device issues a CHLO packet that contains a plurality of unencrypted fields, as shown in fig. 5, from which the following characteristic fields can be extracted as model inputs: server Name Indication (SNI), user agent of client, and QUIC federation properties. Wherein the QUIC joint attributes comprise one or more of a QUIC version VER, a PAD padding, a source address token STK, a public certificate set CCS, a customer list NONE, an authentication encryption algorithm AEAD, a server configuration account number SCID, a connection identity truncation TCID, a verification requirement PDMD, a maximum header list SMHL support, a lifecycle of idle connection state ICSL, a customer attestation NONP, a public value PUBS of key exchange, a maximum incoming dynamic flow MIDS, a no-hint close timeout SCLS, a key exchange algorithm KEXS, an expected leaf certificate XLCT, a signature certificate timestamp CSCT, a connection option COPT, a cache certificate CCRT, an automatic initial round trip delay IRRT, an initial session/connection flow control receive window CFCW, an initial flow control receive window SFCW.

In one embodiment, after obtaining the QUIC joint attribute field, calling Message-Digest Algorithm (MD 5) to encrypt the QUIC joint attribute field to obtain 4 groups of hash values with 32 bit length, and obtaining the data fingerprint of the QUIC joint attribute with 128 byte length by concatenation. MD4 encryption algorithm, URL encryption algorithm, JS encryption algorithm, etc. may also be used, which is not limited in this application.

In one embodiment, the QUIC attribute information further includes one or more of a size of the QUIC data traffic, a time sequence of arrival of a plurality of datagrams of the QUIC data traffic. For example, when analyzing the size of QUIC data traffic, mainly look at small packets of about 64 bytes and large packets of about 1500 bytes, and in normal data traffic, the distribution ratio of the two packets should be small.

In one embodiment, data analysis indicates that the difference between malicious data traffic and normal data traffic in the domain name system information is mainly reflected in the internet protocol number, Time To Live (TTL), Alexa website ranking, domain name length, domain name numeric characters, and non-alphanumeric characters. For example, the TTL values of normal data traffic are typically 60, 300, 20, 30; while malicious data traffic is 300 a lot, about 22% of domain name system responses have a summary TTL of 100, which is rare in normal data traffic. In the malicious dns response message, the most common TTL values are 100, 300, and 60, where the TTL value of the normal dns response message is never used 100. Wherein the domain Name system information may be obtained from a Server Name Indication (SNI) field.

S302, combining the QUIC attribute information and the domain name system information in each data flow sample into the flow characteristics of each data flow sample.

In one embodiment, as shown in table 1, the QUIC attribute information and the domain name system information in each data traffic sample are combined into a traffic profile of each data traffic sample, wherein the QUIC attribute information includes: one or more of a user agent of the client, a data fingerprint of the QUIC federation attribute, a size of the QUIC data traffic, a time sequence of arrival of a plurality of datagrams of the QUIC data traffic; the domain name system information includes: one or more of an internet protocol number, a time to live, an Alexa website ranking, a domain name length, a ratio of domain name numeric characters and non-alphanumeric characters.

TABLE 1

S303, training the original traffic type recognition model by using the traffic characteristics of each data traffic sample and the traffic type of each data traffic sample to obtain a traffic type recognition model.

In an embodiment, the original traffic type identification model may be a Vector Machine (SVM), an l1 regularized logistic Regression algorithm (l1-logistic Regression), a Random Forest (RF), a Gradient Boosting decision tree algorithm, an Extreme Gradient Boosting (XGBoost) algorithm, and the like, which is not limited in this application. The original traffic type recognition model is trained through the traffic characteristics of each data traffic sample in the data traffic training sample set and the corresponding traffic type, for example, a plurality of classification regression trees are sequentially trained by using the data traffic training sample set based on a gradient lifting decision tree algorithm, and a final traffic type recognition model is obtained.

Specifically, when an original traffic type recognition model is trained by using a data traffic training sample set, the traffic type of each data traffic sample is divided into normal data traffic and malicious data traffic, where the malicious data traffic includes, for example: trojan horses, lemonades, infectious viruses, worms, downloaders, and other malware, among others.

In the embodiment of the application, the data traffic training sample set with the traffic type is obtained, and the traffic characteristics of each data traffic sample in the data traffic training sample set are extracted, so that the original traffic type identification model is trained based on the traffic characteristics and the traffic types of each data traffic sample, and the obtained traffic type identification model can identify the traffic type of the data traffic to be identified.

In one embodiment, the obtaining a data traffic training sample set includes: acquiring a plurality of original data traffic samples and a protocol identifier of each original data traffic sample; extracting a plurality of data traffic samples generated by QUIC protocol from the plurality of raw data traffic samples according to the protocol identification of each raw data traffic sample; and acquiring the traffic type of each data traffic sample, and combining the multiple data traffic samples and the traffic type of each data traffic sample into a data traffic training sample set.

In one embodiment, multiple raw data traffic samples and a protocol identification for each raw data traffic sample need to be obtained. The protocol identifier of each original data traffic sample is an identifier that is different from other data traffic samples, and may be a preset identifier bit or a protocol name of the data traffic sample.

In one embodiment, in order to obtain the original data traffic sample, a packet capturing platform may be installed at a switch in the network, for example, a host installed with Sniffer software is connected to a certain Port of the switch (destination-mapping Port), and then traffic of other switch ports (which may not be on the same switch) that need to collect traffic is mapped to the Port, so that data traffic of multiple ports may be collected by scanning one Port, or data traffic in the network may be captured by deploying an optical splitter at the switch, which does not limit the manner of obtaining the original data traffic sample.

In one embodiment, after the original data traffic sample is obtained, the protocol type of the original data traffic sample may be identified by using a traffic identification technology based on a network port number, a traffic identification technology based on Deep Packet Inspection (DPI), a traffic identification technology based on Dynamic Flow Inspection (DFI), a traffic identification technology based on host behavior, or the like. The traffic identification technology based on the network port number mainly analyzes the protocol used by the data traffic by checking the source port number or the destination port number in the data packet according to the information of the port number list of an Internet Assigned Number Authority (IANA). For example, the number of the port assigned for FTP service registration is 21, the number of the port assigned for web application of HTTP protocol is 80, and the number of the port assigned for common email box protocol SMTP is 25. The DPI technology mainly analyzes the payload of a packet in data traffic to determine whether the payload portion matches with a currently known protocol on certain characteristic words.

In one embodiment, after the protocol identifier of each data traffic sample is obtained, the data traffic samples generated based on the QUIC protocol are extracted, each data traffic sample is further decrypted, the traffic type of each data traffic sample is obtained through a decrypted plaintext, and the plurality of data traffic samples and the traffic type of each data traffic sample are combined into a data traffic training sample set.

In one embodiment, the obtaining the traffic type of each data traffic sample includes: decrypting each data flow sample by using a shared key generated by a key exchange algorithm to obtain the plaintext data flow of each data flow sample; and detecting the plaintext data traffic of each data traffic sample by using an intrusion detection method, and determining the traffic type of each data traffic sample.

In one embodiment, the shared key generated by the key exchange algorithm is an indispensable condition for the QUIC protocol to provide secure transmission, so that the data traffic samples can be encrypted by the shared key. Meanwhile, when the data traffic samples are received, each data traffic sample can be decrypted through a shared key generated by a key exchange algorithm, so that the plaintext data traffic of each data traffic sample is obtained.

In one embodiment, the shared key of the QUIC protocol is such that the key exchange algorithm is mainly DH (Diffie hellman) algorithm, which is based on a key agreement mechanism, and a key which is private to the terminal device and the server is generated by the terminal device and the server, and a shared key is generated by the terminal device and the server by combining the private keys. When the server sends the data packet included in each data traffic sample to the client or the client sends the data packet included in each data traffic sample to the server for key agreement, each data traffic sample can be decrypted through the shared key generated by the DH algorithm, so that the plaintext data traffic of each data traffic sample after decryption is obtained.

In one embodiment, the plaintext data traffic of each data traffic sample is detected by using an intrusion detection method, and the traffic type of each data traffic sample is determined.

Specifically, the intrusion detection method may determine whether the data traffic sample is abnormal through the plaintext data traffic of the decrypted data traffic sample, for example, by using intrusion detection based on a user behavior probability statistical model, an intrusion detection method based on a neural network, an intrusion detection technique based on an expert system, an intrusion detection technique based on model inference, and the like, and using the data traffic sample with the detected abnormality as malicious data traffic, and conversely, the data traffic sample is normal data traffic. The abnormal data traffic sample refers to that the behavior of the data traffic does not conform to the expected normal behavior pattern. The occurrence of abnormal data traffic means that there may be some unauthorized information access and data manipulation in the network. Such as denial of service (DoS) attacks that overload the corresponding servers, worms and viruses that privilege access and attack hosts through the network with known vulnerabilities, etc.

According to the embodiment of the application, malicious data traffic before decryption and normal data traffic are distinguished by using the data traffic capable of being decrypted, and the defect that the malicious data traffic generated based on the QUIC protocol is difficult to collect at present can be overcome.

In one embodiment, the data traffic to be identified comprises QUIC data traffic, said QUIC data traffic comprising said QUIC attribute information and said Domain name System information; the QUIC attribute information includes: one or more of a user agent of a client, a data fingerprint of a QUIC federation attribute, a size of the QUIC data traffic, a time sequence of arrival of a plurality of datagrams of the QUIC data traffic; the domain name system information includes: one or more of an internet protocol number, a time to live, an Alexa website ranking, a domain name length, a ratio of domain name numeric characters and non-alphanumeric characters.

In one embodiment, data traffic to be identified is typically only a small portion of the data that is generated primarily when the connection is established, unencrypted for security purposes, and therefore, the unencrypted special fields included in the QUIC data traffic are extracted as QUIC attribute information, while system information is extracted by selecting a Domain name associated with the destination IP address.

In one embodiment, the handshake packet generated when connection is established for the QUIC data traffic is unencrypted, and taking GQUIC version Q035 as an example, a terminal device sends out a CHLO packet containing a plurality of unencrypted fields, as shown in fig. 5, the following characteristic fields can be extracted from the CHLO packet as model inputs: server Name Indication (SNI), user agent of client, and QUIC federation properties. Wherein the QUIC joint attributes comprise one or more of a QUIC version VER, a PAD padding, a source address token STK, a public certificate set CCS, a customer list NONE, an authentication encryption algorithm AEAD, a server configuration account number SCID, a connection identity truncation TCID, a verification requirement PDMD, a maximum header list SMHL support, a lifecycle of idle connection state ICSL, a customer attestation NONP, a public value PUBS of key exchange, a maximum incoming dynamic flow MIDS, a no-hint close timeout SCLS, a key exchange algorithm KEXS, an expected leaf certificate XLCT, a signature certificate timestamp CSCT, a connection option COPT, a cache certificate CCRT, an automatic initial round trip delay IRRT, an initial session/connection flow control receive window CFCW, an initial flow control receive window SFCW.

And the QUIC attribute information also comprises one or more of the size of the QUIC data flow and the arrival time sequence of a plurality of datagrams of the QUIC data flow. For example, when analyzing the size of QUIC data traffic, mainly look at small packets of about 64 bytes and large packets of about 1500 bytes, and in normal data traffic, the distribution ratio of the two packets should be small.

In one embodiment, the difference between malicious data traffic and normal data traffic in the dns information is mainly reflected in the internet protocol number, Time To Live (TTL), Alexa website ranking, domain length, ratio of domain numeric characters and non-alphanumeric characters. For example, the TTL values of normal data traffic are typically 60, 300, 20, 30; while malicious data traffic is 300 a lot, about 22% of domain name system responses have a summary TTL of 100, which is rare in normal data traffic. In the malicious dns response message, the most common TTL values are 100, 300, and 60, where the TTL value of the normal dns response message is never used 100. Wherein the domain Name system information may be obtained from a Server Name Indication (SNI) field.

In one embodiment, after the original flow type identification model is trained to obtain the flow type identification model, abnormal detection can be performed on data flow generated based on a QUIC protocol through the flow type identification model, the recall rate and the accuracy of the model are calculated, feature extraction and parameter optimization are continuously improved, manual intervention is performed to add false reports and missing reports output by the model into a feature engineering of the flow type identification model, and the flow type identification model is updated through iterative optimization.

In an embodiment, as shown in fig. 6, a flow diagram of a data processing method is provided, where a data traffic capture module is configured to obtain a plurality of original data traffic samples, and decrypt the original data traffic samples after obtaining the plurality of original data traffic samples; then calling a flow detection module to identify a data flow sample generated based on a QUIC protocol, identifying the flow type of the data flow sample according to the decrypted data flow sample, and associating the flow type with the data flow sample to obtain a data flow sample set with a flow type label; QUIC attribute information and network domain name system information are extracted from the data flow samples included in the sample set, and the QUIC attribute information and the network domain name system information are combined to be used as flow characteristics of the data flow samples; performing data preprocessing on the flow characteristics of the data flow samples, wherein the data preprocessing is mainly the standardization processing explained in the above embodiment; the traffic characteristics of the data traffic samples after data preprocessing are input into the machine learning model, and the machine learning model is trained according to the traffic types of the data traffic samples to obtain an output model, i.e., a traffic type identification model. When a data flow identification task arrives, extracting flow characteristics of data flow to be identified, and starting an output model to identify the flow type of the data flow, so as to judge whether the data flow to be identified is malicious data flow, and displaying an obtained malicious data flow result in a result display interface. Other specific steps in the embodiments of the present application have been described in detail in the above embodiments, and are not described in detail in this embodiment.

As shown in fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the apparatus includes:

an obtaining module 701, configured to obtain data traffic to be identified, where the data traffic to be identified is data traffic generated based on a fast user datagram protocol (QUIC) network connection (QUIC) protocol, and the data traffic to be identified includes QUIC attribute information and domain name system information;

a processing module 702, configured to combine the QUIC attribute information and the domain name system information in the data traffic to be identified into a traffic characteristic of the data traffic to be identified;

the processing module 702 is further configured to invoke a traffic type identification model to identify traffic characteristics of the data traffic to be identified, obtain a traffic type of the data traffic to be identified, and output the traffic type of the data traffic to be identified.

In an embodiment, the processing module 702 is specifically configured to:

carrying out standardization processing on the flow characteristics of the data flow to be identified to obtain standardized flow characteristics;

the calling the traffic type identification model to identify the traffic characteristics to obtain the traffic type of the data traffic to be identified, including:

and calling a flow type identification model to identify the standardized flow characteristics to obtain the flow type of the data flow to be identified.

In an embodiment, the processing module 702 is specifically configured to:

calling a plurality of classification regression tree sets in a flow type identification model to identify the standardized flow characteristics to obtain a prediction result of the standardized flow characteristics in each classification regression tree set, wherein each prediction result comprises a matching probability set between the standardized flow characteristics and a plurality of flow types;

and determining the traffic type of the data traffic to be identified based on a plurality of prediction results.

In an embodiment, the processing module 702 is specifically configured to:

acquiring a data traffic training sample set, wherein the data traffic training sample set comprises a plurality of data traffic samples and the traffic type of each data traffic sample, any data traffic sample in the data traffic training sample set is data traffic generated based on a QUIC protocol, and any data traffic sample in the data traffic training sample set comprises QUIC attribute information and domain name system information;

combining QUIC attribute information and domain name system information in each data traffic sample into a traffic characteristic of each data traffic sample;

and training the original traffic type recognition model by using the traffic characteristics of each data traffic sample and the traffic type of each data traffic sample to obtain a traffic type recognition model.

In an embodiment, the processing module 702 is specifically configured to:

acquiring a plurality of original data traffic samples and a protocol identifier of each original data traffic sample;

extracting a plurality of data traffic samples generated by QUIC protocol from the plurality of raw data traffic samples according to the protocol identification of each raw data traffic sample;

and acquiring the traffic type of each data traffic sample, and combining the multiple data traffic samples and the traffic type of each data traffic sample into a data traffic training sample set.

In an embodiment, the processing module 702 is specifically configured to:

the obtaining of the traffic type of each data traffic sample includes:

decrypting each data flow sample by using a shared key generated by a key exchange algorithm to obtain the plaintext data flow of each data flow sample;

and detecting the plaintext data traffic of each data traffic sample by using an intrusion detection method, and determining the traffic type of each data traffic sample.

As shown in fig. 8, fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application, where an internal structure of the computer device is shown in fig. 8, and the computer device includes: one or more processors 801, memory 802, and a communication interface 803. The processor 801, the memory 802 and the communication interface 803 may be connected by a bus 804 or in other ways, and the embodiment of the present application is exemplified by the bus 804.

The processor 801 (or referred to as a Central Processing Unit (CPU)) is a computing core and a control core of the computer device, and can analyze various instructions in the computer device and process various data of the computer device, for example: the CPU can be used for analyzing a power-on and power-off instruction sent to the computer equipment by a user and controlling the computer equipment to carry out power-on and power-off operation; the following steps are repeated: the CPU may transmit various types of interactive data between the internal structures of the computer device, and so on. The communication interface 803 may optionally include a standard wired interface, a wireless interface (e.g., Wi-Fi, mobile communication interface, etc.), under the control of the processor 801 for transceiving data. The Memory 802(Memory) is a Memory device in the computer device for storing programs and data. It will be appreciated that the memory 802 can comprise both internal memory of the computing device and, of course, expanded memory supported by the computing device. The memory 802 provides storage space that stores the operating system of the computer device, which may include, but is not limited to: windows system, Linux system, etc., which are not limited in this application.

In an embodiment, the processor 801 is specifically configured to:

the obtaining of the traffic type of each data traffic sample includes:

It will be understood by those skilled in the art that all or part of the processes of the method for implementing the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the file management method. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

One or more embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps performed in the embodiments of the methods described above.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the invoking the traffic type recognition model to recognize the standardized traffic characteristics to obtain the traffic type of the data traffic to be recognized comprises:

4. The method of claims 1-3, further comprising:

5. The method of claim 4, wherein the obtaining a data traffic training sample set comprises:

6. The method of claim 5, wherein each data traffic sample is encrypted data;

the obtaining of the traffic type of each data traffic sample includes:

7. The method of claim 1, wherein data traffic to be identified comprises QUIC data traffic, said QUIC data traffic comprising said QUIC attribute information and said Domain name System information;

the QUIC attribute information includes: one or more of a user agent of a client, a data fingerprint of a QUIC federation attribute, a size of the QUIC data traffic, a time sequence of arrival of a plurality of datagrams of the QUIC data traffic;

the domain name system information includes: one or more of an internet protocol number, a time to live, an Alexa website ranking, a domain name length, a ratio of domain name numeric characters and non-alphanumeric characters.

8. A data processing apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory, a communication interface and a processor, wherein the memory, the communication interface and the processor are connected to each other, the memory stores computer program code, and the processor calls the computer program code stored in the memory to execute the data processing method according to any one of claims 1 to 8.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the data processing method of any one of claims 1 to 8.