CN116155821A

CN116155821A - ET-BERT flow classification method, storage medium and equipment based on multitask learning

Info

Publication number: CN116155821A
Application number: CN202310084193.2A
Authority: CN
Inventors: 刘兰; 余永杰; 吴亚峰; 惠占发; 陈桂铭
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-05-23

Abstract

The invention discloses an ET-BERT flow classifying method, a storage medium and equipment based on multi-task learning, which aim to assume that a plurality of learning tasks are not completely independent, and a plurality of auxiliary tasks can promote the learning of another task through hard parameter sharing, so that the requirement for a large number of marked training samples is reduced by combining a bidirectional encoder (ET-BERT model) from a converter when a main task is executed. The method comprises the steps of: acquiring a flow data set, and preprocessing the flow data set; acquiring time sequence characteristics of a data set; according to the time sequence characteristics, taking the predicted bandwidth and duration as auxiliary tasks, and pre-training an ET-BERT model; obtaining the optimal value of the bandwidth and duration frequency divider, converting the optimal value into a token for batch optimization and training by an Adam optimizer; and fine-tuning parameters in the pre-trained ET-BERT model, and carrying out main task flow category prediction by adopting the ET-BERT model after fine-tuning the parameters.

Description

ET-BERT flow classification method, storage medium and equipment based on multitask learning

Technical Field

The invention relates to the technical fields of deep learning, network traffic analysis and network space security application, in particular to an ET-BERT traffic classification method, a storage medium and equipment based on multi-task learning.

Background

Network traffic classification has a wide range of applications in today's Internet, such as resource allocation, qoS provisioning, ISP billing, anomaly detection, etc. These methods rely on human labor to continually find patterns or matching port numbers in the unencrypted payload. New approaches, such as Random Forest (RF) and K-nearest neighbor (KNN), have emerged on the basis of classical machine learning algorithms due to inefficiency and poor accuracy.

Classical machine learning algorithms have achieved the most advanced accuracy in traffic classification tasks for several years. However, these relatively simple methods cannot capture the more complex patterns present in today's Internet traffic, and therefore their accuracy has degraded. Recently, deep learning models have achieved the most advanced performance in terms of traffic classification. Their ability to learn complex patterns and perform automatic feature extraction makes them ideal choices for traffic classification.

Although deep learning methods can achieve high accuracy, they require a large amount of labeled training data. Labeling is a time consuming and cumbersome task in the network traffic classification task. To properly label each flow, researchers typically isolate and capture the flow of each class in a controlled environment with minimal background traffic. This process is time consuming and laborious. Furthermore, the observed flow patterns in a controlled environment may be quite different from the actual flow, which makes the inference inaccurate.

Disclosure of Invention

In order to overcome the technical defects, the invention provides an ET-BERT traffic classification method, a storage medium and equipment based on multi-task learning, which can reduce the requirement for a large number of marked training samples in network traffic classification tasks.

In order to solve the problems, the invention is realized according to the following technical scheme:

in a first aspect, the present invention provides an ET-BERT traffic classification method based on multitask learning, comprising the steps of:

acquiring a flow data set, and preprocessing the flow data set;

acquiring time sequence characteristics of a data set;

according to the time sequence characteristics, taking the predicted bandwidth and duration as auxiliary tasks, and pre-training an ET-BERT model;

obtaining the optimal value of the bandwidth and duration frequency divider, converting the optimal value into a token for batch optimization and training by an Adam optimizer;

and fine-tuning parameters in the pre-trained ET-BERT model, and carrying out main task flow category prediction by adopting the ET-BERT model after fine-tuning the parameters.

As an improvement to the above solution, the pre-training of the ET-BERT model further comprises multiplying the input of the traffic class softmax layer by the mask vector.

As an improvement of the above solution, the obtaining the optimal value of the bandwidth and duration divider comprises the steps of:

dividing the bandwidth and duration values into five classes and finding the average duration of each class;

sequencing the average value of each class of the bandwidth from top to bottom to obtain a bandwidth intermediate point between every two continuous bandwidth average values, wherein the bandwidth intermediate point is an optimal value obtained by a bandwidth data set;

the average value of each class of duration is ordered from top to bottom, resulting in a duration intermediate point between two consecutive duration averages, which is the best value obtained by the duration dataset.

As an improvement of the scheme, the method for converting the optimal value into the Token to perform batch optimization and training by an Adam optimizer by adopting a Token3Embedding method comprises the following steps:

converting the optimal value of the bandwidth and duration frequency divider into hexadecimal sequences, and encoding the sequences;

the token is represented by a byte pair code, and a special tag for the token is added to the code sequence.

As an improvement of the above solution, the time series features are packet length, arrival time and payload of the packets.

As an improvement of the above scheme, the multitask learning parameter tuning formula in the ET-BERT model is expressed as:

wherein ,

the weight in the common linear regression is equivalent to the weight in the common linear regression, the weight lambda of the loss function is the weight of the importance of the main flow class prediction task, the rho is a regularized weight factor, the model coefficient is reduced, the complexity of the model is reduced, and the overfitting phenomenon W= [ W ] is prevented ₁ ,w ₂ ,...,w _k ] _nxk Is a weight matrix under multitask learning, < ->

w ⁱ ＝[W _i,1 ,W _i,2 ,…,W _i,k ]，A _i Input representing the ii-th data sample, +.>

and />

Representing the respective outputs of bandwidth, duration and traffic class prediction tasks, denoted B, D and T, respectively.

In a second aspect, the present invention provides a computer readable storage medium having stored therein at least one instruction, at least one program, code set or instruction set, which is loaded and executed by a processor to implement the ET-BERT traffic classification method based on multitasking learning as described in the first aspect.

In a third aspect, the present invention provides an apparatus comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set loaded and executed by the processor to implement the ET-BERT traffic classification method based on multitasking learning as described in the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

according to the method and the device, bandwidth and duration are used as auxiliary tasks to pretrain an ET-BERT model according to the acquired time sequence characteristics, parameters of the ET-BERT model are finely adjusted, main task flow category prediction is carried out by using the ET-BERT model, flow category prediction can be improved, manual marking of a data set is reduced, and the requirement of a large number of marking training samples in network flow classification tasks is reduced.

Drawings

The invention is described in further detail below with reference to the attached drawing figures, wherein:

FIG. 1 is a flow chart of an ET-BERT flow classification method based on multi-task learning in one embodiment of the present application;

FIG. 2 is a schematic diagram of a multi-task learning in one embodiment of the present application;

FIG. 3 is a schematic flow chart of step S4 according to one embodiment of the present application;

FIG. 4 is an architecture diagram of an ET-BERT model architecture based on a multitasking learning framework in one embodiment of the present application.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

It should be noted that, the numbers mentioned herein, such as S1 and S2 … …, are merely used as distinction between steps and do not represent that the steps must be strictly performed according to the order of the numbers.

The invention provides an ET-BERT flow classification method based on a multi-task learning framework, which aims to assume that a plurality of learning tasks are not completely independent, and a plurality of auxiliary tasks can promote the learning of another task through hard parameter sharing, so that the requirement for a large number of marked training samples is reduced by combining a bidirectional encoder (ET-BERT model) from Transformers when a main task is executed.

In one embodiment, as shown in fig. 1, an ET-BERT traffic classification method based on multitasking learning includes the steps of:

s1: acquiring a flow data set, and preprocessing the flow data set;

in particular, the acquisition dataset is an ISCX VPN-non-VPN dataset, which is captured at the university of new torsemide, which contains the original PCAP files of several traffic types. The dataset provides fine-grained labels allowing for different classifications: based on application (e.g., AIM Chat, gmail, facebook, etc.), based on traffic type (e.g., chat, streaming media, voIP, etc.), and VPN/non-VPN. The datasets are classified into 5 classes, which have different QoS requirements and bandwidth/duration characteristics: chat, email, file transfer, streaming media, and VoIP. All flows are associated with one traffic type label, and in traffic class prediction, only a small fraction of the labels are used to predict traffic class in primary task learning.

As the data set is captured at the data link layer. Thus, it includes an ethernet header. The data link header contains information about the physical link, such as a Media Access Control (MAC) address, which is necessary for forwarding frames in the network, but is not necessary for traffic classification. Thus, in the preprocessing stage, the ethernet header is first deleted. In the transport layer section, the header lengths of the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP) are different. The former typically has a header length of 20 bytes, while the latter has a header length of 8 bytes. To make the transport layer segments uniform, zeros are injected at the end of the UDP segment header, making them equal to the length of the TCP header. The packet is then converted from bits to bytes, which helps reduce the input size. It also contains some irrelevant data packets that are not of interest and should be discarded. In particular, the data set includes some TCP segments, where the SYN, ACK, or FIN flag is set to 1, and does not contain a payload. These fragments are required for the three-way handshake process when setting up a connection or completing a connection, but they do not carry information about the application that generated them and can therefore be safely discarded. In addition, there are some Domain Name Service (DNS) fragments in the dataset. These segments are used for hostname resolution, i.e. converting URLs to IP addresses. These segments are independent of the EI-ther application identity or traffic characteristics and therefore may be omitted from the dataset.

S2: acquiring time sequence characteristics of a data set;

in particular, in the traffic classification task, only the first few packets are available, not the entire data stream, and therefore, the first few time series features are obtained by observing the first few packets, where the obtained time series features are the packet lengths, arrival times and payloads of the first k packets.

S3: according to the time sequence characteristics, taking the predicted bandwidth and duration as auxiliary tasks, and pre-training an ET-BERT model;

specifically, according to the time sequence characteristics, the predicted bandwidth and duration are used as auxiliary task training. The auxiliary task is characterized in that: 1. it is highly correlated to the primary task traffic class; 2. the tag should be readily available. The bandwidth and duration are taken as separate tasks as outputs rather than the usual traffic classification method using these values as inputs.

The ET-BERT model multitask learning model architecture uses bi-directional encoders from convertors (BERT model); in natural language processing, the model achieves optimal results for a plurality of tasks. In addition, the model is in visual languageThe widespread use of intersecting fields such as speech and computer vision has demonstrated the advantage that they use unlabeled data to help learn robust feature representations over finite labeled data. The overall architecture of the method is shown in fig. 2, with the rectified linear unit (relu) activation also serving as an activation function in the overall model. Bandwidth, duration and traffic class prediction tasks are denoted B, D and T, respectively. Wherein, there are N training data, A _i Representing the input of the ith data sample,

and />

Representing the bandwidth, duration and traffic class prediction tasks. The parameter tuning target formula of the multitask learning method can be expressed as:

wherein

Equivalent to the weights within the normal linear regression,

W＝[w ₁ ,w ₂ ,…,w _k ] _nxk is a weight matrix under the multi-task learning,

and w is ⁱ ＝[W _i,1 ,W _i,2 ,...,W _i,k ]. This corresponds to a one-time row-wise thinning of the parameter matrix W, i.e. a row-wise feature selection.

Where l is the cross entropy loss function. λ is a weight representing the importance of the traffic class prediction task. Since there are far fewer training data samples for this task than for the other two auxiliary tasks, λ can be increased to slightly compensate for the deficiency in the marker data. Bandwidth and duration labels may be used for all training data. However, only a small portion of the data samples have traffic class labels.

During the training process, we multiply the input of the traffic class softmax layer by the mask vector to prevent back propagation of this task for data samples without traffic class labels.

S4: obtaining the optimal value of the bandwidth and duration frequency divider, converting the optimal value into a token for batch optimization and training by an Adam optimizer;

in one embodiment, as shown in fig. 3, the step S4 includes the following steps:

s41: dividing the bandwidth and duration values into five classes and finding the average duration of each class;

specifically, the labels defining the bandwidth and duration classes are shown in table 1, and the bandwidth and duration values are divided into five classes, namely [ bw1, bw2, bw3, bw4] and [ d1, d2, d3, d4] are bandwidth and time dividers. For example, if the bandwidth of a flow is between bw1 and bw2, then class number 2 is assigned to the flow as a label. The number of categories of bandwidth and duration prediction tasks may be different from the number of traffic categories. They may depend on the application, the scenario and the needs of the ISP. For example, an ISP that only considers the difference between short-term and long-term flows may define only two duration classes.

Class number	Bandwidth B	Duration D
			1	B<bw1	D<d1
2	bw1<B<bw2	d1<D<d2
			3	bw2<B<bw3	d2<D<d3
4	bw3<B<bw4	d3<D<d4
			5	B>b4	D>d4

Table 1 bandwidth and duration class definition table

S42: sequencing the average value of each class of the bandwidth from top to bottom to obtain a bandwidth intermediate point between every two continuous bandwidth average values, wherein the bandwidth intermediate point is an optimal value obtained by a bandwidth data set;

s43: sequencing the average value of each class of duration from top to bottom to obtain a duration middle point between two continuous duration average values, wherein the duration middle point is the optimal value obtained by the duration data set;

specifically, to find the optimum value of the duration divider [ d1, d2, d3, d4], the average duration of each class is first found. Then we rank the averages from top to bottom and then find the intermediate point between two consecutive averages as [ D1, D2, D3, D4]. A similar approach is also used to obtain a bandwidth divider. These values are the best values obtained from the entire dataset. However, these small amounts of marking data are used only in the main task.

S44: converting the optimal value of the bandwidth and duration frequency divider into hexadecimal sequences, and encoding the sequences;

s45: the token is represented by a byte pair code, and a special tag for the token is added to the code sequence.

In particular, encrypted traffic differs significantly from natural language and images in that it does not contain human-understandable content and explicit semantic elements. A Token3Embedding method is provided, the found optimal values of the bandwidth and duration frequency dividers are converted into tokens similar to languages for batch optimization and training by an Adam optimizer.

The key 3Embedding method has the main principle of bidirectional coding according to the context relation of traffic bytes. The found optimal value of the bandwidth and duration divider is first converted to a hexadecimal sequence and then encoded, where each cell consists of two adjacent bytes. The tokens are then represented using byte pair encodings, with each token ranging from 0 to 65535. Specific markers [ CLS ], [ SEP ], [ PAD ] and [ MASK ] are added to the training task. As shown in fig. 4, the first token of each sequence is always [ CLS ], and the final hidden layer state associated with that token is used to represent the complete sequence of classification tasks. The token [ PAD ] is a filler symbol that meets the minimum length requirement. Tokens MASK may appear during pre-training to learn the context of the traffic. The optimal values for each set of bandwidth and duration dividers are divided into two locations for the SBP task. We indicate the location of the segment by a special flag SEP, whether it belongs to segment a or segment B. We denote segment a as position a and segment B as position a. Where position a is the network packet with the optimum value of the bandwidth allocator and position B is the network packet with the optimum value of the duration divider.

Each Token obtained by Token3Embedding method is represented by three kinds of embeddings, token embedment, location embedment and segment embedment. A complete token representation is constructed by summing the three embeddings. The complete tokenized datagram is taken as the original input. The first set of embedding vectors is randomly initialized, where the embedding dimension is h=768. After N Transformer encodings, the final token embedding is obtained.

Position embedding: since the transfer of traffic data is closely related to order, we use location embedding to ensure that the model learns to focus on the duration and bandwidth relationships of the tokens by relative location. We assign an H-dimensional vector to each input token to represent its position information in the sequence. The embedding dimension H is set to 768.

Segment embedding: the embedding dimension is set to 768 based on the segment embedding a and B learned from the input position sequence.

S5: and fine-tuning parameters in the pre-trained ET-BERT model, and carrying out main task flow category prediction by adopting the ET-BERT model after fine-tuning the parameters.

In particular, fine tuning can serve traffic classification tasks well because: 1. the auxiliary task pre-training representation is highly correlated to traffic class classification; 2. because the input of the auxiliary task pre-training model is at the data packet byte level, the main task needing to classify the packets and streams is converted into corresponding data packet byte tokens to be classified by the model; 3. the special CLS tokens output by the pre-training model a representation of the overall input traffic, which can be used directly for classification. Since the primary task model tuning and the auxiliary task pre-training model structure are substantially identical, we input the task-specific data packet byte token representation into the pre-training ET-BERT and fine tune all parameters in the end-to-end model.

When the number of training samples of one task in the multi-task learning is obviously smaller than that of other tasks, the sharing parameters of the ET-BERT model are more influenced by the tasks with rich data in the training process. Thus, increasing the weight of the loss function of a task with less data may compensate for the lack of data during training and increase the impact of this task on the training process. Increasing λ helps the model fit the traffic class prediction task until maximum accuracy is reached. However, further increasing λ reduces the accuracy of all tasks. This is because when λ is very large, the model highly overfits the flow classification training data, and therefore the data performs poorly under all task tests. Furthermore, when λ is very large, the gradient updated values become very large for traffic class prediction compared to other tasks, which makes it extremely difficult for the training process to converge to local minima without fine tuning the learning rate. This phenomenon affects the execution of all tasks. Thus, for a multitasking learning method, a suitable lambda value should be found as a hyper-parameter. A good starting point is to set λ to compare the experimental effect with the ratio of the number of samples of the bandwidth and duration tasks to the number of samples of the traffic classification tasks.

In one embodiment, a computer readable storage medium is provided, the computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to implement the ET-BERT traffic classification method based on multi-task learning provided in the first aspect.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer-readable storage media, which may include computer-readable storage media (or non-transitory media) and communication media (or transitory media).

The term computer-readable storage medium includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

The computer readable storage medium may be an internal storage unit of the network management device according to the foregoing embodiment, for example, a hard disk or a memory of the network management device. The computer readable storage medium may also be an external storage device of the network management device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the network management device.

In one embodiment, an apparatus is provided that includes a processor and a memory for storing a computer program; the processor is configured to execute the computer program and implement the ET-BERT traffic classification method based on the multi-task learning provided in the first aspect of the present invention when the computer program is executed.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The ET-BERT flow classification method based on multitask learning is characterized by comprising the following steps:

acquiring a flow data set, and preprocessing the flow data set;

acquiring time sequence characteristics of a data set;

2. The ET-BERT traffic classification method based on multi-task learning of claim 1, wherein the pre-training of the ET-BERT model further comprises multiplying an input of a traffic class softmax layer by a mask vector.

3. The ET-BERT traffic classification method based on multi-task learning according to claim 1, wherein the obtaining of the optimal value of the bandwidth and duration divider comprises the steps of:

4. The ET-BERT traffic classification method based on multi-task learning according to claim 1, wherein the converting the optimal value into a Token for batch optimization and training by Adam optimizer by Token3Embedding method comprises the steps of:

5. The ET-BERT traffic classification method based on multi-task learning according to any of claims 1 to 4, wherein the time series features packet length, arrival time and payload of the packets.

6. The ET-BERT flow classification method based on the multi-task learning according to claim 5, wherein the multi-task learning objective formula of the ET-BERT model is expressed as:

wherein ,

the weight is equal to the weight in the common linear regression, i is the weight of the loss function, and lambda is the weight of the importance of the main flow class prediction task; ρ is a regularized weight factor for reducing model coefficients and reducing model complexity, preventing overfitting;

W＝[w ₁ ,w ₂ ,...,w _k ] _nxk for a weight matrix under multitasking learning,

w ⁱ ＝[W _i,1 ,W _i,2 ,...,W _i,k ]，A _i representing the input of the ith data sample,

and />

7. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement the ET-BERT traffic classification method based on multitasking learning of any of claims 1 to 6.

8. An apparatus comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by the processor to implement the ET-BERT traffic classification method based on multi-tasking according to any of claims 1 to 6.