CN109873774B

CN109873774B - Network traffic identification method and device

Info

Publication number: CN109873774B
Application number: CN201910036196.2A
Authority: CN
Inventors: 廖青; 赵晶玲; 李天琦; 刘月
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2021-01-01
Anticipated expiration: 2039-01-15
Also published as: CN109873774A

Abstract

The embodiment of the invention provides a network flow identification method and a device, wherein the method comprises the following steps: under the condition that the current data stream is received, extracting data of a data packet header in the current data stream as a first sample; inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model; the category of the next data stream after the current data stream is then identified. Compared with the prior art, the method and the device for identifying the data stream change the structure of the machine identification model, and the machine learning model with the changed structure is used for identifying the category of the next data stream after the current data stream, so that the category real-time performance of the identified data stream can be improved.

Description

Network traffic identification method and device

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a network traffic identification method and apparatus.

Background

The flow is an important carrier for transmitting data in the network, the flow identification is a key link of network monitoring, and only if the flow is identified, different monitoring strategies can be adopted according to different flows, for example: rejection, optimization, marking, priority classification, etc., and thus identification of network traffic is critical. Generally, network traffic is transmitted in the form of data streams, each data stream includes a plurality of data packets, each data packet includes header data of a fixed byte, characteristics of the header data can be obtained according to the header data, and the characteristics of the header data include: time interval, stream duration, mean, variance of packet size, etc.

In the prior art, a machine learning-based method is adopted for network traffic identification, and the method mainly comprises the steps of mining the characteristics of network packet header data through a machine learning technology, then training to obtain a machine learning model, inputting a data stream into the machine learning model obtained through training, and outputting the category of online network traffic. The machine learning model is obtained by training through the following steps: firstly, counting the characteristics of packet header data of a data packet in the whole data stream, selecting the characteristics of all or part of the packet header data in the whole data stream as a sample, training the sample to obtain a machine learning model, wherein the machine learning model is an offline model and the internal structure of the machine learning model is fixed.

Because the characteristics of the data stream can change due to the real-time change of the network environment, the real-time performance of identifying the type of the online network traffic is not high by using the machine learning model with a fixed internal structure, and therefore the real-time performance of identifying the type of the online network traffic is not high in the prior art.

Disclosure of Invention

The embodiment of the invention aims to provide a network flow identification method and a network flow identification device, which improve the category real-time property of identification data streams, and the specific technical scheme is as follows:

in a first aspect, a method for identifying network traffic provided in an embodiment of the present invention is applied to a server, and the method includes:

under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream as a first sample;

inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; determining whether the samples with the class labels are positioned in the boundary distance of the clusters according to the distribution relation;

under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model;

the category of the next data stream after the current data stream is identified using an online identification model.

Optionally, in a case that receiving the current data stream is completed, before the step of extracting the data of the packet header in the current data stream as the first sample, the method further includes:

sequentially receiving data packets of a current data stream and acquiring quintuple information of the data packets;

judging whether the database stores quintuple information or not, if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;

and if the database does not store the quintuple information, creating a storage area of a path corresponding to the quintuple information, and storing the packet header data of the data packet into the storage area of the path corresponding to the quintuple information.

Optionally, under the condition that receiving the current data stream is completed, extracting header data of a data packet in the current data stream as a first sample, including:

judging whether each data packet of the current data stream contains an end identifier, if one data packet contains the end identifier, receiving the data stream is finished, and extracting packet header data of the data packet in the data stream as a first sample.

under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream;

encoding packet header data of a data packet in a current data stream to obtain a fixed-dimension vector, and taking the fixed-dimension vector as a first sample.

Optionally, inputting the first sample into a semi-supervised model, and outputting a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model, including:

inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;

calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set formed by samples used for training the semi-supervised model;

adding a third sample to the cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;

if the distance between the first sample and the third sample exceeds the boundary distance of the cluster, judging that the first sample is not positioned in the boundary distance of the cluster;

if the distance of the first sample from the third sample does not exceed the cluster boundary distance, the first sample is determined to be within the cluster boundary distance.

Optionally, when the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, including:

under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as the first sample exists in the second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set formed by data streams used for training the machine recognition model;

under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.

under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, increasing the parameter dimension in a preset machine identification model by one dimension, and taking the machine identification model with the increased parameter dimension as an online identification model.

under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as a basic recognition model;

inputting the first sample into a basic recognition model, and calculating partial derivatives of a loss function of the basic recognition model to the weight and the bias of an output layer in the basic recognition model;

in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using a parameter updating formula; the parameter updating formula comprises the results of multiplying proficiency increment and the partial derivatives of the loss function of the basic recognition model on the weight and the bias of an output layer in the basic recognition model respectively;

and determining the basic recognition model after updating the weight and the bias as an online recognition model.

Optionally, identifying the category of the next data stream after the current data stream by using an online identification model includes:

under the condition that the number of data packets in the next data stream after the current data stream reaches the preset number, extracting data of a data packet header in the next data stream as a second sample;

and inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.

In a second aspect, an embodiment of the present invention provides a network traffic identification apparatus, which is applied to a server, and includes:

the device comprises a sample module, a data processing module and a data processing module, wherein the sample module is used for extracting packet header data of a data packet in a current data stream as a first sample under the condition that the current data stream is received;

the monitoring module is used for inputting the first sample into the semi-monitoring model and outputting the category of the first sample and the result of whether the first sample is positioned in the boundary distance of the cluster by using the semi-monitoring model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; determining whether the samples with the class labels are positioned in the boundary distance of the clusters according to the distribution relation;

the changing module is used for adding an output node in the output nodes of the preset machine recognition model if the first sample is a new-class sample under the condition that the first sample is located within the boundary distance of the cluster, and taking the machine recognition model after the output node is added as an online recognition model;

and the identification module is used for identifying the category of the next data stream after the current data stream by using the online identification model.

Optionally, the network traffic identification apparatus provided in the embodiment of the present invention further includes:

the storage unit is used for sequentially receiving the data packets of the current data stream and acquiring quintuple information of the data packets;

Optionally, the sample module is specifically configured to:

Optionally, the monitoring module is specifically configured to:

Optionally, the modification module is specifically configured to:

Optionally, the identification module is specifically configured to:

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute a network traffic identification method as described in any one of the above.

In yet another aspect of the present invention, the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any one of the above described network traffic identification methods.

In the method and the device for identifying network traffic provided by the embodiment of the invention, under the condition that the current data stream is received, the data of the packet header of a data packet in the current data stream is extracted as a first sample; inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model; the category of the next data stream after the current data stream is identified using an online identification model. Compared with the prior art, the method and the device have the advantages that the type of the current data stream is identified through the semi-supervised model, whether the current data stream is a new type sample or not is judged based on the identification type, the structure of the machine identification model is changed, the machine learning model with the changed structure is used as the online identification model, the type of the next data stream after the current data stream is identified, and the real-time property of the type of the identification data stream can be improved.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of a network traffic identification method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating storing a current data stream according to an embodiment of the present invention;

fig. 3 is a diagram of a preset semi-supervised model architecture according to an embodiment of the present invention;

FIG. 4 is a network architecture diagram of an LSTM encoder loop provided by an embodiment of the present invention;

FIG. 5 is a diagram of the internal structure of an LSTM encoder according to an embodiment of the present invention;

FIG. 6 is a block diagram of a default machine identification model according to an embodiment of the present invention;

FIG. 7 is a block diagram of an online identification model provided by an embodiment of the present invention;

FIG. 8 is a graph of the effect of proficiency as a function of different parameters provided by embodiments of the present invention;

FIG. 9 is a diagram illustrating the effect of the proficiency function on different parameters on the horizontal axis and the vertical axis, respectively, according to an embodiment of the present invention;

FIG. 10 is a graph illustrating the effect of proficiency function at different β values provided by embodiments of the present invention;

fig. 11 is a structural diagram of a network traffic recognition apparatus according to an embodiment of the present invention;

fig. 12 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In the method and the device for identifying network traffic provided by the embodiment of the invention, under the condition that the current data stream is received, the data of the packet header of a data packet in the current data stream is extracted as a first sample; inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model; the category of the next data stream after the current data stream is identified using an online identification model.

First, a method for identifying network traffic according to an embodiment of the present invention is described below.

As shown in fig. 1, a network traffic identification method provided in an embodiment of the present invention is applied to a server, and the method includes:

s101, extracting header data of a data packet in a current data stream as a first sample under the condition that the current data stream is received;

before the step of S101, the method for identifying network traffic provided in the embodiment of the present invention further includes storing the current data stream:

as shown in fig. 2, the storing the current data stream includes:

s201, receiving data packets of the current data stream in sequence, and acquiring quintuple information of the data packets;

wherein, the quintuple information is: source IP address, destination IP address, source port number, destination port number, transport layer protocol.

S202, judging whether the database stores quintuple information or not, and if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;

s203, if the database does not store the quintuple information, a storage area of a path corresponding to the quintuple information is created, and the header data of the data packet is stored in the storage area of the path corresponding to the quintuple information.

The embodiment of the invention can improve the efficiency of searching the packet header data of the data packet in the same data stream by storing the packet header data of the data packet into the storage area corresponding to the quintuple information.

In order to improve the real-time performance of identifying the category of the data stream, at least one embodiment may be adopted in the above S101 to obtain a first sample:

in a possible implementation manner, by judging whether each data packet of the current data stream contains an end identifier, if one data packet contains the end identifier, the receiving of the data stream is completed; the header data of the data packets in the data stream is extracted as a first sample.

It can be understood that each data packet contains an end identifier, and if transmission of a data stream is ended, the end identifier of the last data packet in the data stream is changed.

In one possible embodiment, the first sample is obtained by:

the method comprises the following steps: under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream;

step two: encoding packet header data of a data packet in a current data stream to obtain a fixed-dimension vector, and taking the fixed-dimension vector as a first sample.

It is understood that a network flow is composed of a series of packets, each packet has a very regular header format, contains a fixed number of bytes, and has different field values, for example, for a TCP protocol packet, the header data thereof has 54 bytes except for optional fields, and includes a header of 14 bytes, an IP header of 20 bytes, and a TCP header of 20 bytes. If p data packets of a stream, each packet header data containing q bytes, each byte is converted into an unsigned integer, and one packet header data is taken as a line, a vector X ∈ R of a fixed dimension is obtained^p×qThe element is also [0, 255]Is an integer of (1). Therefore, in the present embodiment, the header data is encoded to obtain a fixed-dimension vector, where the fixed dimension is p × q, and the fixed-dimension vector is used as the first sample, thereby improving the efficiency of identifying the first sample type.

S102, inputting the first sample into a semi-supervised model, and outputting the type of the first sample and a result of whether the first sample is positioned in the boundary distance of the cluster by using the semi-supervised model;

the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; the distribution relationship determines the result of whether the class-labeled sample lies within the cluster boundary distance.

It can be understood that the first training sample set required for training the semi-supervised model includes a part of samples with class labels, and the samples with class labels are packet header data of the obtained data stream after a class is determined. The remainder are samples without class labels. And dividing the samples of the first training sample set into a plurality of clusters, wherein the distribution of the samples of the first training sample set in each cluster is determined, and then the distribution of samples with class labels and samples without class labels in each cluster is determined. Therefore, the semi-supervised model is obtained by training the sample with the class label, and the semi-supervised model comprises the result of whether the sample with the class label is positioned within the boundary distance of the cluster.

In one possible embodiment, the semi-supervised model may be obtained by:

firstly, a first training sample set sample trains a preset semi-supervised model to obtain a trained semi-supervised model;

as shown in fig. 3, the preset semi-supervised model is composed of an LSTM (Long Short-Term Memory) encoder, a softmax layer and a CFSFDP (Clustering by Fast Search and Find of Density Peaks, Clustering based on Density Peaks) layer, wherein a sample carrying a class label is input into the LSTM encoder, a data stream with an unfixed length is encoded into a fixed-dimension vector by the LSTM encoder, the fixed-dimension vector includes a whole data stream sequence output, so that the fixed-dimension vector can represent characteristics of all data packets of the whole data stream, the softmax layer is used for mapping the fixed-dimension vector to a fixed class, and the softmax layer can output a class of packet header data; and then removing the softmax layer from the preset semi-supervised model, inputting the samples carrying the category labels and the samples not carrying the category labels into an LSTM encoder, and inputting the output of the LSTM encoder into a CFSFDP Clustering layer, wherein whether the vector of the fixed dimension is the cluster center point of the cluster and the category of the samples is determined mainly by using a CFSFDP (Clustering by Fast Search and Find of Density Peaks) algorithm.

As shown in fig. 4 and 5, the process of encoding the data stream into a fixed-dimension vector by the LSTM encoder is as follows:

such as for a data stream x₀,x₁,…,x_t-1,x_tThe data streams are sequentially input into a circular network structure as in fig. 4, each input x_tAll can have an output h_tWhile passing the current state to the next input, thus outputting h_tIn not only contains x_tAlso contains x₀～x_t-1The information of (1).

Internal structure of LSTM encoder As shown in FIG. 5, input x is received at encoder input_tLast output h_t-1And last input x_t-1State C of the postcoder_t-1Let us assume the output h of each step_tThe dimension of (2) is 128 dimensions, a certain data stream contains n data packets in total, all the data packets of the whole stream are taken as a sequence, and the header data of each data packet is x_tThe last x of the data stream_nOutput h of_n。

Wherein x is_tRepresents a header data; t represents the serial number of the packet header data; n represents the total number of data packets in a data stream; h is_tRepresenting the LSTM encoder input as x_tThe output of the LSTM encoder; h is_nRepresenting the last output of the LSTM encoder after a data stream is input to the LSTM encoder, i.e., the encoded fixed-dimension vector.

Inputting the test samples in the test sample set into the trained semi-supervised model, and outputting the types of the test samples in the test sample set by using the trained semi-supervised model;

it can be understood that the class labels of the header data in one data stream are the same, the test sample is the header data of the whole data stream, and the class label identifies the class of the test sample.

Determining whether the trained semi-supervised model meets the test index or not based on the category of the concentrated test sample of the trained semi-supervised model test sample and the label category of the concentrated test sample of the test sample;

wherein the test indexes are as follows: the accuracy reaches an accuracy threshold, the recall rate reaches a recall rate threshold, the F1 score reaches an F1 score threshold and/or F_βThe fraction reaches F_βA score threshold.

Wherein, the accuracy threshold, the precision threshold, the recall threshold, the F1 score threshold, and the F_βThe score threshold is a preset numerical value, such as accuracy, precision, recall, F1 score, F_βThe scores are respectively the accuracy, precision, recall, F1 score and F_βAnd calculating a fraction formula.

If the trained semi-supervised model does not meet the test index, updating the parameters of the LSTM encoder in the trained semi-supervised model until the trained semi-supervised model meets the test index;

wherein, the output formula of the encoder is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

o_t＝σ(W_o·[h_t-1,x_t]+b_o)

h_t＝o_t·tanh(C_t)

therein, the LSTM encoderThe parameters are as follows: w_*And b_*，W_*Representing a weight parameter, b_*Representing a bias parameter, f_tAn activation value representing the current step forgetting gate,

as a Sigmoid function, W_fWeight representing forgetting gate, h_t-1Representing the output of the previous step, x_tInput representing current step, b_fRepresenting the offset of a forgetting gate i_tRepresenting the activation value of the input gate at the current step, W_iRepresenting the weight of the input gate, b_iWhich represents the offset of the input gate,

representing the current step intermediate state, W_CRepresenting the state weight, b_CRepresentative state bias, C_t-1Represents the state of the previous step, C_tRepresenting the state of the current step, o_tRepresenting the current step output gate activation value, W_oRepresenting the output gate weight, b_oRepresenting the output gate bias, i representing the activation value of the input gate, f representing the activation value of the forgetting gate, t representing the packet number, o representing the activation value of the output gate, and C representing the state.

If the trained semi-supervised model does not meet the test index, updating the parameter W of the LSTM encoder in the trained semi-supervised model_f、b_f、W_i、b_i、W_C、b_C、W_o、b_o。

And step five, determining the trained semi-supervised model meeting the test index as the semi-supervised model.

According to the embodiment, the accuracy of determining the first sample category can be improved by determining the trained preset semi-supervised model meeting the test index as the semi-supervised model.

S103, under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model;

and S104, identifying the category of the next data stream after the current data stream by using the online identification model.

Compared with the prior art, the method and the device have the advantages that the type of the current data stream is identified through the semi-supervised model, whether the current data stream is a new type sample or not is judged based on the identification type, the structure of the machine identification model is changed, the machine learning model with the changed structure is used as the online identification model, the type of the next data stream after the current data stream is identified, the capability of the online identification model for adapting to the network environment is improved, and the type real-time performance of the identification data stream can be improved.

In order to improve the real-time property of identifying the category of the data stream, at least one embodiment may be adopted in the above S102 to obtain the result of the category of the first sample and whether the first sample is located within the boundary distance of the cluster:

in one possible embodiment, the result of the class of the first sample and whether the first sample is located within the boundary distance of the cluster is obtained by:

the method comprises the following steps: inputting the first sample into a semi-supervised model, and outputting the category of the first sample by using the semi-supervised model;

step two: calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set formed by samples used for training the semi-supervised model;

step three: adding a third sample to the cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;

step four: if the distance between the first sample and the third sample exceeds the boundary distance of the cluster, judging that the first sample is not positioned in the boundary distance of the cluster;

the distance of the cluster boundary is a range surrounded by a spherical region which is formed by taking the cluster center point of the cluster as the spherical center and taking the radius as the boundary distance.

Step five: if the distance of the first sample from the third sample does not exceed the cluster boundary distance, the first sample is determined to be within the cluster boundary distance.

First, assume a first training sample set of data as S ═ x_j|j∈I_SIn which I_S＝{1,2,…,n}，d_ijDenotes x_iSample sum x_jDistance between samples, calculating sample x_iLocal density of phi_iAnd a minimum distance theta_i，Φ＝{φ_i|i∈I_SAnd theta ═ theta_i|i∈I_S}，I_SRepresenting an integer set, wherein i and j are positive integers; Φ represents the local density set and Θ represents the minimum distance set.

Local density phi_iThe calculation formula of (2) is as follows:

or

Wherein d is_cThe representative truncation distance is a preset numerical value, the boundary distance is m times of the truncation distance, and m is a preset numerical value; chi (·) is a step function,

x is the input to the step function.

From local density phi_iThe calculation formula of (2):

it can be seen that x_iPhi of_iRelative size and to x_iIs less than d_cIs related to the number of samples, i.e. less than d_cThe more samples, phi_iThe larger the value of (c). The calculation formula can be expressed as phi_iThe discrete value of the value is changed into a continuous value, and the accuracy rate of calculating the local density is improved.

Minimum distance theta_iThe calculation formula of (2) is as follows:

the local density exceeds the density threshold and the minimum distance exceeds the distance threshold as a third sample.

Wherein the distance between the third sample and the first sample, d_ijMay be calculated using equations such as the euclidean distance, the manhattan distance, the chebyshev distance, the minkowski distance, the normalized euclidean distance, or the cosine similarity distance.

In the embodiment, by calculating the local density and the minimum distance of each sample in the first training sample set, if the distance between the first sample and the third sample does not exceed the boundary distance of the cluster, it is determined that the first sample is located within the boundary distance of the cluster, and the efficiency of determining that the first sample is located within the boundary distance of the cluster can be improved.

After the step of obtaining the result of whether the category of the first sample and the first sample are located within the boundary distance of the cluster by the foregoing embodiment, the method for identifying network traffic provided by the embodiment of the present invention further includes: the cluster center point of the cluster is updated.

In one possible embodiment, the cluster center point of a cluster is updated by:

the method comprises the following steps: based on the local density and the minimum distance of each sample in the first training sample set, taking the sample of which the local density exceeds a density threshold value and the minimum distance does not exceed a distance threshold value as a fourth sample;

step two: adding the fourth sample to the cluster where the cluster center point closest to the fourth sample is located;

step three: if the first sample is located within the boundary distance of the cluster, adding the first sample to the cluster where the cluster center point closest to the first sample is located;

step four: calculating the local density and the minimum distance of the samples in each cluster, determining the cluster center point of the cluster according to the samples of which the local density exceeds the density threshold and the minimum distance exceeds the distance threshold aiming at one cluster, and taking the cluster after the center point is updated as the updated cluster.

The cluster center point of the cluster is updated, so that the accuracy of determining whether the first sample is located within the boundary distance of the cluster can be improved.

In another possible embodiment, the result of the class of the first sample and whether the first sample is located within the boundary distance of the cluster is obtained by:

the method comprises the following steps: inputting the first sample into a semi-supervised model, and outputting the probability of different classes of the first sample and the result of whether the first sample is positioned in the boundary distance of the cluster by using an output node of the semi-supervised model; the output node corresponds to the category to which the first sample belongs.

For example: the g output node outputs the probability that the first sample belongs to the g category.

Step two: and selecting the class to which the first sample with the highest probability belongs as the class of the first sample from the probabilities of outputting different classes to which the first sample belongs at the output node.

In the present embodiment, the category to which the first sample with the highest probability belongs is selected as the category of the first sample, so that the accuracy of determining the category of the first sample can be improved.

In order to improve the real-time property of identifying the category of the data stream, the online identification model may be obtained in S103 by using at least one embodiment:

in one possible embodiment, the online identification model is obtained by:

the method comprises the following steps: under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as the first sample exists in the second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set formed by data streams used for training the machine recognition model;

step two: under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.

Referring to fig. 6 and 7, a preset machine recognition modelM_oUsing CNN (Convolutional Neural Networks), M_oThe output layer comprises K nodes, the upper layer of the output layer comprises J nodes, a connecting line between the output layer and the upper layer represents parameters of the output layer, under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same type as the first sample exists in the second training sample set, the first sample is judged not to be a new type sample, and the parameters between the output layer and the upper layer of the preset machine recognition model are updated. Under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in the second training sample set, the first sample is judged to be a sample of a new category, an output node is added to the output nodes of the preset machine recognition model, and the machine recognition model with the added output nodes is used as an online recognition model.

In another possible implementation, in a case that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, the parameter dimension in the preset machine recognition model is increased by one dimension, and the machine recognition model with the increased parameter dimension is used as the online recognition model.

When the first sample belongs to a new class of samples, i.e. X ∈ C_K+1For parameterized preset machine identification models, adding an output node means increasing the dimension of the output layer parameter of the preset machine identification model by one dimension, and setting W ∈ R^J×K→W∈R^J×(K+1)，b∈R^K→b∈R^K+1，ρ∈R^K→ρ∈R^K+1Where W represents a set of weights, R represents a set of real numbers, b represents a set of biases, ρ represents a set of proficiency, → represents assignments, and K represents the total number of output nodes.

In yet another possible embodiment, the online identification model is obtained by:

the method comprises the following steps: under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node in output nodes of a preset machine identification model, and taking the machine identification model with the added output node as a basic identification model;

step two: inputting the first sample into a basic recognition model, and calculating partial derivatives of a loss function of the basic recognition model to the weight and the bias of an output layer in the basic recognition model;

step three: in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using a parameter updating formula; the parameter updating formula comprises the result of multiplying the increment of proficiency and the partial derivative of the loss function of the basic recognition model on the weight and the bias of an output layer in the basic recognition model;

step four: and determining the basic recognition model after updating the weight and the bias as an online recognition model.

The output of the upper layer of the output layer of the basic recognition model is assumed as follows: f ═ F_j}∈R^JOutput layer output Y ═ Y_k}∈R^KWeight W ═ W_jk}∈R^J×KOffset b ═ b_k}∈R^K. If the output layer of the basic identification model is a softmax layer, then the cross entropy loss function of the basic identification model can be derived as follows:

wherein T ═ { T ═ T_k}∈R^KOne-hot encoding, y, being a data stream class_k＝g(z_k) Is the activation value of the kth output node, g (-) is the softmax function, softmax is:

f_ja characteristic activation value representing the jth node, b_kRepresenting the offset of the kth node, R^KThe representative dimension is K dimension, t_kRepresents the kth bit, z, of one-hot code_kRepresents the activation value of the kth node, w_jkRepresenting the weight between the node j of the upper layer of the output layer and the output node k, i and k representing the serial number of the output node, taking positive integer, k alsoThe sequence number representing the median of one-hot coding, J represents the sequence number of the node on the upper layer of the output layer, K is the total number of output nodes, and J is the number of the nodes on the upper layer of the output layer, and the increment of the weight and the offset can be obtained by solving the partial derivative of the loss function:

wherein I {. is an indicator function:

it can be understood that, each time the basic recognition model is trained by using the samples of the new category to obtain the online recognition model, the parameters of the basic recognition model can continuously adapt to the samples of the new category, the characteristics of the samples of the new category are learned, and the samples of the old category do not participate in the training. The training mode which enables the online learning model to adapt to the new environment without limitation has a serious problem, namely when the internal parameters of the basic recognition model change, the learned characteristics are affected, and even the capability of the previous basic recognition model is possibly completely damaged, so that the recognition of samples which are not of a new category is seriously wrong, and the problem of 'catastrophic forgetting' is caused.

To solve the catastrophic forgetting problem, the underlying recognition model needs to make a trade-off between learning samples of the new class and retaining samples of the old class. When the stability of the basic recognition model is higher, the basic recognition model is more prone to retain the characteristics of the samples of the old category, and the characteristic ability for learning the samples of the new category is weakened; on the contrary, when the plasticity of the basic recognition model is higher, the basic recognition model has stronger ability of learning samples of new classes, and is easier to forget the characteristics of samples of old classes, and the key point of the trained online recognition model for adapting to the network environment lies in obtaining different tradeoffs between stability and plasticity.

For online identification of model stability-plasticity controllability, the embodiment of the present invention proposes a proficiency mechanism that introduces an additional set of parameters ρ ═ ρ { (ρ) }_k}∈R^KWhere ρ represents a proficiency set, ρ_kRepresenting a basis recognition modelProficiency of the type on the class output by the kth node is used for measuring the recognition capability of the online recognition model on each class sample.

Where ρ is_kE [0, 1)), the initial value of which is 0, indicates that the classification proficiency of the basic recognition model for all classes in the initial case is 0. In order to utilize proficiency to influence the stability-controllability of the model, the proficiency ρ should have the following properties:

1) proficiency is influenced by the result of identifying the sample class. The more times of correctly identifying the type of the sample, the higher the corresponding proficiency; the more times the wrong sample identifies a category, the lower the corresponding proficiency.

2) Proficiency affects variations in itself. When the proficiency is low, the difficulty of further improving or reducing the proficiency is small, and the proficiency is increased or reduced quickly; the higher the proficiency, the more difficult it is to further increase or decrease the proficiency itself, i.e., the slower the proficiency is increased or decreased.

3) Proficiency affects the difficulty of learning or forgetting knowledge. When the proficiency is low, it is relatively easy to learn more features of samples in a new class or forget the features of samples in an old class, namely the model parameters are updated more quickly; conversely, when proficiency is high, the difficulty of learning or forgetting is also greater, i.e., the model parameters are updated more slowly.

For example, if X ∈ C_kAnd Y ∈ C_kIf the classification of the kth class corresponding to the sample X is correct, the corresponding proficiency ρ is obtained_kIncreasing; if X ∈ C_iBut Y ∈ C_jIf the type i error is identified as the type j, the corresponding rho_iAnd ρ_jAnd decreases.

Referring to FIG. 8, to implement Property 2 and Property 3, embodiments of the present invention propose a function of proficiency for calculating the increment of proficiency:

the function of proficiency is:

where α and β are two parameters, the overall trend of the function used to control proficiencyFIG. 8 shows the course of the function of proficiency at different α and β, when ρ_kSmaller, incremental proficiency prof (p)_k) Greater, with ρ_kIncrease, prof (p)_k) And its derivative are all gradually reduced when p_kIncreasing the proficiency increase prof (p) to the limit value of 1_k) Value 0, proficiency ρ_kNo further updates are made.

Proficiency [ rho ]_kThe updated formula of (2): rho_k←ρ_k±prof(ρ_k)，

Referring to FIG. 9, as proficiency increases, the proficiency increment prof (ρ)_k) Gradually reducing, namely gradually reducing the updating amplitude of the parameters of the basic recognition model; FIG. 9 shows the incremental prof (ρ) of proficiency at different parameters_k) It can be seen that the update rate of the underlying recognition model can be controlled by adjusting the parameters α and β for the increments of proficiency.

Therefore, when the weight and the offset of the output layer of the basic recognition model are updated by using the parameter updating formula, the increment of the proficiency is increased, and then the parameter updating formula is as follows:

represents a weight w_jkThe increment of (a) is increased by (b),

represents the offset b_kThe increment of (c). Function of proficiency prof_k) For updating the weight W and the offset b, the order is required

To ensure that prof (0) is 1, i.e. when proficiency ρ_kWhen the proficiency function is 0, the proficiency function does not influence the updating of the model; the incremental coefficient prof (ρ) with increasing proficiency_k) Gradually reducing, namely gradually reducing the updating amplitude of the parameters of the basic recognition model; when rho_k→ 1, prof (ρ)_k) → 0, i.e. the update amplitude of the underlying recognition modelTending to 0.

Referring to fig. 10, prof (ρ) in the case where the parameter β is different_k) The larger beta, the higher prof (p)_k) The faster the rate of descent.

From the above analysis, it can be seen that the proficiency set ρ and the function prof (ρ) of proficiency_k) The method can control the capability and speed of updating parameters of the basic recognition model by utilizing two parameters alpha and beta of a function of proficiency, further realize the balance of the stability and plasticity of the online recognition model, and solve the problem of 'catastrophic forgetting'.

For example, one-hot encoding of data stream classes is illustrated, and it is assumed that the first training set samples are divided into 6 classes of samples, and an output layer of the basic recognition model has 6 nodes. A number is assigned to each type of sample, and the sample categories of the first training set include "RDP (Remote Desktop Protocol)", "bit-flood BitTorrent", "Web (World Wide Web )", "SSH (Secure Shell, Secure Shell)", "eDonkey (eDonkey Network, donkey)", and "NTP (Network Time Protocol)", corresponding numbers are: 0.1, 2, 3, 4, 5, the corresponding one-hot codes are: 100000, 010000, 001000, 000100, 000010, 000001. Assume that the label of a sample in the first training set is 0, the sample label is numbered 0, and the class of the sample is "RDP". The output of the 1 st to 6 th nodes of the category of the sample identified by the basic identification model is 0.5, 0.1 and 0.1. Wherein, the probability that the basic recognition model recognizes the sample as the code 0 is the highest, and the loss function of the basic recognition model is:

L＝-1·log0.5+(-0·log0.1)+(-0·log0.1)+(-0·log0.1)+(-0·log0.1)+(-0·log0.1)。

in order to improve the real-time property of identifying the class of the data stream, at least one implementation manner may be adopted in S104 to identify the class of the data in the packet header of the next data stream after the current data stream:

in one possible embodiment, the class of data of the packet header in the next data stream after the current data stream is identified by the following steps:

the method comprises the following steps: under the condition that the number of data packets in the next data stream after the current data stream reaches the preset number, extracting data of a data packet header in the next data stream as a second sample;

step two: and inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.

The following provides a description of a network traffic identification apparatus according to an embodiment of the present invention.

As shown in fig. 11, a network traffic identification apparatus provided in an embodiment of the present invention is applied to a server, and the apparatus includes:

a sample module 1101, configured to extract packet header data of a data packet in a current data stream as a first sample when receiving the current data stream is completed;

a monitoring module 1102, configured to input the first sample into a semi-monitoring model, and output a result of the category of the first sample and whether the first sample is located within a boundary distance of the cluster by using the semi-monitoring model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; determining whether the samples with the class labels are positioned in the boundary distance of the clusters according to the distribution relation;

a changing module 1103, configured to, when the first sample is located within the boundary distance of the cluster, if the first sample is a new-class sample, add an output node to output nodes of a preset machine identification model, and use the machine identification model after the output node is added as an online identification model;

an identifying module 1104 is configured to identify a category of a next data stream after the current data stream using an online identification model.

The sample module is specifically configured to:

The supervision module is specifically configured to:

The modification module is specifically configured to:

in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using a parameter updating formula; the parameter updating formula comprises the result of multiplying the increment of proficiency and the partial derivative of the loss function of the basic recognition model on the weight and the bias of an output layer in the basic recognition model;

The identification module is specifically configured to:

An embodiment of the present invention further provides an electronic device, as shown in fig. 12, including a processor 1201, a communication interface 1202, a memory 1203, and a communication bus 1204, where the processor 1201, the communication interface 1202, and the memory 1203 complete mutual communication through the communication bus 1204,

a memory 1203 for storing a computer program;

the processor 1201 is configured to implement the following steps when executing the program stored in the memory 1203:

inputting the first sample into a semi-supervised model, and outputting the category of the first sample and a result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model;

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform a network traffic identification method as described in any of the above embodiments.

In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform a method of network traffic identification as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus/electronic device/computer-readable storage medium/computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for relevant points, reference may be made to some descriptions of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A network traffic identification method is applied to a server, and the method comprises the following steps:

under the condition that the current data stream is received, extracting packet header data of a data packet in the current data stream to serve as a first sample;

inputting the first sample into a semi-supervised model, and outputting the category of the first sample and the result of whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; the distribution relation determines whether the samples with the class labels are positioned in the boundary distance of the clusters;

under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node in output nodes of a preset machine recognition model, and taking the machine recognition model with the added output node as an online recognition model;

identifying a category of a next data stream after the current data stream using the online identification model;

wherein the inputting the first sample into a semi-supervised model, and outputting the result of the category of the first sample and whether the first sample is located within the boundary distance of the cluster by using the semi-supervised model, comprises:

calculating the local density and the minimum distance of each sample in the first training sample set, and taking the sample of which the local density exceeds a density threshold value and the minimum distance exceeds a distance threshold value as a third sample; the first training sample set is a set consisting of samples used for training the semi-supervised model;

adding the third sample to a cluster and determining the third sample as a cluster center point of the cluster; the number of the clusters is the same as that of the third samples, and each cluster is provided with only one third sample;

determining that the first sample is not within a cluster boundary distance if the first sample is more than the cluster boundary distance from the third sample;

determining that the first sample is within a cluster boundary distance if the first sample is not a distance from the third sample that exceeds the cluster boundary distance.

2. The method of claim 1, wherein before the step of extracting data of a packet header in the current data stream as the first sample in case of completion of receiving the current data stream, the method further comprises:

judging whether a database stores the quintuple information or not, if the database stores the quintuple information, storing packet header data of the data packet to a storage area of a path corresponding to the quintuple information;

and if the database does not store the quintuple information, creating a storage area of a path corresponding to the quintuple information, and storing the packet header data of the data packet to the storage area of the path corresponding to the quintuple information.

3. The method according to claim 1, wherein in a case that receiving the current data stream is completed, extracting header data of a data packet in the current data stream as a first sample includes:

4. The method according to claim 1, wherein in a case that receiving the current data stream is completed, extracting header data of a data packet in the current data stream as a first sample includes:

and encoding packet header data of a data packet in the current data stream to obtain a vector with a fixed dimension, and taking the vector with the fixed dimension as a first sample.

5. The method according to claim 1, wherein if the first sample is a sample of a new class in a case where the first sample is located within a boundary distance of a cluster, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, the method includes:

under the condition that the first sample is located within the boundary distance of the cluster, if a second sample with the same category as that of the first sample exists in a second training sample set, judging that the first sample is not a sample of a new category, and updating the parameters of a preset machine recognition model; the second training sample set is a set of data streams used for training the machine recognition model;

under the condition that the first sample is located within the boundary distance of the cluster, if a second sample which is the same as the first sample in category does not exist in a second training sample set, the first sample is judged to be a sample of a new category, an output node is added to an output node of a preset machine recognition model, and the machine recognition model with the added output node is used as an online recognition model.

6. The method according to claim 1, wherein if the first sample is a sample of a new class in a case where the first sample is located within a boundary distance of a cluster, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, the method includes:

under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding one dimension to a parameter dimension in a preset machine identification model, and taking the machine identification model with the parameter dimension added as an online identification model.

7. The method according to claim 1, wherein if the first sample is a sample of a new class in a case where the first sample is located within a boundary distance of a cluster, adding an output node to output nodes of a preset machine recognition model, and using the machine recognition model after the output node is added as an online recognition model, the method includes:

under the condition that the first sample is located within the boundary distance of the cluster, if the first sample is a sample of a new category, adding an output node in output nodes of a preset machine identification model, and taking the machine identification model with the added output node as a basic identification model;

inputting the first sample into the basic recognition model, and calculating partial derivatives of loss functions of the basic recognition model to output layer weights and biases in the basic recognition model;

in the gradient descending direction, updating the weight and the bias of the output layer of the basic recognition model by using the parameter updating formula; the parameter updating formula comprises the result of multiplying the proficiency increment by the bias partial derivative of the loss function of the basic recognition model on the output layer weight and the bias in the basic recognition model respectively;

8. The method of claim 1, wherein identifying the category of the next data stream after the current data stream using the online identification model comprises:

inputting the second sample into the online identification model, and outputting the category of the second sample by using the online identification model.

9. A network traffic identification device, applied to a server, the device comprising:

the monitoring module is used for inputting the first sample into a semi-monitoring model and outputting the result of the category of the first sample and whether the first sample is positioned within the boundary distance of the cluster by using the semi-monitoring model; the semi-supervised model is obtained by utilizing a first training sample set for training and comprises the distribution relation between the category of the obtained packet header data and other samples in the first training sample set; the first training sample set comprises samples with at least one class label; the distribution relation determines whether the samples with the class labels are positioned in the boundary distance of the clusters;

a changing module, configured to, when the first sample is located within a boundary distance of a cluster, if the first sample is a new-class sample, add an output node to output nodes of a preset machine identification model, and use the machine identification model after the output node is added as an online identification model;

the identification module is used for identifying the category of the next data stream after the current data stream by using the online identification model;

the supervision module is specifically configured to: