WO2020159439A1

WO2020159439A1 - System and method for network anomaly detection and analysis

Info

Publication number: WO2020159439A1
Application number: PCT/SG2020/050033
Authority: WO
Inventors: Quoc Phong NGUYEN; Dinil Mon DIVAKARAN; Kar Wai LIM; Kian Hsiang LOW; Mun Choon Chan
Original assignee: Singapore Telecommunications Limited
Priority date: 2019-01-29
Filing date: 2020-01-22
Publication date: 2020-08-06

Abstract

This document describes a system and method for detecting and identifying network anomalies. In particular, this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic using deep learning techniques and a gradient-based fingerprinting technique.

Description

SYSTEM AND METHOD FOR NETWORK ANOMALY DETECTION AND ANALYSIS

Field of the Invention

This invention relates to a system and method for detecting and identifying network anomalies. In particular, this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic. This is done using deep learning techniques and a gradient-based fingerprinting technique.

Summary of Prior Art

Anomalies in network traffic may occur due to various types of cyber-related attacks and threats, such as Distributed Denial-of-Service (DDoS) attacks (e.g., TCP SYN flooding, DNS amplification attacks, etc.), brute force attempts, botnet communications, spam campaign, network/port scans, and etc. Such network anomalies may also occur due to non- malicious causes, such as faults that occur in the network, misconfigurations, improper Border Gateway Protocol (BGP) policy updates, changes in user behaviours, and etc. Hence, in order to ensure minimal disruptions to the network and to maintain the security of the network, these anomalies should be detected, identified and categorized in real time so that they may be rapidly addressed and resolved as required. Maintaining the network’s integrity is of increasing importance especially with the rapid adoption of Internet of Things (loT) technology. With wider implementations of loT devices, cyber criminals are able to target and harness more resources by exploiting the vulnerabilities found in loT devices (e.g. by using the Mirai attack).

The detection of network anomalies remains challenging for a number of reasons. First, the characteristics of network data are dependent on a number of factors, such as the end-user’s behaviour, the customer’s types of business (e.g., banking, retail), the types of the applications, the location, the time of the day, and are expected to change with time. Such diversity and dynamism limits the application of rule-based systems for the detection of anomalies in network traffic.

Further, as capturing, storing and processing raw traffic from high capacity networks are not practical, Internet routers today typically extract and export meta data in the form of, but are not limited to, NetFlow records. A typical NetFlow record represents meta data of a set of related packets, and are often generated from sampled packets. A downside of using NetFlow records is that useful information including suspicious keywords in payloads, Transmission Control Protocol (TCP) state transitions, Transport Layer Security (TLS) connection handshake, sizes of each packet, time between consecutive packets, etc. are unavailable. Therefore, anomaly detection solutions that utilize NetFlow records may contain incomplete and/or lossy information.

Finally, as security operation centre (SOC) analysts usually have a limited budget for analysing alerts raised by an anomaly detector, including alert escalation, threat and attack mitigation, intelligence gathering, forensic analysis, etc, it is desirable that anomaly detectors go beyond merely indicating the presence of anomalies, but also seek to provide other information such as the time of the anomaly, the anomaly type, and the corresponding set of suspicious flows. In general, when more the information may be passed (along with the alerts) to the analysts, this greatly accelerates the analysis of the threat and quickens the decision process.

Those skilled in the art have proposed that statistical models and machine learning algorithms be used to address the problems mentioned above. In other words, when the detection of anomalies are treated as a binary classification problem, a supervised machine learning model can be built using normal and anomalous data, and this may be used to classify anomalies. However, existing approaches have the following limitations.

First, many of the approaches exploit only a small number of features (e.g. traffic volume, flow rates, or entropy). Such approaches require the users to use the appropriate domain knowledge to select the right set of features which may not always be feasible and optimal.

Second, supervised machine learning approaches require large sets of data with ground truth for training. As the network’s characteristics and attacks evolve, models have to be retrained and the corresponding labelled datasets have to be made available. This requires costly and laborious manual efforts, and yet, given the size of traffic flowing through a backbone network, it is highly impractical to assume that all data records are correctly labelled. Further, supervised machine learning approaches are unlikely to detect unseen and zero-day attack traffic.

In view of the above, it is most desirous to employ an unsupervised network anomaly detection method that is scalable in terms of both network data size and feature dimension. Such a method would be useful in detecting anomalies in large scale networks without the need to rely on domain knowledge. Further, in addition to detecting anomalies, the method should be able to analyse the detected anomalies and ascertain the type of attacks by identifying the main features that cause the anomalies. For the above reasons, those skilled in the art are constantly striving to come up with a system and method that is capable of receiving and quantitatively unifying unstructured and/or unlabelled information security threat data from any source or system whereby the processed information is then provided back to all the upstream systems to actively tune and improve the security postures of these systems in near-real-time.

Summary of the Invention

The above and other problems are solved and an advance in the art is made by systems and methods provided by embodiments in accordance with the invention.

A first advantage of embodiments of systems and methods in accordance with the invention is that based on gradient information obtained from the deep neural network model, flagged anomalies may be effectively and efficiently identified.

A second advantage of embodiments of systems and methods in accordance with the invention is that gradient information obtained from the deep neural network model may be employed to detect future anomalies that have yet to be labelled or identified.

A third advantage of embodiments of systems and methods in accordance with the invention is that the invention does not require labelled data as training data and as a result, is likely to detect zero-day type network attacks.

A fourth advantage of embodiments of system and methods in accordance with the invention is that the invention provides an efficient way of explaining and/or identifying attacks from detected anomalous traffic as not all anomalous traffic may comprise cyber-attacks on the network.

The above advantages are provided by embodiments of a method in accordance with the invention operating in the following manner.

According to a first aspect of the invention, a method for detecting and analysing anomalies in network traffic is disclosed whereby the method performed by a computer system comprises: collecting network data from the network traffic; extracting one or more features from the network data to form a dataset; providing the dataset to a deep neural network model to train the deep neural network model; detecting anomalies in the network traffic using the trained deep neural network model; and computing gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies. With reference to the first aspect, the step of extracting one or more features from the network data to form the dataset comprises: grouping the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses, wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.

With reference to the first aspect, the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of: mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate; entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.

With reference to the first aspect, the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.

With reference to the first aspect, the training of the VAE model is performed using Adam as an optimisation algorithm.

With reference to the first aspect, the step of using the trained deep neural network model to detect anomalies in the network traffic comprises: obtaining a reconstruction error for each data point; and comparing the reconstruction error with a predetermined threshold for each data point, wherein a data point is determine to be an anomaly when the reconstruction error exceeds the predetermined threshold.

With reference to the first aspect, the method further comprises the step of clustering the computed gradients for the one or more features.

With reference to the first aspect, the method further comprises the step of identifying types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.

According to a second aspect of the invention, a computer system for detecting and analysing anomalies in network traffic is disclosed, the system comprising circuitry configured to collect network data from the network traffic; circuitry configured to extract one or more features from the network data to form a dataset; circuitry configured to provide the dataset to a deep neural network model to train the deep neural network model; circuitry configured to detect anomalies in the network traffic using the trained deep neural network model; and circuitry configured to compute gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.

With reference to the second aspect, the circuitry configured to extract the one or more features from the network data to form the dataset further comprises: circuitry configured to group the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses, wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.

With reference to the second aspect, the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of: mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate; entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.

With reference to the second aspect, the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.

With reference to the second aspect, the training of the VAE model is performed using Adam as an optimisation algorithm.

With reference to the second aspect, the circuitry configured to use the trained deep neural network model to detect anomalies in the network traffic comprises: circuitry configured to obtain a reconstruction error for each data point; and circuitry configured to compare the reconstruction error with a predetermined threshold for each data point, wherein a data point is determine to be an anomaly when the reconstruction error exceeds the predetermined threshold.

With reference to the second aspect, the computer system further comprises circuitry configured to cluster the computed gradients for the one or more features.

With reference to the second aspect, the computer system further comprises circuitry configured to identify types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks. Brief Description of the Drawings

The above and other problems are solved by features and advantages of a system and method in accordance with the present invention described in the detailed description and shown in the following drawings.

Figure 1 illustrating a process or method for detecting and analysing anomalies in a network in accordance with embodiments of the invention;

Figure 2 illustrating a block diagram of modules that may be used to implement the method for detecting and analysing network anomalies in accordance with embodiments of the invention;

Figure 3 illustrating a block diagram representative of processing systems providing embodiments in accordance with embodiments of the invention

Figure 4 illustrating an exemplary architecture of an Autoencoder (AE) and a Variational Autoencoder (VAE) in accordance with embodiments of the invention;

Figure 5 illustrating plots of VAE loss and Kullback-Leibler (KL) loss from a recovered Gaussian distribution;

Figure 6 illustrating distributions of reconstruction errors of training data for various anomalies as obtained using the VAE model for Example 1 ;

Figure 7 illustrating the receiver operation characteristic (ROC) curves for VAE, AE and Gaussian-based thresholding (GBT) in detecting anomalies in the training dataset for Example 1 ;

Figure 8 illustrating the ROC curves for VAE, AE and GBT in detecting anomalies in the test dataset for Example 1 ;

Figure 9 illustrating the normalized gradients of spam, scan 11 and scan 44 based on selected features using the VAE model for Example 1 ;

Figure 10 illustrating the ROC curves for anomaly detection using fingerprints for Example 1 ; and

Figure 11 illustrating the distribution of clusters for each attack type for Example 1. Detailed Description

This invention relates to a system and method for detecting and identifying network anomalies. In particular, this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic using deep learning techniques and a gradient-based fingerprinting technique. Still more particularly, this invention relates to a system and method for processing a network’s meta data, such as NetFlow records obtained from network monitoring activities, using a deep neural network model to detect network anomalies whereby the anomalies are subsequently analysed based on gradient information obtained from the deep neural network model.

The present invention will now be described in detail with reference to several embodiments thereof as illustrated in the accompanying drawings. In the following description, numerous specific features are set forth in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art, that embodiments may be realised without some or all of the specific features. Such embodiments should also fall within the scope of the current invention. Further, certain process steps and/or structures in the following may not been described in detail and the reader will be referred to a corresponding citation so as to not obscure the present invention unnecessarily.

Further, one skilled in the art will recognize that many functional units in this description have been labelled as modules throughout the specification. The person skilled in the art will also recognize that a module may be implemented as circuits, logic chips or any sort of discrete component. Still further, one skilled in the art will also recognize that a module may be implemented in software which may then be executed by a variety of processors. In embodiments of the invention, a module may also comprise computer instructions or executable code that may instruct a computer processor to carry out a sequence of events based on instructions received. The choice of the implementation of the modules is left as a design choice to a person skilled in the art and does not limit the scope of this invention in any way.

Figure 1 sets out an exemplary flowchart of process 100 for detecting, analysing and identifying a network anomaly in accordance with embodiments of the invention. Process 100 comprises the following steps:

Step 105: collecting network data; Step 110: extracting one or more features from the network data that is collected in step 105 to form a training dataset;

Step 1 15: feeding the training dataset from step 1 10 into a deep neural network model, thereby training the deep neural network model to learn normal behaviour of the network;

Step 120: using the trained deep neural network model to detect an anomaly; and

Step 125: obtaining gradient information of the anomaly from the deep neural network to analyse and identify the anomaly.

The steps of process 100 may be performed by modules contained within system 200, as illustrated in Figure 2, whereby system 200 comprises feature extraction module 205, gradient fingerprinting module 225 and VAE module 210 (which in turn comprises training module 215 and anomaly detection module 220). In embodiments of the invention, system 200 may comprise a computer system.

In accordance with embodiments of the invention, a block diagram representative of components of processing system 300 that may be provided within modules 205, 210, 215, 220 and 225 for implementing embodiments in accordance with embodiments of the invention is illustrated in Figure 3. One skilled in the art will recognize that the exact configuration of each processing system provided within these modules may be different and the exact configuration of processing system 300 may vary and Figure 3 is provided by way of example only.

In embodiments of the invention, each of modules 205, 210, 215, 220 and 225 may comprise controller 301 and user interface 302. User interface 302 is arranged to enable manual interactions between a user and each of these modules as required and for this purpose includes the input/output components required for the user to enter instructions to provide updates to each of these modules. A person skilled in the art will recognize that components of user interface 302 may vary from embodiment to embodiment but will typically include one or more of display 340, keyboard 335 and track-pad 336.

Controller 301 is in data communication with user interface 302 via bus 315 and includes memory 320, processor 305 mounted on a circuit board that processes instructions and data for performing the method of this embodiment, an operating system 306, an input/output (I/O) interface 330 for communicating with user interface 302 and a communications interface, in this embodiment in the form of a network card 350. Network card 350 may, for example, be utilized to send data from these modules via a wired or wireless network to other processing devices or to receive data via the wired or wireless network. Wireless networks that may be utilized by network card 350 include, but are not limited to, Wireless-Fidelity (Wi-Fi), Bluetooth, Near Field Communication (NFC), cellular networks, satellite networks, telecommunication networks, Wide Area Networks (WAN) and etc.

Memory 320 and operating system 306 are in data communication with CPU 305 via bus 310. The memory components include both volatile and non-volatile memory and more than one of each type of memory, including Random Access Memory (RAM) 320, Read Only Memory (ROM) 325 and a mass storage device 345, the last comprising one or more solid- state drives (SSDs). Memory 320 also includes secure storage 346 for securely storing secret keys, or private keys. One skilled in the art will recognize that the memory components described above comprise non-transitory computer-readable media and shall be taken to comprise all computer-readable media except for a transitory, propagating signal. Typically, the instructions are stored as program code in the memory components but can also be hardwired. Memory 320 may include a kernel and/or programming modules such as a software application that may be stored in either volatile or non-volatile memory.

Herein the term“processor” is used to refer generically to any device or component that can process such instructions and may include: a microprocessor, microcontroller, programmable logic device or other computational device. That is, processor 305 may be provided by any suitable logic circuitry for receiving inputs, processing them in accordance with instructions stored in memory and generating outputs (for example to the memory components or on display 340). In this embodiment, processor 305 may be a single core or multi-core processor with memory addressable space. In one example, processor 305 may be multi-core, comprising— for example— an 8 core CPU.

With reference to Figure 1 , at step 105, network data are constantly collected by network routers within a monitored network. In accordance with embodiments of the invention, step 105 may be performed by feature extraction module 205 as illustrated in Figure 2.

In the following description, for illustration purposes, it shall be assumed that the network data collected by module 205 comprises NetFlow records. It should be appreciated that other forms of aggregate data that are collectable by routers in an Internet Service Provider (ISP) network may be collected at step 105 by module 205 and that the type of data collected at step 105 is not specifically limited only to NetFlow type records.

As known to one skilled in the art, a NetFlow record comprises a set of packets that has the same five-tuple of source and destination IP addresses, source and destination ports, and protocol. In addition to the above, some of the important fields collected in the NetFlow records include, but are not limited to, the start time of the flow (based on the first sampled packet), duration, number of packets and Transmission Control Protocol (TCP) flag.

At step 1 10, using module 205, one or more features may be extracted from the NetFlow records collected at step 105 to form a training dataset. In an embodiment of the invention, the NetFlow records are first grouped into sliding windows of a predetermined duration based on source IP addresses before one or more aggregated features are extracted from each window and statistically analysed. In a preferred embodiment, the NetFlow records are grouped into 3-minute long sliding windows based on the source IP addresses. This means that each data point in the training dataset corresponds to network statistics from a single source IP address within a 3-minutes period. This enables identification of an offending IP address that is responsible for an anomaly and the time window to which the anomaly belongs to. The period of 3 minutes is chosen as a balance between the practicality and quality of statistics from the aggregated features, whereby the statistics will be insignificant if the window period is too short, and the capability for real time analysis is lost if the window period is too long. Hence, as would be understood by the skilled person, the duration of a 3-minute sliding window is not meant to be limiting to the present invention and other durations may be used without departing from this invention.

In a preferred embodiment, 53 aggregated features are extracted from the NetFlow records to form the training dataset. This include:

• mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate;

• entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and

• proportion of ports used for common applications (e.g. Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.

To ensure that meaningful statistics are captured, data points that contains too few flows (e.g. less than 10) may be removed from the training dataset to reduce noise. The statistics may be further normalised into a Z-score or scaled to a value between 0 and 1.

At step 1 15, the training dataset from step 110 is fed into a deep neural network model found in training module 215 to learn the normal behaviour of the network. Various unsupervised deep neural network models may be used. These models do not require labelled information (i.e. normal or anomalous) for training and instead exploit the fact that anomalous behaviours tend to differ greatly from the standard or normal behaviour of the network. In an embodiment of the invention, the unsupervised deep neural network model may comprise an Autoencoder (AE) model or a Variational Autoencoder (VAE) model that is a probabilistic generalisation of the AE model.

An exemplary architecture 400 for an AE or VAE model is illustrated in Figure 4. In general, the AE or VAE model is made of three main layers: an input layer 401 to take in the features; a latent representation layer 402 of the features; and an output layer 403 that is a reconstruction of the features. As illustrated, the AE or VAE model comprises of two parts, an encoder 405 and a decoder 410. The function of encoder 405 is to map a feature from the input layer 401 into its latent representation in the latent representation layer 402 while decoder 410 derives an output in the output layer 403 that is a reconstruction of the feature from the input layer 401 based on the latent representation.

The encoder 405 may be considered as part of a deep neural network in the sense that information from the input is passed through several mappings (and hidden layers) similar to the deep architecture in a supervised deep learning model; and likewise for the decoder 410.

As an illustrative example, the latent representation layer 402 may be set to a size of 100. In addition, the encoder 405 and the decoder 410 may each have three hidden layers with sizes 512, 512, and 1 ,024 respectively, as illustrated in Figure 4. In this illustration, nodes that are shaded represent the observed data (used as both inputs and outputs), while the unshaded nodes represent unobserved latent variables that correspond to the hidden layers. The exact sizes or dimensions of the layers are shown above the nodes in this illustration. Additionally, the links between the layers show how the values of the next layer can be computed. Commonly, for an AE model, the value of one hidden layer can be computed as:

where is a vector of values for the previous layer,

is a matrix of weights that signifies the relationship from the previous layer, and

is a vector of bias terms. Both are

parameters to be learned through training and optimising the model using the training dataset from step 1 10. The function g{ ) is known as the activation function that transforms the computation in a non-linear way and allows complex relationships to be learned. Popularly used activation functions include the sigmoid function

and the Rectified Linear Unit (ReLU), g(x) = max(0,%), which is preferably used in the present invention.

The learning of the parameters is generally achieved by minimizing the reconstruction errors (e.g. mean square errors) via backpropagation with random initialization, and can be optimized with a variety of optimizers such as the stochastic gradient descent optimizer. The detailed optimization steps are omitted here for brevity and as the person skilled in the art may obtain more details from the publication by Y. Bengio titled“Practical recommendations for gradient-based training of deep architecture”, Neural Networks: Tricks of the Trade, Springer, 2012, pages 437-478, which is incorporated herein in its entirety for details on optimisation.

Based on the above, it can be said that the AE model may be viewed as a deterministic model that maps a set of inputs into their reconstructed outputs. On the other hand, the VAE model is a generative model that treats the latent representation layer as random variables that are conditional on the inputs. While the encoder and decoder in the VAE follows the same computational model as the AE (i.e. as set out in equation (1)), the encoding process is instead used to compute the parameters for the conditional distributions of the latent representation layer. The parameters that are obtained from model training are then used to generate or sample the latent representation for decoding. A detailed explanation on the working of the VAE is set out in the later sections.

In general, the conditional distributions in the VAE model are generally assumed to be Gaussian for real-valued nodes. For example, when z_t is denoted as the value of the latent representation layer, then it can be written as

where diagQ denotes a function that transforms a vector into a diagonal matrix; and and

are the mean and variance for the conditional Gaussian distribution obtained from the output of the encoder:

The parameters are interpreted in the same way as in the AE

model. Treatment on the hidden layers is also identical to that of AE. The probabilistic nature of the VAE also means that the usual learning algorithm on standard objective function (e.g. mean square error) may not be used to train the model. Instead, a class of approximate statistical inference method known as Variational Bayes are used. The detailed steps are omitted here for brevity and as the person skilled in the art may obtain more details from the publication by D.P. Kingma and M. Welling titled“Auto-encoding variational Bayes”, arXiv preprint arXiv: 1312.61 14, 2013 which is incorporated herein in its entirety. In essence, where the VAE model is used, an alternative objective function known as the variational lower bound is optimized, and stochastic sampling is used for approximation. In terms of architecture, the architecture of a VAE is similar to that of an AE as illustrated in Figure 4. The ReLU activation function is used by the encoder and the decoder in all of the intermediate layers, and the linear activation g(x ) = x will be used for the output.

]

In this regard, the training algorithm of the VAE model may be implemented using TensorFlow. In summary, before starting the training procedure, the parameters in the VAE are randomly initialized. A forward pass on the encoder is subsequently performed by computing the distribution of the latent representation layer via equation (2). With this, several samples can be generated from the Gaussian distribution which are used to compute the variational lower bound, which consists of a Kullback-Leibler (KL) divergence term and an expectation term:

where z is the latent representation of the input features x. Here, the distribution p(·) corresponds to the Gaussian prior and conditional distribution of the VAE model; while q(·) is a variational approximation of p(·), generally chosen to be Gaussian as well (whereby such functions are commonly used in auto-encoding variational Bayes functions). Fortunately, this objective function can be maximized with stochastic optimization techniques since the gradients are readily available via automatic differentiation.

As an example, an optimisation algorithm that enables training to be performed in mini batches is outlined in the publication by D.P. Kingma and J. Ba. titled“Adam: A method for stochastic optimization”, Third International Conference on Learning Representations, 2015, herein incorporated in its entirety may be used. This generally allows for real-time training by choosing a small mini batch size and discarding the data after one epoch. Further, label information is not required during the training of the model.

Returning to Figure 1 , at step 120 and using module 220, once the parameters have been optimised by module 215 at step 1 15, the deep neural network model is then used for anomaly detection. As an example, when the VAE model is used in module 220, an IP address and its time window may be recognised by module 220 as abnormal when a reconstruction error of its input features is high. For clarity, the reconstruction error comprises the mean square difference between the observed features and the expectation of their reconstruction. A high reconstruction error is generally observed when the network behaviour differs greatly from the normal behaviour that was learned by the VAE. In embodiments of the invention, a threshold value may be selected such that a small percentage (say 5%) of the data is treated as anomalies. Otherwise, labeled information may be used to select the threshold to maximize the detection rate while allowing a small false positive rate to be detected. Note that although the anomalies correspond to unusual or unexpected network behaviours demonstrated by a particular source IP address, it may not necessarily be malicious in nature.

At step 125, an anomaly is analysed by module 225 by obtaining gradient information from the deep neural network. This is done by analysing the gradient that is contributed by each feature at the anomalous data point, i.e. how the VAE’s objective function varies if a feature in the anomaly data point increases or decreases by a small amount. Intuitively, given the trained VAE and an anomaly data point, if the function (reconstruction error) changes quite a lot when a particular feature of the anomaly data point is varied by a small amount, then this feature at its current value is significantly abnormal, since it would likely perturb the VAE model (through optimization) to fit itself better.

Gradients, or more technically the derivative of the variational lower bound, are computed at step 125 for each feature from each data point i. The gradient may be

obtained as follows:

Two applications of the gradient can be immediately derived. In an embodiment of the invention, even without employing ground truth labels, the flagged anomalies can be clustered based on their gradients into groups that share similar behaviour, making it easier for

analysts to investigate. In another embodiment of the invention, if labeled information on certain types of attacks were used to train the deep neural network model, gradient-based fingerprints associated with the attacks may be derived. These fingerprints may then be used to identify similar future attacks. The anomalies that are identified through the fingerprinting approach are more accurate as labeled information was indirectly used in a way similar to semi-supervised learning.

Variational Autoencoder (VAE)

The VAE model that may be utilized in accordance with embodiments of the invention is discussed in greater detail in this section. The VAE model may be used to represent the probability of a data point p(x) through a latent variable where the probability of a

data point p(z) is:

where p(x|z) is representative of a neural network,

0 otherwise for

is a neural network and the probability p(z) is the prior probability of the latest random variable z, which is assumed to be a standard Gaussian distribution, i.e.

where O is a vector of all 0 and of length d_z, and I is an identify square matrix

with the number of columns equal d_z.

Given a data point x, VAE would be able to learn about the posterior distribution of the latent random variable

which is done via variational inference (VI). VI learns an approximated posterior,

such that it minimizes the Kullback-Leibler (KL) divergence between the approximate distribution q{z |x) and the exact posterior distribution p(z|x), denoted as This is done by

minimizing the following objective function:

where The approximate probability q(z|x) is

assumed to be a Gaussian distribution with the mean being a column vector of length d_z, denoted as and the covariance matrix is a square matrix with the diagonal being a column vector of length

denoted as The mean and the variance are modeled with a neural

network whose input is x and outputs are We denote this neural network as f_en¥de.

It is noted that as q(z|x) and p(z) are both Gaussian distributions, the second term in Equation (8) has a closed form expression,

where denotes the sum of all elements in the vector a. The first term in Equation (8), i.e. can be replaced with the mean squared error (MSE) for

continuous random variable x, i.e.

Hence, the objective function may be derived as:

It should be noted that the training of VAE is iterative. At each iteration, a random mini batch is drawn from the dataset, over which the average of (x) is minimized. This is done using any available optimization package, such as Adam, which is available in the Tensorflow library. In particular, when this library is adopted, the method minimize can be called in the class tf.keras. optimizers. Adam with the average of Equation (10) over a mini-batch as the objective function, and the parameters of f_encode and f_decode as the trainable variables.

Anomaly Detection

Intuitively, following on from the section on VAE above, if L x) in Equation (10) is large, it can be assumed that x is anomalous. For example, let d be a threshold value such that

of the dataset.

For example, if a = 95, it means that 95% of the dataset have the loss value L(x) less than the threshold value d. Then given a test data point %_*, we can compute the loss L(x_*), and classify as an anomaly if

Anomaly Explanation using Gradient Analysis

The gradient of the loss may be computed with respect to (w.r.t)

where

where L(x) is defined in Equation (10), the data point x_* is represented as a dx-dimensional vector

denote the partial derivative of L(x ) w.r.t the first feature/dimension of

(these d dimensions are also referred to as the features). These gradients can be obtained with tf. gradients in the Tensorflow library by specifying Equation (10) as the ys and as xs. As such,

where is a column vector of gradients defined in Equation (11) and is a column vector

of length d_x. The i-th element of z represents the adjustment to the i-th element of

Hence, we can make an anomaly x„ less anomalous by adjusting

for a small positive number e, i.e. , a small adjustment in the direction of the negative gradient of the loss w.r.t. the data point. Therefore, we use this gradient of the loss as an explanation of the anomaly. For example,

• If then it means that the first dimension/feature

of , is too large to be a normal data point in the dataset. This is because increasing the first dimension/feature of by a little increases the loss significantly. • If then it means that the first dimension/feature

of is too small to be a normal data point in the dataset. This is because decreasing the first dimension/feature of by a little increases the loss significantly.

• If

then it means that the first dimension/feature of

does not explain why is anomalous. This is because changing the first dimension/feature of by a

little has no effect on the loss.

Such an example is illustrated in Figure 5 where the dataset consists of 2-dimensional data points following a Gaussian distribution, plots 505 and 510. We assume that those data points that are outside 2 standard deviation from the mean may be treated as anomalies (plotted as dots in the plots 505 and 510). The bottom 2 plots 515 and 520 show the negative values of the gradients of both VAE loss, L x), and the KL loss, at different

data points in the domain. The negative gradients of the VAE loss or the KL loss show the direction of the adjustment to these data points to make them normal (i.e., towards the mean/center in this case). As an example, reference is made to the data point plotted as the red star in plots 515 and 520 (near the location (30; 200)). Its negative gradient vector points to the left, so it means that the feature on the horizontal axis is too large and the feature on the vertical axis plays no role in explaining the anomaly.

The present invention is further explained by way of an example.

Example 1

Dataset and evaluation

The UGR16 dataset, which contains anonymized NetFlow traces captured from a real network of a Tier 3 ISP, is used herein as an example. The ISP provides cloud services and is used by many client companies of different sizes and markets. The UGR trace is a fairly recent and large scale data trace that contains real background traffic from a wide range of Internet users, rather than specific traffic patterns from synthetically generated data (e.g., DARPA'98 and DARPA'99, UNB ISCX 2012, UNSW-NB15, CTU13).

The UGR contains traffic for the whole day and traces over a 4-month period. Furthermore, the UGR’s attack traffic data is a mixture of generated attacks, labelled real attacks, and botnet attacks from controlled environment. Specifically, the labelled attacks comprise of: • Low-rate Denial-of-Service (DoS): TCP SYN packets are sent to victims with packet of size 1280 bits and of rate 100 packets/s to port 80. The rate of the attack is sufficiently low such that the normal operation of the network is not affected.

• Port scanning: A continuous SYN scanning to common ports of victims. There are two kinds of scanning, oneto-one scan attack (Scanll) and four-to-four (Scan44).

• Botnet: A simulated botnet traffic obtained from the execution of the malware known as Neris. This data comes from the CTU13 trace.

• Spam: Peaks of SMTP traffic data that were determined as a spam campaign.

• Blacklist: IP addresses published in the public blacklists.

A total of five days of the UGR data were selected for use herein. Two Saturdays were used as training data while three other days (Friday and Sundays) were chosen as test days.

The statistics for the data are presented in Table I below whereby NetFlow records that are not given any labels make up the background data. Note that the data for March 18 is smaller as it does not comprise a full day’s data.

After the data was processed by feature extraction module 205, a training dataset of 5,990,295 data points was obtained. The data was then used by training module 215 (contained with module 210) to train a VAE model by stochastic optimization with 50 epochs and a mini-batch of size 300 whereby the weight decay was set to 0.01. As a result, a total of 1 ,957,71 1 data points on March 18, 2,954,983 data points on March 20, and 2,878,422 data points on July 31 were processed. It should be noted that a data point belongs to an attack type if the proportion of flows that are labelled with such attack within the 3-minute aggregation is greater than 50%. Figures 6(a)-(f) illustrates the distribution of the reconstruction errors that were generated for the training data. The ground truth labels were used to separate the anomalies from the background flows, which allows the anomalies to be examined to determine whether the anomalies behave differently from the what was expected. As can be seen from Figures 6(a)-(f), there is some overlap in the reconstruction error plots with the background plots for spam (Figure 6(a)), botnet (Figure 6(b)), DoS (Figure 6(c)), and scanning activities (Figure 6(d)-(f)). A cut-off point may be used to distinguish between background noise versus an attack such as spam, botnet, DoS, and scanning activities. However, from Figure 6(f), care should be taken for blacklist type anomalies as the behaviour for blacklist appears to be indistinguishable from the background traffic.

Comparing Anomaly Detection performance

As a reference, the performance of the VAE model for anomaly detection was compared against that of AE and also a Gaussian-based thresholding (GBT) approach. For a fairer comparison, the baseline AE was configured to share the same architecture as the VAE. Further, as illustrated in Figure 4, the AE was implemented using Keras, a high level open source neural network library and the same TensorFlow backend was used for training the AE, i.e. by iteratively minimizing the reconstruction error (mean square error) using the stochastic optimizer Adam, with a mini-batch size chosen to be 256. Similar to the VAE, data points that have large reconstruction errors were then flagged as anomalies.

For the GBT, independent but non-identical Gaussian distribution models were fitted to the features to learn the standard behaviours of the data. Z-scores were then calculated for all features in the testing dataset and the product of the average, standard deviation, and the maximum of the Z-scores were used as final score for anomaly detection. The data points with scores that exceed a certain threshold were then considered as anomalies.

The performance of the three methods (VAE, AE and GBT) were evaluated using Receiver Operating Characteristic (ROC) curves, where the true positive rates were plotted against the false positive rates by varying the sensitivity or threshold for anomaly detection. A good performance occurs when the true positive values are high while the false positive values are low. The ROC results obtained for the training datasets are plotted at Figures 7(a)-(f) and results obtained for the testing datasets are plotted at Figures 8(a)-(d).

From the plots in these two figures, it can be seen that threshold-based approaches such as GBT works well for attacks that cause the volume of certain categories of traffic to increase significantly, such as spam and port scanning type attacks. However, such approaches do not appear to work as well for botnet and low rate spam type attacks for this example.

On the other hand, AE does not work as well for spam and low rate DoS type attacks. For these two types of attacks, the AE model may be unable to differentiate spam from regular email traffic because of the high volume of data and would be unable to detect DoS attacks due to the low rate volume of data.

From these plots, it can be seen that the VAE model is the most robust and provides the best performance for detecting all attack modes. Table II below summarizes the individual and average area under the ROC curve (AUC) for various types of attacks (except blacklist) and algorithms and Table II shows that VAE provides the best overall performance in terms of AUC.

Identifying Anomalies and Using Gradient Fingerprints

Using the gradient fingerprinting module 225, the gradients for all the features for the various attacks may then computed based on the VAE model. Figure 9 illustrates the gradients of the features for spam, scanll, and scan44 whereby the black bars reflect one standard error for the gradients (which are useful for assessing the significance of the gradients, i.e. , whether it is due to noise or not). Features that are not significant are also presented as contrast.

From Figure 9, it is observed that only a small subset of the features exhibit large gradients. These features with the greatest absolute gradients provide an explanation for why these flows of an IP are detected as anomalies. For example, in the case of spam attacks 902 (which includes spam emails), five features were found to have more positive gradients (higher than the learned normal) while four features were found to have much more negative gradients (lower than learned normal). Critically, these combination of gradients and features can be used as a fingerprint to identify or cluster similar attacks. For example, it can be observed from Figure 9 that scan H and scan44 (plots 904 and 906 respectively) type attacks exhibit similar gradient fingerprints.

To further validate the analysis that the gradient fingerprints can be used to identify similar attacks, the ROC for various attacks detected are plotted in the following way. First, let L² denote the Euclidean distance between the average gradient fingerprint obtained from data with labeled attacks and the gradients of the VAE's objective function w.r.t. each test data point, L². Similarly,

is defined as the same distance metric but computed based on normalized gradient vectors, i.e. — normalized distance is taken to be proportional to the

cosine distance and may be computed as:

where is the Euclidean distance.

The ROC is then produced by varying the threshold on L² or distance. If the

gradient fingerprint is to be a good measure of an attack, the ROC is expected to have high true positive values and low false positive values. The results of this analysis are plotted in Figure 10. It can be seen that with the use of

the gradient fingerprints learned were found to ne indeed good representations of these attacks. In fact, these plots show that this approach is an improvement over the simplified use of reconstruction errors for anomaly detection. The results of this approach are plotted at, the last row on Table II for its AUC.

Clustering of Anomalies

In other embodiments of the invention, the gradients generated by the VAE module 210 may be used for clustering. The idea is that if the clustering is effective, the identification of the types of attacks may be limited to a smaller number of clusters thereby increasing the overall identification speed of the system.

In this embodiment, k-mean clustering is performed with a random initial seed on the dataset (over the training set) with k = 100. Figure 1 1 illustrates the clusters that four of the attacks belong to when they are clustered accordingly. For example, 92.4% of the DoS attacks appear in only two clusters (c82 and c84) with the other 7.6% appearing in four other clusters. For spam attacks, 74.3% of them appeared in two clusters (c1 1 and c15), while 25.7% appeared in another 11 clusters. Hence, clustering is an effective tool for analysts to focus on a much smaller number of clusters for particular type of attacks. Numerous other changes, substitutions, variations and modifications may be ascertained by the skilled in the art and it is intended that the present invention encompass all such changes, substitutions, variations and modifications as falling within the scope of the appended claims.

Claims

CLAIMS:

1. A method for detecting and analysing anomalies in network traffic, the method to be performed by a computer system comprising:

collecting network data from the network traffic;

extracting one or more features from the network data to form a dataset;

providing the dataset to a deep neural network model to train the deep neural network model;

detecting anomalies in the network traffic using the trained deep neural network model; and

computing gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.

2. The method according to claim 1 , wherein the step of extracting one or more features from the network data to form the dataset comprises:

grouping the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses,

wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.

3. The method according to claim 1 or 2, wherein the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of:

mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate;

entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and

proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.

4. The method according to any one of claims 1 to 3, wherein the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.

5. The method according to claim 4, wherein the training of the VAE model is performed using Adam as an optimisation algorithm.

6. The method according to any one of claims 4 to 5, wherein the step of using the trained deep neural network model to detect anomalies in the network traffic comprises:

obtaining a reconstruction error for each data point; and

comparing the reconstruction error with a predetermined threshold for each data point, wherein a data point is determined to be an anomaly when the reconstruction error exceeds the predetermined threshold.

7. The method according to any one of claims 1 to 6, further comprising:

clustering the computed gradients for the one or more features.

8. The method according to claim 7, further comprising:

identifying types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.

9. A computer system for detecting and analysing anomalies in network traffic, the system comprising:

circuitry configured to collect network data from the network traffic;

circuitry configured to extract one or more features from the network data to form a dataset;

circuitry configured to provide the dataset to a deep neural network model to train the deep neural network model;

circuitry configured to detect anomalies in the network traffic using the trained deep neural network model; and

circuitry configured to compute gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.

10. The computer system according to claim 9, wherein the circuitry configured to extract the one or more features from the network data to form the dataset further comprises:

circuitry configured to group the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses,

wherein the circuitry configured to extract one or more features from the network data to form a dataset comprises circuitry configured to extract one or more aggregated features from each of the plurality of sliding windows.

11. The computer system according to claim 9 or 10, wherein the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of:

12. The computer system according to any one of claims 9 to 11 , wherein the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.

13. The computer system according to claim 12, wherein the training of the VAE model is performed using Adam as an optimisation algorithm.

14. The computer system according to any one of claims 12 to 13, wherein the circuitry configured to use the trained deep neural network model to detect anomalies in the network traffic comprises:

circuitry configured to obtain a reconstruction error for each data point; and circuitry configured to compare the reconstruction error with a predetermined threshold for each data point, wherein a data point is determined to be an anomaly when the reconstruction error exceeds the predetermined threshold.

15. The computer system according to any one of claims 9 to 14, further comprising:

circuitry configured to cluster the computed gradients for the one or more features.

16. The computer system according to claim 15, further comprising:

circuitry configured to identify types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.