WO2020159439A1 - System and method for network anomaly detection and analysis - Google Patents

System and method for network anomaly detection and analysis Download PDF

Info

Publication number
WO2020159439A1
WO2020159439A1 PCT/SG2020/050033 SG2020050033W WO2020159439A1 WO 2020159439 A1 WO2020159439 A1 WO 2020159439A1 SG 2020050033 W SG2020050033 W SG 2020050033W WO 2020159439 A1 WO2020159439 A1 WO 2020159439A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
network
circuitry configured
data
gradients
Prior art date
Application number
PCT/SG2020/050033
Other languages
French (fr)
Inventor
Quoc Phong NGUYEN
Dinil Mon DIVAKARAN
Kar Wai LIM
Kian Hsiang LOW
Mun Choon Chan
Original Assignee
Singapore Telecommunications Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Singapore Telecommunications Limited filed Critical Singapore Telecommunications Limited
Publication of WO2020159439A1 publication Critical patent/WO2020159439A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • This invention relates to a system and method for detecting and identifying network anomalies.
  • this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic. This is done using deep learning techniques and a gradient-based fingerprinting technique.
  • Anomalies in network traffic may occur due to various types of cyber-related attacks and threats, such as Distributed Denial-of-Service (DDoS) attacks (e.g., TCP SYN flooding, DNS amplification attacks, etc.), brute force attempts, botnet communications, spam campaign, network/port scans, and etc.
  • DDoS Distributed Denial-of-Service
  • Such network anomalies may also occur due to non- malicious causes, such as faults that occur in the network, misconfigurations, improper Border Gateway Protocol (BGP) policy updates, changes in user behaviours, and etc.
  • BGP Border Gateway Protocol
  • loT Internet of Things
  • the detection of network anomalies remains challenging for a number of reasons.
  • the characteristics of network data are dependent on a number of factors, such as the end-user’s behaviour, the customer’s types of business (e.g., banking, retail), the types of the applications, the location, the time of the day, and are expected to change with time.
  • Such diversity and dynamism limits the application of rule-based systems for the detection of anomalies in network traffic.
  • NetFlow records typically extract and export meta data in the form of, but are not limited to, NetFlow records.
  • a typical NetFlow record represents meta data of a set of related packets, and are often generated from sampled packets.
  • TCP Transmission Control Protocol
  • TLS Transport Layer Security
  • anomaly detectors go beyond merely indicating the presence of anomalies, but also seek to provide other information such as the time of the anomaly, the anomaly type, and the corresponding set of suspicious flows. In general, when more the information may be passed (along with the alerts) to the analysts, this greatly accelerates the analysis of the threat and quickens the decision process.
  • supervised machine learning approaches require large sets of data with ground truth for training. As the network’s characteristics and attacks evolve, models have to be retrained and the corresponding labelled datasets have to be made available. This requires costly and laborious manual efforts, and yet, given the size of traffic flowing through a backbone network, it is highly impractical to assume that all data records are correctly labelled. Further, supervised machine learning approaches are unlikely to detect unseen and zero-day attack traffic.
  • an unsupervised network anomaly detection method that is scalable in terms of both network data size and feature dimension. Such a method would be useful in detecting anomalies in large scale networks without the need to rely on domain knowledge. Further, in addition to detecting anomalies, the method should be able to analyse the detected anomalies and ascertain the type of attacks by identifying the main features that cause the anomalies. For the above reasons, those skilled in the art are constantly striving to come up with a system and method that is capable of receiving and quantitatively unifying unstructured and/or unlabelled information security threat data from any source or system whereby the processed information is then provided back to all the upstream systems to actively tune and improve the security postures of these systems in near-real-time.
  • a first advantage of embodiments of systems and methods in accordance with the invention is that based on gradient information obtained from the deep neural network model, flagged anomalies may be effectively and efficiently identified.
  • a second advantage of embodiments of systems and methods in accordance with the invention is that gradient information obtained from the deep neural network model may be employed to detect future anomalies that have yet to be labelled or identified.
  • a third advantage of embodiments of systems and methods in accordance with the invention is that the invention does not require labelled data as training data and as a result, is likely to detect zero-day type network attacks.
  • a fourth advantage of embodiments of system and methods in accordance with the invention is that the invention provides an efficient way of explaining and/or identifying attacks from detected anomalous traffic as not all anomalous traffic may comprise cyber-attacks on the network.
  • a method for detecting and analysing anomalies in network traffic comprising: collecting network data from the network traffic; extracting one or more features from the network data to form a dataset; providing the dataset to a deep neural network model to train the deep neural network model; detecting anomalies in the network traffic using the trained deep neural network model; and computing gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.
  • the step of extracting one or more features from the network data to form the dataset comprises: grouping the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses, wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.
  • the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of: mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate; entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.
  • the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.
  • AE Autoencoder
  • VAE Variational Autoencoder
  • the training of the VAE model is performed using Adam as an optimisation algorithm.
  • the step of using the trained deep neural network model to detect anomalies in the network traffic comprises: obtaining a reconstruction error for each data point; and comparing the reconstruction error with a predetermined threshold for each data point, wherein a data point is determine to be an anomaly when the reconstruction error exceeds the predetermined threshold.
  • the method further comprises the step of clustering the computed gradients for the one or more features.
  • the method further comprises the step of identifying types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.
  • a computer system for detecting and analysing anomalies in network traffic comprising circuitry configured to collect network data from the network traffic; circuitry configured to extract one or more features from the network data to form a dataset; circuitry configured to provide the dataset to a deep neural network model to train the deep neural network model; circuitry configured to detect anomalies in the network traffic using the trained deep neural network model; and circuitry configured to compute gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.
  • the circuitry configured to extract the one or more features from the network data to form the dataset further comprises: circuitry configured to group the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses, wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.
  • the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of: mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate; entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.
  • the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.
  • AE Autoencoder
  • VAE Variational Autoencoder
  • the training of the VAE model is performed using Adam as an optimisation algorithm.
  • the circuitry configured to use the trained deep neural network model to detect anomalies in the network traffic comprises: circuitry configured to obtain a reconstruction error for each data point; and circuitry configured to compare the reconstruction error with a predetermined threshold for each data point, wherein a data point is determine to be an anomaly when the reconstruction error exceeds the predetermined threshold.
  • the computer system further comprises circuitry configured to cluster the computed gradients for the one or more features.
  • the computer system further comprises circuitry configured to identify types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.
  • FIG. 1 illustrating a process or method for detecting and analysing anomalies in a network in accordance with embodiments of the invention
  • FIG. 2 illustrating a block diagram of modules that may be used to implement the method for detecting and analysing network anomalies in accordance with embodiments of the invention
  • FIG. 3 illustrating a block diagram representative of processing systems providing embodiments in accordance with embodiments of the invention
  • FIG. 4 illustrating an exemplary architecture of an Autoencoder (AE) and a Variational Autoencoder (VAE) in accordance with embodiments of the invention
  • Figure 6 illustrating distributions of reconstruction errors of training data for various anomalies as obtained using the VAE model for Example 1 ;
  • FIG. 7 illustrating the receiver operation characteristic (ROC) curves for VAE, AE and Gaussian-based thresholding (GBT) in detecting anomalies in the training dataset for Example 1 ;
  • Figure 8 illustrating the ROC curves for VAE, AE and GBT in detecting anomalies in the test dataset for Example 1 ;
  • Figure 9 illustrating the normalized gradients of spam, scan 11 and scan 44 based on selected features using the VAE model for Example 1 ;
  • This invention relates to a system and method for detecting and identifying network anomalies.
  • this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic using deep learning techniques and a gradient-based fingerprinting technique.
  • this invention relates to a system and method for processing a network’s meta data, such as NetFlow records obtained from network monitoring activities, using a deep neural network model to detect network anomalies whereby the anomalies are subsequently analysed based on gradient information obtained from the deep neural network model.
  • modules may be implemented as circuits, logic chips or any sort of discrete component. Still further, one skilled in the art will also recognize that a module may be implemented in software which may then be executed by a variety of processors. In embodiments of the invention, a module may also comprise computer instructions or executable code that may instruct a computer processor to carry out a sequence of events based on instructions received. The choice of the implementation of the modules is left as a design choice to a person skilled in the art and does not limit the scope of this invention in any way.
  • FIG. 1 sets out an exemplary flowchart of process 100 for detecting, analysing and identifying a network anomaly in accordance with embodiments of the invention.
  • Process 100 comprises the following steps:
  • Step 105 collecting network data
  • Step 110 extracting one or more features from the network data that is collected in step 105 to form a training dataset
  • Step 1 15 feeding the training dataset from step 1 10 into a deep neural network model, thereby training the deep neural network model to learn normal behaviour of the network;
  • Step 120 using the trained deep neural network model to detect an anomaly
  • Step 125 obtaining gradient information of the anomaly from the deep neural network to analyse and identify the anomaly.
  • system 200 comprises feature extraction module 205, gradient fingerprinting module 225 and VAE module 210 (which in turn comprises training module 215 and anomaly detection module 220).
  • system 200 may comprise a computer system.
  • FIG. 3 a block diagram representative of components of processing system 300 that may be provided within modules 205, 210, 215, 220 and 225 for implementing embodiments in accordance with embodiments of the invention is illustrated in Figure 3.
  • FIG 3 a block diagram representative of components of processing system 300 that may be provided within modules 205, 210, 215, 220 and 225 for implementing embodiments in accordance with embodiments of the invention.
  • each of modules 205, 210, 215, 220 and 225 may comprise controller 301 and user interface 302.
  • User interface 302 is arranged to enable manual interactions between a user and each of these modules as required and for this purpose includes the input/output components required for the user to enter instructions to provide updates to each of these modules.
  • components of user interface 302 may vary from embodiment to embodiment but will typically include one or more of display 340, keyboard 335 and track-pad 336.
  • Controller 301 is in data communication with user interface 302 via bus 315 and includes memory 320, processor 305 mounted on a circuit board that processes instructions and data for performing the method of this embodiment, an operating system 306, an input/output (I/O) interface 330 for communicating with user interface 302 and a communications interface, in this embodiment in the form of a network card 350.
  • Network card 350 may, for example, be utilized to send data from these modules via a wired or wireless network to other processing devices or to receive data via the wired or wireless network.
  • Wireless networks that may be utilized by network card 350 include, but are not limited to, Wireless-Fidelity (Wi-Fi), Bluetooth, Near Field Communication (NFC), cellular networks, satellite networks, telecommunication networks, Wide Area Networks (WAN) and etc.
  • Memory 320 and operating system 306 are in data communication with CPU 305 via bus 310.
  • the memory components include both volatile and non-volatile memory and more than one of each type of memory, including Random Access Memory (RAM) 320, Read Only Memory (ROM) 325 and a mass storage device 345, the last comprising one or more solid- state drives (SSDs).
  • RAM Random Access Memory
  • ROM Read Only Memory
  • Mass storage device 345 the last comprising one or more solid- state drives (SSDs).
  • SSDs solid- state drives
  • Memory 320 also includes secure storage 346 for securely storing secret keys, or private keys.
  • the memory components described above comprise non-transitory computer-readable media and shall be taken to comprise all computer-readable media except for a transitory, propagating signal.
  • the instructions are stored as program code in the memory components but can also be hardwired.
  • Memory 320 may include a kernel and/or programming modules such as a software application that may be stored in either volatile or non-volatile memory.
  • processor 305 may be provided by any suitable logic circuitry for receiving inputs, processing them in accordance with instructions stored in memory and generating outputs (for example to the memory components or on display 340).
  • processor 305 may be a single core or multi-core processor with memory addressable space.
  • processor 305 may be multi-core, comprising— for example— an 8 core CPU.
  • step 105 network data are constantly collected by network routers within a monitored network.
  • step 105 may be performed by feature extraction module 205 as illustrated in Figure 2.
  • the network data collected by module 205 comprises NetFlow records. It should be appreciated that other forms of aggregate data that are collectable by routers in an Internet Service Provider (ISP) network may be collected at step 105 by module 205 and that the type of data collected at step 105 is not specifically limited only to NetFlow type records.
  • ISP Internet Service Provider
  • a NetFlow record comprises a set of packets that has the same five-tuple of source and destination IP addresses, source and destination ports, and protocol.
  • some of the important fields collected in the NetFlow records include, but are not limited to, the start time of the flow (based on the first sampled packet), duration, number of packets and Transmission Control Protocol (TCP) flag.
  • TCP Transmission Control Protocol
  • one or more features may be extracted from the NetFlow records collected at step 105 to form a training dataset.
  • the NetFlow records are first grouped into sliding windows of a predetermined duration based on source IP addresses before one or more aggregated features are extracted from each window and statistically analysed.
  • the NetFlow records are grouped into 3-minute long sliding windows based on the source IP addresses. This means that each data point in the training dataset corresponds to network statistics from a single source IP address within a 3-minutes period. This enables identification of an offending IP address that is responsible for an anomaly and the time window to which the anomaly belongs to.
  • the period of 3 minutes is chosen as a balance between the practicality and quality of statistics from the aggregated features, whereby the statistics will be insignificant if the window period is too short, and the capability for real time analysis is lost if the window period is too long.
  • the duration of a 3-minute sliding window is not meant to be limiting to the present invention and other durations may be used without departing from this invention.
  • 53 aggregated features are extracted from the NetFlow records to form the training dataset. This include:
  • data points that contains too few flows may be removed from the training dataset to reduce noise.
  • the statistics may be further normalised into a Z-score or scaled to a value between 0 and 1.
  • the training dataset from step 110 is fed into a deep neural network model found in training module 215 to learn the normal behaviour of the network.
  • Various unsupervised deep neural network models may be used. These models do not require labelled information (i.e. normal or anomalous) for training and instead exploit the fact that anomalous behaviours tend to differ greatly from the standard or normal behaviour of the network.
  • the unsupervised deep neural network model may comprise an Autoencoder (AE) model or a Variational Autoencoder (VAE) model that is a probabilistic generalisation of the AE model.
  • AE Autoencoder
  • VAE Variational Autoencoder
  • an exemplary architecture 400 for an AE or VAE model is illustrated in Figure 4.
  • the AE or VAE model is made of three main layers: an input layer 401 to take in the features; a latent representation layer 402 of the features; and an output layer 403 that is a reconstruction of the features.
  • the AE or VAE model comprises of two parts, an encoder 405 and a decoder 410.
  • the function of encoder 405 is to map a feature from the input layer 401 into its latent representation in the latent representation layer 402 while decoder 410 derives an output in the output layer 403 that is a reconstruction of the feature from the input layer 401 based on the latent representation.
  • the encoder 405 may be considered as part of a deep neural network in the sense that information from the input is passed through several mappings (and hidden layers) similar to the deep architecture in a supervised deep learning model; and likewise for the decoder 410.
  • the latent representation layer 402 may be set to a size of 100.
  • the encoder 405 and the decoder 410 may each have three hidden layers with sizes 512, 512, and 1 ,024 respectively, as illustrated in Figure 4.
  • nodes that are shaded represent the observed data (used as both inputs and outputs), while the unshaded nodes represent unobserved latent variables that correspond to the hidden layers.
  • the exact sizes or dimensions of the layers are shown above the nodes in this illustration.
  • the links between the layers show how the values of the next layer can be computed. Commonly, for an AE model, the value of one hidden layer can be computed as:
  • the function g ⁇ is known as the activation function that transforms the computation in a non-linear way and allows complex relationships to be learned.
  • the learning of the parameters is generally achieved by minimizing the reconstruction errors (e.g. mean square errors) via backpropagation with random initialization, and can be optimized with a variety of optimizers such as the stochastic gradient descent optimizer.
  • optimizers such as the stochastic gradient descent optimizer.
  • the detailed optimization steps are omitted here for brevity and as the person skilled in the art may obtain more details from the publication by Y. Bengio titled“Practical recommendations for gradient-based training of deep architecture”, Neural Networks: Tricks of the Trade, Springer, 2012, pages 437-478, which is incorporated herein in its entirety for details on optimisation.
  • the AE model may be viewed as a deterministic model that maps a set of inputs into their reconstructed outputs.
  • the VAE model is a generative model that treats the latent representation layer as random variables that are conditional on the inputs. While the encoder and decoder in the VAE follows the same computational model as the AE (i.e. as set out in equation (1)), the encoding process is instead used to compute the parameters for the conditional distributions of the latent representation layer. The parameters that are obtained from model training are then used to generate or sample the latent representation for decoding. A detailed explanation on the working of the VAE is set out in the later sections.
  • conditional distributions in the VAE model are generally assumed to be Gaussian for real-valued nodes.
  • z t is denoted as the value of the latent representation layer, then it can be written as
  • diagQ denotes a function that transforms a vector into a diagonal matrix; and and are the mean and variance for the conditional Gaussian distribution obtained from the output of the encoder:
  • VAE Probabilistic nature of the VAE also means that the usual learning algorithm on standard objective function (e.g. mean square error) may not be used to train the model. Instead, a class of approximate statistical inference method known as Variational Bayes are used. The detailed steps are omitted here for brevity and as the person skilled in the art may obtain more details from the publication by D.P. Kingma and M. Welling titled“Auto-encoding variational Bayes”, arXiv preprint arXiv: 1312.61 14, 2013 which is incorporated herein in its entirety.
  • VAE virtual image stabilization
  • variational lower bound an alternative objective function known as the variational lower bound
  • stochastic sampling is used for approximation.
  • the architecture of a VAE is similar to that of an AE as illustrated in Figure 4.
  • the training algorithm of the VAE model may be implemented using TensorFlow.
  • the parameters in the VAE are randomly initialized.
  • a forward pass on the encoder is subsequently performed by computing the distribution of the latent representation layer via equation (2).
  • several samples can be generated from the Gaussian distribution which are used to compute the variational lower bound, which consists of a Kullback-Leibler (KL) divergence term and an expectation term:
  • KL Kullback-Leibler
  • z is the latent representation of the input features x.
  • the distribution p( ⁇ ) corresponds to the Gaussian prior and conditional distribution of the VAE model; while q( ⁇ ) is a variational approximation of p( ⁇ ), generally chosen to be Gaussian as well (whereby such functions are commonly used in auto-encoding variational Bayes functions). Fortunately, this objective function can be maximized with stochastic optimization techniques since the gradients are readily available via automatic differentiation.
  • an optimisation algorithm that enables training to be performed in mini batches is outlined in the publication by D.P. Kingma and J. Ba. titled“Adam: A method for stochastic optimization”, Third International Conference on Learning Representations, 2015, herein incorporated in its entirety may be used. This generally allows for real-time training by choosing a small mini batch size and discarding the data after one epoch. Further, label information is not required during the training of the model.
  • the deep neural network model is then used for anomaly detection.
  • an IP address and its time window may be recognised by module 220 as abnormal when a reconstruction error of its input features is high.
  • the reconstruction error comprises the mean square difference between the observed features and the expectation of their reconstruction.
  • a high reconstruction error is generally observed when the network behaviour differs greatly from the normal behaviour that was learned by the VAE.
  • a threshold value may be selected such that a small percentage (say 5%) of the data is treated as anomalies. Otherwise, labeled information may be used to select the threshold to maximize the detection rate while allowing a small false positive rate to be detected. Note that although the anomalies correspond to unusual or unexpected network behaviours demonstrated by a particular source IP address, it may not necessarily be malicious in nature.
  • an anomaly is analysed by module 225 by obtaining gradient information from the deep neural network. This is done by analysing the gradient that is contributed by each feature at the anomalous data point, i.e. how the VAE’s objective function varies if a feature in the anomaly data point increases or decreases by a small amount. Intuitively, given the trained VAE and an anomaly data point, if the function (reconstruction error) changes quite a lot when a particular feature of the anomaly data point is varied by a small amount, then this feature at its current value is significantly abnormal, since it would likely perturb the VAE model (through optimization) to fit itself better.
  • Gradients are computed at step 125 for each feature from each data point i.
  • the gradient may be
  • the flagged anomalies can be clustered based on their gradients into groups that share similar behaviour, making it easier for
  • VAE Variational Autoencoder
  • the VAE model that may be utilized in accordance with embodiments of the invention is discussed in greater detail in this section.
  • the VAE model may be used to represent the probability of a data point p(x) through a latent variable where the probability of a
  • p(z) is the prior probability of the latest random variable z, which is assumed to be a standard Gaussian distribution, i.e.
  • O is a vector of all 0 and of length d z
  • I is an identify square matrix
  • VAE Given a data point x, VAE would be able to learn about the posterior distribution of the latent random variable which is done via variational inference (VI). VI learns an approximated posterior, such that it minimizes the Kullback-Leibler (KL) divergence between the approximate distribution q ⁇ z
  • the mean and the variance are modeled with a neural
  • this neural network whose input is x and outputs are We denote this neural network as f en ⁇ de .
  • Equation (8) has a closed form expression
  • Equation (8) i.e. can be replaced with the mean squared error (MSE) for
  • the objective function may be derived as:
  • VAE training of VAE is iterative. At each iteration, a random mini batch is drawn from the dataset, over which the average of (x) is minimized. This is done using any available optimization package, such as Adam, which is available in the Tensorflow library. In particular, when this library is adopted, the method minimize can be called in the class tf.keras. optimizers. Adam with the average of Equation (10) over a mini-batch as the objective function, and the parameters of f encode and f decode as the trainable variables.
  • Adam Adam with the average of Equation (10) over a mini-batch as the objective function, and the parameters of f encode and f decode as the trainable variables.
  • Equation (10) if L x) in Equation (10) is large, it can be assumed that x is anomalous.
  • d be a threshold value such that of the dataset.
  • the gradient of the loss may be computed with respect to (w.r.t) where
  • Equation 10 L(x) is defined in Equation (10)
  • the data point x * is represented as a dx-dimensional vector denote the partial derivative of L(x ) w.r.t the first feature/dimension of (these d dimensions are also referred to as the features).
  • Equation (10) tf. gradients in the Tensorflow library by specifying Equation (10) as the ys and as xs.
  • the i-th element of z represents the adjustment to the i-th element of Hence, we can make an anomaly x rope less anomalous by adjusting
  • FIG. 5 Such an example is illustrated in Figure 5 where the dataset consists of 2-dimensional data points following a Gaussian distribution, plots 505 and 510. We assume that those data points that are outside 2 standard deviation from the mean may be treated as anomalies (plotted as dots in the plots 505 and 510).
  • the bottom 2 plots 515 and 520 show the negative values of the gradients of both VAE loss, L x), and the KL loss, at different
  • the negative gradients of the VAE loss or the KL loss show the direction of the adjustment to these data points to make them normal (i.e., towards the mean/center in this case).
  • the UGR16 dataset which contains anonymized NetFlow traces captured from a real network of a Tier 3 ISP, is used herein as an example.
  • the ISP provides cloud services and is used by many client companies of different sizes and markets.
  • the UGR trace is a fairly recent and large scale data trace that contains real background traffic from a wide range of Internet users, rather than specific traffic patterns from synthetically generated data (e.g., DARPA'98 and DARPA'99, UNB ISCX 2012, UNSW-NB15, CTU13).
  • the UGR contains traffic for the whole day and traces over a 4-month period. Furthermore, the UGR’s attack traffic data is a mixture of generated attacks, labelled real attacks, and botnet attacks from controlled environment. Specifically, the labelled attacks comprise of: • Low-rate Denial-of-Service (DoS): TCP SYN packets are sent to victims with packet of size 1280 bits and of rate 100 packets/s to port 80. The rate of the attack is sufficiently low such that the normal operation of the network is not affected.
  • DoS Low-rate Denial-of-Service
  • Port scanning A continuous SYN scanning to common ports of victims. There are two kinds of scanning, oneto-one scan attack (Scanll) and four-to-four (Scan44).
  • Botnet A simulated botnet traffic obtained from the execution of the malware known as Neris. This data comes from the CTU13 trace.
  • Blacklist IP addresses published in the public blacklists.
  • a total of five days of the UGR data were selected for use herein. Two Saturdays were used as training data while three other days (Friday and Sundays) were chosen as test days.
  • a training dataset of 5,990,295 data points was obtained.
  • the data was then used by training module 215 (contained with module 210) to train a VAE model by stochastic optimization with 50 epochs and a mini-batch of size 300 whereby the weight decay was set to 0.01.
  • a total of 1 ,957,71 1 data points on March 18, 2,954,983 data points on March 20, and 2,878,422 data points on July 31 were processed. It should be noted that a data point belongs to an attack type if the proportion of flows that are labelled with such attack within the 3-minute aggregation is greater than 50%.
  • Figures 6(a)-(f) illustrates the distribution of the reconstruction errors that were generated for the training data.
  • the ground truth labels were used to separate the anomalies from the background flows, which allows the anomalies to be examined to determine whether the anomalies behave differently from the what was expected.
  • Figures 6(a)-(f) there is some overlap in the reconstruction error plots with the background plots for spam ( Figure 6(a)), botnet ( Figure 6(b)), DoS ( Figure 6(c)), and scanning activities ( Figure 6(d)-(f)).
  • a cut-off point may be used to distinguish between background noise versus an attack such as spam, botnet, DoS, and scanning activities.
  • care should be taken for blacklist type anomalies as the behaviour for blacklist appears to be indistinguishable from the background traffic.
  • the performance of the VAE model for anomaly detection was compared against that of AE and also a Gaussian-based thresholding (GBT) approach.
  • GBT Gaussian-based thresholding
  • the baseline AE was configured to share the same architecture as the VAE.
  • the AE was implemented using Keras, a high level open source neural network library and the same TensorFlow backend was used for training the AE, i.e. by iteratively minimizing the reconstruction error (mean square error) using the stochastic optimizer Adam, with a mini-batch size chosen to be 256. Similar to the VAE, data points that have large reconstruction errors were then flagged as anomalies.
  • threshold-based approaches such as GBT works well for attacks that cause the volume of certain categories of traffic to increase significantly, such as spam and port scanning type attacks.
  • spam and port scanning type attacks do not appear to work as well for botnet and low rate spam type attacks for this example.
  • AE does not work as well for spam and low rate DoS type attacks.
  • the AE model may be unable to differentiate spam from regular email traffic because of the high volume of data and would be unable to detect DoS attacks due to the low rate volume of data.
  • VAE model is the most robust and provides the best performance for detecting all attack modes.
  • Table II summarizes the individual and average area under the ROC curve (AUC) for various types of attacks (except blacklist) and algorithms and Table II shows that VAE provides the best overall performance in terms of AUC.
  • the gradients for all the features for the various attacks may then computed based on the VAE model.
  • Figure 9 illustrates the gradients of the features for spam, scanll, and scan44 whereby the black bars reflect one standard error for the gradients (which are useful for assessing the significance of the gradients, i.e. , whether it is due to noise or not). Features that are not significant are also presented as contrast.
  • the ROC for various attacks detected are plotted in the following way.
  • L 2 denote the Euclidean distance between the average gradient fingerprint obtained from data with labeled attacks and the gradients of the VAE's objective function w.r.t. each test data point, L 2 .
  • L 2 is defined as the same distance metric but computed based on normalized gradient vectors, i.e. — normalized distance is taken to be proportional to the
  • cosine distance and may be computed as: where is the Euclidean distance.
  • the ROC is then produced by varying the threshold on L 2 or distance. If the
  • the gradients generated by the VAE module 210 may be used for clustering.
  • the idea is that if the clustering is effective, the identification of the types of attacks may be limited to a smaller number of clusters thereby increasing the overall identification speed of the system.
  • Figure 1 1 illustrates the clusters that four of the attacks belong to when they are clustered accordingly. For example, 92.4% of the DoS attacks appear in only two clusters (c82 and c84) with the other 7.6% appearing in four other clusters. For spam attacks, 74.3% of them appeared in two clusters (c1 1 and c15), while 25.7% appeared in another 11 clusters.
  • clustering is an effective tool for analysts to focus on a much smaller number of clusters for particular type of attacks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Virology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

This document describes a system and method for detecting and identifying network anomalies. In particular, this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic using deep learning techniques and a gradient-based fingerprinting technique.

Description

SYSTEM AND METHOD FOR NETWORK ANOMALY DETECTION AND ANALYSIS
Field of the Invention
This invention relates to a system and method for detecting and identifying network anomalies. In particular, this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic. This is done using deep learning techniques and a gradient-based fingerprinting technique.
Summary of Prior Art
Anomalies in network traffic may occur due to various types of cyber-related attacks and threats, such as Distributed Denial-of-Service (DDoS) attacks (e.g., TCP SYN flooding, DNS amplification attacks, etc.), brute force attempts, botnet communications, spam campaign, network/port scans, and etc. Such network anomalies may also occur due to non- malicious causes, such as faults that occur in the network, misconfigurations, improper Border Gateway Protocol (BGP) policy updates, changes in user behaviours, and etc. Hence, in order to ensure minimal disruptions to the network and to maintain the security of the network, these anomalies should be detected, identified and categorized in real time so that they may be rapidly addressed and resolved as required. Maintaining the network’s integrity is of increasing importance especially with the rapid adoption of Internet of Things (loT) technology. With wider implementations of loT devices, cyber criminals are able to target and harness more resources by exploiting the vulnerabilities found in loT devices (e.g. by using the Mirai attack).
The detection of network anomalies remains challenging for a number of reasons. First, the characteristics of network data are dependent on a number of factors, such as the end-user’s behaviour, the customer’s types of business (e.g., banking, retail), the types of the applications, the location, the time of the day, and are expected to change with time. Such diversity and dynamism limits the application of rule-based systems for the detection of anomalies in network traffic.
Further, as capturing, storing and processing raw traffic from high capacity networks are not practical, Internet routers today typically extract and export meta data in the form of, but are not limited to, NetFlow records. A typical NetFlow record represents meta data of a set of related packets, and are often generated from sampled packets. A downside of using NetFlow records is that useful information including suspicious keywords in payloads, Transmission Control Protocol (TCP) state transitions, Transport Layer Security (TLS) connection handshake, sizes of each packet, time between consecutive packets, etc. are unavailable. Therefore, anomaly detection solutions that utilize NetFlow records may contain incomplete and/or lossy information.
Finally, as security operation centre (SOC) analysts usually have a limited budget for analysing alerts raised by an anomaly detector, including alert escalation, threat and attack mitigation, intelligence gathering, forensic analysis, etc, it is desirable that anomaly detectors go beyond merely indicating the presence of anomalies, but also seek to provide other information such as the time of the anomaly, the anomaly type, and the corresponding set of suspicious flows. In general, when more the information may be passed (along with the alerts) to the analysts, this greatly accelerates the analysis of the threat and quickens the decision process.
Those skilled in the art have proposed that statistical models and machine learning algorithms be used to address the problems mentioned above. In other words, when the detection of anomalies are treated as a binary classification problem, a supervised machine learning model can be built using normal and anomalous data, and this may be used to classify anomalies. However, existing approaches have the following limitations.
First, many of the approaches exploit only a small number of features (e.g. traffic volume, flow rates, or entropy). Such approaches require the users to use the appropriate domain knowledge to select the right set of features which may not always be feasible and optimal.
Second, supervised machine learning approaches require large sets of data with ground truth for training. As the network’s characteristics and attacks evolve, models have to be retrained and the corresponding labelled datasets have to be made available. This requires costly and laborious manual efforts, and yet, given the size of traffic flowing through a backbone network, it is highly impractical to assume that all data records are correctly labelled. Further, supervised machine learning approaches are unlikely to detect unseen and zero-day attack traffic.
In view of the above, it is most desirous to employ an unsupervised network anomaly detection method that is scalable in terms of both network data size and feature dimension. Such a method would be useful in detecting anomalies in large scale networks without the need to rely on domain knowledge. Further, in addition to detecting anomalies, the method should be able to analyse the detected anomalies and ascertain the type of attacks by identifying the main features that cause the anomalies. For the above reasons, those skilled in the art are constantly striving to come up with a system and method that is capable of receiving and quantitatively unifying unstructured and/or unlabelled information security threat data from any source or system whereby the processed information is then provided back to all the upstream systems to actively tune and improve the security postures of these systems in near-real-time.
Summary of the Invention
The above and other problems are solved and an advance in the art is made by systems and methods provided by embodiments in accordance with the invention.
A first advantage of embodiments of systems and methods in accordance with the invention is that based on gradient information obtained from the deep neural network model, flagged anomalies may be effectively and efficiently identified.
A second advantage of embodiments of systems and methods in accordance with the invention is that gradient information obtained from the deep neural network model may be employed to detect future anomalies that have yet to be labelled or identified.
A third advantage of embodiments of systems and methods in accordance with the invention is that the invention does not require labelled data as training data and as a result, is likely to detect zero-day type network attacks.
A fourth advantage of embodiments of system and methods in accordance with the invention is that the invention provides an efficient way of explaining and/or identifying attacks from detected anomalous traffic as not all anomalous traffic may comprise cyber-attacks on the network.
The above advantages are provided by embodiments of a method in accordance with the invention operating in the following manner.
According to a first aspect of the invention, a method for detecting and analysing anomalies in network traffic is disclosed whereby the method performed by a computer system comprises: collecting network data from the network traffic; extracting one or more features from the network data to form a dataset; providing the dataset to a deep neural network model to train the deep neural network model; detecting anomalies in the network traffic using the trained deep neural network model; and computing gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies. With reference to the first aspect, the step of extracting one or more features from the network data to form the dataset comprises: grouping the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses, wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.
With reference to the first aspect, the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of: mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate; entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.
With reference to the first aspect, the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.
With reference to the first aspect, the training of the VAE model is performed using Adam as an optimisation algorithm.
With reference to the first aspect, the step of using the trained deep neural network model to detect anomalies in the network traffic comprises: obtaining a reconstruction error for each data point; and comparing the reconstruction error with a predetermined threshold for each data point, wherein a data point is determine to be an anomaly when the reconstruction error exceeds the predetermined threshold.
With reference to the first aspect, the method further comprises the step of clustering the computed gradients for the one or more features.
With reference to the first aspect, the method further comprises the step of identifying types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.
According to a second aspect of the invention, a computer system for detecting and analysing anomalies in network traffic is disclosed, the system comprising circuitry configured to collect network data from the network traffic; circuitry configured to extract one or more features from the network data to form a dataset; circuitry configured to provide the dataset to a deep neural network model to train the deep neural network model; circuitry configured to detect anomalies in the network traffic using the trained deep neural network model; and circuitry configured to compute gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.
With reference to the second aspect, the circuitry configured to extract the one or more features from the network data to form the dataset further comprises: circuitry configured to group the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses, wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.
With reference to the second aspect, the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of: mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate; entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.
With reference to the second aspect, the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.
With reference to the second aspect, the training of the VAE model is performed using Adam as an optimisation algorithm.
With reference to the second aspect, the circuitry configured to use the trained deep neural network model to detect anomalies in the network traffic comprises: circuitry configured to obtain a reconstruction error for each data point; and circuitry configured to compare the reconstruction error with a predetermined threshold for each data point, wherein a data point is determine to be an anomaly when the reconstruction error exceeds the predetermined threshold.
With reference to the second aspect, the computer system further comprises circuitry configured to cluster the computed gradients for the one or more features.
With reference to the second aspect, the computer system further comprises circuitry configured to identify types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks. Brief Description of the Drawings
The above and other problems are solved by features and advantages of a system and method in accordance with the present invention described in the detailed description and shown in the following drawings.
Figure 1 illustrating a process or method for detecting and analysing anomalies in a network in accordance with embodiments of the invention;
Figure 2 illustrating a block diagram of modules that may be used to implement the method for detecting and analysing network anomalies in accordance with embodiments of the invention;
Figure 3 illustrating a block diagram representative of processing systems providing embodiments in accordance with embodiments of the invention
Figure 4 illustrating an exemplary architecture of an Autoencoder (AE) and a Variational Autoencoder (VAE) in accordance with embodiments of the invention;
Figure 5 illustrating plots of VAE loss and Kullback-Leibler (KL) loss from a recovered Gaussian distribution;
Figure 6 illustrating distributions of reconstruction errors of training data for various anomalies as obtained using the VAE model for Example 1 ;
Figure 7 illustrating the receiver operation characteristic (ROC) curves for VAE, AE and Gaussian-based thresholding (GBT) in detecting anomalies in the training dataset for Example 1 ;
Figure 8 illustrating the ROC curves for VAE, AE and GBT in detecting anomalies in the test dataset for Example 1 ;
Figure 9 illustrating the normalized gradients of spam, scan 11 and scan 44 based on selected features using the VAE model for Example 1 ;
Figure 10 illustrating the ROC curves for anomaly detection using fingerprints for Example 1 ; and
Figure 11 illustrating the distribution of clusters for each attack type for Example 1. Detailed Description
This invention relates to a system and method for detecting and identifying network anomalies. In particular, this invention relates to a system and method for detecting and identifying the network anomalies contained within network traffic using deep learning techniques and a gradient-based fingerprinting technique. Still more particularly, this invention relates to a system and method for processing a network’s meta data, such as NetFlow records obtained from network monitoring activities, using a deep neural network model to detect network anomalies whereby the anomalies are subsequently analysed based on gradient information obtained from the deep neural network model.
The present invention will now be described in detail with reference to several embodiments thereof as illustrated in the accompanying drawings. In the following description, numerous specific features are set forth in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art, that embodiments may be realised without some or all of the specific features. Such embodiments should also fall within the scope of the current invention. Further, certain process steps and/or structures in the following may not been described in detail and the reader will be referred to a corresponding citation so as to not obscure the present invention unnecessarily.
Further, one skilled in the art will recognize that many functional units in this description have been labelled as modules throughout the specification. The person skilled in the art will also recognize that a module may be implemented as circuits, logic chips or any sort of discrete component. Still further, one skilled in the art will also recognize that a module may be implemented in software which may then be executed by a variety of processors. In embodiments of the invention, a module may also comprise computer instructions or executable code that may instruct a computer processor to carry out a sequence of events based on instructions received. The choice of the implementation of the modules is left as a design choice to a person skilled in the art and does not limit the scope of this invention in any way.
Figure 1 sets out an exemplary flowchart of process 100 for detecting, analysing and identifying a network anomaly in accordance with embodiments of the invention. Process 100 comprises the following steps:
Step 105: collecting network data; Step 110: extracting one or more features from the network data that is collected in step 105 to form a training dataset;
Step 1 15: feeding the training dataset from step 1 10 into a deep neural network model, thereby training the deep neural network model to learn normal behaviour of the network;
Step 120: using the trained deep neural network model to detect an anomaly; and
Step 125: obtaining gradient information of the anomaly from the deep neural network to analyse and identify the anomaly.
The steps of process 100 may be performed by modules contained within system 200, as illustrated in Figure 2, whereby system 200 comprises feature extraction module 205, gradient fingerprinting module 225 and VAE module 210 (which in turn comprises training module 215 and anomaly detection module 220). In embodiments of the invention, system 200 may comprise a computer system.
In accordance with embodiments of the invention, a block diagram representative of components of processing system 300 that may be provided within modules 205, 210, 215, 220 and 225 for implementing embodiments in accordance with embodiments of the invention is illustrated in Figure 3. One skilled in the art will recognize that the exact configuration of each processing system provided within these modules may be different and the exact configuration of processing system 300 may vary and Figure 3 is provided by way of example only.
In embodiments of the invention, each of modules 205, 210, 215, 220 and 225 may comprise controller 301 and user interface 302. User interface 302 is arranged to enable manual interactions between a user and each of these modules as required and for this purpose includes the input/output components required for the user to enter instructions to provide updates to each of these modules. A person skilled in the art will recognize that components of user interface 302 may vary from embodiment to embodiment but will typically include one or more of display 340, keyboard 335 and track-pad 336.
Controller 301 is in data communication with user interface 302 via bus 315 and includes memory 320, processor 305 mounted on a circuit board that processes instructions and data for performing the method of this embodiment, an operating system 306, an input/output (I/O) interface 330 for communicating with user interface 302 and a communications interface, in this embodiment in the form of a network card 350. Network card 350 may, for example, be utilized to send data from these modules via a wired or wireless network to other processing devices or to receive data via the wired or wireless network. Wireless networks that may be utilized by network card 350 include, but are not limited to, Wireless-Fidelity (Wi-Fi), Bluetooth, Near Field Communication (NFC), cellular networks, satellite networks, telecommunication networks, Wide Area Networks (WAN) and etc.
Memory 320 and operating system 306 are in data communication with CPU 305 via bus 310. The memory components include both volatile and non-volatile memory and more than one of each type of memory, including Random Access Memory (RAM) 320, Read Only Memory (ROM) 325 and a mass storage device 345, the last comprising one or more solid- state drives (SSDs). Memory 320 also includes secure storage 346 for securely storing secret keys, or private keys. One skilled in the art will recognize that the memory components described above comprise non-transitory computer-readable media and shall be taken to comprise all computer-readable media except for a transitory, propagating signal. Typically, the instructions are stored as program code in the memory components but can also be hardwired. Memory 320 may include a kernel and/or programming modules such as a software application that may be stored in either volatile or non-volatile memory.
Herein the term“processor” is used to refer generically to any device or component that can process such instructions and may include: a microprocessor, microcontroller, programmable logic device or other computational device. That is, processor 305 may be provided by any suitable logic circuitry for receiving inputs, processing them in accordance with instructions stored in memory and generating outputs (for example to the memory components or on display 340). In this embodiment, processor 305 may be a single core or multi-core processor with memory addressable space. In one example, processor 305 may be multi-core, comprising— for example— an 8 core CPU.
With reference to Figure 1 , at step 105, network data are constantly collected by network routers within a monitored network. In accordance with embodiments of the invention, step 105 may be performed by feature extraction module 205 as illustrated in Figure 2.
In the following description, for illustration purposes, it shall be assumed that the network data collected by module 205 comprises NetFlow records. It should be appreciated that other forms of aggregate data that are collectable by routers in an Internet Service Provider (ISP) network may be collected at step 105 by module 205 and that the type of data collected at step 105 is not specifically limited only to NetFlow type records.
As known to one skilled in the art, a NetFlow record comprises a set of packets that has the same five-tuple of source and destination IP addresses, source and destination ports, and protocol. In addition to the above, some of the important fields collected in the NetFlow records include, but are not limited to, the start time of the flow (based on the first sampled packet), duration, number of packets and Transmission Control Protocol (TCP) flag.
At step 1 10, using module 205, one or more features may be extracted from the NetFlow records collected at step 105 to form a training dataset. In an embodiment of the invention, the NetFlow records are first grouped into sliding windows of a predetermined duration based on source IP addresses before one or more aggregated features are extracted from each window and statistically analysed. In a preferred embodiment, the NetFlow records are grouped into 3-minute long sliding windows based on the source IP addresses. This means that each data point in the training dataset corresponds to network statistics from a single source IP address within a 3-minutes period. This enables identification of an offending IP address that is responsible for an anomaly and the time window to which the anomaly belongs to. The period of 3 minutes is chosen as a balance between the practicality and quality of statistics from the aggregated features, whereby the statistics will be insignificant if the window period is too short, and the capability for real time analysis is lost if the window period is too long. Hence, as would be understood by the skilled person, the duration of a 3-minute sliding window is not meant to be limiting to the present invention and other durations may be used without departing from this invention.
In a preferred embodiment, 53 aggregated features are extracted from the NetFlow records to form the training dataset. This include:
• mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate;
• entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and
• proportion of ports used for common applications (e.g. Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.
To ensure that meaningful statistics are captured, data points that contains too few flows (e.g. less than 10) may be removed from the training dataset to reduce noise. The statistics may be further normalised into a Z-score or scaled to a value between 0 and 1.
At step 1 15, the training dataset from step 110 is fed into a deep neural network model found in training module 215 to learn the normal behaviour of the network. Various unsupervised deep neural network models may be used. These models do not require labelled information (i.e. normal or anomalous) for training and instead exploit the fact that anomalous behaviours tend to differ greatly from the standard or normal behaviour of the network. In an embodiment of the invention, the unsupervised deep neural network model may comprise an Autoencoder (AE) model or a Variational Autoencoder (VAE) model that is a probabilistic generalisation of the AE model.
An exemplary architecture 400 for an AE or VAE model is illustrated in Figure 4. In general, the AE or VAE model is made of three main layers: an input layer 401 to take in the features; a latent representation layer 402 of the features; and an output layer 403 that is a reconstruction of the features. As illustrated, the AE or VAE model comprises of two parts, an encoder 405 and a decoder 410. The function of encoder 405 is to map a feature from the input layer 401 into its latent representation in the latent representation layer 402 while decoder 410 derives an output in the output layer 403 that is a reconstruction of the feature from the input layer 401 based on the latent representation.
The encoder 405 may be considered as part of a deep neural network in the sense that information from the input is passed through several mappings (and hidden layers) similar to the deep architecture in a supervised deep learning model; and likewise for the decoder 410.
As an illustrative example, the latent representation layer 402 may be set to a size of 100. In addition, the encoder 405 and the decoder 410 may each have three hidden layers with sizes 512, 512, and 1 ,024 respectively, as illustrated in Figure 4. In this illustration, nodes that are shaded represent the observed data (used as both inputs and outputs), while the unshaded nodes represent unobserved latent variables that correspond to the hidden layers. The exact sizes or dimensions of the layers are shown above the nodes in this illustration. Additionally, the links between the layers show how the values of the next layer can be computed. Commonly, for an AE model, the value of one hidden layer can be computed as:
Figure imgf000013_0002
Figure imgf000013_0001
where is a vector of values for the previous layer,
Figure imgf000013_0004
is a matrix of weights that signifies the relationship from the previous layer, and
Figure imgf000013_0005
is a vector of bias terms. Both are
Figure imgf000013_0003
parameters to be learned through training and optimising the model using the training dataset from step 1 10. The function g{ ) is known as the activation function that transforms the computation in a non-linear way and allows complex relationships to be learned. Popularly used activation functions include the sigmoid function
Figure imgf000013_0006
and the Rectified Linear Unit (ReLU), g(x) = max(0,%), which is preferably used in the present invention.
The learning of the parameters is generally achieved by minimizing the reconstruction errors (e.g. mean square errors) via backpropagation with random initialization, and can be optimized with a variety of optimizers such as the stochastic gradient descent optimizer. The detailed optimization steps are omitted here for brevity and as the person skilled in the art may obtain more details from the publication by Y. Bengio titled“Practical recommendations for gradient-based training of deep architecture”, Neural Networks: Tricks of the Trade, Springer, 2012, pages 437-478, which is incorporated herein in its entirety for details on optimisation.
Based on the above, it can be said that the AE model may be viewed as a deterministic model that maps a set of inputs into their reconstructed outputs. On the other hand, the VAE model is a generative model that treats the latent representation layer as random variables that are conditional on the inputs. While the encoder and decoder in the VAE follows the same computational model as the AE (i.e. as set out in equation (1)), the encoding process is instead used to compute the parameters for the conditional distributions of the latent representation layer. The parameters that are obtained from model training are then used to generate or sample the latent representation for decoding. A detailed explanation on the working of the VAE is set out in the later sections.
In general, the conditional distributions in the VAE model are generally assumed to be Gaussian for real-valued nodes. For example, when zt is denoted as the value of the latent representation layer, then it can be written as
Figure imgf000014_0001
where diagQ denotes a function that transforms a vector into a diagonal matrix; and and
Figure imgf000014_0002
are the mean and variance for the conditional Gaussian distribution obtained from the output of the encoder:
Figure imgf000014_0003
The parameters are interpreted in the same way as in the AE
Figure imgf000014_0004
model. Treatment on the hidden layers is also identical to that of AE. The probabilistic nature of the VAE also means that the usual learning algorithm on standard objective function (e.g. mean square error) may not be used to train the model. Instead, a class of approximate statistical inference method known as Variational Bayes are used. The detailed steps are omitted here for brevity and as the person skilled in the art may obtain more details from the publication by D.P. Kingma and M. Welling titled“Auto-encoding variational Bayes”, arXiv preprint arXiv: 1312.61 14, 2013 which is incorporated herein in its entirety. In essence, where the VAE model is used, an alternative objective function known as the variational lower bound is optimized, and stochastic sampling is used for approximation. In terms of architecture, the architecture of a VAE is similar to that of an AE as illustrated in Figure 4. The ReLU activation function is used by the encoder and the decoder in all of the intermediate layers, and the linear activation g(x ) = x will be used for the output.
]
In this regard, the training algorithm of the VAE model may be implemented using TensorFlow. In summary, before starting the training procedure, the parameters in the VAE are randomly initialized. A forward pass on the encoder is subsequently performed by computing the distribution of the latent representation layer via equation (2). With this, several samples can be generated from the Gaussian distribution which are used to compute the variational lower bound, which consists of a Kullback-Leibler (KL) divergence term and an expectation term:
Figure imgf000015_0001
where z is the latent representation of the input features x. Here, the distribution p(·) corresponds to the Gaussian prior and conditional distribution of the VAE model; while q(·) is a variational approximation of p(·), generally chosen to be Gaussian as well (whereby such functions are commonly used in auto-encoding variational Bayes functions). Fortunately, this objective function can be maximized with stochastic optimization techniques since the gradients are readily available via automatic differentiation.
As an example, an optimisation algorithm that enables training to be performed in mini batches is outlined in the publication by D.P. Kingma and J. Ba. titled“Adam: A method for stochastic optimization”, Third International Conference on Learning Representations, 2015, herein incorporated in its entirety may be used. This generally allows for real-time training by choosing a small mini batch size and discarding the data after one epoch. Further, label information is not required during the training of the model.
Returning to Figure 1 , at step 120 and using module 220, once the parameters have been optimised by module 215 at step 1 15, the deep neural network model is then used for anomaly detection. As an example, when the VAE model is used in module 220, an IP address and its time window may be recognised by module 220 as abnormal when a reconstruction error of its input features is high. For clarity, the reconstruction error comprises the mean square difference between the observed features and the expectation of their reconstruction. A high reconstruction error is generally observed when the network behaviour differs greatly from the normal behaviour that was learned by the VAE. In embodiments of the invention, a threshold value may be selected such that a small percentage (say 5%) of the data is treated as anomalies. Otherwise, labeled information may be used to select the threshold to maximize the detection rate while allowing a small false positive rate to be detected. Note that although the anomalies correspond to unusual or unexpected network behaviours demonstrated by a particular source IP address, it may not necessarily be malicious in nature.
At step 125, an anomaly is analysed by module 225 by obtaining gradient information from the deep neural network. This is done by analysing the gradient that is contributed by each feature at the anomalous data point, i.e. how the VAE’s objective function varies if a feature in the anomaly data point increases or decreases by a small amount. Intuitively, given the trained VAE and an anomaly data point, if the function (reconstruction error) changes quite a lot when a particular feature of the anomaly data point is varied by a small amount, then this feature at its current value is significantly abnormal, since it would likely perturb the VAE model (through optimization) to fit itself better.
Gradients, or more technically the derivative of the variational lower bound, are computed at step 125 for each feature from each data point i. The gradient may be
Figure imgf000016_0004
obtained as follows:
Figure imgf000016_0001
Two applications of the gradient can be immediately derived. In an embodiment of the invention, even without employing ground truth labels, the flagged anomalies can be clustered based on their gradients into groups that share similar behaviour, making it easier for
Figure imgf000016_0002
analysts to investigate. In another embodiment of the invention, if labeled information on certain types of attacks were used to train the deep neural network model, gradient-based fingerprints associated with the attacks may be derived. These fingerprints may then be used to identify similar future attacks. The anomalies that are identified through the fingerprinting approach are more accurate as labeled information was indirectly used in a way similar to semi-supervised learning.
Variational Autoencoder (VAE)
The VAE model that may be utilized in accordance with embodiments of the invention is discussed in greater detail in this section. The VAE model may be used to represent the probability of a data point p(x) through a latent variable where the probability of a
Figure imgf000016_0003
data point p(z) is:
Figure imgf000017_0006
where p(x|z) is representative of a neural network,
Figure imgf000017_0007
0 otherwise for
Figure imgf000017_0005
is a neural network and the probability p(z) is the prior probability of the latest random variable z, which is assumed to be a standard Gaussian distribution, i.e.
where O is a vector of all 0 and of length dz, and I is an identify square matrix
Figure imgf000017_0004
with the number of columns equal dz.
Given a data point x, VAE would be able to learn about the posterior distribution of the latent random variable
Figure imgf000017_0003
which is done via variational inference (VI). VI learns an approximated posterior,
Figure imgf000017_0002
such that it minimizes the Kullback-Leibler (KL) divergence between the approximate distribution q{z |x) and the exact posterior distribution p(z|x), denoted as This is done by
Figure imgf000017_0001
minimizing the following objective function:
Figure imgf000017_0008
where The approximate probability q(z|x) is
Figure imgf000017_0012
assumed to be a Gaussian distribution with the mean being a column vector of length dz, denoted as and the covariance matrix is a square matrix with the diagonal being a column vector of length
Figure imgf000017_0013
denoted as The mean and the variance are modeled with a neural
Figure imgf000017_0014
Figure imgf000017_0016
Figure imgf000017_0015
network whose input is x and outputs are We denote this neural network as fen¥de.
Figure imgf000017_0017
It is noted that as q(z|x) and p(z) are both Gaussian distributions, the second term in Equation (8) has a closed form expression,
Figure imgf000017_0009
where denotes the sum of all elements in the vector a. The first term in Equation (8), i.e. can be replaced with the mean squared error (MSE) for
Figure imgf000017_0011
continuous random variable x, i.e.
Figure imgf000017_0018
Hence, the objective function may be derived as:
Figure imgf000017_0010
It should be noted that the training of VAE is iterative. At each iteration, a random mini batch is drawn from the dataset, over which the average of (x) is minimized. This is done using any available optimization package, such as Adam, which is available in the Tensorflow library. In particular, when this library is adopted, the method minimize can be called in the class tf.keras. optimizers. Adam with the average of Equation (10) over a mini-batch as the objective function, and the parameters of fencode and fdecode as the trainable variables.
Anomaly Detection
Intuitively, following on from the section on VAE above, if L x) in Equation (10) is large, it can be assumed that x is anomalous. For example, let d be a threshold value such that
Figure imgf000018_0013
of the dataset.
For example, if a = 95, it means that 95% of the dataset have the loss value L(x) less than the threshold value d. Then given a test data point %*, we can compute the loss L(x*), and classify as an anomaly if
Figure imgf000018_0011
Anomaly Explanation using Gradient Analysis
The gradient of the loss may be computed with respect to (w.r.t)
Figure imgf000018_0012
where
Figure imgf000018_0001
where L(x) is defined in Equation (10), the data point x* is represented as a dx-dimensional vector
Figure imgf000018_0008
denote the partial derivative of L(x ) w.r.t the first feature/dimension of
Figure imgf000018_0009
(these d dimensions are also referred to as the features). These gradients can be obtained with tf. gradients in the Tensorflow library by specifying Equation (10) as the ys and as xs. As such,
Figure imgf000018_0010
Figure imgf000018_0002
where is a column vector of gradients defined in Equation (11) and is a column vector
Figure imgf000018_0003
Figure imgf000018_0006
of length dx. The i-th element of z represents the adjustment to the i-th element of
Figure imgf000018_0005
Hence, we can make an anomaly x„ less anomalous by adjusting
Figure imgf000018_0004
for a small positive number e, i.e. , a small adjustment in the direction of the negative gradient of the loss w.r.t. the data point. Therefore, we use this gradient of the loss as an explanation of the anomaly. For example,
• If then it means that the first dimension/feature
Figure imgf000018_0007
of , is too large to be a normal data point in the dataset. This is because increasing the first dimension/feature of by a little increases the loss significantly. • If then it means that the first dimension/feature
Figure imgf000019_0001
of is too small to be a normal data point in the dataset. This is because decreasing the first dimension/feature of by a little increases the loss significantly.
• If
Figure imgf000019_0004
then it means that the first dimension/feature of
Figure imgf000019_0002
does not explain why is anomalous. This is because changing the first dimension/feature of by a
Figure imgf000019_0003
little has no effect on the loss.
Such an example is illustrated in Figure 5 where the dataset consists of 2-dimensional data points following a Gaussian distribution, plots 505 and 510. We assume that those data points that are outside 2 standard deviation from the mean may be treated as anomalies (plotted as dots in the plots 505 and 510). The bottom 2 plots 515 and 520 show the negative values of the gradients of both VAE loss, L x), and the KL loss, at different
Figure imgf000019_0005
data points in the domain. The negative gradients of the VAE loss or the KL loss show the direction of the adjustment to these data points to make them normal (i.e., towards the mean/center in this case). As an example, reference is made to the data point plotted as the red star in plots 515 and 520 (near the location (30; 200)). Its negative gradient vector points to the left, so it means that the feature on the horizontal axis is too large and the feature on the vertical axis plays no role in explaining the anomaly.
The present invention is further explained by way of an example.
Example 1
Dataset and evaluation
The UGR16 dataset, which contains anonymized NetFlow traces captured from a real network of a Tier 3 ISP, is used herein as an example. The ISP provides cloud services and is used by many client companies of different sizes and markets. The UGR trace is a fairly recent and large scale data trace that contains real background traffic from a wide range of Internet users, rather than specific traffic patterns from synthetically generated data (e.g., DARPA'98 and DARPA'99, UNB ISCX 2012, UNSW-NB15, CTU13).
The UGR contains traffic for the whole day and traces over a 4-month period. Furthermore, the UGR’s attack traffic data is a mixture of generated attacks, labelled real attacks, and botnet attacks from controlled environment. Specifically, the labelled attacks comprise of: • Low-rate Denial-of-Service (DoS): TCP SYN packets are sent to victims with packet of size 1280 bits and of rate 100 packets/s to port 80. The rate of the attack is sufficiently low such that the normal operation of the network is not affected.
• Port scanning: A continuous SYN scanning to common ports of victims. There are two kinds of scanning, oneto-one scan attack (Scanll) and four-to-four (Scan44).
• Botnet: A simulated botnet traffic obtained from the execution of the malware known as Neris. This data comes from the CTU13 trace.
• Spam: Peaks of SMTP traffic data that were determined as a spam campaign.
• Blacklist: IP addresses published in the public blacklists.
A total of five days of the UGR data were selected for use herein. Two Saturdays were used as training data while three other days (Friday and Sundays) were chosen as test days.
The statistics for the data are presented in Table I below whereby NetFlow records that are not given any labels make up the background data. Note that the data for March 18 is smaller as it does not comprise a full day’s data.
Figure imgf000020_0001
After the data was processed by feature extraction module 205, a training dataset of 5,990,295 data points was obtained. The data was then used by training module 215 (contained with module 210) to train a VAE model by stochastic optimization with 50 epochs and a mini-batch of size 300 whereby the weight decay was set to 0.01. As a result, a total of 1 ,957,71 1 data points on March 18, 2,954,983 data points on March 20, and 2,878,422 data points on July 31 were processed. It should be noted that a data point belongs to an attack type if the proportion of flows that are labelled with such attack within the 3-minute aggregation is greater than 50%. Figures 6(a)-(f) illustrates the distribution of the reconstruction errors that were generated for the training data. The ground truth labels were used to separate the anomalies from the background flows, which allows the anomalies to be examined to determine whether the anomalies behave differently from the what was expected. As can be seen from Figures 6(a)-(f), there is some overlap in the reconstruction error plots with the background plots for spam (Figure 6(a)), botnet (Figure 6(b)), DoS (Figure 6(c)), and scanning activities (Figure 6(d)-(f)). A cut-off point may be used to distinguish between background noise versus an attack such as spam, botnet, DoS, and scanning activities. However, from Figure 6(f), care should be taken for blacklist type anomalies as the behaviour for blacklist appears to be indistinguishable from the background traffic.
Comparing Anomaly Detection performance
As a reference, the performance of the VAE model for anomaly detection was compared against that of AE and also a Gaussian-based thresholding (GBT) approach. For a fairer comparison, the baseline AE was configured to share the same architecture as the VAE. Further, as illustrated in Figure 4, the AE was implemented using Keras, a high level open source neural network library and the same TensorFlow backend was used for training the AE, i.e. by iteratively minimizing the reconstruction error (mean square error) using the stochastic optimizer Adam, with a mini-batch size chosen to be 256. Similar to the VAE, data points that have large reconstruction errors were then flagged as anomalies.
For the GBT, independent but non-identical Gaussian distribution models were fitted to the features to learn the standard behaviours of the data. Z-scores were then calculated for all features in the testing dataset and the product of the average, standard deviation, and the maximum of the Z-scores were used as final score for anomaly detection. The data points with scores that exceed a certain threshold were then considered as anomalies.
The performance of the three methods (VAE, AE and GBT) were evaluated using Receiver Operating Characteristic (ROC) curves, where the true positive rates were plotted against the false positive rates by varying the sensitivity or threshold for anomaly detection. A good performance occurs when the true positive values are high while the false positive values are low. The ROC results obtained for the training datasets are plotted at Figures 7(a)-(f) and results obtained for the testing datasets are plotted at Figures 8(a)-(d).
From the plots in these two figures, it can be seen that threshold-based approaches such as GBT works well for attacks that cause the volume of certain categories of traffic to increase significantly, such as spam and port scanning type attacks. However, such approaches do not appear to work as well for botnet and low rate spam type attacks for this example.
On the other hand, AE does not work as well for spam and low rate DoS type attacks. For these two types of attacks, the AE model may be unable to differentiate spam from regular email traffic because of the high volume of data and would be unable to detect DoS attacks due to the low rate volume of data.
From these plots, it can be seen that the VAE model is the most robust and provides the best performance for detecting all attack modes. Table II below summarizes the individual and average area under the ROC curve (AUC) for various types of attacks (except blacklist) and algorithms and Table II shows that VAE provides the best overall performance in terms of AUC.
Figure imgf000022_0001
Identifying Anomalies and Using Gradient Fingerprints
Using the gradient fingerprinting module 225, the gradients for all the features for the various attacks may then computed based on the VAE model. Figure 9 illustrates the gradients of the features for spam, scanll, and scan44 whereby the black bars reflect one standard error for the gradients (which are useful for assessing the significance of the gradients, i.e. , whether it is due to noise or not). Features that are not significant are also presented as contrast.
From Figure 9, it is observed that only a small subset of the features exhibit large gradients. These features with the greatest absolute gradients provide an explanation for why these flows of an IP are detected as anomalies. For example, in the case of spam attacks 902 (which includes spam emails), five features were found to have more positive gradients (higher than the learned normal) while four features were found to have much more negative gradients (lower than learned normal). Critically, these combination of gradients and features can be used as a fingerprint to identify or cluster similar attacks. For example, it can be observed from Figure 9 that scan H and scan44 (plots 904 and 906 respectively) type attacks exhibit similar gradient fingerprints.
To further validate the analysis that the gradient fingerprints can be used to identify similar attacks, the ROC for various attacks detected are plotted in the following way. First, let L2 denote the Euclidean distance between the average gradient fingerprint obtained from data with labeled attacks and the gradients of the VAE's objective function w.r.t. each test data point, L2. Similarly,
Figure imgf000023_0004
is defined as the same distance metric but computed based on normalized gradient vectors, i.e. — normalized distance is taken to be proportional to the
Figure imgf000023_0003
cosine distance and may be computed as:
Figure imgf000023_0001
where is the Euclidean distance.
Figure imgf000023_0002
The ROC is then produced by varying the threshold on L2 or distance. If the
Figure imgf000023_0005
gradient fingerprint is to be a good measure of an attack, the ROC is expected to have high true positive values and low false positive values. The results of this analysis are plotted in Figure 10. It can be seen that with the use of
Figure imgf000023_0006
the gradient fingerprints learned were found to ne indeed good representations of these attacks. In fact, these plots show that this approach is an improvement over the simplified use of reconstruction errors for anomaly detection. The results of this approach are plotted at, the last row on Table II for its AUC.
Clustering of Anomalies
In other embodiments of the invention, the gradients generated by the VAE module 210 may be used for clustering. The idea is that if the clustering is effective, the identification of the types of attacks may be limited to a smaller number of clusters thereby increasing the overall identification speed of the system.
In this embodiment, k-mean clustering is performed with a random initial seed on the dataset (over the training set) with k = 100. Figure 1 1 illustrates the clusters that four of the attacks belong to when they are clustered accordingly. For example, 92.4% of the DoS attacks appear in only two clusters (c82 and c84) with the other 7.6% appearing in four other clusters. For spam attacks, 74.3% of them appeared in two clusters (c1 1 and c15), while 25.7% appeared in another 11 clusters. Hence, clustering is an effective tool for analysts to focus on a much smaller number of clusters for particular type of attacks. Numerous other changes, substitutions, variations and modifications may be ascertained by the skilled in the art and it is intended that the present invention encompass all such changes, substitutions, variations and modifications as falling within the scope of the appended claims.

Claims

CLAIMS:
1. A method for detecting and analysing anomalies in network traffic, the method to be performed by a computer system comprising:
collecting network data from the network traffic;
extracting one or more features from the network data to form a dataset;
providing the dataset to a deep neural network model to train the deep neural network model;
detecting anomalies in the network traffic using the trained deep neural network model; and
computing gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.
2. The method according to claim 1 , wherein the step of extracting one or more features from the network data to form the dataset comprises:
grouping the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses,
wherein the step of extracting one or more features from the network data to form a dataset comprises extracting one or more aggregated features from each of the plurality of sliding windows.
3. The method according to claim 1 or 2, wherein the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of:
mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate;
entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and
proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.
4. The method according to any one of claims 1 to 3, wherein the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.
5. The method according to claim 4, wherein the training of the VAE model is performed using Adam as an optimisation algorithm.
6. The method according to any one of claims 4 to 5, wherein the step of using the trained deep neural network model to detect anomalies in the network traffic comprises:
obtaining a reconstruction error for each data point; and
comparing the reconstruction error with a predetermined threshold for each data point, wherein a data point is determined to be an anomaly when the reconstruction error exceeds the predetermined threshold.
7. The method according to any one of claims 1 to 6, further comprising:
clustering the computed gradients for the one or more features.
8. The method according to claim 7, further comprising:
identifying types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.
9. A computer system for detecting and analysing anomalies in network traffic, the system comprising:
circuitry configured to collect network data from the network traffic;
circuitry configured to extract one or more features from the network data to form a dataset;
circuitry configured to provide the dataset to a deep neural network model to train the deep neural network model;
circuitry configured to detect anomalies in the network traffic using the trained deep neural network model; and
circuitry configured to compute gradients for one or more features associated with each of the detected anomalies, whereby the computed gradients are used to identify the detected anomalies.
10. The computer system according to claim 9, wherein the circuitry configured to extract the one or more features from the network data to form the dataset further comprises:
circuitry configured to group the network data into a plurality of sliding windows of a predetermined duration based on source IP addresses,
wherein the circuitry configured to extract one or more features from the network data to form a dataset comprises circuitry configured to extract one or more aggregated features from each of the plurality of sliding windows.
11. The computer system according to claim 9 or 10, wherein the one or more features or aggregated features extracted from the network data comprises one or more from the group comprising of:
mean and standard deviation of flow durations, number of packets, number of bytes, packet rate and byte rate;
entropy of protocol type, destination IP addresses, source ports, destination ports, and TCP flags; and
proportion of ports used for common applications including Win RPC, Telnet, DNS, SSH, HTTP, FTP, and POP3.
12. The computer system according to any one of claims 9 to 11 , wherein the deep neural network model comprises an Autoencoder (AE) or Variational Autoencoder (VAE) model.
13. The computer system according to claim 12, wherein the training of the VAE model is performed using Adam as an optimisation algorithm.
14. The computer system according to any one of claims 12 to 13, wherein the circuitry configured to use the trained deep neural network model to detect anomalies in the network traffic comprises:
circuitry configured to obtain a reconstruction error for each data point; and circuitry configured to compare the reconstruction error with a predetermined threshold for each data point, wherein a data point is determined to be an anomaly when the reconstruction error exceeds the predetermined threshold.
15. The computer system according to any one of claims 9 to 14, further comprising:
circuitry configured to cluster the computed gradients for the one or more features.
16. The computer system according to claim 15, further comprising:
circuitry configured to identify types of attacks associated with each cluster of computed gradients by comparing the clustered gradients with gradients of labelled attacks.
PCT/SG2020/050033 2019-01-29 2020-01-22 System and method for network anomaly detection and analysis WO2020159439A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10201900840P 2019-01-29
SG10201900840P 2019-01-29

Publications (1)

Publication Number Publication Date
WO2020159439A1 true WO2020159439A1 (en) 2020-08-06

Family

ID=71842475

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2020/050033 WO2020159439A1 (en) 2019-01-29 2020-01-22 System and method for network anomaly detection and analysis

Country Status (1)

Country Link
WO (1) WO2020159439A1 (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134847A (en) * 2020-08-26 2020-12-25 郑州轻工业大学 Attack detection method based on user flow behavior baseline
CN112598111A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Abnormal data identification method and device
CN112953924A (en) * 2021-02-04 2021-06-11 西安电子科技大学 Network abnormal flow detection method, system, storage medium, terminal and application
CN113259332A (en) * 2021-04-29 2021-08-13 上海电力大学 Multi-type network flow abnormity detection method and system based on end-to-end
CN113255750A (en) * 2021-05-17 2021-08-13 安徽大学 VCC vehicle attack detection method based on deep learning
CN113965393A (en) * 2021-10-27 2022-01-21 浙江网安信创电子技术有限公司 Botnet detection method based on complex network and graph neural network
US20220060235A1 (en) * 2020-08-18 2022-02-24 Qualcomm Incorporated Federated learning for client-specific neural network parameter generation for wireless communication
CN114124420A (en) * 2020-08-28 2022-03-01 哈尔滨理工大学 Network flow abnormity detection method based on deep neural network
CN114273981A (en) * 2022-03-04 2022-04-05 苏州古田自动化科技有限公司 Horizontal five-axis numerical control machining center with error compensation function
CN114301719A (en) * 2022-03-10 2022-04-08 中国人民解放军国防科技大学 Malicious update detection method and model based on variational self-encoder
CN114448661A (en) * 2021-12-16 2022-05-06 北京邮电大学 Slow denial of service attack detection method and related equipment
CN114615003A (en) * 2020-12-07 2022-06-10 ***通信有限公司研究院 Verification method and device for command and control C & C domain name and electronic equipment
US11570046B2 (en) 2020-12-17 2023-01-31 Nokia Solutions And Networks Oy Method and apparatus for anomaly detection in a network
CN116436819A (en) * 2023-02-22 2023-07-14 深圳市昆腾电源科技有限公司 Parallel operation UPS communication abnormality detection method and device and parallel operation UPS system
CN116633809A (en) * 2023-06-26 2023-08-22 中国信息通信研究院 Detection method and system based on artificial intelligence
CN116628554A (en) * 2023-05-31 2023-08-22 烟台大学 Industrial Internet data anomaly detection method, system and equipment
CN116915512A (en) * 2023-09-14 2023-10-20 国网江苏省电力有限公司常州供电分公司 Method and device for detecting communication flow in power grid
CN117633665A (en) * 2024-01-26 2024-03-01 深圳市互盟科技股份有限公司 Network data monitoring method and system
JP7444271B2 (en) 2020-09-18 2024-03-06 日本電信電話株式会社 Learning devices, learning methods and learning programs

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180337836A1 (en) * 2011-11-07 2018-11-22 Netflow Logic Corporation Method and system for confident anomaly detection in computer network traffic
US20190020669A1 (en) * 2017-07-11 2019-01-17 The Boeing Company Cyber security system with adaptive machine learning features
CN109274673A (en) * 2018-09-26 2019-01-25 广东工业大学 A kind of detection of exception of network traffic and defence method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180337836A1 (en) * 2011-11-07 2018-11-22 Netflow Logic Corporation Method and system for confident anomaly detection in computer network traffic
US20190020669A1 (en) * 2017-07-11 2019-01-17 The Boeing Company Cyber security system with adaptive machine learning features
CN109274673A (en) * 2018-09-26 2019-01-25 广东工业大学 A kind of detection of exception of network traffic and defence method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AMARASINGHE K., KENNEY KEVIN; MANIC MILOS: "Toward Explainable Deep Neural Network Based Anomaly Detection", 2018 11TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI, 6 July 2018 (2018-07-06), pages 1 - 7, XP033384140 *
AN JINWON; SUNGZOON CHO: "Variational Autoencoder based Anomaly Detection using Reconstruction Probability", 27 December 2015 (2015-12-27), XP0055688241, Retrieved from the Internet <URL:http://dm.snu.ac.kr/static/docs/TR/SNUDM-TR-2015-03.pdf> [retrieved on 20200318] *
CHALAPATHY RAGHAVENDRA, ANJAY CHAWLA: "Deep Learning for Anomaly Detection: A Survey", 24 January 2019 (2019-01-24), XP055672228, Retrieved from the Internet <URL:https://arxiv.org/pdf/1901.03407.pdf> [retrieved on 20200318] *
SHEN S., SHRUTI TOPLE; PRATEEK SAXENA: "AUROR: Defending Against Poisoning Attacks in Collaborative Deep Learning Systems", ACSAC '16: PROCEEDINGS OF THE 32ND ANNUAL CONFERENCE ON COMPUTER SECURITY APPLICATIONS DECEMBER 2016, 31 December 2016 (2016-12-31), pages 508 - 519, XP058306858 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11909482B2 (en) * 2020-08-18 2024-02-20 Qualcomm Incorporated Federated learning for client-specific neural network parameter generation for wireless communication
US20220060235A1 (en) * 2020-08-18 2022-02-24 Qualcomm Incorporated Federated learning for client-specific neural network parameter generation for wireless communication
CN112134847A (en) * 2020-08-26 2020-12-25 郑州轻工业大学 Attack detection method based on user flow behavior baseline
CN114124420A (en) * 2020-08-28 2022-03-01 哈尔滨理工大学 Network flow abnormity detection method based on deep neural network
JP7444271B2 (en) 2020-09-18 2024-03-06 日本電信電話株式会社 Learning devices, learning methods and learning programs
CN112598111A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Abnormal data identification method and device
CN114615003A (en) * 2020-12-07 2022-06-10 ***通信有限公司研究院 Verification method and device for command and control C & C domain name and electronic equipment
US11570046B2 (en) 2020-12-17 2023-01-31 Nokia Solutions And Networks Oy Method and apparatus for anomaly detection in a network
CN112953924A (en) * 2021-02-04 2021-06-11 西安电子科技大学 Network abnormal flow detection method, system, storage medium, terminal and application
CN112953924B (en) * 2021-02-04 2022-10-21 西安电子科技大学 Network abnormal flow detection method, system, storage medium, terminal and application
CN113259332A (en) * 2021-04-29 2021-08-13 上海电力大学 Multi-type network flow abnormity detection method and system based on end-to-end
CN113255750B (en) * 2021-05-17 2022-11-08 安徽大学 VCC vehicle attack detection method based on deep learning
CN113255750A (en) * 2021-05-17 2021-08-13 安徽大学 VCC vehicle attack detection method based on deep learning
CN113965393A (en) * 2021-10-27 2022-01-21 浙江网安信创电子技术有限公司 Botnet detection method based on complex network and graph neural network
CN113965393B (en) * 2021-10-27 2023-08-01 浙江网安信创电子技术有限公司 Botnet detection method based on complex network and graph neural network
CN114448661A (en) * 2021-12-16 2022-05-06 北京邮电大学 Slow denial of service attack detection method and related equipment
CN114273981A (en) * 2022-03-04 2022-04-05 苏州古田自动化科技有限公司 Horizontal five-axis numerical control machining center with error compensation function
CN114301719A (en) * 2022-03-10 2022-04-08 中国人民解放军国防科技大学 Malicious update detection method and model based on variational self-encoder
CN114301719B (en) * 2022-03-10 2022-05-13 中国人民解放军国防科技大学 Malicious update detection method and system based on variational self-encoder
CN116436819A (en) * 2023-02-22 2023-07-14 深圳市昆腾电源科技有限公司 Parallel operation UPS communication abnormality detection method and device and parallel operation UPS system
CN116628554B (en) * 2023-05-31 2023-11-03 烟台大学 Industrial Internet data anomaly detection method, system and equipment
CN116628554A (en) * 2023-05-31 2023-08-22 烟台大学 Industrial Internet data anomaly detection method, system and equipment
CN116633809B (en) * 2023-06-26 2024-01-23 中国信息通信研究院 Detection method and system based on artificial intelligence
CN116633809A (en) * 2023-06-26 2023-08-22 中国信息通信研究院 Detection method and system based on artificial intelligence
CN116915512A (en) * 2023-09-14 2023-10-20 国网江苏省电力有限公司常州供电分公司 Method and device for detecting communication flow in power grid
CN116915512B (en) * 2023-09-14 2023-12-01 国网江苏省电力有限公司常州供电分公司 Method and device for detecting communication flow in power grid
CN117633665A (en) * 2024-01-26 2024-03-01 深圳市互盟科技股份有限公司 Network data monitoring method and system
CN117633665B (en) * 2024-01-26 2024-05-28 深圳市互盟科技股份有限公司 Network data monitoring method and system

Similar Documents

Publication Publication Date Title
WO2020159439A1 (en) System and method for network anomaly detection and analysis
Nguyen et al. Gee: A gradient-based explainable variational autoencoder for network anomaly detection
Idhammad et al. Detection system of HTTP DDoS attacks in a cloud environment based on information theoretic entropy and random forest
Janarthanan et al. Feature selection in UNSW-NB15 and KDDCUP'99 datasets
Cordero et al. Analyzing flow-based anomaly intrusion detection using replicator neural networks
Yang et al. TLS/SSL encrypted traffic classification with autoencoder and convolutional neural network
Catak et al. Distributed denial of service attack detection using autoencoder and deep neural networks
Ahmad et al. A comprehensive deep learning benchmark for IoT IDS
Tufan et al. Anomaly-based intrusion detection by machine learning: A case study on probing attacks to an institutional network
Abdelaty et al. Gadot: Gan-based adversarial training for robust ddos attack detection
Salahuddin et al. Chronos: Ddos attack detection using time-based autoencoder
Monshizadeh et al. Performance evaluation of a combined anomaly detection platform
Labonne Anomaly-based network intrusion detection using machine learning
Hossain et al. Ensuring network security with a robust intrusion detection system using ensemble-based machine learning
Bodström et al. State of the art literature review on network anomaly detection with deep learning
Atli Anomaly-based intrusion detection by modeling probability distributions of flow characteristics
Al-Fawa'reh et al. Detecting stealth-based attacks in large campus networks
Almarshdi et al. Hybrid Deep Learning Based Attack Detection for Imbalanced Data Classification.
Borisenko et al. Intrusion detection using multilayer perceptron and neural networks with long short-term memory
Babbar et al. Evaluation of deep learning models in its software-defined intrusion detection systems
de Campos et al. Network intrusion detection system using data mining
Koniki et al. An anomaly based network intrusion detection system using LSTM and GRU
Wei et al. Reconstruction-based LSTM-Autoencoder for Anomaly-based DDoS Attack Detection over Multivariate Time-Series Data
Shaik et al. capsAEUL: Slow http DoS attack detection using autoencoders through unsupervised learning
Gouveia et al. Deep Learning for Network Intrusion Detection: An Empirical Assessment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20749242

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20749242

Country of ref document: EP

Kind code of ref document: A1