WO2023179861A1 - Aggregation of anomalies in a network - Google Patents

Aggregation of anomalies in a network Download PDF

Info

Publication number
WO2023179861A1
WO2023179861A1 PCT/EP2022/057757 EP2022057757W WO2023179861A1 WO 2023179861 A1 WO2023179861 A1 WO 2023179861A1 EP 2022057757 W EP2022057757 W EP 2022057757W WO 2023179861 A1 WO2023179861 A1 WO 2023179861A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
events
network
aggregating
class
Prior art date
Application number
PCT/EP2022/057757
Other languages
French (fr)
Inventor
Jose Manuel NAVARRO GONZALEZ
Alexis HUET
Dario Rossi
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2022/057757 priority Critical patent/WO2023179861A1/en
Publication of WO2023179861A1 publication Critical patent/WO2023179861A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks

Definitions

  • the invention relates to anomaly aggregation, more particularly, the invention relates to a method and a system for aggregation of anomalies in a network.
  • KPIs Key Performance Indicators
  • network devices measured features over time (time series) in order to troubleshoot them.
  • troubleshooting is performed using (a) Univariate time series detection on each individual time series, giving anomalous dates for each KPI, and (b) optionally, manually aggregating those anomalies obtained for each single KPI series together.
  • the disadvantage is that where the number of KPIs is typically more than 1000, a large number of anomalies are detected. Further, this can overload the capacity of a manual operator to diagnose the system in real time.
  • Recent anomaly detection algorithms are multivariate. They take multivariate time series inputs but are only able to identify moments in time which contain anomalous data. Additional techniques must be used in order to determine which KPIs are anomalous. This still yields an excessive number of anomalies, as each device is considered separately.
  • the present invention aims to provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems.
  • the invention provides a method and a system for aggregating events in a network.
  • a method for aggregating events in a network including: receiving an event from a device in the network; assigning a class to the received event based on information stored in an anomaly database; aggregating the event with one or more other events present in an event cache, wherein aggregation includes classifying events corresponding to a plurality of anomalies identified in the network; and generating a summarised list of aggregated events for assessment.
  • the method for aggregating events in the network provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent status of a device, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of a system as a whole. Thus, such aggregation of events provides improved diagnostics and efficiency.
  • the method works in both an online and offline manner, the method can aggregate similar and dissimilar events across devices, which is not possible as per the prior art.
  • the method reduces the time taken by network engineers to troubleshoot faults in their monitored systems.
  • aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache.
  • aggregating the event with the one or more other events includes querying the anomaly database to find events with different classes that occurred in the past together with events with the class assigned to the received event.
  • the event is a grouping of one or more alarms occurring in the device at a time instant or within a time window.
  • the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device.
  • KPIs Key Performance Indicators
  • the assignment of the class to the received event is based on grouping events by similar KPIs.
  • steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.
  • a system for aggregating events in a network including: an input module configured to receive an event from a device in the network; a class assignment module configured to assign a class to the received event based on information stored in an anomaly database; an event aggregator module configured to aggregate the event with one or more other events present in an event cache, wherein aggregation includes classifying events corresponding to a plurality of anomalies identified in the network; and an output module configured to generate a summarised list of aggregated events for assessment.
  • a computer program including instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method or any of the preceding preferences.
  • the disclosed method provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent a status of a device, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of a system as a whole. Additionally, though the method works in both online and offline manner, method is able to aggregate similar and dissimilar events across devices, which is not possible as per the prior art.
  • FIG. 1 illustrates a block diagram of a system for aggregating events in a network in accordance with an implementation of the invention
  • FIG. 2 illustrates a block diagram of a system for aggregating similar events in a network in accordance with an implementation of the invention
  • FIG. 3 illustrates a block diagram of a system for aggregating dissimilar events in a network in accordance with an implementation of the invention
  • FIG. 4 is a flow diagram that illustrates a method for aggregating events in a network in accordance with an implementation of the invention.
  • FIG. 5 is an illustration of a computer system (e.g. a system) in which the various architectures and functionalities of the various previous implementations may be implemented.
  • Implementations of the invention provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems.
  • a process, a method, a system, a product, or a device that includes a series of steps or units is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
  • FIG. 1 illustrates a block diagram of a system 100 for aggregating events in a network 104 in accordance with an implementation of the invention.
  • the system 100 includes an input module 106, a co-occurrence module 110, a class assignment module 112, an event aggregator module 116, and an output module 118.
  • the input module 106 is configured to receive an event from a device 102.
  • the class assignment module 112 configured to assign a class to the received event based on information stored in an anomaly database 108.
  • the event aggregator module 116 is configured to aggregate the event with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network 104.
  • the output module 118 configured to generate a summarised list of aggregated events for assessment.
  • the system 100 for aggregating events in the network 104 provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent a status of the device 102, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of the system 100 as a whole.
  • the system 100 includes a co-occurrence module 110 that includes the class assignment module 112 and a co-occurrence query module 114.
  • the class assignment module 112 and the co-occurrence query module 114 are configured to aggregate similar and dissimilar events respectively.
  • the co-occurrence query module 114 is configured to determine whether events of dissimilar classes should be aggregated together based on the past behaviour of the events when a new event arrives to the input module 106.
  • the co-occurrence module 110 is connected to the event aggregator module 116 that is configured to aggregate the one or more events with one or more other events present in an event cache.
  • the event cache keeps a record of events that happened previously. Aggregation includes classifying the one or more events corresponding to a plurality of anomalies identified in the network 104.
  • the event aggregator module 116 is connected to the output module 118 that is configured to generate a summarized list of aggregated events for assessment.
  • An event is defined as a grouping of all the individual anomalies happening in the device 102 in a specific time window.
  • An anomaly is defined as a tuple that includes a device identifier, a time, and an error type, where the tuple indicates that a faulty behaviour was detected in a specific key performance indicator (KPI) or service.
  • KPI key performance indicator
  • the event may be expressed using the device identifier and identifiers of KPIs that exhibit anomalous behaviour. As an example, the event may be expressed as “ ⁇ device, time, ⁇ anomaly type 1, anomaly type 2..., anomaly type N ⁇ ”. Events may be pre-aggregated with each KPI that is relevant to the anomaly.
  • the anomaly database 108 is configured to store a record of past anomalies and their related KPIs.
  • aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache.
  • aggregating the event with the one or more other events includes querying the anomaly database 108 to find events with different classes that occurred in the past together with events with the class assigned to the received event.
  • the event is a grouping of one or more alarms occurring in the device 102 at a time instant or within a time window.
  • the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device 102.
  • KPIs Key Performance Indicators
  • the assignment of the class to the received event is based on grouping events by similar KPIs.
  • steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.
  • FIG. 2 illustrates a block diagram of a system 200 for aggregating similar events in a network 204 in accordance with an implementation of the invention.
  • the system 200 includes a device 202, an input module 206, an anomaly database 208, a co-occurrence module 210, a class assignment module 212, an event aggregator module 216, and an output module 218.
  • the system 200 receives an event “El” happened at a device “Device 1” at the input module 206.
  • the event cache includes an event similar to the event “El”.
  • the event “El” is passed to a class assignment module 212 that assigns the class “Routing Issue” to the event “El” and passes it to the event aggregator module 216.
  • the event aggregator module 216 compares the event “El” with the event already existing in the event cache.
  • KPIs ⁇ TOTAL ROUTE COUNT, BGP ROUTE COUNT, Memory Usage ⁇
  • As both events share the class “Routing issue” they are aggregated together and can be expressed in a collapsed form in the output, though the event already existing in the event cache has another anomalous KPI “Memory Usage”.
  • FIG. 3 illustrates a block diagram a system 300 for aggregating events dissimilar events in a network 304 in accordance with an implementation of the invention.
  • the system 300 includes a device 302, an input module 306, an anomaly database 308, a class assignment module 312, an event aggregator module 316, and an output module 318.
  • the system 300 further includes a co-occurrence module 310 that includes the class assignment module 312 and a co-occurrence query module 314. As an example, system 300 receives an event “E2” that happened at a device “Device 3” at an input module 306.
  • the event “E3” is passed to the class assignment module 312 and receives a class “CPU Issue”.
  • the event “E3” is sent to the event aggregator module 316.
  • the event aggregator module 316 compares the event “E3” with the events already existing in the event cache.
  • a Co-Occurrence query is launched using the co-occurrence query module 314 that asks whether class “CPU Issue” should be aggregated with class “Routing issue”. The answer is a yes because based on the data in the anomaly database 308, a CPU issue is observed 46% times after a Routing issue has appeared. Based on this answer, the event “E3” and the previous ones are aggregated together and presented as a single anomaly.
  • FIG. 4 is a flow diagram that illustrates a method of aggregating events in a network in accordance with an implementation of the invention.
  • an event is received from a device in the network.
  • a class is assigned to the received event based on information stored in an anomaly database.
  • the event is aggregated with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network.
  • a summarised list of aggregated events is generated for assessment.
  • aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache. For example, an input from a computer network dataset including 299 hand- labelled multi-dimensional events across two weeks, 22 devices in the network, and 35 KPI types is received in the input module. A binary matrix is created, where every row is an event and each column is a KPI type, where a “1” entry in a cell indicates that a specific KPI was anomalous in that specific event. A binary clustering algorithm is applied that groups the events based on shared KPI patterns. As an example, output, almost 300 events with 35 different KPI types, and the result obtained are described in a table as follows.
  • aggregating the event with the one or more other events includes querying the anomaly database to find events with different classes that occurred in the past together with events with the class assigned to the received event.
  • hand-labelled events across three weeks with 22 devices in the network and having 134 KPI types are received as an input in the input module.
  • Each event is labelled using extracted clusters. Jaccard similarity process may be used, which is a measure of proximity between data points. The following set of steps are performed:
  • A) proportion of times B appears between t(A)+l and t(A)+TTL (Time To Live threshold).
  • A- B indicates that P(B
  • the event aggregator module queries the anomaly database to obtain P(B
  • A) 0.4)” is obtained. 504 events where 345 (-31.5%) aggregated similar events and 311 (-38.3%) aggregated similar and dissimilar events.
  • an event of associated class “CPU Usage (B)” happens at a device “Device 3”.
  • the event cache is examined and there is an ongoing “Routing issue (A)” event class for devices “Device 1” and “Device 2”.
  • the event is a grouping of a plurality of alarms occurring in the device at a time instant or within a time window.
  • the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device.
  • KPIs Key Performance Indicators
  • the assignment of the class to the received event is based on grouping events by similar KPIs.
  • steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.
  • a computer program includes instructions which, when the program is executed by a computer, cause the computer to carry out the steps of a method or any of the preceding preferences.
  • FIG. 5 is an illustration of a computer system (e.g. a system) in which the various architectures and functionalities of the various previous implementations may be implemented.
  • the computer system 500 includes at least one processor 504 that is connected to a bus 502, wherein the computer system 500 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point-to-point communication protocol (s).
  • the computer system 500 also includes a memory 506.
  • Control logic (software) and data are stored in the memory 506 which may take a form of random-access memory (RAM).
  • a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on- chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
  • the computer system 500 may also include a secondary storage 510.
  • the secondary storage 510 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory.
  • the removable storage drive at least one of, reads from and writes to a removable storage unit in a well-known manner.
  • Computer programs, or computer control logic algorithms may be stored in at least one of the memory 506 and the secondary storage 510. Such computer programs, when executed, enable the computer system 500 to perform various functions as described in the foregoing.
  • the memory 506, the secondary storage 510, and any other storage are possible examples of computer-readable media.
  • the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 504, a graphics processor coupled to a communication interface 512, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 504 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).
  • the architectures and functionalities depicted in the various previous-described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system.
  • the computer system 500 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.
  • the computer system 500 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 500 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 508.
  • a network for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Provided a method for aggregating events in a network (104, 204, 304). The method includes receiving an event from a device (102, 202, 302) in the network (104, 204, 304). The method includes assigning a class to the received event based on information stored in an anomaly database (108, 208, 308). The method includes aggregating the event with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network and generating a summarised list of aggregated events for assessment.

Description

AGGREGATION OF ANOMALIES IN A NETWORK
TECHNICAL FIELD
The invention relates to anomaly aggregation, more particularly, the invention relates to a method and a system for aggregation of anomalies in a network.
BACKGROUND
For each device in a monitored network, experts need to analyse the evolution of the KPIs (Key Performance Indicators) or network devices’ measured features over time (time series) in order to troubleshoot them.
Traditionally, troubleshooting is performed using (a) Univariate time series detection on each individual time series, giving anomalous dates for each KPI, and (b) optionally, manually aggregating those anomalies obtained for each single KPI series together. The disadvantage is that where the number of KPIs is typically more than 1000, a large number of anomalies are detected. Further, this can overload the capacity of a manual operator to diagnose the system in real time.
Recent anomaly detection algorithms are multivariate. They take multivariate time series inputs but are only able to identify moments in time which contain anomalous data. Additional techniques must be used in order to determine which KPIs are anomalous. This still yields an excessive number of anomalies, as each device is considered separately.
The traditional solutions are limited to work with anomalies individually and create results that are lengthy and difficult to interpret by a network engineer.
Therefore, the present invention aims to provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems. SUMMARY
It is an object of the invention to provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems.
This object is achieved by the features of the independent claims. Further implementations are apparent from the dependent claims, the description, and the figures.
The invention provides a method and a system for aggregating events in a network.
According to a first aspect, there is provided a method for aggregating events in a network including: receiving an event from a device in the network; assigning a class to the received event based on information stored in an anomaly database; aggregating the event with one or more other events present in an event cache, wherein aggregation includes classifying events corresponding to a plurality of anomalies identified in the network; and generating a summarised list of aggregated events for assessment.
The method for aggregating events in the network provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent status of a device, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of a system as a whole. Thus, such aggregation of events provides improved diagnostics and efficiency.
Additionally, though the method works in both an online and offline manner, the method can aggregate similar and dissimilar events across devices, which is not possible as per the prior art. The method reduces the time taken by network engineers to troubleshoot faults in their monitored systems.
Preferably, aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache. Preferably, aggregating the event with the one or more other events includes querying the anomaly database to find events with different classes that occurred in the past together with events with the class assigned to the received event.
Preferably, the event is a grouping of one or more alarms occurring in the device at a time instant or within a time window.
Preferably, the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device.
Preferably, the assignment of the class to the received event is based on grouping events by similar KPIs.
Preferably, steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.
According to a second aspect, there is provided a system for aggregating events in a network including: an input module configured to receive an event from a device in the network; a class assignment module configured to assign a class to the received event based on information stored in an anomaly database; an event aggregator module configured to aggregate the event with one or more other events present in an event cache, wherein aggregation includes classifying events corresponding to a plurality of anomalies identified in the network; and an output module configured to generate a summarised list of aggregated events for assessment.
According to a third aspect, a computer program including instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method or any of the preceding preferences.
Therefore, in contradistinction to the prior art, the disclosed method provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent a status of a device, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of a system as a whole. Additionally, though the method works in both online and offline manner, method is able to aggregate similar and dissimilar events across devices, which is not possible as per the prior art.
These and other aspects of the invention will be apparent from the implementations described below.
BRIEF DESCRIPTION OF DRAWINGS
Implementations of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 illustrates a block diagram of a system for aggregating events in a network in accordance with an implementation of the invention;
FIG. 2 illustrates a block diagram of a system for aggregating similar events in a network in accordance with an implementation of the invention;
FIG. 3 illustrates a block diagram of a system for aggregating dissimilar events in a network in accordance with an implementation of the invention;
FIG. 4 is a flow diagram that illustrates a method for aggregating events in a network in accordance with an implementation of the invention; and
FIG. 5 is an illustration of a computer system (e.g. a system) in which the various architectures and functionalities of the various previous implementations may be implemented.
DETAILED DESCRIPTION OF THE DRAWINGS
Implementations of the invention provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems. To make solutions of the invention more comprehensible for a person skilled in the art, the following implementations of the invention are described with reference to the accompanying drawings.
Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the invention are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the invention described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
FIG. 1 illustrates a block diagram of a system 100 for aggregating events in a network 104 in accordance with an implementation of the invention. The system 100 includes an input module 106, a co-occurrence module 110, a class assignment module 112, an event aggregator module 116, and an output module 118. The input module 106 is configured to receive an event from a device 102. The class assignment module 112 configured to assign a class to the received event based on information stored in an anomaly database 108. The event aggregator module 116 is configured to aggregate the event with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network 104. The output module 118 configured to generate a summarised list of aggregated events for assessment.
The system 100 for aggregating events in the network 104 provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent a status of the device 102, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of the system 100 as a whole.
Additionally, though the system 100 works in both an online and offline manner, the system 100 can aggregate similar and dissimilar events across devices, which is not possible as per the prior art. The method reduces the time taken by network engineers to troubleshoot faults in their monitored systems. The system 100 includes a co-occurrence module 110 that includes the class assignment module 112 and a co-occurrence query module 114. The class assignment module 112 and the co-occurrence query module 114 are configured to aggregate similar and dissimilar events respectively. The co-occurrence query module 114 is configured to determine whether events of dissimilar classes should be aggregated together based on the past behaviour of the events when a new event arrives to the input module 106. The co-occurrence module 110 is connected to the event aggregator module 116 that is configured to aggregate the one or more events with one or more other events present in an event cache. The event cache keeps a record of events that happened previously. Aggregation includes classifying the one or more events corresponding to a plurality of anomalies identified in the network 104. The event aggregator module 116 is connected to the output module 118 that is configured to generate a summarized list of aggregated events for assessment.
An event is defined as a grouping of all the individual anomalies happening in the device 102 in a specific time window. An anomaly is defined as a tuple that includes a device identifier, a time, and an error type, where the tuple indicates that a faulty behaviour was detected in a specific key performance indicator (KPI) or service. The event may be expressed using the device identifier and identifiers of KPIs that exhibit anomalous behaviour. As an example, the event may be expressed as “{device, time, {anomaly type 1, anomaly type 2..., anomaly type N}{”. Events may be pre-aggregated with each KPI that is relevant to the anomaly. The anomaly database 108 is configured to store a record of past anomalies and their related KPIs.
Preferably, aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache.
Preferably, aggregating the event with the one or more other events includes querying the anomaly database 108 to find events with different classes that occurred in the past together with events with the class assigned to the received event.
Preferably, the event is a grouping of one or more alarms occurring in the device 102 at a time instant or within a time window.
Preferably, the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device 102. Preferably, the assignment of the class to the received event is based on grouping events by similar KPIs.
Preferably, steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.
FIG. 2 illustrates a block diagram of a system 200 for aggregating similar events in a network 204 in accordance with an implementation of the invention. The system 200 includes a device 202, an input module 206, an anomaly database 208, a co-occurrence module 210, a class assignment module 212, an event aggregator module 216, and an output module 218. As an example, the system 200 receives an event “El” happened at a device “Device 1” at the input module 206. The event “El” expressed as “{Device 1, T = 2, KPIs: {TOTAL ROUTE COUNT, BGP ROUTE COUNT}
Figure imgf000009_0001
meaning that the event “El” happened at “Device 1” at time=2 and showing KPIs TOTAL ROUTE COUNT and BGP ROUTE COUNT. The event cache includes an event similar to the event “El”. The event “El” is passed to a class assignment module 212 that assigns the class “Routing Issue” to the event “El” and passes it to the event aggregator module 216.
The event aggregator module 216 compares the event “El” with the event already existing in the event cache. The event already existing in the event cache is expressed as “{Device 2, T=l, KPIs: {TOTAL ROUTE COUNT, BGP ROUTE COUNT, Memory Usage}, Assigned class: Routing issue}”. As both events share the class “Routing issue”, they are aggregated together and can be expressed in a collapsed form in the output, though the event already existing in the event cache has another anomalous KPI “Memory Usage”. The output module 218 generates a summarized list of aggregated events as “{Devices 1 & 2, T = 1-2, Class: Routing issue}”.
FIG. 3 illustrates a block diagram a system 300 for aggregating events dissimilar events in a network 304 in accordance with an implementation of the invention. The system 300 includes a device 302, an input module 306, an anomaly database 308, a class assignment module 312, an event aggregator module 316, and an output module 318. The system 300 further includes a co-occurrence module 310 that includes the class assignment module 312 and a co-occurrence query module 314. As an example, system 300 receives an event “E2” that happened at a device “Device 3” at an input module 306. The event “E3” is expressed as “{Device 3, T = 3, KPIs: CPUUsage}”, meaning that the event “E3” happened at “Device 3” at time=3 and showing an anomaly KPI “CPU usage”. The event “E3” is passed to the class assignment module 312 and receives a class “CPU Issue”. The event “E3” is sent to the event aggregator module 316.
The event aggregator module 316 compares the event “E3” with the events already existing in the event cache. The event already existing in the event cache is expressed as “{Devices 1 & 2, T = 1-2, Class: Routing issue}”. As the classes are different, a Co-Occurrence query is launched using the co-occurrence query module 314 that asks whether class “CPU Issue” should be aggregated with class “Routing issue”. The answer is a yes because based on the data in the anomaly database 308, a CPU issue is observed 46% times after a Routing issue has appeared. Based on this answer, the event “E3” and the previous ones are aggregated together and presented as a single anomaly. The output module 318 generates a summarized list of aggregated events as “T = 1-3, {Devices 1 & 2, Class: Routing issue}, {Device 3, Class: CPU issue}”.
FIG. 4 is a flow diagram that illustrates a method of aggregating events in a network in accordance with an implementation of the invention. At step 402, an event is received from a device in the network. At step 404, a class is assigned to the received event based on information stored in an anomaly database. At step 406, the event is aggregated with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network. At step 408, a summarised list of aggregated events is generated for assessment.
Preferably, aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache. For example, an input from a computer network dataset including 299 hand- labelled multi-dimensional events across two weeks, 22 devices in the network, and 35 KPI types is received in the input module. A binary matrix is created, where every row is an event and each column is a KPI type, where a “1” entry in a cell indicates that a specific KPI was anomalous in that specific event. A binary clustering algorithm is applied that groups the events based on shared KPI patterns. As an example, output, almost 300 events with 35 different KPI types, and the result obtained are described in a table as follows.
Figure imgf000011_0001
As described in the table, just 10 clusters can summarize 90% of the samples.
Preferably, aggregating the event with the one or more other events includes querying the anomaly database to find events with different classes that occurred in the past together with events with the class assigned to the received event.
This is performed to aggregate across two or more network devices, two or more dissimilar anomalous multi-dimensional events that have shown a temporal correlation in the past. As an example, hand-labelled events across three weeks with 22 devices in the network and having 134 KPI types are received as an input in the input module. Each event is labelled using extracted clusters. Jaccard similarity process may be used, which is a measure of proximity between data points. The following set of steps are performed:
1. For each cluster A:
For each other cluster B:
Extract P(B|A): proportion of times B appears between t(A)+l and t(A)+TTL (Time To Live threshold).
2. Build “co-occurrence” matrix from extracted probabilities.
Represented as a table: A- B indicates that P(B|A) > threshold (0.4 in the example).
3. In the system:
(a) When a new event B is added to the event cache, the event aggregator module queries the anomaly database to obtain P(B| each event class in the event cache)
(b) If any P(B | A) > threshold, they are aggregated together.
As a result (TTL = 5 minutes, minimum samples per cluster = 19, minimum P(B|A) =0.4)” is obtained. 504 events where 345 (-31.5%) aggregated similar events and 311 (-38.3%) aggregated similar and dissimilar events.
As an example, it is observed that an event of associated class “CPU Usage (B)” happens at a device “Device 3”. The event cache is examined and there is an ongoing “Routing issue (A)” event class for devices “Device 1” and “Device 2”. A query is performed to the anomaly database, with A = Routing issue, B = CPU Usage, using the following information in the anomaly database:
Figure imgf000012_0001
Since B appears in the anomaly database for A, the query is positive and they are aggregated together.
Preferably, the event is a grouping of a plurality of alarms occurring in the device at a time instant or within a time window.
Preferably, the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device.
Preferably, the assignment of the class to the received event is based on grouping events by similar KPIs.
Preferably, steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.
In an implementation, a computer program includes instructions which, when the program is executed by a computer, cause the computer to carry out the steps of a method or any of the preceding preferences.
FIG. 5 is an illustration of a computer system (e.g. a system) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computer system 500 includes at least one processor 504 that is connected to a bus 502, wherein the computer system 500 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point-to-point communication protocol (s). The computer system 500 also includes a memory 506.
Control logic (software) and data are stored in the memory 506 which may take a form of random-access memory (RAM). In the disclosure, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on- chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user. The computer system 500 may also include a secondary storage 510. The secondary storage 510 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive, at least one of, reads from and writes to a removable storage unit in a well-known manner.
Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 506 and the secondary storage 510. Such computer programs, when executed, enable the computer system 500 to perform various functions as described in the foregoing. The memory 506, the secondary storage 510, and any other storage are possible examples of computer-readable media.
In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 504, a graphics processor coupled to a communication interface 512, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 504 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).
Furthermore, the architectures and functionalities depicted in the various previous-described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computer system 500 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.
Furthermore, the computer system 500 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 500 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 508.
It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.

Claims

1. A method for aggregating events in a network (104, 204, 304) comprising: receiving an event from a device (102, 202, 302) in the network (104, 204, 304); assigning a class to the received event based on information stored in an anomaly database (108, 208, 308); aggregating the event with one or more other events present in an event cache, wherein aggregation comprises classifying events corresponding to a plurality of anomalies identified in the network (104, 204, 304); and generating a summarised list of aggregated events for assessment.
2. The method of claim 1, wherein aggregating the event with the one or more other events comprises comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache.
3. The method of claim 1 , wherein aggregating the event with the one or more other events comprises querying the anomaly database (108, 208, 308) to find events with different classes that occurred in the past together with events with the class assigned to the received event.
4. The method of any one of preceding claims, wherein the event is a grouping of a plurality of alarms occurring in the device (102, 202, 302) at a time instant or within a time window.
5. The method of claim 4, wherein the event comprises a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device (102, 202, 302).
6. The method of claim 5, wherein the assignment of the class to the received event is based on grouping events by similar KPIs.
7. The method of any one of preceding claims, wherein steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.
8. A system (100, 200, 300) for aggregating events in a network (104, 204, 304) comprising: an input module (106, 206, 306) configured to receive an event from a device (102, 202, 302) in the network (104, 204, 304); a class assignment module (112, 212, 312) configured to assign a class to the received event based on information stored in an anomaly database (108, 208, 308); an event aggregator module (116, 216, 316) configured to aggregate the event with one or more other events present in an event cache, wherein aggregation comprises classifying events corresponding to a plurality of anomalies identified in the network (104, 204, 304); and an output module (118, 218, 318) configured to generate a summarised list of aggregated events for assessment.
9. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 to 7.
PCT/EP2022/057757 2022-03-24 2022-03-24 Aggregation of anomalies in a network WO2023179861A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/057757 WO2023179861A1 (en) 2022-03-24 2022-03-24 Aggregation of anomalies in a network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/057757 WO2023179861A1 (en) 2022-03-24 2022-03-24 Aggregation of anomalies in a network

Publications (1)

Publication Number Publication Date
WO2023179861A1 true WO2023179861A1 (en) 2023-09-28

Family

ID=81384541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/057757 WO2023179861A1 (en) 2022-03-24 2022-03-24 Aggregation of anomalies in a network

Country Status (1)

Country Link
WO (1) WO2023179861A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160248624A1 (en) * 2015-02-09 2016-08-25 TUPL, Inc. Distributed multi-data source performance management
US20170318037A1 (en) * 2016-04-29 2017-11-02 Hewlett Packard Enterprise Development Lp Distributed anomaly management
US20190081850A1 (en) * 2017-09-13 2019-03-14 Cisco Technology, Inc. Network assurance event aggregator
WO2021215720A1 (en) * 2020-04-22 2021-10-28 Samsung Electronics Co., Ltd. Apparatus and method for identifying network anomalies in cellular network
US20210406112A1 (en) * 2020-06-29 2021-12-30 International Business Machines Corporation Anomaly classification in information technology environments

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160248624A1 (en) * 2015-02-09 2016-08-25 TUPL, Inc. Distributed multi-data source performance management
US20170318037A1 (en) * 2016-04-29 2017-11-02 Hewlett Packard Enterprise Development Lp Distributed anomaly management
US20190081850A1 (en) * 2017-09-13 2019-03-14 Cisco Technology, Inc. Network assurance event aggregator
WO2021215720A1 (en) * 2020-04-22 2021-10-28 Samsung Electronics Co., Ltd. Apparatus and method for identifying network anomalies in cellular network
US20210406112A1 (en) * 2020-06-29 2021-12-30 International Business Machines Corporation Anomaly classification in information technology environments

Similar Documents

Publication Publication Date Title
Borghesi et al. Anomaly detection using autoencoders in high performance computing systems
US20190018667A1 (en) Systems and Methods of Constructing a Network Topology
CN109587008B (en) Method, device and storage medium for detecting abnormal flow data
US8386848B2 (en) Root cause analysis for complex event processing
JP2020053036A (en) System and method for analyzing interquartile range subjected to binning in malfunction detection of data series
CN113328872A (en) Fault repair method, device and storage medium
WO2023056723A1 (en) Fault diagnosis method and apparatus, and electronic device and storage medium
US20090185493A1 (en) Hierarchical and Incremental Multivariate Analysis for Process Control
CN110764980A (en) Log processing method and device
US20220200902A1 (en) Method, apparatus and storage medium for application identification
KR20020018202A (en) Method and apparatus for characterizing a semiconductor device
CN111240876A (en) Fault positioning method and device for microservice, storage medium and terminal
CN112579603A (en) CDC-based data model dynamic information perception monitoring method and device
CN115965230A (en) Wafer process yield analysis method, equipment and system
CN111367782B (en) Regression testing data automatic generation method and device
WO2020175113A1 (en) Anomaly detection device, anomaly detection method, and anomaly detection program
CN113438123B (en) Network flow monitoring method and device, computer equipment and storage medium
CN111431733B (en) Service alarm coverage information evaluation method and device
CN113723555A (en) Abnormal data detection method and device, storage medium and terminal
CN108512675B (en) Network diagnosis method and device, control node and network node
WO2023179861A1 (en) Aggregation of anomalies in a network
CN110943887B (en) Probe scheduling method, device, equipment and storage medium
CN112035286A (en) Method and device for determining fault cause, storage medium and electronic device
CN116681350A (en) Intelligent factory fault detection method and system
CN111309511A (en) Application running data processing method and device and terminal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22718130

Country of ref document: EP

Kind code of ref document: A1