WO2023179861A1

WO2023179861A1 - Aggregation of anomalies in a network

Info

Publication number: WO2023179861A1
Application number: PCT/EP2022/057757
Authority: WO
Inventors: Jose Manuel NAVARRO GONZALEZ; Alexis HUET; Dario Rossi
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2023-09-28

Abstract

Provided a method for aggregating events in a network (104, 204, 304). The method includes receiving an event from a device (102, 202, 302) in the network (104, 204, 304). The method includes assigning a class to the received event based on information stored in an anomaly database (108, 208, 308). The method includes aggregating the event with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network and generating a summarised list of aggregated events for assessment.

Description

AGGREGATION OF ANOMALIES IN A NETWORK

TECHNICAL FIELD

The invention relates to anomaly aggregation, more particularly, the invention relates to a method and a system for aggregation of anomalies in a network.

BACKGROUND

For each device in a monitored network, experts need to analyse the evolution of the KPIs (Key Performance Indicators) or network devices’ measured features over time (time series) in order to troubleshoot them.

Traditionally, troubleshooting is performed using (a) Univariate time series detection on each individual time series, giving anomalous dates for each KPI, and (b) optionally, manually aggregating those anomalies obtained for each single KPI series together. The disadvantage is that where the number of KPIs is typically more than 1000, a large number of anomalies are detected. Further, this can overload the capacity of a manual operator to diagnose the system in real time.

Recent anomaly detection algorithms are multivariate. They take multivariate time series inputs but are only able to identify moments in time which contain anomalous data. Additional techniques must be used in order to determine which KPIs are anomalous. This still yields an excessive number of anomalies, as each device is considered separately.

The traditional solutions are limited to work with anomalies individually and create results that are lengthy and difficult to interpret by a network engineer.

Therefore, the present invention aims to provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems. SUMMARY

It is an object of the invention to provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems.

This object is achieved by the features of the independent claims. Further implementations are apparent from the dependent claims, the description, and the figures.

The invention provides a method and a system for aggregating events in a network.

According to a first aspect, there is provided a method for aggregating events in a network including: receiving an event from a device in the network; assigning a class to the received event based on information stored in an anomaly database; aggregating the event with one or more other events present in an event cache, wherein aggregation includes classifying events corresponding to a plurality of anomalies identified in the network; and generating a summarised list of aggregated events for assessment.

The method for aggregating events in the network provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent status of a device, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of a system as a whole. Thus, such aggregation of events provides improved diagnostics and efficiency.

Additionally, though the method works in both an online and offline manner, the method can aggregate similar and dissimilar events across devices, which is not possible as per the prior art. The method reduces the time taken by network engineers to troubleshoot faults in their monitored systems.

Preferably, aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache. Preferably, aggregating the event with the one or more other events includes querying the anomaly database to find events with different classes that occurred in the past together with events with the class assigned to the received event.

Preferably, the event is a grouping of one or more alarms occurring in the device at a time instant or within a time window.

Preferably, the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device.

Preferably, the assignment of the class to the received event is based on grouping events by similar KPIs.

Preferably, steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.

According to a second aspect, there is provided a system for aggregating events in a network including: an input module configured to receive an event from a device in the network; a class assignment module configured to assign a class to the received event based on information stored in an anomaly database; an event aggregator module configured to aggregate the event with one or more other events present in an event cache, wherein aggregation includes classifying events corresponding to a plurality of anomalies identified in the network; and an output module configured to generate a summarised list of aggregated events for assessment.

According to a third aspect, a computer program including instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method or any of the preceding preferences.

Therefore, in contradistinction to the prior art, the disclosed method provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent a status of a device, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of a system as a whole. Additionally, though the method works in both online and offline manner, method is able to aggregate similar and dissimilar events across devices, which is not possible as per the prior art.

These and other aspects of the invention will be apparent from the implementations described below.

BRIEF DESCRIPTION OF DRAWINGS

Implementations of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 illustrates a block diagram of a system for aggregating events in a network in accordance with an implementation of the invention;

FIG. 2 illustrates a block diagram of a system for aggregating similar events in a network in accordance with an implementation of the invention;

FIG. 3 illustrates a block diagram of a system for aggregating dissimilar events in a network in accordance with an implementation of the invention;

FIG. 4 is a flow diagram that illustrates a method for aggregating events in a network in accordance with an implementation of the invention; and

FIG. 5 is an illustration of a computer system (e.g. a system) in which the various architectures and functionalities of the various previous implementations may be implemented.

DETAILED DESCRIPTION OF THE DRAWINGS

Implementations of the invention provide a method for aggregation anomalies in a network having multiple devices for reducing the time taken by network engineers to troubleshoot faults in their monitored systems. To make solutions of the invention more comprehensible for a person skilled in the art, the following implementations of the invention are described with reference to the accompanying drawings.

Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the invention are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the invention described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.

FIG. 1 illustrates a block diagram of a system 100 for aggregating events in a network 104 in accordance with an implementation of the invention. The system 100 includes an input module 106, a co-occurrence module 110, a class assignment module 112, an event aggregator module 116, and an output module 118. The input module 106 is configured to receive an event from a device 102. The class assignment module 112 configured to assign a class to the received event based on information stored in an anomaly database 108. The event aggregator module 116 is configured to aggregate the event with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network 104. The output module 118 configured to generate a summarised list of aggregated events for assessment.

The system 100 for aggregating events in the network 104 provides adaptation to working with events, as opposed to anomalies. Based on the assumption that events represent a status of the device 102, as opposed to a narrow view that an anomaly provides, aggregating events across devices enables to quickly understand the status and diagnosis of the system 100 as a whole.

Additionally, though the system 100 works in both an online and offline manner, the system 100 can aggregate similar and dissimilar events across devices, which is not possible as per the prior art. The method reduces the time taken by network engineers to troubleshoot faults in their monitored systems. The system 100 includes a co-occurrence module 110 that includes the class assignment module 112 and a co-occurrence query module 114. The class assignment module 112 and the co-occurrence query module 114 are configured to aggregate similar and dissimilar events respectively. The co-occurrence query module 114 is configured to determine whether events of dissimilar classes should be aggregated together based on the past behaviour of the events when a new event arrives to the input module 106. The co-occurrence module 110 is connected to the event aggregator module 116 that is configured to aggregate the one or more events with one or more other events present in an event cache. The event cache keeps a record of events that happened previously. Aggregation includes classifying the one or more events corresponding to a plurality of anomalies identified in the network 104. The event aggregator module 116 is connected to the output module 118 that is configured to generate a summarized list of aggregated events for assessment.

An event is defined as a grouping of all the individual anomalies happening in the device 102 in a specific time window. An anomaly is defined as a tuple that includes a device identifier, a time, and an error type, where the tuple indicates that a faulty behaviour was detected in a specific key performance indicator (KPI) or service. The event may be expressed using the device identifier and identifiers of KPIs that exhibit anomalous behaviour. As an example, the event may be expressed as “{device, time, {anomaly type 1, anomaly type 2..., anomaly type N}{”. Events may be pre-aggregated with each KPI that is relevant to the anomaly. The anomaly database 108 is configured to store a record of past anomalies and their related KPIs.

Preferably, aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache.

Preferably, aggregating the event with the one or more other events includes querying the anomaly database 108 to find events with different classes that occurred in the past together with events with the class assigned to the received event.

Preferably, the event is a grouping of one or more alarms occurring in the device 102 at a time instant or within a time window.

Preferably, the event includes a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device 102. Preferably, the assignment of the class to the received event is based on grouping events by similar KPIs.

FIG. 2 illustrates a block diagram of a system 200 for aggregating similar events in a network 204 in accordance with an implementation of the invention. The system 200 includes a device 202, an input module 206, an anomaly database 208, a co-occurrence module 210, a class assignment module 212, an event aggregator module 216, and an output module 218. As an example, the system 200 receives an event “El” happened at a device “Device 1” at the input module 206. The event “El” expressed as “{Device 1, T = 2, KPIs: {TOTAL ROUTE COUNT, BGP ROUTE COUNT}

meaning that the event “El” happened at “Device 1” at time=2 and showing KPIs TOTAL ROUTE COUNT and BGP ROUTE COUNT. The event cache includes an event similar to the event “El”. The event “El” is passed to a class assignment module 212 that assigns the class “Routing Issue” to the event “El” and passes it to the event aggregator module 216.

The event aggregator module 216 compares the event “El” with the event already existing in the event cache. The event already existing in the event cache is expressed as “{Device 2, T=l, KPIs: {TOTAL ROUTE COUNT, BGP ROUTE COUNT, Memory Usage}, Assigned class: Routing issue}”. As both events share the class “Routing issue”, they are aggregated together and can be expressed in a collapsed form in the output, though the event already existing in the event cache has another anomalous KPI “Memory Usage”. The output module 218 generates a summarized list of aggregated events as “{Devices 1 & 2, T = 1-2, Class: Routing issue}”.

FIG. 3 illustrates a block diagram a system 300 for aggregating events dissimilar events in a network 304 in accordance with an implementation of the invention. The system 300 includes a device 302, an input module 306, an anomaly database 308, a class assignment module 312, an event aggregator module 316, and an output module 318. The system 300 further includes a co-occurrence module 310 that includes the class assignment module 312 and a co-occurrence query module 314. As an example, system 300 receives an event “E2” that happened at a device “Device 3” at an input module 306. The event “E3” is expressed as “{Device 3, T = 3, KPIs: CPUUsage}”, meaning that the event “E3” happened at “Device 3” at time=3 and showing an anomaly KPI “CPU usage”. The event “E3” is passed to the class assignment module 312 and receives a class “CPU Issue”. The event “E3” is sent to the event aggregator module 316.

The event aggregator module 316 compares the event “E3” with the events already existing in the event cache. The event already existing in the event cache is expressed as “{Devices 1 & 2, T = 1-2, Class: Routing issue}”. As the classes are different, a Co-Occurrence query is launched using the co-occurrence query module 314 that asks whether class “CPU Issue” should be aggregated with class “Routing issue”. The answer is a yes because based on the data in the anomaly database 308, a CPU issue is observed 46% times after a Routing issue has appeared. Based on this answer, the event “E3” and the previous ones are aggregated together and presented as a single anomaly. The output module 318 generates a summarized list of aggregated events as “T = 1-3, {Devices 1 & 2, Class: Routing issue}, {Device 3, Class: CPU issue}”.

FIG. 4 is a flow diagram that illustrates a method of aggregating events in a network in accordance with an implementation of the invention. At step 402, an event is received from a device in the network. At step 404, a class is assigned to the received event based on information stored in an anomaly database. At step 406, the event is aggregated with one or more other events present in an event cache. Aggregation includes classifying events corresponding to one or more anomalies identified in the network. At step 408, a summarised list of aggregated events is generated for assessment.

Preferably, aggregating the event with the one or more other events includes comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache. For example, an input from a computer network dataset including 299 hand- labelled multi-dimensional events across two weeks, 22 devices in the network, and 35 KPI types is received in the input module. A binary matrix is created, where every row is an event and each column is a KPI type, where a “1” entry in a cell indicates that a specific KPI was anomalous in that specific event. A binary clustering algorithm is applied that groups the events based on shared KPI patterns. As an example, output, almost 300 events with 35 different KPI types, and the result obtained are described in a table as follows.

As described in the table, just 10 clusters can summarize 90% of the samples.

Preferably, aggregating the event with the one or more other events includes querying the anomaly database to find events with different classes that occurred in the past together with events with the class assigned to the received event.

This is performed to aggregate across two or more network devices, two or more dissimilar anomalous multi-dimensional events that have shown a temporal correlation in the past. As an example, hand-labelled events across three weeks with 22 devices in the network and having 134 KPI types are received as an input in the input module. Each event is labelled using extracted clusters. Jaccard similarity process may be used, which is a measure of proximity between data points. The following set of steps are performed:

1. For each cluster A:

For each other cluster B:

Extract P(B|A): proportion of times B appears between t(A)+l and t(A)+TTL (Time To Live threshold).

2. Build “co-occurrence” matrix from extracted probabilities.

Represented as a table: A- B indicates that P(B|A) > threshold (0.4 in the example).

3. In the system:

(a) When a new event B is added to the event cache, the event aggregator module queries the anomaly database to obtain P(B| each event class in the event cache)

(b) If any P(B | A) > threshold, they are aggregated together.

As a result (TTL = 5 minutes, minimum samples per cluster = 19, minimum P(B|A) =0.4)” is obtained. 504 events where 345 (-31.5%) aggregated similar events and 311 (-38.3%) aggregated similar and dissimilar events.

As an example, it is observed that an event of associated class “CPU Usage (B)” happens at a device “Device 3”. The event cache is examined and there is an ongoing “Routing issue (A)” event class for devices “Device 1” and “Device 2”. A query is performed to the anomaly database, with A = Routing issue, B = CPU Usage, using the following information in the anomaly database:

Since B appears in the anomaly database for A, the query is positive and they are aggregated together.

Preferably, the event is a grouping of a plurality of alarms occurring in the device at a time instant or within a time window.

In an implementation, a computer program includes instructions which, when the program is executed by a computer, cause the computer to carry out the steps of a method or any of the preceding preferences.

FIG. 5 is an illustration of a computer system (e.g. a system) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computer system 500 includes at least one processor 504 that is connected to a bus 502, wherein the computer system 500 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point-to-point communication protocol (s). The computer system 500 also includes a memory 506.

Control logic (software) and data are stored in the memory 506 which may take a form of random-access memory (RAM). In the disclosure, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on- chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user. The computer system 500 may also include a secondary storage 510. The secondary storage 510 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive, at least one of, reads from and writes to a removable storage unit in a well-known manner.

Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 506 and the secondary storage 510. Such computer programs, when executed, enable the computer system 500 to perform various functions as described in the foregoing. The memory 506, the secondary storage 510, and any other storage are possible examples of computer-readable media.

In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 504, a graphics processor coupled to a communication interface 512, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 504 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).

Furthermore, the architectures and functionalities depicted in the various previous-described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computer system 500 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.

Furthermore, the computer system 500 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 500 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 508.

It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.

Claims

1. A method for aggregating events in a network (104, 204, 304) comprising: receiving an event from a device (102, 202, 302) in the network (104, 204, 304); assigning a class to the received event based on information stored in an anomaly database (108, 208, 308); aggregating the event with one or more other events present in an event cache, wherein aggregation comprises classifying events corresponding to a plurality of anomalies identified in the network (104, 204, 304); and generating a summarised list of aggregated events for assessment.

2. The method of claim 1, wherein aggregating the event with the one or more other events comprises comparing the class assigned to the received event with classes previously assigned to the other events present in the event cache.

3. The method of claim 1 , wherein aggregating the event with the one or more other events comprises querying the anomaly database (108, 208, 308) to find events with different classes that occurred in the past together with events with the class assigned to the received event.

4. The method of any one of preceding claims, wherein the event is a grouping of a plurality of alarms occurring in the device (102, 202, 302) at a time instant or within a time window.

5. The method of claim 4, wherein the event comprises a device identifier and one or more anomaly identifiers for Key Performance Indicators, KPIs, of the device (102, 202, 302).

6. The method of claim 5, wherein the assignment of the class to the received event is based on grouping events by similar KPIs.

7. The method of any one of preceding claims, wherein steps of the method for aggregating the events are done either in a batch mode or in a streaming mode.

8. A system (100, 200, 300) for aggregating events in a network (104, 204, 304) comprising: an input module (106, 206, 306) configured to receive an event from a device (102, 202, 302) in the network (104, 204, 304); a class assignment module (112, 212, 312) configured to assign a class to the received event based on information stored in an anomaly database (108, 208, 308); an event aggregator module (116, 216, 316) configured to aggregate the event with one or more other events present in an event cache, wherein aggregation comprises classifying events corresponding to a plurality of anomalies identified in the network (104, 204, 304); and an output module (118, 218, 318) configured to generate a summarised list of aggregated events for assessment.

9. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 to 7.