WO2003039070A2

WO2003039070A2 - Method and apparatus for analysing network robustness

Info

Publication number: WO2003039070A2
Application number: PCT/GB2002/005029
Authority: WO
Inventors: Fabrice Tristan Pierre Saffre; Robert Alan Ghanea-Hercock
Original assignee: British Telecommunications Public Limited Company
Priority date: 2001-11-01
Filing date: 2002-11-01
Publication date: 2003-05-08
Also published as: WO2003039070A3

Abstract

A method and apparatus are disclosed which enable the robustness of a network to cumulative node failure to be determined. The system takes as its input a description of network topology and proceeds simulate cumulative node failure and produce a model of the robustness of the network that can be used as a relative measure against revised or other network topologies.

Description

METHOD AND APPARATUS FOR ANALYSING NETWORK ROBUSTNESS

The present invention relates to analysis of the structure of networks. In particular but not exclusively, the invention related to assessing the robustness of a network topology when exposed to cumulative node failure. Such node failure may result from a node going out of service as a result of maintenance or a directed attack. In an ad-hoc mobile network a node may go out of service because it is out of range of nodes that it was previously in communication with.

Different network topologies react differently to node failure and/or broken links (see R Albert, H Jeong, and A-L. Barabasi, "Error and attack tolerance of complex networks", Nature 406, pages 376-382 and R Cohen, K Erez, D ben-Avraham and S Havlin, "Resilience of the Internet to random breakdowns", Physics Review Letters 85, pages 4626-4628) and that mathematical techniques used in statistical physics can be used to describe the behavior of such networks in this situation (see D S Callaway, M E J Newman, S H Strogatz, and D J Watts, "Network Robustness and Fragility: Percolation on Random Graphs", Physics Review Letters 85, pages 5468-5471). Also, most artificial networks, including the Internet and the World Wide Web, can be described as complex systems, often featuring "scale-free" properties (see R Albert, H Jeong, and A-L. Barabasi, "Diameter of the World-Wide Web", Nature 401 , pages 130-131 , 1999; M Faloutsos, P Faloutsos, and C Faloutsos, "On Power-Law Relationships of the Internet Topology", ACM SIGCOMM '99, Computer Communications Review 29, pages 251-263 and B Tadic, "Dynamics of directed graphs: the world-wide Web", Physica A 293/1-2, pages 273-284).

As will be appreciated by those skilled in the art, robustness of a wide variety of real distributed architectures (telecommunication and transportation networks, power grids etc.) is a function of their topology and could therefore be evaluated on the basis of their blueprint. Similarly, several alternative designs could be compared before their actual implementation, in order, for example, to balance redundancy costs against increased resilience.

However, one problem that occurs is that efficient quantification and comparison requires selecting a consistent set of measurements that are considered a suitable summary of network behaviour under stress. According to an embodiment for the present invention there is provided apparatus for determining the response of a network to node failure, said apparatus comprising: means for inputting a representation of a network; means for measuring the performance of the network in simulations of node failure; and means for comparing the performance of the network in simulations to one or more models of network response to node failure.

Embodiments of the present invention provide a network analyser that quantifies a complex networks' behaviour when submitted to cumulative node failure. The analyser tests the robustness of any given network topology in an automated fashion, computing the values for a set of global variables after performing a statistical analysis of simulation results. Those variables characterise the decay of the network's largest component and effectively summarise the system's resilience to stress. In addition, the analyser provides a user-friendly interface to specify key simulation parameters and a graphical representation of the results. The results are also made available as text files.

Embodiments of the invention will now be described with reference to the accompanying drawing in which:

Figure 1 is a representation of the topology of a network; Figure 2 is a representation of the topology of the network of figure 1 after being subjected to cumulative node failure;

Figure 3 is a flow diagram illustrating the analysis method used by the analysis apparatus according to an embodiment of the present invention;

Figure 4a & 4b are graphs illustrating specific steps in the analysis illustrated in figure 3; Figure 5 is an annotated screen shot of the graphical user interface (GUI) of the analysis apparatus;

Figures 6 and 9 to 12 are screen shots of the display by the analysis apparatus of the results of its analysis;

Figure 7 is a graph representing features of the network whose analysis results are shown in figure 6; and

Figure 8 is a further screen shot of the GUI showing the inputs used to generate the analysis shown in figure 9.

With reference to figure 1 , a network 101 is made up of a number of nodes 103 interconnected by links 105 (Nodes 103 may also be referred to as vertices and links 105 as edges). One measure of the effect of cumulative node failures on a network is to measure the relative size (S) of the largest intact component to the total number of nodes. For example, the network 101 of figure 1 is fully intact since there is a path between each node and each other node and so S=1. Figure 2 illustrates the same network 101 after cumulative node failure. This node failure has resulted in 50% or less of the remaining nodes (i.e. the nodes left after the failed nodes have been removed) are still connected together in the largest component. The other 50% of the nodes are attached in smaller groups or not attached at all. As a result, S=0.5 (for clarity only the largest component is illustrated in figure 2). For example, a network may have one hundred nodes of which five fail leaving 95 nodes in the network. For a relatively resilient network topology this could result in a value of S of 0.80. In other words, 80% or 76 of the remaining 95 nodes would still be connected together. For a relatively brittle network topology this could result in a value of S of 0.20. In other words, 20% or 19 of the remaining 95 nodes would still be connected together.

The decay of the average relative size of the largest component <S> of a given network can be modelled using one of two basic non-linear equations. The first equation performs better when modelling networks with relatively resilient topology and has the form:

(s) [1a]

X + e*¹

where X and β are constants and x is the fraction of nodes which have been disconnected or removed from the original network. If the topology of the network is such that is has a relatively brittle response to cumulative node loss then it may be modelled better by the expression:

(S) = -£-j [1b]

Equations [1a] and [1b] obey a very similar logic and are relative efficient in describing the network's behaviour under cumulative node failure. They can be used to discriminate between 2 qualitatively different categories of architecture. Since expression [1a] or [1b] give an approximation of the decay of a given network's largest component, then the corresponding X and β global variables are a suitable measurement for quantifying its resilience to cumulative node failure.

A further useful indicator derived from an adjusted value of X is X_c. This is defined as the value of x for which the average relative size (<S>) of the largest component is equal to 0.5. In other words X_c is the critical fraction of "missing" nodes above which, on average, less than 50% of the surviving nodes are still interconnected. The value of β provides an approximation of the slope of the curve around the critical value X_c. X_c is defined for networks described by equation [1a] as:

_ ln( )

X_c = [2a] β

and for networks described by equation [1 b] as:

x =^ξf^χ [2b]

Figure 3 is a flow diagram illustrating the analysis method carried out by a computer program embodying the present invention running on a general purpose computer. The program provides an analysis apparatus that takes as input a description of the topology of the network to be analysed. The topology is described in a text file that lists the total number of nodes, the total number of links and paired node identification numbers (IDs) thereby specifying which nodes are directly connected to which other nodes. An example network of 1000 nodes interconnected by 999 links is set out in Table 1 below (not all the connections are shown).

Table 1 : Topology file format

After a properly formatted topology file has been generated for analysis, the user can launch the program to perform robustness tests which will start by displaying a graphical user interface (GUI) allowing simulation parameters and topology file name to be entered/modified by the user. The purpose of the simulations is to enable the calculation of the global variables β, X and X_c by performing a statistical analysis on data produced using Monte Carlo techniques for both the random failure and directed attack simulation techniques which will be described in further detail below.

A representation of the GUI is shown in figure 5. The GUI 501 comprises a number of user definable fields, a check box and two buttons in addition to the standard Windows ™ control buttons. The #Sims box 503 enables the user to determine how many simulations should be performed on the supplied topology data. The Sample box 505 enables the user to determine how many points there should be during each simulation where the effect of node losses should be calculated i.e. S measured. The File box 507 enables the user to define the file in which the topology of the network to be analysed is stored. The Seed box 509 is used to define a number that is used by the analyser to initialise its random number generator. The Attack check box 511 enables the user to choose between a random node failure simulation or a directed attack simulation. The Start button 513 begins the simulation process while the Exit button 515 closes the program.

After the simulation phase is over, the analyser creates two separate text files, bearing the same name as the original topology file, but with different extensions. One is contains the values for the global variables and a measurement of fitting quality (r²) and has a ".gvr" suffix. The other contains a table of numerical values as shown in table 2 below and has a ".rst" suffix. The first column of table 2 contains the fraction of nodes that have failed, the second contains the corresponding average relative size <S> of the largest component, and the third is the standard deviation for S. The fourth column is the value of <S> as predicted by expression [1a], and the fifth is the value of <S> as predicted by [1b].

Table 2: Example ".rst" file

The method used by the analysis by the program to produce data of the type shown in table 2 will now be described with reference to figure 3. At step 301, the program is initiated and extracts the topology data from the topology file described above. Processing then moves to step 303 at which the topology data is used to simulate network decay either by random node loss or by directed attack as determined by the user via the GUI as noted above.

The random node loss is simulated by the system randomly choosing one or more nodes from the supplied topology and removing it from the network. This is repeated until all nodes have been removed. The number simulations that are carried out and the number of nodes that are removed at each iteration can be varied by the user via the GUI which is described in further detail below. After each round of node removal, the size S of the largest component of the remaining network is calculated by known methods and stored. At step 305, the average value between simulations (where there is more than one) of S is calculated along with its standard deviation SDEV and stored in the manner noted above with reference to table 2. Figure 4a is a graph of <S> (relative size of the largest component) derived from the simulations plotted against x (proportion of original number of nodes removed). At step 307, for each of the expressions [1a] & 1[b] a linear transformation is applied to the values of S. For both expressions [1a] & [1b] the transform is:

S'= ln(l -S) -ln(S) [3a]

For expression [1a], the linear transformation given by [3a] above must be plotted against x. For expression [1b], the linear transformation must be plotted against a modified version of x, namely x' where:

X' = \Ά(X) [3b]

Points should then be regrouped along a straight line for the model that best fits the numerical data. After the transformation provided by expression [3a] the data shown in figure 4a appears as shown in figure 4b. At step 309, the regression of the points shown in figure 4b is calculated which provides the expression:

S' = Ax + Const [4a]

Where A is in fact the constant β and X = exp(-Const). The same regression is also applied to the points produced by expression [3b] which similarly yields the constants β and via the expression [4a]. Having calculated the constants ? and the analyser then proceeds to calculate S using expressions [1a] & [1b] and stores the results as shown in the fourth and fifth columns of the table 2 above. X_c is also calculated using expressions [2a] & [2b] and the results stored as described above.

Processing then moves to step 311 where a fitting function is used to compare the results of the calculations of S using expressions [1a] & [1b] against the empirical results for S from the simulations carried out in step 303. The fitting function gives a measure (r²) for each of the curves derived from expressions [1a] & [1b] relative to the curve from the simulation. At step 313, the analyser displays the data it has calculated. An example display is shown in figure 6. The results window 601 is displayed which includes values for all global variables and a graph showing simulation data (average S +/- standard deviation) as well as the results from each of the expressions [1a] & [1b] (referred to in the window 601 as Option 1 and Option 2). In this example (1000 nodes, 999 links, scale-free) the value for r2 is highest for expression [1b] (option 2) indicating that provides expression [1b] is the better model for the network being analysed. As noted above, expression [1 b] is typical of a brittle network.

If the "Attack" option is selected using the check box 511 of the GUI 501 shown in figure 5 then the analyser is arranged to remove nodes using a "best guess" strategy in the simulation carried out at step 303 of figure 3. This strategy emulates an attacker's strategy where the attacker possesses partial information about network topology which is used to chose which node to target next. This strategy is modelled by attributing to each surviving node a probability of being selected that is linearly proportional to its degree k (i.e. the number of links it has to other nodes):

Using equation [5], the analyser recalculates P,- after each attack in order to take into account the changing probability distribution caused by the elimination of one of the nodes. This increased complexity means that testing a network's resilience for directed attack is more intensive and time consuming than for random failure.

The "Attack" scenario, because of its stochastic nature, can also be used to model special forms of accidental damage where connectivity level is involved. For example, in a network where congestion is a cause for node failure, key relays (highly connected nodes) are more likely to suffer breakdown, which can be modelled using expression [5].

The use of the analyser as a design tool when planning network architecture will now be described with reference to worked examples illustrated in figures 7 to 12. The example network is a relatively large 3000 nodes system. The cheapest way to have them all such nodes interconnected (from a topological point of view) involves 2999 links. They could all be arranged in a single "star" or in a closed "loop", but more realistic architectures would involve inter-connected sub-domains of different size and/or topology. The network used for this example is a scale-free network of the appropriate size (3000 nodes, one link per node except the first one) to use as the basic blueprint. Figure 7 indicates that the example network's topology is scale-free (power law relationship between node frequency and degree). The most highly connected node has a direct link with 45 other nodes, 9 "secondary hubs" have more than 20 connections, and 28 have between 10 and 20 direct "affiliates".

Using this topology the network designer can use the analyser to compute statistics about its resilience to node failure, in terms of the cohesion of its largest component (initially including all nodes). In this example, the designer wants the analyser to conduct statistics on a series of 100 simulations, "killing" 1/30 ~ 0.033 = 100 randomly selected nodes between successive sample values. The GUI entries to provide this are shown in figure 8. After the complete process is completed the analyser displays the results window as shown in figure 9 from which it can be seen that the r² value for Option 2 correlates closest to the simulation results indicating that expression [1a] models the network best.

As a result of the example network having a tree-like hierarchical structure with no built-in redundancy (1 link per node), it is not very robust to node failure. Indeed, the analyser shows that, on average, removing only about 14% of all nodes (equivalent to severing all their links) is enough to reduce the size of the largest component to 50% of the surviving population (X_c ~ 0.14). The analyser tells the designer that if 500 nodes out of 3000 are malfunctioning, chances are the largest sub-set of relays that are still interconnected contains less than a half of the 2500 surviving nodes. In other words, it is likely that in this situation, around 1250 operational nodes are cut from (and unable to exchange any information with) the core of the network.

Testing the same architecture for "attack" (by checking the box 511 in the GUI 501) would give even more concerning results. For example killing only about 2% (X_c - 0.14) of the population but this time selecting preferentially highly connected nodes is enough to reach the same situation. So when applied to a typical scale-free architecture, the analyser correctly and automatically predicts the type of network behaviour and summarises it using a set of global variables. When the designer wants to increase the robustness of a planned network, alternative blueprints are produced, then fed in to the analyser in order to compare their performance against that of an original or control structure. For example, a straightforward way of increasing robustness is to add at least some backup links, so that alternative routes are available between nodes in case the primary (presumably most efficient) path becomes unavailable due to node failure(s). Continuing the above example, the designer could want to test the influence of doubling the total number of connections (raising it to 5999 links).

The results of this are illustrated in figure 10, where with 3000 new connections added to the original blueprint, the network becomes much more resilient to node failure. It now takes about 60% of the nodes to be missing before more than a half of the surviving population is cut from the largest component. It is also clear that option 1 instead gives a much better fitting than option 2. This suggests a "qualitative" change in network behaviour. Moreover, the analyser provides additional information in the form of the evolution of the standard deviation around the average value. Indeed, until up to 50 percent nodes have failed, the relative size of the largest component appears extremely stable relative to the simulation shown in figure 9. This indicates that the changes to the architecture (doubling the number of links between nodes) have resulted in the reaction of the network to cumulative stress being more predictable.

The ability of the network to withstand directed attack is also increased, as shown on figure 11 which illustrates the analysis of the same network as figure 9 except with the Attack box 511 checked. Instead of requiring the removal of only 2% of the nodes, it is now necessary to kill up to 40% to break the largest component, even though the most highly connected vertices are still specifically targeted.

Doubling the number of links may however be an unacceptable solution because of financial considerations. The network designer may look for alternative ways of improving robustness, perhaps by testing the benefit of partial route redundancy. Again, the analyser would allow the making of projections on the basis of another blueprint. For example, if only 1000 extra-connections are added to the original topology, bringing it to 3999. The results of this are shown in figure 12. As can be seen, the robustness is not increased in the same proportion as before. However, even though 33% extra links were created instead of 100%, the critical size X_c is shifted to -0.44. In other words, the modified network is three times more robust on this measure relative to the original blueprint. Since the doubling of the number of connections described above only results in the robustness increasing four times then the second choice may be more cost-effective.

These results demonstrate that the analyser enables the network designer to obtain valuable and detailed information quickly (including the value of β, which was not discussed in the example but gives a useful indication of how fast the network is likely collapse when approaching critical size). The apparatus described above is a combined simulation and analysis tool designed to study topological robustness. It does not take into account other critical aspects of network operation like traffic or routing management. Its purpose is to provide a suitable way of estimating the speed and profile of the largest component's decay under cumulative node failure, a necessary step in assessing a system's ability to withstand damage.

It will be understood by those skilled in the art that the apparatus that embodies the invention could be a general purpose device having software arranged to provide the an embodiment of the invention. The device could be a single device or a group of devices and the software could be a single program or a set of programs. Furthermore, any or all of the software used to implement the invention can be contained on various transmission and/or storage mediums such as a floppy disc, CD-ROM, or magnetic tape so that the program can be loaded onto one or more general purpose devices or could be downloaded over a network using a suitable transmission medium.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising" and the like are to be construed in an inclusive as opposed to an exclusive or exhaustive sense; that is to say, in the sense of "including, but not limited to".

Claims

1. Apparatus for determining the response of a network to node failure, said apparatus comprising: means for inputting a representation of a network; means for measuring the performance of the network in simulations of node failure; and means for comparing the performance of the network in simulations to one or more models of network response to node failure.

2. Apparatus according to claim 1 in which the means for measuring the performance of the network is operable to determine two characteristics of the network.

3. Apparatus according to any preceding claim in which the means for measuring the performance of the network is operable to determine a measure (X, X_c) of the decay of the largest component of the network in response to node failure.

4. Apparatus according to any preceding claim in which the means for measuring the performance of the network is operable to determine a measure (β) of the robustness of the network in response to node failure.

5. Apparatus according to any preceding claim in which the means for measuring the performance of the network in simulations of node failure is operable to carry out simulations for a plurality of types of node failure.

6. Apparatus according to claim 5 in which the plurality of types of node failure include node failure resulting from directed attack or from random failure.

7. Apparatus according to any preceding claim in which the means for measuring the performance of the network in simulations of node failure is operable to carry out simulations for a plurality of types of network types such as a brittle network or a resilient network.

8. Apparatus according to any preceding claim further comprising means for choosing one of the models as modelling the performance of the network.

9. A method for determining the response of a network to node failure, said method comprising the steps of: determining a representation of a network; measuring the performance of the network in simulations of node failure; and comparing the performance of the network in simulations to one or more models of network response to node failure.

10. A method according to claim 9 in which the measuring step includes measuring the performance of the network to determine two characteristics of the network.

11. A method according to claim 8 or 9 in which the performance of the network is measured to determine of the decay of the largest component (X, X_c) of the network in response to node failure.

12. A method according to any of claims 9 to 11 in which the performance of the network is measured to determine the robustness (β) of the network in response to node failure.

13. A method according to any of claims 9 to 12 in which the performance of the network in is simulated for a plurality of types of node failure.

14. A method according to claim 13 in which the plurality of types of node failure include node failure resulting from directed attack or from random failure.

15. A method according to any of claims 9 to 14 in which the simulations of node failure are carried out for a plurality of types of network types such as a brittle network or a resilient network.

16. A method according to any of claims 9 to 15 comprising the further step of choosing one of the models as modelling the performance of the network.

17. A computer program or suite of computer programs arranged to enable a computer or computers to provide the functions of the method or apparatus of any preceding claim.