WO2017072717A1

WO2017072717A1 - Learning of the structure of bayesian networks from a complete data set

Info

Publication number: WO2017072717A1
Application number: PCT/IB2016/056512
Authority: WO
Inventors: Mauro SCANAGATTA; Cassio POLPO DE CAMPOS; Giorgio CORANI; Marco ZAFFALON
Original assignee: Supsi
Priority date: 2015-10-29
Filing date: 2016-10-28
Publication date: 2017-05-04
Also published as: CH711716A1

Abstract

A device (1) for learning the structure of a Bayesian network related to a plurality of variables, each of the variables being able to assume a plurality of finite states, which comprises: - an initialization module (30) configured to calculate exact scores in the relationships between each one of the variables and the remaining variables; - a computing engine (32) configured to calculate heuristic scores in the relationships of each one of the variables with respect to the pairs of the remaining variables, to select parent subsets having highest score, and to calculate heuristic scores for the parent subsets integrated with the score for one of the remaining variables, thus generating new subsets; - a ordering module (34) configured to order the parent subsets into a list, ordered by score; - an optimization module (40) configured to generate iteratively, for a selectable period of time, a random sequence of the items that are present in the list of the parent subsets, and to calculate an overall score for the parent set that covers the entire set of the variables, while holding the parent set to which the highest score corresponds.

Description

LEARNING OF THE STRUCTURE OF BAYESIAN NETWORKS FROM A COMPLETE DATA SET

The present invention relates to a device and a method for learning the structure of Bayesian networks from a complete data set, particularly but not exclusively useful and practical for learning the structure of Bayesian networks with a large number of variables, which can be used for analysis and correlation of large quantities of data. The device and the method according to the present invention are applicable, for example, to the diagnostics of mobile networks, to the analysis of financial data, to decision analysis, to weather forecasts and to behavior prediction in general.

A Bayesian network is the graphical representation of a probabilistic model constituted by a set of random variables and of their conditional dependencies, represented with the aid of a directed acyclic graph (DAG), also known as a network structure.

A directed acyclic graph is a graph that has nodes and arcs which are directed, but lacks directed cycles. This means that starting from any node of the graph it is not possible to visit the initial node again by moving along the direction of the arcs of the network.

The nodes represent random variables, where each variable can assume a given state or value.

A probability value is associated with the possible states of each variable, which must be mutually exclusive, while the arcs between nodes indicate a conditional dependence relationship between the variables represented by the nodes.

Conditional probability tables (CPT), i.e., tables that contain the probabilities of the values of the node conditioned by the possible combinations of values of the parent nodes, are associated with the nodes that have parents, i.e., nodes that are connected to at least one arc that points to them.

In greater detail, a Bayesian network is the graphical representation of the probabilistic model, and makes it possible to represent and analyze a phenomenon being studied in conditions of uncertainty. The probabilistic model represents a probability distribution on a set of discrete random variables X= {Xi,X2, ... ,X_n}- Bayesian networks, typically denoted by B = (G,9), are defined by specifying two components: a qualitative component and a quantitative component.

The qualitative component consists of the above mentioned directed acyclic graph (DAG), denoted by G = (V,£), in which the nodes F have a one-to-one correspondence with the set of discrete random variables X = {Xi,X2, ··· ,X_n} and the directed arcs ε are ordered pairs of elements of V. Each arc represents the conditional dependence between the nodes that it connects.

The quantitive component consists of a set of conditional probability distributions (CPD) o X, Θ, which are defined as network parameters. The local conditional probability distributions, associated with each random variable and conditioned by every possible combination of the values assumed by the parent set of the variable, are specified by means of a set of parameters.

In practice, the structure of a Bayesian network codifies the qualitative relationships, represented by means of arcs, that exist between the discrete random variables, represented by means of nodes.

The strength of the relationships that exist between the discrete random variables, i.e., the weight of the arcs that connect the nodes, is quantified by the conditional probability distributions associated with each node.

If a node has many parents or if the parents of a node have many possible states, the associated conditional probability table (CPT) can be very large. The size of a conditional probability table is in fact exponential with respect to the number of parents. The learning process of a Bayesian network substantially comprises two distinct steps. The first step relates to the learning of the structure of the network, or structural learning, i.e., the relationships between the variables. The second step relates to the learning of the network parameters, or learning parameters, i.e., the conditional probabilities.

In particular, the learning of the structure of a Bayesian network starting from a complete data set is an NP-hard problem.

Various exact algorithms are currently known for learning the structure of a Bayesian network as a function of a score, i.e., algorithms that have the task of finding the structure of the Bayesian network that maximizes the score that depends on the data set. These algorithms are based on different methods that are known in the background art, including for example dynamic programming, the branch and bound method, linear and integer programming, or shortest path heuristics.

Typically, the learning of the structure of the network is performed during two distinct steps. First one proceeds with an identification of the parent set and then with the optimization of the structure.

The parent set identification step produces, for each random variable, i.e., for each node, a list of candidate parent sets, i.e., a list of parent sets suitable to maximize the score of the Bayesian network.

The structure optimization step instead selects, for each node, a parent set among the ones listed in the above cited list and assigns it to the corresponding node, maximizing the score of the resulting structure without introducing cycles.

The typical problem that one must deal with in the parent set identification step is that it is unlikely that this step will admit a polynomial- time algorithm with a guarantee of good quality.

This problem has pushed research toward the development of effective search heuristics. However, usually the in-degree k, which indicates the number of parents per node, is defined beforehand and then the score of all the parent sets is calculated.

It should be noted that a higher in-degree entails a larger search space and consequently makes it possible to obtain a higher score.

These known solutions are not free from drawbacks, among which, in particular, one must include the fact that they require a very long computing time, which is exponential with respect to the in-degree defined beforehand. In practice, the larger the in-degree k, the longer the computing time and the greater the computational burden.

By choosing the in-degree beforehand, the user accepts a compromise between two different goals: minimizing computing time and maximizing the score of the Bayesian network. However, the user can only approximately estimate the impact of the in-degree on computing time and on the score of the Bayesian network. When the number of random variables, and therefore of nodes, is very large, the in-degree is generally set to a small value in order to maintain the feasibility of the structure optimization step.

The aim of the present invention is to overcome the limitations of the background art described above, by devising a device and a method for learning the structure of Bayesian networks from a complete data set that make it possible to obtain effects that are similar to, or better than, those obtainable with solutions of the known type, by making it possible to identify a structure of a Bayesian network of good quality, i.e., optimized as much as possible and with a maximized score, while maintaining computing time at a modest level.

Within the scope of this aim, an object of the present invention is to conceive of a device and a method for learning the structure of Bayesian networks from a complete data set that make it possible to define beforehand the computing time available for the identification of a structure of a Bayesian network of good quality.

Another object of the present invention is to devise a device and method for learning the structure of Bayesian networks from a complete data set that can be applied easily to massive data sets with a large number of variables, for example financial data, commodity data, weather data, webpage access data, and so forth.

Another object of the present invention is to provide a device and method for learning the structure of Bayesian networks from a complete data set that are highly reliable, relatively simple to provide and low cost.

This aim, as well as these and other objects that will become better apparent hereinafter, are achieved by a device for learning the structure of a Bayesian network related to a plurality of variables, each of said variables being able to assume a plurality of finite states, which comprises:

- an initialization module configured to calculate exact scores in the relationships between each one of said variables and the remaining variables;

- a computing engine configured to calculate heuristic scores in the relationships of each one of said variables with respect to the pairs of said remaining variables, to select parent subsets having highest score, and to calculate heuristic scores for said parent subsets integrated with the score for one of said remaining variables, thus generating new subsets;

- an ordering module configured to order said parent subsets into a list, ordered by score;

- an optimization module configured to generate iteratively, for a selectable period of time, a random sequence of the items that are present in said list of said parent subsets, and to calculate an overall score for the parent set that covers the entire set of said variables, while holding the parent set to which the highest score corresponds.

The intended aim and objects are furthermore achieved by a method for learning the structure of a Bayesian network related to a plurality of variables, each one of said variables being able to assume a plurality of finite states, which comprises the steps that consist in: - initializing the exact scores in the relationships between each one of said variables and the remaining variables;

- calculating the heuristic scores in the relationships of each one of said variables with respect to the pairs of said remaining variables, in order to select parent subsets having highest score;

- calculating the heuristic scores for said parent subsets integrated with the score for one of said remaining variables, thus generating new subsets;

- ordering said parent subsets into a list, ordered by score;

- optimizing the structure of the Bayesian network by generating iteratively, for a selectable period of time, a random sequence of the items that are present in said list of said parent subsets, and calculating an overall score for the parent set that covers the entire set of said variables, while holding the parent set to which the highest score corresponds.

Further characteristics and advantages of the invention will become better apparent from the description of a preferred but not exclusive embodiment of the device and of the method for learning the structure of Bayesian networks from a complete data set according to the invention, illustrated by way of nonlimiting example in the accompanying drawings, wherein:

Figure 1 is a block diagram showing schematically the components of an embodiment of the device for learning the structure of Bayesian networks, according to the present invention;

Figure 2 is a view of a known Bayesian network built manually by experts in biology and relating to the diagnosis of pulmonary diseases in children;

Figure 3 is a view of a Bayesian network generated by the device and by the method according to the present invention starting from a complete data set, which can be compared with the network of Figure 2.

With reference to the figures, the device, designated generally by the reference numeral 1, and the method for learning the structure of Bayesian networks from a complete data set according to the invention will now be described in relation to the steps that compose said method.

The first step of the method for learning the structure of Bayesian networks according to the present invention is described hereinafter. The parent set identification step according to the present invention, which uses heuristic assessment of the score of the structure of a Bayesian network, is performed by means of an identification module 20.

Within the scope of this first step, mainly two lists 10, 12 are used. The first list 10 is termed the open list, and is a list that lists the parent sets still to be explored, ordered by their heuristic score.

The second list 12 is termed the closed list, and is a list that lists the parent sets that have already been explored, together with their exact score.

The following functions are then used:

a first function score (P, X) 21, which calculates an exact score according to mutually alternative known methods, for example a BDeu (Bayesian Dirichlet equivalent uniform) score or a BIC (Bayesian information criterion) score which is exact for the set of variables P as parents of the variable X;

a second function c(Y, sk) 22 which stores in a temporary memory area or cache the score for a variable Fas a single parent;

a third function s(Y) 23, which retrieves the score for the variable X as a single parent;

a fourth function pop(L) 24, which extracts the parent set with the best score from the list L;

a fifth function add{L, P, s) 25, which adds the parent set P to the list L, with the score s;

a sixth function timeQ 26, which returns the Boolean value true if there is still time available for computational calculation.

Knowing a target variable X and a set of candidate variables Y, the goal of the parent set identification module 20 is to find the subset of Y that yields the best scores as a parent set for X.

As a first step, an initialization module 30 calculates the exact score for each candidate variable 7 as a single parent of X with a score {Y]X), adds the set and the score to the closed list 12, and saves the result in the cache memory for retrieval at a later time.

Subsequently, the initialization module 30 proceeds with the addition to the open list 10 of all the pairs of candidate variables Y_t and Y₂, assigning to them the heuristic score obtained as s Y X) + s(Y₂,X).

At this point the open list 10 is initialized and a computing engine 32 can proceed with the execution of the main cycle, repeating the following steps until all the elements in the open list 10 have been processed, or until the time available for the computational calculation has ended.

Initially, the computing engine 32 extracts from the open list 10 the subset P with the best heuristic score, calculates the exact score for the parent set for said subset P, adding the set and the score to the closed list 12.

The computing engine 32 then proceeds by searching for all the possible expansions of P that are the result of the addition of a single candidate variable Y, after checking that the candidate Y is not already part of a subset P that has already been processed, i.e., that the set of variables P u Fdoes not already exist in the open list 10 or in the closed list 12.

The computing engine 32 then assigns a heuristic score obtained as s(P,X) + s(Y,X) to each one of these candidate parent sets, and adds these sets to the open list 10.

Finally, a module 34 for ordering the parent set returns the content of the closed list 12, ordered according to the exact score calculated previously.

The operations performed by the identification module 20 are summarized, for greater clarity, in the following lines of pseudocode.

PARENT SET IDENTIFICATION 1 : function

variable x)

2: open — 0

3 : closed— 0

4: for all candidate variable 7 do

5: sk <- score({Y}JC)

6: c(Y, sk)

7: add{closed, Y, sk)

8: end for

9: for all pair of candidate variables Yi and Y₂ do

10: h <- s(Y_hX) + s(Y₂,X)

1 1 : add(open, {Yl, Y2}, h)

12: end for

13: while open\ = 0 & do

14: P <— pop(open)

15: sk ^ score(PJ0

16: for all candidate variable ΥΑ Ρ, Ρ Ό Y open u closed do

17: h ^ sk + s(Y )

18 : add(open, Ρ υ Υ, η)

19: end for

20: end while

return ordered(closed)

21 : end function

The second step of the method for learning the structure of Bayesian networks according to the present invention is described hereinafter. The step of structure optimization of a Bayesian network according to the present invention is performed by means of an optimization module 40.

The optimization module 40 implements the following functions: a first function ancestor{o, p, a) 42, which, given a vector or array a of the ancestors, checks whether p is an ancestor of o; a second function update(o, p, a) 44, which updates the array a of the ancestors or descendants with the parent set p for o;

a third function scoreip) 46, which calculates the score of the parent set p; a fourth function randomize(L) 48, which arranges in a random order the items of the list L;

a fifth function timeQ 50, which returns true if there is still time available for computational calculation.

Knowing, for each variable or node, a list of the most promising parent sets, together with their scores, the purpose of the optimization module 40 of the structure is to find and assign the best possible parent set to each variable or node.

It is noted that, according to the invention, it is necessary to ensure the acyclic nature of the graph, i.e., there can be no directed cycle in the graph, and to maximize the resulting score, which can be broken down as the sum of the scores of the individual parent sets.

In order to fully understand acyclic control, it can be useful to think in terms of ancestors of a variable, i.e., its parents, grandparents, and so forth, right down to the roots of the graph. A cycle is in fact introduced only when a variable or node is connected to one of its ancestors in the graph.

In order to ensure compliance with this constraint, i.e., in order to ensure the absence of directed cycles in the graph, the optimization module 40 can use, in a particularly effective embodiment, a bit array, in particular one bit array per variable or node. Given the variable , the bit array for the variable i contains, for each position j, the information that indicates whether the variable j is an ancestor of the variable i in the graph or not.

The optimization module 40 repeats the same operation for the descendants of a variable, recursively for all the children of that variable. The optimization module 40 uses the array of descendants to update the array of ancestors following each assignment of a parent set, this assignment being executable by means of binary operations that are very quick in terms of execution.

The main cycle of the method according to the invention can apply the Monte Carlo method, which is known in the background art, to the possible orders of the variables or nodes. At each iteration, the optimization module 40 performs a random permutation, selecting a new order.

Following the selection of this order, the optimization module 40 selects, for each variable, the parent sets that have the best score and which at the same time do not introduce cycles. At the end of the iteration, the optimization module 40 compares the resulting structure, i.e., the choice of a parent set for each variable or node, with the best structure identified up to that moment. When the time available has ended, the optimization module 40 returns that structure with the best score that has been found.

The optimization module 40 deems acceptable a parent set p for a variable x if none of the parent variables is an ancestor of x, i.e., the parent set p does not introduce directed cycles into the graph. This check can be performed rapidly where implemented by means of binary operations on an array of ancestors.

When the optimization module 40 has completed the choice of the parent set, it deals with updating the arrays of the ancestors and descendants.

For this purpose, the optimization module 40 sets the variable x and its descendants, as well as the descendants of any parent in p and its ancestors. It then sets every p and its ancestors, as well as the ancestors of x and its descendants. In this case also, these operations can be performed rapidly by means of binary operations.

The operations performed by the optimization module 40 are summarized for greater clarity in the following lines of pseudocode.

STRUCTURE OPTIMIZATION

1 : function search(parent sets P)

2: order <- l : JV 3 : bestStructure <— nil

4: bestScore <—∞

5: while timeQ do

6: randomize(order)

7: structure <— empty

8: score <— 0

9: for all o <= order do

10: p <— bestParent(o, P, ancestors, descendants)

1 1 : structureip) <— ?

12: score+ = scoreip)

13 : end for

14: if score > bestScore then

15 : bestScore— score

16: bes tStructure <— s tructure

17: end if

18: end while

return bestStructure

19: end function 1 : function besii¾re«i(variable o, parent sets , ancestors array a, descendants array d)

2: best ^ 0

3 : for all parentSet e (o) do

4: admissible <— /rwe

5: for all ?are«i & parentSet do

6 : if ancestor{p, parent, a) then

7: admissible <— a/se

8: end if 9: end for

10 if admissible then

11 best <— parentSet

12 break

13 end if

14: end for

15 : update(ancestor, o, p)

16 : update{descendants, o, p)

return best

17: end function

In practice it has been found that the invention fully achieves the intended aim and objects. In particular, it has been found that the device and the method for learning the structure of Bayesian networks from a complete data set thus conceived make it possible to overcome the qualitative limitations of the background art, since they make it possible to identify a structure of a Bayesian network of good quality, i.e., as optimized as possible and with a maximized score, while maintaining computing time at a modest level.

Another advantage of the device and the method for learning the structure of Bayesian networks from a complete data set is that they make it possible to define beforehand the computing time available for identifying a structure of a Bayesian network of good quality.

A further advantage of the device and of the method for learning the structure of Bayesian networks from a complete data set is that they can be applied easily to massive data sets with a large number of variables, for example financial data, commodity data, weather data, webpage access data, and so forth.

Although the device and the method for learning the structure of Bayesian networks from a complete data set according to the invention have been conceived in particular for learning the structure of Bayesian networks with a large number of variables, which can be used for the analysis and correlation of large quantities of data, they can in any case be used more generally for learning the structure of Bayesian networks with any number of variables, even less than ten variables if necessary.

The invention thus conceived is susceptible of numerous modifications and variations, all of which are within the scope of the inventive concept. All the details may furthermore be replaced with other technically equivalent elements.

In practice, the materials used, as well as the contingent shapes and dimensions, may be any according to the requirements and the state of the art.

To conclude, the scope of protection of the claims must not be limited by the illustrations or by the preferred embodiments presented in the description as examples, but rather the claims must comprise all the characteristics of patentable novelty that reside within the present invention, including all the characteristics that would be treated as equivalent by the person skilled in the art.

Where technical features mentioned in any claim are followed by reference signs, those reference signs have been included for the sole purpose of increasing the intelligibility of the claims and accordingly such reference signs do not have any limiting effect on the interpretation of each element identified by way of example by such reference signs.

Claims

1. A device (1) for learning the structure of a Bayesian network related to a plurality of variables, each of said variables being able to assume a plurality of finite states, which comprises:

- an initialization module (30) configured to calculate exact scores in the relationships between each one of said variables and the remaining variables;

- a computing engine (32) configured to calculate heuristic scores in the relationships of each one of said variables with respect to the pairs of said remaining variables, to select parent subsets having highest score, and to calculate heuristic scores for said parent subsets integrated with the score for one of said remaining variables, thus generating new subsets;

- a ordering module (34) configured to order said parent subsets into a list, ordered by score;

- an optimization module (40) configured to generate iteratively, for a selectable period of time, a random sequence of the items that are present in said list of said parent subsets, and to calculate an overall score for the parent set that covers the entire set of said variables, while holding the parent set to which the highest score corresponds.

2. The device (1) for learning the structure of a Bayesian network according to claim 1, characterized in that said exact scores are calculated by means of the BDeu, Bayesian Dirichlet equivalent uniform, method or the BIC, Bayesian information criterion, method or other methods that operate according to the same principles.

3. A method for learning the structure of a Bayesian network related to a plurality of variables, each one of said variables being able to assume a plurality of finite states, comprising the steps that consist in:

- initializing the exact scores in the relationships between each one of said variables and the remaining variables;

- ordering said parent subsets into a list, ordered by score;

4. The method for learning the structure of a Bayesian network according to claim 3, characterized in that said exact scores are calculated by means of the BDeu, Bayesian Dirichlet equivalent uniform, method or the BIC, Bayesian information criterion, method or other methods that operate according to the same principles.