bnlearn - Learning Bayesian Networks from Incomplete Data.

A quick introduction

In this lecture we will introduce how to model data containing missing values with Bayesian networks. Missing values can arise in different ways, which were originally codified by Rubin. Bayesian networks give graphical form to Rubin's work using a directed graph where the variables and the missingness mechanisms are represented as nodes and relationships as arcs. The two key points of the lecture will be:

Reviewing Rubin's classification of missing data and its graphical representation following the work of Mohan and Pearl.
Adapting parameter learning and structure learning to learn Bayesian networks from incomplete data using techniques like the Expectation-Maximisation algorithm.

Bayesian networks

Definitions

Bayesian networks (BNs)¹ are probabilistic graphical models defined by:

a network structure, a directed acyclic graph (DAG) in which each node corresponds to a random variable ;
a global probability distribution over (with parameters ), which can be factorised into smaller local probability distributions according to the arcs present in the graph.

The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation: nodes that are not connected by an arc are independent conditional on some other nodes. In other words,

graphical separation implies causation

Nodes that are connected by an arc cannot be graphically separated, and therefore they depend directly on each other in probability.

The most important consequence of this property is the factorisation of the global distribution of :

local distributions

Each local distribution has its own parameter set Theta Xi ; and all Theta Xi is much smaller than Theta because many parameter are fixed by the fact that the variables they belong to are independent. This makes it possible to work with a high-dimensional distribution through a set of much smaller and simpler distributions whose dimensions depends only on the respective number of parents P Xi .

The first component of a BN is a DAG. Graphically, it looks like this: all arcs have directions, and no cycles are allowed.

> string = "[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]"
> sample.dag = model2network(string)
> graphviz.plot(sample.dag, shape = "rectangle")

And the implication is that:

decomposition

The second component of a BN is the probability distribution Prob X . The choice should be such that the BN:

can be learned efficiently from data;
is flexible (it can encode a reasonable variety of phenomena);
is easy to query to perform inference.

The three most common choices in the literature and in practical applications (by far) are:

Discrete BNs, in which

and the

are multinomial. The parameters Theta Xi

of the local distributions are the conditional probabilities pi i k j

> dag = model2network("[X1][X2][X3|X2:X1]")
> cptX1 = matrix(c(0.53, 0.47), ncol = 2, dimnames = list(NULL, c("a", "b")))
> cptX2 = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("c", "d")))
> cptX3 = c(0.15, 0.85, 0.4, 0.6, 0.75, 0.25, 0.80, 0.20)
> dim(cptX3) = c(2, 2, 2)
> dimnames(cptX3) = list("X3" = c("e", "f"), "X1" = c("a", "b"), "X2" = c("c", "d"))
> bn = custom.fit(dag, list("X1" = cptX1, "X2" = cptX2, "X3" = cptX3))
> bn


  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:

   a    b
0.53 0.47

  Parameters of node X2 (multinomial distribution)

Conditional probability table:

   c    d
0.34 0.66

  Parameters of node X3 (multinomial distribution)

Conditional probability table:

, , X2 = c

   X1
X3     a    b
  e 0.15 0.40
  f 0.85 0.60

, , X2 = d

   X1
X3     a    b
  e 0.75 0.80
  f 0.25 0.20

Gaussian BNs (GBNs), in which

follows a multivariate normal distribution and the local distributions

are defined by the linear regression models for each

against its parents Pi Xi

regression

> bn = custom.fit(dag, list(
+   "X1" = list(coef = c("(Intercept)" = 2.4), sd = sqrt(0.8)),
+   "X2" = list(coef = c("(Intercept)" = 1.8), sd = sqrt(0.6)),
+   "X3" = list(coef = c("(Intercept)" = 2.1, "X1" = 1.2, "X2" = -0.4), sd = sqrt(0.9))
+ ))
> bn


  Bayesian network parameters

  Parameters of node X1 (Gaussian distribution)

Conditional density: X1
Coefficients:
(Intercept)
        2.4
Standard deviation of the residuals: 0.8944272

  Parameters of node X2 (Gaussian distribution)

Conditional density: X2
Coefficients:
(Intercept)
        1.8
Standard deviation of the residuals: 0.7745967

  Parameters of node X3 (Gaussian distribution)

Conditional density: X3 | X1 + X2
Coefficients:
(Intercept)           X1           X2
        2.1          1.2         -0.4
Standard deviation of the residuals: 0.9486833

Conditional linear Gaussian BNs (CLGBNs or CGBNs), in which

is a mixture of multivariate normal distributions and the

are either multinomial, univariate normal or mixtures of normals.

Discrete are only allowed to have discrete parents (denoted ), are assumed to follow a multinomial distribution parameterised with conditional probability tables.
Continuous are allowed to have both discrete and continuous parents (denoted , ), and their local distributions are $normal$ which can be written as a mixture of linear regressions $mixture$ against the continuous parents with one component for each configuration of the discrete parents. If has no discrete parents, the mixture reverts to a single linear regression.

> bn = custom.fit(dag, list(
+   "X1" = list(coef = c("(Intercept)" = 3.1), sd = sqrt(0.9)),
+   "X2" = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("a", "b"))),
+   "X3" = list(coef = matrix(c(1:4), ncol = 2), sd = sqrt(c(0.4, 0.3)))
+ ))
> bn


  Bayesian network parameters

  Parameters of node X1 (Gaussian distribution)

Conditional density: X1
Coefficients:
(Intercept)
        3.1
Standard deviation of the residuals: 0.9486833

  Parameters of node X2 (multinomial distribution)

Conditional probability table:

   a    b
0.34 0.66

  Parameters of node X3 (conditional Gaussian distribution)

Conditional density: X3 | X1 + X2
Coefficients:
             0  1
(Intercept)  1  3
X1           2  4
Standard deviation of the residuals:
        0          1
0.6324555  0.5477226
Discrete parents' configurations:
   X2
0   a
1   b

Learning

Model selection and model estimation are collectively known as learning, in the case of BNs. They are usually performed as a two-step process:

structure learning, learning the structure of the DAG from the data;
parameter learning, learning the local distributions implied by the DAG learned in the previous step.

This workflow is Bayesian: given a data set and a BN , we have

structure and parameter learning

Structure learning can be done in practice as

marginal likelihood

where we use Bayes' theorem to move from graph probability to loglikelihood because the latter is much easier to define. Assuming that the marginal likelihood decomposes along local distributions,

local marginal likelihood

which can be computed efficiently. This approach is called score-based structure learning.

Once we have learned in structure learning, we know the parents of each node and therefore we can identify the local distributions . This makes it possible to estimate their parameter sets Theta Xi independently from each other. Assuming that is sparse, each . involves only few variables and thus estimating its parameters is computationally simple.

In practice, this means that we first use some structure learning algorithm to learn the DAG from the data,

> dag = hc(data, score = "bde")

using the marginal likelihood above (which is available in closed form for discrete BNs, GBNs and CLGBNs) or some other goodness-of-fit score like BIC.

> hc.dag = hc(data, score = "bic")
> graphviz.plot(hc.dag, shape = "rectangle")

plot of chunk example_of_structure_learning_ctd_2

We then take the hc.dag together with the data and use it for parameter learning with some estimator (maximum likelihood, Bayesian posterior or something else).

> fitted = bn.fit(hc.dag, data, method = "mle")
> fitted


  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:
      a      b      c
0.3336 0.3340 0.3324

  Parameters of node X2 (multinomial distribution)

Conditional probability table:

   X1
X2           a          b          c
  a 0.85611511 0.44491018 0.11492178
  b 0.02517986 0.22095808 0.09446450
  c 0.11870504 0.33413174 0.79061372

  Parameters of node X3 (multinomial distribution)

Conditional probability table:
      a      b      c
0.7434 0.2048 0.0518

  Parameters of node X4 (multinomial distribution)

Conditional probability table:

, , X3 = a

   X1
X4           a          b          c
  a 0.80081301 0.09251810 0.10530547
  b 0.09024390 0.80209171 0.11173633
  c 0.10894309 0.10539019 0.78295820

, , X3 = b

   X1
X4           a          b          c
  a 0.18079096 0.88304094 0.24695122
  b 0.13276836 0.07017544 0.49390244
  c 0.68644068 0.04678363 0.25914634

, , X3 = c

   X1
X4           a          b          c
  a 0.42857143 0.34117647 0.13333333
  b 0.20238095 0.38823529 0.44444444
  c 0.36904762 0.27058824 0.42222222


  Parameters of node X5 (multinomial distribution)

Conditional probability table:

, , X6 = a

   X2
X5           a          b          c
  a 0.80524979 0.20588235 0.11937378
  b 0.09737511 0.17973856 0.11448141
  c 0.09737511 0.61437908 0.76614481

, , X6 = b

   X2
X5           a          b          c
  a 0.40050804 0.31679389 0.23759542
  b 0.49026249 0.36641221 0.50667939
  c 0.10922947 0.31679389 0.25572519


  Parameters of node X6 (multinomial distribution)

Conditional probability table:
      a      b
0.5018 0.4982

For structure learning, we can consider using conditional independence tests instead of maximising a goodness-of-fit score. BNs are defined such that graphical separation between nodes in the DAG implies (conditional or marginal) independence in the probability distributions of the corresponding variables. But this means that we can use suitable statistical tests to assess (conditional or marginal) independence from data and insert (or not) arcs in the DAG as appropriate. This approach is called constraint-based structure learning.

> pc.dag = pc.stable(data, test = "x2")
> graphviz.plot(pc.dag, shape = "rectangle")

plot of chunk example_of_constraint_based

We will discuss in a minute the reason why the graph learned by the PC algorithm (or that would be learned by any other constraint-based algorithm, for that matter) is only partially directed.

Inference

Inference on BNs usually consists of conditional probability (CP) or maximum a posteriori (MAP) queries. The general idea of a CP query is:

we have some evidence, that is, we know the values of some variables and we fix the nodes accordingly; and
we want to look into the probability of some event involving (a subset of) the other variables conditional on the evidence we have.

Say we start from a discrete BN with the DAG used above.

Graphically a CP query looks like this (evidence node in red, event of interest in blue):

On the other hand, the goal of a MAP query is to find the combination of values for (a subset of) the variables in the network that has the highest probability given some evidence. If the evidence is a partially-observed new individual, then performing a MAP query amounts to a classic prediction task. Graphically, a MAP query looks like this (evidence nodes in red, variables of interest in blue):

This is more computationally challenging than it sounds because MAP queries do not decompose along local distributions.

As an example, performing a CP query using the gRain package with a BN from bnlearn involves the following steps:

> library(gRain)
> # 1) Export the BN from bnlearn to gRain, which transforms it into a junction tree.
> junction.tree = as.grain(sample.bn)
> # 2) Introduce the evidence in the junction tree, modifying the distribution of X2.
> with.evidence = setEvidence(junction.tree, nodes = "X2", states = "b")
> # 3) Query about the distribution of the node containing the event of interest.
> querygrain(with.evidence, node = "X6")

$X6
X6
        a         b
0.4993782 0.5006218

A MAP query goes further and finds the highest probability in the joint distribution of the variables of interest (which is different from the maxima found in the respective marginal distribution, in general). The values of the variables associated with that probabilities are the answer to the MAP query.

> # 2) Introduce the evidence in on X1, X4 and X6 the junction tree.
> with.evidence = setEvidence(junction.tree, nodes = c("X1", "X4", "X6"),
+                             states = c("a", "c", "b"))
> # 3) Query about the distribution of the nodes of interest X2 and X5
> pdist = querygrain(with.evidence, nodes = c("X2", "X5"), type = "joint")
> # 4) Find the configuration of values with the maximum probability.
> which.max = which(pdist == max(pdist), arr.ind = TRUE)
> which.max

  X2 X5
a  1  2

> rownames(pdist)[which.max[, 1]]

[1] "a"

> colnames(pdist)[which.max[, 2]]

[1] "b"

Both types of queries can be computed by:

Applying exact inference algorithms to the BN, changing its data structure into a form (the “junction tree”) that allows for the efficient computation of marginal, joint and conditional probabilities.
Generating random samples form the BN and using them with Monte Carlo algorithms to implement approximate inference.

Exact inference is more computationally demanding on BNs with many nodes and arcs than approximate inference, but approximate inference is affected by stochastic noise from the random samples generation process.

Equivalence classes

BNs use directed acyclic graphs to represent the probabilistic relationships between the data. By default, that is all they can do: they are not inherently causal models even though their arcs have a direction that could be interpreted as a cause-effect relationship. BNs are probabilistic graphical models that represent probabilistic dependencies in graphical form: probabilistic dependence is a symmetric relationship because

conditioning is symmetric

which in turn depends on Bayes' theorem,

Bayes theorem

As a result, we can represent the same probability distribution with different DAGs. Such DAGs are said to belong to the same equivalence class because they are indistinguishable without the use of external information. Consider the following example:

> dag1 = model2network("[X1][X2|X1][X3|X1][X4|X1]")
> dag2 = model2network("[X1|X4][X2|X1][X3|X1][X4]")
> par(mfrow = c(1, 2))
> graphviz.compare(dag1, dag2, diff = "none", shape = "rectangle")

We can apply Bayes' theorem and show that the decompositions into local probabilities in dag1 and dag2 are in fact equivalent:

identical CPDAGs

In other words, the directions of some arcs are not identifiable even though we can tell that the nodes are connected by an arc. The graphical representation of this idea is the completed partially-directed acyclic graph (CPDAG) in which only the arcs whose directions can be uniquely identified are directed.

> par(mfrow = c(1, 2))
> graphviz.plot(cpdag(dag1), shape = "rectangle")
> graphviz.plot(cpdag(dag2), shape = "rectangle")

plot of chunk equivalence_class_representation

Formally, CPDAGs have two key properties:

The CPDAG is defined by the skeleton of the DAG (the underlying undirected graph) and by the v-structures (, where and are not connected by an arc). The arcs belonging to the v-structures and those that would introduce cycles/other v-structures in the CPDAG are the only directed arcs; all other arcs in the CPDAG have no direction.
Each CPDAG is the unique representation of an equivalence class. In other words, DAGs that can be transformed into the same CPDAG belong to the same equivalence class.

Furthermore, DAGs in the same equivalence class are score equivalent: we cannot select one member of the equivalence class over another in structure learning (without the use of external information) because goodness-of-fit scores will assign the same value to both of them. After all, all DAGs in the same equivalence class represent the same probability distribution and the goodness-of-fit scores are a function of that probability distribution. For instance:

> score(dag1, data[, nodes(dag1)], type = "bic")

[1] -17074.39

> score(dag2, data[, nodes(dag2)], type = "bic")

[1] -17074.39

This suggests that we should always compare the structure of BNs through their CPDAGs to be able to assess whether they are effectively different or not.

> cpdag1 = cpdag(dag1)
> cpdag2 = cpdag(dag2)
> unlist(compare(cpdag1, cpdag2))

tp fp fn
 3  0  0

> all.equal(cpdag1, cpdag1)

[1] TRUE

Sometimes we need the inverse operation: we have a CPDAG and we need to find out one of the DAGs that are part of the equivalence class that the CPDAG represents. This operation is called finding a consistent extension of the CPDAG.

> cextend(cpdag1)


  Random/Generated Bayesian network

  model:
   [X1][X2|X1][X3|X1][X4|X1]
  nodes:                                 4
  arcs:                                  3
    undirected arcs:                     0
    directed arcs:                       3
  average markov blanket size:           1.50
  average neighbourhood size:            1.50
  average branching factor:              0.75

  generation algorithm:                  Empty

Causal interpretation of BNs

What do we need to do to learn BNs as causal network models, that is, a graphical model in which arcs represent direct cause-effect relationship? Firstly, we need to make some assumptions:

Causal Markov assumption: Each variable is conditionally independent of its non-effects, both direct and indirect, given its direct causes. This represents the causal equivalent of the decomposition into local distributions.
Faithfulness: There must exist a DAG which is faithful to the probability distribution of , so that the only dependencies are those arising from d-separation in the DAG.
Causal sufficiency: There must be no latent variables (unobserved variables influencing the variables in the network) acting as confounding factors because they are parents to at least two observed variables. Such variables may induce spurious correlations between the observed variables, thus introducing bias in the causal network.

The last assumption is really a corollary of the first two: the presence of unobserved variables violates the faithfulness assumption (because the network structure cannot include them) and possibly the causal Markov property (if an arc is wrongly added between the observed variables due to the influence of the latent one). As an example, consider the following BN from which generate a sample of 100 observations.

> causal.bn = custom.fit(model2network("[X1][X2|X1][X3|X1]"),
+   list("X1" = list(coef = c("(Intercept)" = 0), sd = 1),
+        "X2" = list(coef = c("(Intercept)" = 0, "X1" = 3), sd = 0.5),
+        "X3" = list(coef = c("(Intercept)" = 0, "X1" = 2), sd = 0.5))
+ )
> graphviz.plot(causal.bn, shape = "rectangle")

plot of chunk causal_network_and_latent_variables

> complete.data = rbn(causal.bn, 100)

In this network, we can observe the following graphical separations:

> dsep(causal.bn, "X2", "X3")

[1] FALSE

> dsep(causal.bn, "X2", "X3", "X1")

[1] TRUE

We can successfully to re-learn the DAG of causal.bn from the complete.data because all variables are observed. However, if we omit , we fail to re-learn the DAG of causal.bn because (now a latent variable) is a parent of both and (observed variables). Violating the causal sufficiency assumption results in learning a spurious arc X2 to X3 that is not present in causal.bn.

> relearn = hc(complete.data)
> modelstring(relearn)

[1] "[X1][X2|X1][X3|X1]"

> dsep(relearn, "X2", "X3")

[1] FALSE

> dsep(relearn, "X2", "X3", "X1")

[1] TRUE

> relearn = hc(complete.data[, c("X2", "X3")])
> modelstring(relearn)

[1] "[X2][X3|X2]"

> dsep(relearn, "X2", "X3")

[1] FALSE

Even if these assumptions are satisfied, we may not be able to completely learn the BN as a causal network: we are using probabilistic tools (goodness-of-fit scores or conditional independence tests) that can only identify equivalence classes of BNs. In order to identify the remaining arc directions, we need external information that typically comes from:

Longitudinal (panel) data: causality can only follow the arrow of time. Any arc between a node associated with a given time point and a node associated with a later time point can only go from the former to the latter.
Experimental (randomised) data: data that are collected following an experimental design (as opposed to observational data that are just recorded are they arise) are such that arcs can only point outwards from the randomised variables.
Domain experts: information about arc directions may be either collected by interviewing experts in the domain the data come from, or by consulting the literature to find relevant results from previous investigations.

Missing data

Rubin², Little and Rubin³ have defined the key properties of missing data. We start from:

the complete data of all samples, which we do not observe;
the missingness mechanism (also known as the pattern of missingness) which determines which variables we actually observe for each sample;
the incomplete data, in which the values of some variables are replaced with NA for some samples to indicate that they are missing. This is what we actually observe.

In other words, the complete data are reduced to the incomplete data by the missingness mechanism. The statistical analysis of incomplete data aims to recover the probabilistic model underlying the complete data from the incomplete data by taking into account the missingness mechanism. Not taking into account the missingness mechanism would result in a biased model.

What types of missingness mechanisms are possible? Rubin defined three:

Missing Completely at Random (MCAR): the probability of the value of a variable being missing is the same for all samples because the causes of the missing data are unrelated to the data. In other words, the missingness mechanism does not depend on the variables in the data.
Missing at Random (MAR): the probability of being missing is the same only within groups defined by the observed data. In other words, the missingness mechanism depends on the variables that are completely observed in the data. MCAR implies MAR.
Missing Not at Random (MNAR): the probability of being missing varies for reasons that are unknown to us. In other words, the missingness mechanism depends on variables we do not observe completely or at all. Any missingness mechanism that is not MCAR or MAR is MNAR.

Formally, a typical notation is

for the observed data;
for the missing data;
for the complete data;
is a zero-one matrix with the same dimensions as : each cell is equal to zero if the corresponding cell in the data is a missing value, or it is equal to one if the corresponding cell in the data is observed.

If the data contain no missing values, then D= D O and we can model through D O in a straightforward way. If the data are MCAR, then

does not depend on D O or D M

if the data are MAR, then

does not depend on D M

if the data are MNAR, then

it does not simplify

Consider the following numerical example (from van Buuren⁴) involving two variables and that jointly follow a bivariate normal distribution.

> library(MASS)
> set.seed(80122)
> n = 300
> Y = mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, 0.5, 0.5, 1), nrow = 2))

Say that only contains missing values and that the missingness mechanism is defined by

general missingness mechanism

If gamma MCAR then

missingness mechanism

and the missingness mechanism does not depend on and at all. If gamma MAR , then

missingness mechanism

and the missingness mechanism depends on which is completely observed. This allows for modelling the missingness mechanism from the data we observe. Finally, if gamma MNAR then

missingness mechanism

and the missingness mechanism depends on which is not completely observed.

> missingness.mechanism = function(Y, gamma) {

+   gamma[1] + exp(Y[, 1]) / (1 + exp(Y[, 1])) * gamma[2] +
+     exp(Y[, 2]) / (1 + exp(Y[, 2])) * gamma[3]

+ }#MISSINGNESS.MECHANISM
> 
> R.MCAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0.5, 0, 0)))
> R.MAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 1, 0)))
> R.MNAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 0, 1)))

If we try to estimate the joint distribution of and under MCAR, that is, when we assume that the missingness mechanism is MCAR, we obtain means that are close to zero for both and . Their correlation is close to 0.5 as well.

> Y.MCAR = Y
> Y.MCAR[as.logical(R.MCAR), 2] = NA
> colMeans(Y.MCAR, na.rm = TRUE)

[1]  0.04083247 -0.03533149

> cor(Y.MCAR, use = "pairwise.complete")

          [,1]      [,2]
[1,] 1.0000000 0.4249437
[2,] 0.4249437 1.0000000

plot of chunk missingness_numeric_example_3_plot

The same is true under MAR.

> Y.MAR = Y
> Y.MAR[as.logical(R.MAR), 2] = NA
> colMeans(Y.MAR, na.rm = TRUE)

[1] 0.04083247 0.26461012

> cor(Y.MAR, use = "pairwise.complete")

          [,1]      [,2]
[1,] 1.0000000 0.5011879
[2,] 0.5011879 1.0000000

plot of chunk missingness_numeric_example_4_plot

However, under MNAR the estimates of the mean of and of the correlation between and are both heavily biased. Intuitively, the bias is unavoidable since it is impossible to correctly model a MNAR missingness mechanism that depends on information we do not have (the values missing in ).

> Y.MNAR = Y
> Y.MNAR[as.logical(R.MNAR), 2] = NA
> colMeans(Y.MNAR, na.rm = TRUE)

[1] 0.04083247 0.43020494

> cor(Y.MNAR, use = "pairwise.complete")

          [,1]      [,2]
[1,] 1.0000000 0.2672833
[2,] 0.2672833 1.0000000

plot of chunk missingness_numeric_example_5_plot

The distributions of D O and D M become more and more different as we move from MCAR to MAR to MNAR.

plot of chunk missingness_numeric_example_bwplot

This behaviour is related to ignorability, which formalises the idea that the distribution of the data is the same whether we observe or not:

same distribution

As a result, we can model the distribution of D M from D O (equivalently, the distribution D given D O and R = 1 ) and use this distribution to work around the missing data. For instance, we can impute them, that is, we can replace missing values with point estimates computed from the observed values.

MCAR is ignorable because

it simplfies

since is independent from both D O and D M and . MAR is also ignorable: the definition of MAR is

R does not depend on D M

and then

derivations

The missing data D M do not depend on , so their distribution (and by extension that of ) is the same for R = 0 and R = 1 .

MNAR is not ignorable: we need to make assumptions on and incorporate it in our models to work around the missing data.

Bayesian networks for missing data

Working with a causal network in the presence of unknown latent variables is difficult because of the issues with learning the structure and the parameters from data. We can partially address this issue by including the latent variables in the network: we define a new DAG G star over O union U where is the set of the fully observed variables and is the set of the completely unobserved variables (that is, the latent variables). Now that the latent variables are included in the network, we can model them and prevent confounding: we add them as columns full of NA to the data we use for learning.

When we have variables that are only partially observed, missing values are present along with observed values and we need to augment the causal network again to model them correctly. In order to do that, Mohan and Pearl define a missingness graph⁵ to extend the ideas of Rubin. A missingness graph is a causal graph where the vertices in are partitioned into five disjoint subsets:

the fully observed nodes ;
the completely unobserved nodes ;
the partially observed variables ;
the proxy variables to the partially observed variables, with the values that are actually observed;
the missingness indicators .

The relationship between , and is as follows:

The variables are the variables for which we would have liked to observe values for all observations.
The variables are the variables which we actually (partially) observe.
The variables are binary indicator variables that control whether the value of is observed for each observation in or a missing value NA is observed instead.

Formally, this means

Si = mi or NA

Since missingness graphs are causal graphs, they can be queried graphically for independencies between variables in the observed-data distribution Prob O S R by using d-separation. The corresponding underlying distribution with no missing values is Prob O M R .

MCAR, MAR and MNAR result in different independence statements which can be assessed through d-separation. Therefore, we can link them to the independency statements they imply in the missingness graph:

MCAR implies : the missingness is random and independent from the fully observed and the partially observed variables.
MAR implies : missingness is random and depends only on the fully observed variables;
MNAR if neither MCAR nor MAR.

The approaches for learning BNs as causal networks or as probabilistic networks from incomplete data have reviewed here.⁶

Consider again the two-variable example we used to illustrate MCAR, MAR and MNAR. As a BN, it would have the following DAG since and are correlated. If the data are completely observed, that is the correct model for the data.

> graphviz.plot(model2network("[Y2][Y1|Y2]"), shape = "rectangle")

Assume now that the data are incomplete and MCAR. The variable is fully observed, so Y1 in O . The variable is only partially observed: in the missingness graph, we replace it with the nodes:

, the mix of values and NAs that we actually observe for the samples.
, the values we would observe for for the samples if there were no missing values. They are identical to the corresponding values in if .
, the binary indicators that determine whether the value of is observed (or not) for each sample.

To obtain the correct model for MCAR, should not have any parents. As for the arc Y2 to Y1 , we replace it with M2 to Y1 to make this directed graph a direct extension of the previous one.

> missingness.graph = model2network("[S2|R2:M2][R2][M2][Y1|M2]")
> graphviz.plot(missingness.graph, shape = "rectangle")

The independency statements of MCAR are satisfied:

> # This checks that O is d-separated from R.
> dsep(missingness.graph, "Y1", "R2")

[1] TRUE

> # This checks that M is d-separated from R.
> dsep(missingness.graph, "M2", "R2")

[1] TRUE

And since MCAR implies MAR, the independency statements of MAR are also satisfied:

> # This checks that M is d-separated from R given O.
> dsep(missingness.graph, "M2", "R2", "Y1")

[1] TRUE

In the MAR case, we had earlier that the missingness mechanism depended on which is a completely-observed variable: then we should add the arc Y1 to R2 to the missingness graph to represent it.

> missingness.graph = model2network("[S2|R2:M2][R2|Y1][M2][Y1|M2]")
> graphviz.plot(missingness.graph, shape = "rectangle")

The independency statements of MAR are satisfied:

> # This checks that M is d-separated from R given O.
> dsep(missingness.graph, "M2", "R2", "Y1")

[1] TRUE

However, the independency statements of MCAR are not longer satisfied.

> # This checks that O is d-separated from R.
> dsep(missingness.graph, "Y1", "R2")

[1] FALSE

> # This checks that M is d-separated from R.
> dsep(missingness.graph, "M2", "R2")

[1] FALSE

In the MNAR case, the missingness mechanism for depended on itself. We do not have a node for in the missingness graph, but we do have which models the (unobserved) complete version of we can then add the arc M2 to R2 to represent this.

> missingness.graph = model2network("[S2|R2:M2][R2|M2][M2][Y1|M2]")
> graphviz.plot(missingness.graph, shape = "rectangle")

Since MNAR excludes MCAR and MAR, the independency statements of both missingness mechanisms are not satisfied.

> # MCAR: O is d-separated from R.
> dsep(missingness.graph, "Y1", "R2")

[1] FALSE

> # MCAR: M is d-separated from R.
> dsep(missingness.graph, "M2", "R2")

[1] FALSE

> # MAR: M is d-separated from R given O.
> dsep(missingness.graph, "M2", "R2", "Y1")

[1] FALSE

What is the role of the missingness graph in structure and parameter learning? In some cases, using it for parameter learning may make it possible to correctly estimate the parameters of the BN: this property is called recoverability, and it is discussed in detail in Mohan and Pearl. As for structure learning, feeding a partially-constructed missingness graph to a structure learning algorithm and let it learn the rest from the data is a viable approach. However, as we will see in the following, many classical approaches to BN modelling just assume that the data are MCAR or MAR to avoid using the missingness graph altogether.

Parameter learning

Firstly, we will assume that the DAG is given and that we only need to learn the parameters of the network from the incomplete data.

Using locally-complete data

The simplest possible way to implement parameter learning is to drop all the observations containing missing values and to use only complete observations.

> colnames(data) = c("X1", "X2", "X3", "X4", "X5", "X6")
> dag = model2network("[X1][X3][X6][X2|X1][X4|X1:X3][X5|X2:X6]")
> incomplete.data = data[1:80, ]
> for (i in seq(ncol(incomplete.data)))
+   incomplete.data[1:10 + 10 * i, i] = NA
> 
> complete.subset = incomplete.data[complete.cases(incomplete.data), ]
> bn.fit(dag, complete.subset, method = "mle")


  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:
    a    b    c
0.25 0.45 0.30

  Parameters of node X2 (multinomial distribution)

Conditional probability table:

   X1
X2          a         b         c
  a 1.0000000 0.4444444 0.1666667
  b 0.0000000 0.3333333 0.1666667
  c 0.0000000 0.2222222 0.6666667

  Parameters of node X3 (multinomial distribution)

Conditional probability table:
    a    b    c
0.60 0.35 0.05

  Parameters of node X4 (multinomial distribution)

Conditional probability table:

, , X3 = a

   X1
X4          a         b         c
  a 1.0000000 0.0000000 0.0000000
  b 0.0000000 0.8000000 0.0000000
  c 0.0000000 0.2000000 1.0000000

, , X3 = b

   X1
X4          a         b         c
  a 0.0000000 1.0000000 0.0000000
  b 0.0000000 0.0000000 0.6666667
  c 1.0000000 0.0000000 0.3333333

, , X3 = c

   X1
X4  a         b c
  a   1.0000000
  b   0.0000000
  c   0.0000000


  Parameters of node X5 (multinomial distribution)

Conditional probability table:

, , X6 = a

   X2
X5          a         b         c
  a 0.7142857 0.0000000 0.0000000
  b 0.2857143 0.0000000 0.2500000
  c 0.0000000 1.0000000 0.7500000

, , X6 = b

   X2
X5          a         b         c
  a 0.0000000 0.0000000 0.0000000
  b 1.0000000 1.0000000 0.5000000
  c 0.0000000 0.0000000 0.5000000


  Parameters of node X6 (multinomial distribution)

Conditional probability table:
    a    b
0.65 0.35

This not an efficient use of the data because it reduces the sample size too much.

> nrow(incomplete.data)

[1] 80

> nrow(complete.subset)

[1] 20

As a result, the estimates of the parameters can be very imprecise.

> bn.fit(dag, complete.subset, method = "mle")$X5


  Parameters of node X5 (multinomial distribution)

Conditional probability table:

, , X6 = a

   X2
X5          a         b         c
  a 0.7142857 0.0000000 0.0000000
  b 0.2857143 0.0000000 0.2500000
  c 0.0000000 1.0000000 0.7500000

, , X6 = b

   X2
X5          a         b         c
  a 0.0000000 0.0000000 0.0000000
  b 1.0000000 1.0000000 0.5000000
  c 0.0000000 0.0000000 0.5000000

> bn.fit(dag, data[1:80, ])$X5


  Parameters of node X5 (multinomial distribution)

Conditional probability table:

, , X6 = a

   X2
X5           a          b          c
  a 0.66666667 0.00000000 0.05882353
  b 0.23809524 0.28571429 0.11764706
  c 0.09523810 0.71428571 0.82352941

, , X6 = b

   X2
X5           a          b          c
  a 0.38461538 0.00000000 0.33333333
  b 0.53846154 0.50000000 0.44444444
  c 0.07692308 0.50000000 0.22222222

A better approach along the same lines is to use locally-complete observations to estimate the parameters of each local distribution: after all, a local distribution only involves one node and its parents so we do not care whether the values of the remaining variables are missing or not.

> # bn.fit() uses locally-complete observations by default for maximum likelihood
> # and Bayesian estimators.
> bn.fit(dag, incomplete.data, method = "mle")$X5


  Parameters of node X5 (multinomial distribution)

Conditional probability table:

, , X6 = a

   X2
X5          a         b         c
  a 0.5714286 0.0000000 0.0000000
  b 0.2857143 0.5000000 0.1666667
  c 0.1428571 0.5000000 0.8333333

, , X6 = b

   X2
X5          a         b         c
  a 0.1428571 0.0000000 0.3636364
  b 0.7142857 1.0000000 0.3636364
  c 0.1428571 0.0000000 0.2727273

The Expectation-Maximisation algorithm

In the context of parameter learning, the Expectation-Maximisation (EM) algorithm retains its classic structure:

In the expectation step (E), we compute the expected values of the sufficient statistics (such as the counts in discrete BNs, partial correlations in GBNs), using exact inference along the lines described above to make use of incomplete as well as complete samples.
In the maximisation step (M), we take the sufficient statistics from the E-step and estimate the parameters of the BN, either using maximum likelihood or Bayesian posterior estimators.

We iterate over these two steps by using the parameter estimates to update the expected values of the sufficient statistics in th next E-step. Repeated iterations of these two steps will return the maximum likelihood or maximum a posteriori estimates for the parameters in the limit. Note that the EM algorithm is only valid for MCAR and MAR missing data.

> start = bn.fit(empty.graph(colnames(data)), incomplete.data)
> fitted = bn.fit(dag, incomplete.data, method = "hard-em", fit = "mle",
+            impute = "exact", threshold = 1e-5, debug = TRUE,
+            start = start)

* expectation-maximization iteration 1 .
  > the relative difference in log-likelihood is: 0.1993889 .
* expectation-maximization iteration 2 .
  > the relative difference in log-likelihood is: 0.04342052 .
* expectation-maximization iteration 3 .
  > the relative difference in log-likelihood is: 0.002834975 .
* expectation-maximization iteration 4 .
  > the relative difference in log-likelihood is: 0.001260253 .
* expectation-maximization iteration 5 .
  > the relative difference in log-likelihood is: 0 .
  @ the difference is smaller than the threshold, stopping.

> fitted


  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:
      a      b      c
0.2250 0.2875 0.4875

  Parameters of node X2 (multinomial distribution)

Conditional probability table:

   X1
X2           a          b          c
  a 0.94444444 0.60869565 0.05128205
  b 0.05555556 0.21739130 0.05128205
  c 0.00000000 0.17391304 0.89743590

  Parameters of node X3 (multinomial distribution)

Conditional probability table:
      a      b      c
0.6375 0.3000 0.0625

  Parameters of node X4 (multinomial distribution)

Conditional probability table:

, , X3 = a

   X1
X4           a          b          c
  a 0.85714286 0.00000000 0.04166667
  b 0.00000000 0.92307692 0.20833333
  c 0.14285714 0.07692308 0.75000000

, , X3 = b

   X1
X4           a          b          c
  a 0.00000000 1.00000000 0.33333333
  b 0.00000000 0.00000000 0.58333333
  c 1.00000000 0.00000000 0.08333333

, , X3 = c

   X1
X4  a          b          c
  a   0.50000000 0.00000000
  b   0.00000000 0.33333333
  c   0.50000000 0.66666667


  Parameters of node X5 (multinomial distribution)

Conditional probability table:

, , X6 = a

   X2
X5           a          b          c
  a 0.68181818 0.00000000 0.00000000
  b 0.22727273 0.40000000 0.09523810
  c 0.09090909 0.60000000 0.90476190

, , X6 = b

   X2
X5           a          b          c
  a 0.09090909 0.00000000 0.27777778
  b 0.81818182 1.00000000 0.33333333
  c 0.09090909 0.00000000 0.38888889


  Parameters of node X6 (multinomial distribution)

Conditional probability table:
   a   b
0.6 0.4

How does EM work in practice? One way of implementing it to impute the missing values in the E-step is the so-called hard EM.

> # 1) E-step: start from an empty network, so that each node encodes the marginal
> #    distribution of the corresponding variable, estimated from the observed
> #    values.
> current.bn = bn.fit(empty.graph(nodes(dag)), incomplete.data)
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> all(complete.cases(completed.data))

[1] TRUE

> # 2) Compute an initial estimate of the log-likelihood of the data.
> current.loglik = logLik(current.bn, completed.data)
> current.loglik

[1] -447.0964

> # 3) M-step: use the completed data to estimate the parameters of the BN.
> new.bn = bn.fit(dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik

[1] -387.1627

> # 4) Compute the relative improvement in the log-likelihood: is it large enough
> #    to make it worth to perform more iterations?
> (new.loglik - current.loglik) / abs(current.loglik)

[1] 0.1340509

> # 5) Reset the current state of the algorithm.
> current.bn = new.bn
> current.loglik = new.loglik
> # 6) E-step.
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> # 7) M-step.
> new.bn = bn.fit(dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik

[1] -337.8896

> # 8) Compute the relative improvement in the log-likelihood: is it large enough
> #    to make it worth to perform more iterations?
> (new.loglik - current.loglik) / abs(current.loglik)

[1] 0.1272672

> par(mfrow = c(1, 2))
> graphviz.chart(current.bn, type = "barprob", main = "Current BN")
> graphviz.chart(new.bn, type = "barprob", main = "New BN")

> # ... And so on until convergence.

Structure learning

Learning the structure of a BN from incomplete data is, in general, computationally unfeasible because we need to perform a joint optimisation over the missing values and the parameters to score each candidate network. We can make this apparent by rewriting P D given G as a function of D O (the observed data) and D M (the missing data):

intergral decomposition

From this expression we can see that in order to maximise P D given G we should jointly maximise the probability of the observed data D O and the probability of the missing data D M given the observed data, for each candidate and averaging over all possible Theta . This gives us the maximum a posteriori solution to structure learning.

A full Bayesian approach would require averaging over all the possible configurations of the missing data as well:

Bayesian integral

This implies one extra dimension of integration for each missing value in addition to one dimension for each parameter in Theta , and Prob D M given D O does not decompose along with local distributions in the general case.

Using locally-complete data

Earlier we saw how we can use locally-complete data to learn the parameters of each local distribution. The next logical step is to use the locally-complete data and the parameters we estimated from them to compute the goodness-of-fit scores we need for structure learning. The main issue with this approach is that local distributions (for the same node) with different parent sets will be learned from different subsets of the data, so:

The respective log-likelihoods may be on different scales, because they may sum over different numbers of observations.
The respective log-likelihoods may be estimated from non-nested subsets of the data, which in the worst case may not overlap at all.

The node-average log-likelihood (NAL)⁷ addresses both problems. If the data were complete, we could compute a penalised score such as BIC as

BIC

which is numerically equivalent to computing

average BIC

The quantity average log-likelihood is the empirical estimate of the expected value of the log-likelihood of a single observation: its scale does not depend on the number of observations. Furthermore, we can seamlessly replace it with the average over the locally-complete observations average BIC in the incomplete data:

PNAL

Tjebbe Bodewes and Marco Scutari proved the NAL ensures both model identifiability (up to the equivalence class) and consistency in structure learning under MCAR as long as n lambda n diverges and sqrt n lambda n diverges as . These conditions hold for neither AIC nor BIC because their penalties are not large enough when transformed into .

In practice, this means we can compute the node-average log-likelihood as we would any other score function, regardless of whether the data are complete or not.

> score(dag, incomplete.data, type = "nal", by.node = TRUE)

        X1         X2         X3         X4         X5         X6
-1.0561521 -0.6071982 -0.8415452 -0.5380200 -0.7824792 -0.6829081

For instance, the NAL for node (with parent in dag) is computed as follows.

> # 1) Identify the parents of node X2.
> node = "X2"
> pars = parents(dag, node)
> # 2) Identify which observations are locally complete for X2 and its parents.
> locally.complete = complete.cases(incomplete.data[, c(node, pars)])
> # 3) Subset the data.
> locally.complete.data = incomplete.data[locally.complete, c(node, pars)]
> # 4) Compute the counts and then the conditional probabilities of X2 given the parents.
> counts = table(locally.complete.data)
> probs = prop.table(counts, margin = "X1")
> # 5) Compute the log-likelihood and divide it by the number of locally-complete observations.
> nal = sum(counts * log(probs), na.rm = TRUE) / nrow(locally.complete.data)
> nal

[1] -0.6071982

Computing the PNAL from the NAL is then just a matter of subtracting the penalty term (the coefficient lambda n , here with the default value, multiplied by the number of free parameters in the local distribution).

> score(dag, incomplete.data, type = "pnal", by.node = TRUE)

        X1         X2         X3         X4         X5         X6
-1.0575453 -0.6113779 -0.8429384 -0.5505589 -0.7908384 -0.6836047

> lambda_n = (nrow(incomplete.data)^(-0.25)) / nnodes(dag)
> pnal = nal - lambda_n/nrow(incomplete.data) * (nrow(probs) - 1) * ncol(probs)
> pnal

[1] -0.6113779

In order to perform structure learning from incomplete data with PNAL, we just need to use score = "pnal" with a score-based learning algorithm like hill-climbing. How do the network structure learned with PNAL, that learned from the complete data (before we introduced the NAs) and the reference network structure?

> pnal.dag = hc(incomplete.data, score = "pnal", k = 1 / (nnodes(dag) * 0.2))
> complete.data.dag = hc(data[1:80, ], score = "bic")
> par(mfrow = c(1, 3))
> graphviz.compare(dag, complete.data.dag, pnal.dag, shape = "rectangle",
+   main = c("Reference DAG", "Complete Data DAG", "Learned with PNAL"))

As we mentioned earlier, a better way of comparing network structures is through their CPDAGs.

> par(mfrow = c(1, 3))
> graphviz.compare(cpdag(dag), cpdag(complete.data.dag), cpdag(pnal.dag),
+   shape = "rectangle", main = c("Reference DAG", "Complete Data DAG", "Learned with PNAL"))

While the three network structures in the first figure look very similar, the CPDAG of the network we learned with BIC is in fact very different from that of the reference network structure, while that of the network learned with PNAL is fairly close.

The Expectation-Maximisation algorithm

In the context of structure learning, the EM algorithm is called structural EM (SEM). It makes structure learning computationally feasible by searching for the best structure inside of EM, instead of embedding EM inside a structure learning algorithm. It consists of two steps like the classic EM:

In the expectation step (E), we complete the data by computing the expected sufficient statistics using the current network structure.
In the maximisation step (M), we find the structure that maximises the expected score function for the completed data.

Since the scoring in the M-step uses the completed data, structure learning can be implemented efficiently using standard algorithms for complete data. The original proposal by Nir Friedman used BIC and greedy search; he later extended SEM to a fully Bayesian approach based posterior scores, and proved the convergence of the resulting algorithm under MCAR and MAR.

> start = bn.fit(empty.graph(colnames(data)), incomplete.data)
> learned.dag = structural.em(incomplete.data, impute = "exact", fit = "mle",
+                 start = start)
> learned.dag


  Bayesian network learned from Missing Data

  model:
   [X1][X3][X6][X2|X1][X4|X1][X5|X1:X6]
  nodes:                                 6
  arcs:                                  4
    undirected arcs:                     0
    directed arcs:                       4
  average markov blanket size:           1.67
  average neighbourhood size:            1.33
  average branching factor:              0.67

  learning algorithm:                    Structural EM
  score-based method:                    Hill-Climbing
  parameter learning method:             Maximum Likelihood (disc.)
  imputation method:                     Exact Inference
  penalization coefficient:              2.191013
  tests used in the learning procedure:  140
  optimized:                             TRUE

Between the limited amounts of data and the missing values, structural.em() is unable to learn the correct DAG.

> par(mfrow = c(1, 2))
> graphviz.compare(dag, learned.dag, main = c("correct DAG", "learned DAG"),
+   shape = "rectangle")

But what is structural.em() doing in practice?

> # 1) E-step: start from an empty network, so that each node encodes the marginal
> #    distribution of the corresponding variable, estimated from the observed
> #    values.
> current.dag = empty.graph(colnames(data))
> current.bn = bn.fit(current.dag, incomplete.data)
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> # 2) Compute an initial estimate of the log-likelihood of the data.
> current.loglik = logLik(current.bn, completed.data)
> current.loglik

[1] -447.0964

> # 3) M-Step: learn the structure of the DAG from the completed data.
> new.dag = tabu(completed.data)
> # 4) Is the DAG we just learned different from the previous one?
> all.equal(current.dag, new.dag)

[1] "Different number of directed/undirected arcs"

> par(mfrow = c(1, 2))
> graphviz.compare(current.dag, new.dag, main = c("Current DAG", "New DAG"),
+   shape = "rectangle")

plot of chunk structural_em_step_by_step

> # 5) Learn the parameters as well and compute the relative improvement in the
> # log-likelihood: is it large enough to make it worth to perform more
> # iterations?
> new.bn = bn.fit(new.dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik

[1] -404.8411

> (new.loglik - current.loglik) / abs(current.loglik)

[1] 0.09451047

> # 6) Reset the current state of the algorithm.
> current.dag = new.dag
> current.bn = new.bn
> current.loglik = new.loglik
> # 7) E-step.
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> # 8) M-Step.
> new.dag = tabu(completed.data)
> # 9) Is the DAG we just learned different from the previous one?
> all.equal(current.dag, new.dag)

[1] "Different arc sets"

> par(mfrow = c(1, 2))
> graphviz.compare(current.dag, new.dag, main = c("Current DAG", "New DAG"),
+   shape = "rectangle")

> # 10) Did the log-likelihood improve?
> new.bn = bn.fit(new.dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik

[1] -379.1162

> (new.loglik - current.loglik) / abs(current.loglik)

[1] 0.06354313

> # ... And so on until convergence.

The PC algorithm

Using locally-complete data is a principled way of learning BNs using constraint-based algorithms, as shown by Eric Strobl, Shyam Visweswaran and Peter Spirtes.⁸ The trick is that we should perform each conditional independence test twice, as follows:

Test whether using locally-complete observations, and record resulting the p-value.
If the p-value smaller than the chosen threshold (typically 0.01)?
- If the answer is no, then reject the null hypothesis of (conditional) independence.
- If the answer is yes, test whether using globally-complete observations and record the resulting p-value.
Is the minimum between the p-values of the two tests smaller than the chosen threshold?
- If the answer is no, then accept the null hypothesis of (conditional) independence.
- If the answer is yes, then reject the null hypothesis of (conditional) independence.

This double-testing ensures that we can apply algorithms like the PC to incomplete data that are MCAR or MAR. Interestingly, it works even if the data are MNAR! Furthermore, Strobl, Visweswaran and Spirtes show that even just performing a single test on the locally-complete data works very well in practice.

> # constraint-based algorithms in bnlearn all use locally-complete observations
> # for constraint-based learning, by default.
> learned.dag = pc.stable(incomplete.data)
> par(mfrow = c(1, 2))
> graphviz.compare(dag, learned.dag, main = c("correct DAG", "learned DAG"),
+   shape = "rectangle")

As an example consider the following conditional independence test, which is the first of the two tests.

> # ci.test() uses locally-complete data by default.
> ci.test("X1", "X6", "X4", data = incomplete.data, test = "x2")


	Pearson's X^2

data:  X1 ~ X6 | X4
x2 = 12.368, df = 6, p-value = 0.05425
alternative hypothesis: true value is greater than 0

We can compute it manually from the locally-complete data.

> local.data = incomplete.data[, c("X1", "X6", "X4")]
> locally.complete = local.data[complete.cases(local.data), ]
> nrow(locally.complete)

[1] 50

> prop.table(table(locally.complete), margin = "X4")

, , X4 = a

   X6
X1           a          b
  a 0.23529412 0.17647059
  b 0.17647059 0.17647059
  c 0.23529412 0.00000000

, , X4 = b

   X6
X1           a          b
  a 0.00000000 0.00000000
  b 0.46666667 0.20000000
  c 0.00000000 0.33333333

, , X4 = c

   X6
X1           a          b
  a 0.11111111 0.00000000
  b 0.05555556 0.00000000
  c 0.38888889 0.44444444

> test.statistic = as.numeric(
+   chisq.test(table(locally.complete)[, , "a"])$stat  +
+   chisq.test(table(locally.complete)[, , "b"] + 1e-6)$stat  +
+   chisq.test(table(locally.complete)[, , "c"])$stat
+ )
> test.statistic

[1] 12.36782

> pchisq(test.statistic, df = 6, lower.tail = FALSE)

[1] 0.05424854

Its p-value is large, so we should compute the second test as well. The second test involves the same variables, but uses only globally-complete data.

> globally.complete = local.data[complete.cases(incomplete.data), ]
> nrow(globally.complete)

[1] 20

> prop.table(table(globally.complete), margin = "X4")

, , X4 = a

   X6
X1          a         b
  a 0.3750000 0.1250000
  b 0.2500000 0.2500000
  c 0.0000000 0.0000000

, , X4 = b

   X6
X1          a         b
  a 0.0000000 0.0000000
  b 0.5000000 0.1666667
  c 0.0000000 0.3333333

, , X4 = c

   X6
X1          a         b
  a 0.1666667 0.0000000
  b 0.1666667 0.0000000
  c 0.5000000 0.1666667

> test.statistic = as.numeric(
+   chisq.test(table(globally.complete)[, , "a"] + 1e-6)$stat  +
+   chisq.test(table(globally.complete)[, , "b"] + 1e-6)$stat  +
+   chisq.test(table(globally.complete)[, , "c"])$stat
+ )
> test.statistic

[1] 4.133331

> pchisq(test.statistic, df = 6, lower.tail = FALSE)

[1] 0.6586386

Equivalently:

> ci.test("X1", "X6", "X4",
+   data = incomplete.data[complete.cases(incomplete.data), ],
+   test = "x2")


	Pearson's X^2

data:  X1 ~ X6 | X4
x2 = 4.1333, df = 6, p-value = 0.6586
alternative hypothesis: true value is greater than 0

The second test produces an even larger p-value: since both are larger than the threshold we accept the null hypothesis of conditional independence between X1 and X6 given X4.

Inference

Incomplete data are not a concern when performing general inference on BNs: as long as we can define the evidence we are considering and the event (for CP queries) or the target nodes (for MAP queries), we can obtain the answers we need in the same way as when data are complete. The only influence missing data may have is on the quality of the BN we learn from them and that we submit our queries to.

There are, however, two inference tasks that are interesting to discuss for incomplete data: imputation and prediction. Both take data as input, as opposed to user-provided evidence and events, and therefore interact with the missing values.

Imputation

Imputing missing data means to replace them with values estimated from the observed data. We can do that with:

Single Imputation: we replace the missing values with their expectations (for continuous variables) or MAP estimates (for discrete variables).
Multiple Imputation: we replace the missing values with values sampled from their posterior distribution given the observed data, that is, . We do that multiple times to capture the variability of the imputation process and to incorporate it in the variability of our statistical results.

Multiple imputation is what Rubin calls a proper imputation because of this, while single imputation is not. However, it is also much more computationally expensive, more so for multivariate models such as BNs. For this reason, single imputation is more common in practical applications.

The simplest way of performing single imputation is to take the local distribution of the node whose value we are not observing and condition on the values of its parents. In a causal setting, we are effectively imputing a variable from its direct causes. If there are multiple missing values in a single observations, we must do this for each of them in the topological ordering implied by the DAG so that the parents of each node are imputed (if needed) before the node itself.

> dag = model2network("[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]")
> bn = bn.fit(dag, data)
> incomplete.data[c(11, 21, 31, 41, 51, 61), ]

     X1   X2   X3   X4   X5   X6
11 <NA>    a    a    b    c    a
21    c <NA>    a    c    b    b
31    a    a <NA>    a    b    b
41    c    c    a <NA>    a    b
51    b    a    a    b <NA>    b
61    c    c    a    c    c <NA>

> imputed.data = impute(bn, data = incomplete.data, method = "parents")
> imputed.data[c(11, 21, 31, 41, 51, 61), ]

   X1 X2 X3 X4 X5 X6
11  b  a  a  b  c  a
21  c  a  a  c  b  b
31  a  a  a  a  b  b
41  c  c  a  b  a  b
51  b  a  a  b  a  b
61  c  c  a  c  c  a

Prediction

We can make similar considerations for prediction. Unlike imputation, three types of variables are involved: the target variables we would like to predict, the observed values D O we would like to predict from, and the missing values D M we should average over (since we cannot condition on them, for obvious reasons). Therefore, following the same notation that we used to describe structure learning, we either find the value with the highest probability

MAP

for a discrete variable, or the expectation

expectation

for a continuous variable. We can also predict just from the parents of the target node but, as we have seen in the case of imputation, this approach can be suboptimal or even unsuitable (for root nodes).

Computing predictions with either exact inference or Monte Carlo performs similarly in typical scenarios: we are only estimating the value of a single variable, so it is a low-dimensional problem. Imputation is not necessarily a low-dimensional problem: its dimensionality depends on the number of missing values in individual observations.

> pred.ex = predict(bn, data = incomplete.data, node = "X5", method = "exact")
> pred.lw = predict(bn, data = incomplete.data, node = "X5", method = "bayes-lw")
> table(exact = pred.ex, lw = pred.lw)

     lw
exact  a  b  c
    a 28  1  8
    b  0 30  0
    c  3  0 10

More in detail (using the gRain package to make explicit the steps followed by bnlearn):

> # 0) Consider an observation with multiple missing values.
> incomplete.data[1, c("X1", "X3", "X5")] = NA
> # 1) Identify the target, missing and observed variables in the observation.
> target = "X6"
> observed.values = incomplete.observation[, which(!is.na(incomplete.data[1, ]))]
> observed.values = observed.values[, names(observed.values) != target]
> missing.values = setdiff(colnames(data), c(target, observed.values))
> # 2) Compute the distribution of the target conditional on all observed values,
> #    marginalising the variables we do not observe.
> junction.tree = as.grain(bn)
> with.evidence = setEvidence(junction.tree, nodes = names(observed.values),
+                             states = sapply(observed.values, as.character))
> condprob = querygrain(with.evidence, node = target)[[target]]
> condprob

X6
        a         b
0.5067387 0.4932613

> # 3) Find the value with the highest probability.
> names(which.max(condprob))

[1] "a"

The code above is equivalent to the following, more explicit code.

> jointprob = querygrain(with.evidence, node = c(target, missing.values),
+                        type = "joint")
> margin.table(jointprob, target)

X6
        a         b
0.5067387 0.4932613

Scutari M and Denis J-B (2021). Bayesian Networks with Examples in R. CRC Press, 2nd edition. link ↩︎
Rubin DB (1976). “Inference and Missing Data.” Biometrika, 63:581–592. link ↩︎
Little RJA and Rubin DB (2019). Statistical Analysis with Missing Data. Wiley, 3rd edition. link ↩︎
van Buuren (2018). Flexible Imputation of Missing Data. CRC Press, 2nd edition. link ↩︎
Mohan K and Pearl J (2018). “Graphical Models for Processing Missing Data.” Journal of the American Statistical Association, 534(116):1023–1037. link ↩︎
Scutari M (2020). “Bayesian Network Models for Incomplete and Dynamic Data.” Statistica Neerlandica, 74(3):397–419. link ↩︎
Bodewes T and Scutari M (2021). “Learning Bayesian Networks from Incomplete Data with the Node-Averaged Likelihood.” International Journal of Approximate Reasoning, 138:145–160. link ↩︎
Strobl E, Visweswaran S and Spirtes PL (2018). “Fast Causal Inference With Non-Random Missingness by Test-Wise Deletion.” International Journal of Data Science and Analytics, 6:47–62. link ↩︎

Last updated on Tue Sep 12 09:47:40 2023 with bnlearn 4.9 and R version 4.2.2 (2022-10-31).