A quick introduction

In this lecture we will introduce how to model data containing missing values with Bayesian networks. Missing values can arise in different ways, which were originally codified by Rubin. Bayesian networks give graphical form to Rubin's work using a directed graph where the variables and the missingness mechanisms are represented as nodes and relationships as arcs. The two key points of the lecture will be:

  • Reviewing Rubin's classification of missing data and its graphical representation following the work of Mohan and Pearl.
  • Adapting parameter learning and structure learning to learn Bayesian networks from incomplete data using techniques like the Expectation-Maximisation algorithm.

Bayesian networks


Bayesian networks (BNs)1 are probabilistic graphical models defined by:

  • a network structure, a directed acyclic graph (DAG) G in which each node corresponds to a random variable Xi ;
  • a global probability distribution over X (with parameters Theta), which can be factorised into smaller local probability distributions according to the arcs present in the graph.

The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation: nodes that are not connected by an arc are independent conditional on some other nodes. In other words,

graphical separation implies causation

Nodes that are connected by an arc cannot be graphically separated, and therefore they depend directly on each other in probability.

The most important consequence of this property is the factorisation of the global distribution of X:

local distributions

Each local distribution has its own parameter set Theta Xi; and all Theta Xi is much smaller than Theta because many parameter are fixed by the fact that the variables they belong to are independent. This makes it possible to work with a high-dimensional distribution X through a set of much smaller and simpler distributions Xi whose dimensions depends only on the respective number of parents P Xi.

The first component of a BN is a DAG. Graphically, it looks like this: all arcs have directions, and no cycles are allowed.

> string = "[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]"
> sample.dag = model2network(string)
> graphviz.plot(sample.dag, shape = "rectangle")
plot of chunk sample-dag

And the implication is that:


The second component of a BN is the probability distribution Prob X. The choice should be such that the BN:

  • can be learned efficiently from data;
  • is flexible (it can encode a reasonable variety of phenomena);
  • is easy to query to perform inference.

The three most common choices in the literature and in practical applications (by far) are:

  • Discrete BNs, in which X and the Xi are multinomial. The parameters Theta Xi of the local distributions are the conditional probabilities pi i k j
    > dag = model2network("[X1][X2][X3|X2:X1]")
    > cptX1 = matrix(c(0.53, 0.47), ncol = 2, dimnames = list(NULL, c("a", "b")))
    > cptX2 = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("c", "d")))
    > cptX3 = c(0.15, 0.85, 0.4, 0.6, 0.75, 0.25, 0.80, 0.20)
    > dim(cptX3) = c(2, 2, 2)
    > dimnames(cptX3) = list("X3" = c("e", "f"), "X1" = c("a", "b"), "X2" = c("c", "d"))
    > bn = custom.fit(dag, list("X1" = cptX1, "X2" = cptX2, "X3" = cptX3))
    > bn
      Bayesian network parameters
      Parameters of node X1 (multinomial distribution)
    Conditional probability table:
       a    b 
    0.53 0.47 
      Parameters of node X2 (multinomial distribution)
    Conditional probability table:
       c    d 
    0.34 0.66 
      Parameters of node X3 (multinomial distribution)
    Conditional probability table:
    , , X2 = c
    X3     a    b
      e 0.15 0.40
      f 0.85 0.60
    , , X2 = d
    X3     a    b
      e 0.75 0.80
      f 0.25 0.20
  • Gaussian BNs (GBNs), in which X follows a multivariate normal distribution and the local distributions Xi are defined by the linear regression models for each Xi against its parents Pi Xi: regression
    > bn = custom.fit(dag, list(
    +   "X1" = list(coef = c("(Intercept)" = 2.4), sd = sqrt(0.8)),
    +   "X2" = list(coef = c("(Intercept)" = 1.8), sd = sqrt(0.6)),
    +   "X3" = list(coef = c("(Intercept)" = 2.1, "X1" = 1.2, "X2" = -0.4), sd = sqrt(0.9))
    + ))
    > bn
      Bayesian network parameters
      Parameters of node X1 (Gaussian distribution)
    Conditional density: X1
    Standard deviation of the residuals: 0.8944272 
      Parameters of node X2 (Gaussian distribution)
    Conditional density: X2
    Standard deviation of the residuals: 0.7745967 
      Parameters of node X3 (Gaussian distribution)
    Conditional density: X3 | X1 + X2
    (Intercept)           X1           X2  
            2.1          1.2         -0.4  
    Standard deviation of the residuals: 0.9486833
  • Conditional linear Gaussian BNs (CLGBNs or CGBNs), in which X is a mixture of multivariate normal distributions and the Xi are either multinomial, univariate normal or mixtures of normals.
    • Discrete Xi are only allowed to have discrete parents (denoted Delta Xi ), are assumed to follow a multinomial distribution parameterised with conditional probability tables.
    • Continuous Xi are allowed to have both discrete and continuous parents (denoted Gamma Xi, combined Pi Xi ), and their local distributions are normal which can be written as a mixture of linear regressions mixture against the continuous parents with one component for each configuration d Xi of the discrete parents. If Xi has no discrete parents, the mixture reverts to a single linear regression.
    > bn = custom.fit(dag, list(
    +   "X1" = list(coef = c("(Intercept)" = 3.1), sd = sqrt(0.9)),
    +   "X2" = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("a", "b"))),
    +   "X3" = list(coef = matrix(c(1:4), ncol = 2), sd = sqrt(c(0.4, 0.3)))
    + ))
    > bn
      Bayesian network parameters
      Parameters of node X1 (Gaussian distribution)
    Conditional density: X1
    Standard deviation of the residuals: 0.9486833 
      Parameters of node X2 (multinomial distribution)
    Conditional probability table:
       a    b 
    0.34 0.66 
      Parameters of node X3 (conditional Gaussian distribution)
    Conditional density: X3 | X1 + X2
                 0  1
    (Intercept)  1  3
    X1           2  4
    Standard deviation of the residuals:
            0          1  
    0.6324555  0.5477226  
    Discrete parents' configurations:
    0   a
    1   b


Model selection and model estimation are collectively known as learning, in the case of BNs. They are usually performed as a two-step process:

  1. structure learning, learning the structure of the DAG from the data;
  2. parameter learning, learning the local distributions implied by the DAG learned in the previous step.

This workflow is Bayesian: given a data set D and a BN B , we have

structure and parameter learning

Structure learning can be done in practice as

marginal likelihood

where we use Bayes' theorem to move from graph probability to loglikelihood because the latter is much easier to define. Assuming that the marginal likelihood marginal likelihood decomposes along local distributions,

local marginal likelihood

which can be computed efficiently. This approach is called score-based structure learning.

Once we have learned G in structure learning, we know the parents of each node and therefore we can identify the local distributions Xi. This makes it possible to estimate their parameter sets Theta Xi independently from each other. Assuming that G is sparse, each Theta Xi. involves only few variables and thus estimating its parameters is computationally simple.

In practice, this means that we first use some structure learning algorithm to learn the DAG G from the data,

> dag = hc(data, score = "bde")

using the marginal likelihood above (which is available in closed form for discrete BNs, GBNs and CLGBNs) or some other goodness-of-fit score like BIC.

> hc.dag = hc(data, score = "bic")
> graphviz.plot(hc.dag, shape = "rectangle")
plot of chunk example_of_structure_learning_ctd_2

We then take the hc.dag together with the data and use it for parameter learning with some estimator (maximum likelihood, Bayesian posterior or something else).

> fitted = bn.fit(hc.dag, data, method = "mle")
> fitted

  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:
      a      b      c 
0.3336 0.3340 0.3324 

  Parameters of node X2 (multinomial distribution)

Conditional probability table:
X2           a          b          c
  a 0.85611511 0.44491018 0.11492178
  b 0.02517986 0.22095808 0.09446450
  c 0.11870504 0.33413174 0.79061372

  Parameters of node X3 (multinomial distribution)

Conditional probability table:
      a      b      c 
0.7434 0.2048 0.0518 

  Parameters of node X4 (multinomial distribution)

Conditional probability table:
, , X3 = a

X4           a          b          c
  a 0.80081301 0.09251810 0.10530547
  b 0.09024390 0.80209171 0.11173633
  c 0.10894309 0.10539019 0.78295820

, , X3 = b

X4           a          b          c
  a 0.18079096 0.88304094 0.24695122
  b 0.13276836 0.07017544 0.49390244
  c 0.68644068 0.04678363 0.25914634

, , X3 = c

X4           a          b          c
  a 0.42857143 0.34117647 0.13333333
  b 0.20238095 0.38823529 0.44444444
  c 0.36904762 0.27058824 0.42222222

  Parameters of node X5 (multinomial distribution)

Conditional probability table:
, , X6 = a

X5           a          b          c
  a 0.80524979 0.20588235 0.11937378
  b 0.09737511 0.17973856 0.11448141
  c 0.09737511 0.61437908 0.76614481

, , X6 = b

X5           a          b          c
  a 0.40050804 0.31679389 0.23759542
  b 0.49026249 0.36641221 0.50667939
  c 0.10922947 0.31679389 0.25572519

  Parameters of node X6 (multinomial distribution)

Conditional probability table:
      a      b 
0.5018 0.4982

For structure learning, we can consider using conditional independence tests instead of maximising a goodness-of-fit score. BNs are defined such that graphical separation between nodes in the DAG implies (conditional or marginal) independence in the probability distributions of the corresponding variables. But this means that we can use suitable statistical tests to assess (conditional or marginal) independence from data and insert (or not) arcs in the DAG as appropriate. This approach is called constraint-based structure learning.

> pc.dag = pc.stable(data, test = "x2")
> graphviz.plot(pc.dag, shape = "rectangle")
plot of chunk example_of_constraint_based

We will discuss in a minute the reason why the graph learned by the PC algorithm (or that would be learned by any other constraint-based algorithm, for that matter) is only partially directed.


Inference on BNs usually consists of conditional probability (CP) or maximum a posteriori (MAP) queries. The general idea of a CP query is:

  1. we have some evidence, that is, we know the values of some variables and we fix the nodes accordingly; and
  2. we want to look into the probability of some event involving (a subset of) the other variables conditional on the evidence we have.

Say we start from a discrete BN with the DAG used above.

plot of chunk figure-survey-base

Graphically a CP query looks like this (evidence node in red, event of interest in blue):

plot of chunk figure-cpquery

On the other hand, the goal of a MAP query is to find the combination of values for (a subset of) the variables in the network that has the highest probability given some evidence. If the evidence is a partially-observed new individual, then performing a MAP query amounts to a classic prediction task. Graphically, a MAP query looks like this (evidence nodes in red, variables of interest in blue):

plot of chunk figure-mapquery

This is more computationally challenging than it sounds because MAP queries do not decompose along local distributions.

As an example, performing a CP query using the gRain package with a BN from bnlearn involves the following steps:

> library(gRain)
> # 1) Export the BN from bnlearn to gRain, which transforms it into a junction tree.
> junction.tree = as.grain(sample.bn)
> # 2) Introduce the evidence in the junction tree, modifying the distribution of X2.
> with.evidence = setEvidence(junction.tree, nodes = "X2", states = "b")
> # 3) Query about the distribution of the node containing the event of interest.
> querygrain(with.evidence, node = "X6")
        a         b 
0.4993782 0.5006218

A MAP query goes further and finds the highest probability in the joint distribution of the variables of interest (which is different from the maxima found in the respective marginal distribution, in general). The values of the variables associated with that probabilities are the answer to the MAP query.

> # 2) Introduce the evidence in on X1, X4 and X6 the junction tree.
> with.evidence = setEvidence(junction.tree, nodes = c("X1", "X4", "X6"),
+                             states = c("a", "c", "b"))
> # 3) Query about the distribution of the nodes of interest X2 and X5
> pdist = querygrain(with.evidence, nodes = c("X2", "X5"), type = "joint")
> # 4) Find the configuration of values with the maximum probability.
> which.max = which(pdist == max(pdist), arr.ind = TRUE)
> which.max
  X2 X5
a  1  2
> rownames(pdist)[which.max[, 1]]
[1] "a"
> colnames(pdist)[which.max[, 2]]
[1] "b"

Both types of queries can be computed by:

  • Applying exact inference algorithms to the BN, changing its data structure into a form (the “junction tree”) that allows for the efficient computation of marginal, joint and conditional probabilities.
  • Generating random samples form the BN and using them with Monte Carlo algorithms to implement approximate inference.

Exact inference is more computationally demanding on BNs with many nodes and arcs than approximate inference, but approximate inference is affected by stochastic noise from the random samples generation process.

Equivalence classes

BNs use directed acyclic graphs to represent the probabilistic relationships between the data. By default, that is all they can do: they are not inherently causal models even though their arcs have a direction that could be interpreted as a cause-effect relationship. BNs are probabilistic graphical models that represent probabilistic dependencies in graphical form: probabilistic dependence is a symmetric relationship because

conditioning is symmetric

which in turn depends on Bayes' theorem,

Bayes theorem

As a result, we can represent the same probability distribution with different DAGs. Such DAGs are said to belong to the same equivalence class because they are indistinguishable without the use of external information. Consider the following example:

> dag1 = model2network("[X1][X2|X1][X3|X1][X4|X1]")
> dag2 = model2network("[X1|X4][X2|X1][X3|X1][X4]")
> par(mfrow = c(1, 2))
> graphviz.compare(dag1, dag2, diff = "none", shape = "rectangle")
plot of chunk equivalence_class_example

We can apply Bayes' theorem and show that the decompositions into local probabilities in dag1 and dag2 are in fact equivalent:

identical CPDAGs

In other words, the directions of some arcs are not identifiable even though we can tell that the nodes are connected by an arc. The graphical representation of this idea is the completed partially-directed acyclic graph (CPDAG) in which only the arcs whose directions can be uniquely identified are directed.

> par(mfrow = c(1, 2))
> graphviz.plot(cpdag(dag1), shape = "rectangle")
> graphviz.plot(cpdag(dag2), shape = "rectangle")
plot of chunk equivalence_class_representation

Formally, CPDAGs have two key properties:

  • The CPDAG is defined by the skeleton of the DAG (the underlying undirected graph) and by the v-structures (converging arcs, where Xj and Xk are not connected by an arc). The arcs belonging to the v-structures and those that would introduce cycles/other v-structures in the CPDAG are the only directed arcs; all other arcs in the CPDAG have no direction.
  • Each CPDAG is the unique representation of an equivalence class. In other words, DAGs that can be transformed into the same CPDAG belong to the same equivalence class.

Furthermore, DAGs in the same equivalence class are score equivalent: we cannot select one member of the equivalence class over another in structure learning (without the use of external information) because goodness-of-fit scores will assign the same value to both of them. After all, all DAGs in the same equivalence class represent the same probability distribution and the goodness-of-fit scores are a function of that probability distribution. For instance:

> score(dag1, data[, nodes(dag1)], type = "bic")
[1] -17074.39
> score(dag2, data[, nodes(dag2)], type = "bic")
[1] -17074.39

This suggests that we should always compare the structure of BNs through their CPDAGs to be able to assess whether they are effectively different or not.

> cpdag1 = cpdag(dag1)
> cpdag2 = cpdag(dag2)
> unlist(compare(cpdag1, cpdag2))
tp fp fn 
 3  0  0
> all.equal(cpdag1, cpdag1)
[1] TRUE

Sometimes we need the inverse operation: we have a CPDAG and we need to find out one of the DAGs that are part of the equivalence class that the CPDAG represents. This operation is called finding a consistent extension of the CPDAG.

> cextend(cpdag1)

  Random/Generated Bayesian network

  nodes:                                 4 
  arcs:                                  3 
    undirected arcs:                     0 
    directed arcs:                       3 
  average markov blanket size:           1.50 
  average neighbourhood size:            1.50 
  average branching factor:              0.75 

  generation algorithm:                  Empty

Causal interpretation of BNs

What do we need to do to learn BNs as causal network models, that is, a graphical model in which arcs represent direct cause-effect relationship? Firstly, we need to make some assumptions:

  • Causal Markov assumption: Each variable Xi in X is conditionally independent of its non-effects, both direct and indirect, given its direct causes. This represents the causal equivalent of the decomposition into local distributions.
  • Faithfulness: There must exist a DAG which is faithful to the probability distribution of Prob X, so that the only dependencies are those arising from d-separation in the DAG.
  • Causal sufficiency: There must be no latent variables (unobserved variables influencing the variables in the network) acting as confounding factors because they are parents to at least two observed variables. Such variables may induce spurious correlations between the observed variables, thus introducing bias in the causal network.

The last assumption is really a corollary of the first two: the presence of unobserved variables violates the faithfulness assumption (because the network structure cannot include them) and possibly the causal Markov property (if an arc is wrongly added between the observed variables due to the influence of the latent one). As an example, consider the following BN from which generate a sample of 100 observations.

> causal.bn = custom.fit(model2network("[X1][X2|X1][X3|X1]"),
+   list("X1" = list(coef = c("(Intercept)" = 0), sd = 1),
+        "X2" = list(coef = c("(Intercept)" = 0, "X1" = 3), sd = 0.5),
+        "X3" = list(coef = c("(Intercept)" = 0, "X1" = 2), sd = 0.5))
+ )
> graphviz.plot(causal.bn, shape = "rectangle")
plot of chunk causal_network_and_latent_variables
> complete.data = rbn(causal.bn, 100)

In this network, we can observe the following graphical separations:

> dsep(causal.bn, "X2", "X3")
> dsep(causal.bn, "X2", "X3", "X1")
[1] TRUE

We can successfully to re-learn the DAG of causal.bn from the complete.data because all variables are observed. However, if we omit X1 , we fail to re-learn the DAG of causal.bn because X1 (now a latent variable) is a parent of both X2 and X3 (observed variables). Violating the causal sufficiency assumption results in learning a spurious arc X2 to X3 that is not present in causal.bn.

> relearn = hc(complete.data)
> modelstring(relearn)
[1] "[X1][X2|X1][X3|X1]"
> dsep(relearn, "X2", "X3")
> dsep(relearn, "X2", "X3", "X1")
[1] TRUE
> relearn = hc(complete.data[, c("X2", "X3")])
> modelstring(relearn)
[1] "[X2][X3|X2]"
> dsep(relearn, "X2", "X3")

Even if these assumptions are satisfied, we may not be able to completely learn the BN as a causal network: we are using probabilistic tools (goodness-of-fit scores or conditional independence tests) that can only identify equivalence classes of BNs. In order to identify the remaining arc directions, we need external information that typically comes from:

  • Longitudinal (panel) data: causality can only follow the arrow of time. Any arc between a node associated with a given time point and a node associated with a later time point can only go from the former to the latter.
  • Experimental (randomised) data: data that are collected following an experimental design (as opposed to observational data that are just recorded are they arise) are such that arcs can only point outwards from the randomised variables.
  • Domain experts: information about arc directions may be either collected by interviewing experts in the domain the data come from, or by consulting the literature to find relevant results from previous investigations.

Missing data

Rubin2, Little and Rubin3 have defined the key properties of missing data. We start from:

  • the complete data of all samples, which we do not observe;
  • the missingness mechanism (also known as the pattern of missingness) which determines which variables we actually observe for each sample;
  • the incomplete data, in which the values of some variables are replaced with NA for some samples to indicate that they are missing. This is what we actually observe.

In other words, the complete data are reduced to the incomplete data by the missingness mechanism. The statistical analysis of incomplete data aims to recover the probabilistic model underlying the complete data from the incomplete data by taking into account the missingness mechanism. Not taking into account the missingness mechanism would result in a biased model.

What types of missingness mechanisms are possible? Rubin defined three:

  1. Missing Completely at Random (MCAR): the probability of the value of a variable being missing is the same for all samples because the causes of the missing data are unrelated to the data. In other words, the missingness mechanism does not depend on the variables in the data.
  2. Missing at Random (MAR): the probability of being missing is the same only within groups defined by the observed data. In other words, the missingness mechanism depends on the variables that are completely observed in the data. MCAR implies MAR.
  3. Missing Not at Random (MNAR): the probability of being missing varies for reasons that are unknown to us. In other words, the missingness mechanism depends on variables we do not observe completely or at all. Any missingness mechanism that is not MCAR or MAR is MNAR.

Formally, a typical notation is

  • D O for the observed data;
  • D M for the missing data;
  • D = D O and D M for the complete data;
  • R is a zero-one matrix with the same dimensions as D: each cell is equal to zero if the corresponding cell in the data is a missing value, or it is equal to one if the corresponding cell in the data is observed.

If the data contain no missing values, then D= D O and we can model D through D O in a straightforward way. If the data are MCAR, then

does not depend on D O or D M

if the data are MAR, then

does not depend on D M

if the data are MNAR, then

it does not simplify

Consider the following numerical example (from van Buuren4) involving two variables Y1 and Y2 that jointly follow a bivariate normal distribution.

> library(MASS)
> set.seed(80122)
> n = 300
> Y = mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, 0.5, 0.5, 1), nrow = 2))

Say that only Y2 contains missing values and that the missingness mechanism is defined by

general missingness mechanism

If gamma MCAR then

missingness mechanism

and the missingness mechanism does not depend on Y1 and Y2 at all. If gamma MAR, then

missingness mechanism

and the missingness mechanism depends on Y1 which is completely observed. This allows for modelling the missingness mechanism from the data we observe. Finally, if gamma MNAR then

missingness mechanism

and the missingness mechanism depends on Y2 which is not completely observed.

> missingness.mechanism = function(Y, gamma) {

+   gamma[1] + exp(Y[, 1]) / (1 + exp(Y[, 1])) * gamma[2] +
+     exp(Y[, 2]) / (1 + exp(Y[, 2])) * gamma[3]

> R.MCAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0.5, 0, 0)))
> R.MAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 1, 0)))
> R.MNAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 0, 1)))

If we try to estimate the joint distribution of Y1 and Y2 under MCAR, that is, when we assume that the missingness mechanism is MCAR, we obtain means that are close to zero for both Y1 and Y2 . Their correlation is close to 0.5 as well.

> Y.MCAR = Y
> Y.MCAR[as.logical(R.MCAR), 2] = NA
> colMeans(Y.MCAR, na.rm = TRUE)
[1]  0.04083247 -0.03533149
> cor(Y.MCAR, use = "pairwise.complete")
          [,1]      [,2]
[1,] 1.0000000 0.4249437
[2,] 0.4249437 1.0000000
plot of chunk missingness_numeric_example_3_plot

The same is true under MAR.

> Y.MAR = Y
> Y.MAR[as.logical(R.MAR), 2] = NA
> colMeans(Y.MAR, na.rm = TRUE)
[1] 0.04083247 0.26461012
> cor(Y.MAR, use = "pairwise.complete")
          [,1]      [,2]
[1,] 1.0000000 0.5011879
[2,] 0.5011879 1.0000000
plot of chunk missingness_numeric_example_4_plot

However, under MNAR the estimates of the mean of Y2 and of the correlation between Y1 and Y2 are both heavily biased. Intuitively, the bias is unavoidable since it is impossible to correctly model a MNAR missingness mechanism that depends on information we do not have (the values missing in Y2 ).

> Y.MNAR = Y
> Y.MNAR[as.logical(R.MNAR), 2] = NA
> colMeans(Y.MNAR, na.rm = TRUE)
[1] 0.04083247 0.43020494
> cor(Y.MNAR, use = "pairwise.complete")
          [,1]      [,2]
[1,] 1.0000000 0.2672833
[2,] 0.2672833 1.0000000
plot of chunk missingness_numeric_example_5_plot

The distributions of D O and D M become more and more different as we move from MCAR to MAR to MNAR.

plot of chunk missingness_numeric_example_bwplot

This behaviour is related to ignorability, which formalises the idea that the distribution of the data is the same whether we observe or not:

same distribution

As a result, we can model the distribution of D M from D O (equivalently, the distribution D given D O and R = 1) and use this distribution to work around the missing data. For instance, we can impute them, that is, we can replace missing values with point estimates computed from the observed values.

MCAR is ignorable because

it simplfies

since R is independent from both D O and D M and D. MAR is also ignorable: the definition of MAR is