A quick introduction

In this lecture we will introduce how to model data containing missing values with Bayesian networks. Missing values can arise in different ways, which were originally codified by Rubin. Bayesian networks give graphical form to Rubin's work using a directed graph where the variables and the missingness mechanisms are represented as nodes and relationships as arcs. The two key points of the lecture will be:

  • Reviewing Rubin's classification of missing data and its graphical representation following the work of Mohan and Pearl.
  • Adapting parameter learning and structure learning to learn Bayesian networks from incomplete data using techniques like the Expectation-Maximisation algorithm.

Bayesian networks

Definitions

Bayesian networks (BNs)1 are probabilistic graphical models defined by:

  • a network structure, a directed acyclic graph (DAG) G in which each node corresponds to a random variable Xi ;
  • a global probability distribution over X (with parameters Theta), which can be factorised into smaller local probability distributions according to the arcs present in the graph.

The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation: nodes that are not connected by an arc are independent conditional on some other nodes. In other words,

graphical separation implies causation

Nodes that are connected by an arc cannot be graphically separated, and therefore they depend directly on each other in probability.

The most important consequence of this property is the factorisation of the global distribution of X:

local distributions

Each local distribution has its own parameter set Theta Xi; and all Theta Xi is much smaller than Theta because many parameter are fixed by the fact that the variables they belong to are independent. This makes it possible to work with a high-dimensional distribution X through a set of much smaller and simpler distributions Xi whose dimensions depends only on the respective number of parents P Xi.

The first component of a BN is a DAG. Graphically, it looks like this: all arcs have directions, and no cycles are allowed.

> string = "[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]"
> sample.dag = model2network(string)
> graphviz.plot(sample.dag, shape = "rectangle")
plot of chunk sample-dag

And the implication is that:

decomposition

The second component of a BN is the probability distribution Prob X. The choice should be such that the BN:

  • can be learned efficiently from data;
  • is flexible (it can encode a reasonable variety of phenomena);
  • is easy to query to perform inference.

The three most common choices in the literature and in practical applications (by far) are:

  • Discrete BNs, in which X and the Xi are multinomial. The parameters Theta Xi of the local distributions are the conditional probabilities pi i k j
    > dag = model2network("[X1][X2][X3|X2:X1]")
    > cptX1 = matrix(c(0.53, 0.47), ncol = 2, dimnames = list(NULL, c("a", "b")))
    > cptX2 = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("c", "d")))
    > cptX3 = c(0.15, 0.85, 0.4, 0.6, 0.75, 0.25, 0.80, 0.20)
    > dim(cptX3) = c(2, 2, 2)
    > dimnames(cptX3) = list("X3" = c("e", "f"), "X1" = c("a", "b"), "X2" = c("c", "d"))
    > bn = custom.fit(dag, list("X1" = cptX1, "X2" = cptX2, "X3" = cptX3))
    > bn
    
    
      Bayesian network parameters
    
      Parameters of node X1 (multinomial distribution)
    
    Conditional probability table:
     
       a    b 
    0.53 0.47 
    
      Parameters of node X2 (multinomial distribution)
    
    Conditional probability table:
     
       c    d 
    0.34 0.66 
    
      Parameters of node X3 (multinomial distribution)
    
    Conditional probability table:
     
    , , X2 = c
    
       X1
    X3     a    b
      e 0.15 0.40
      f 0.85 0.60
    
    , , X2 = d
    
       X1
    X3     a    b
      e 0.75 0.80
      f 0.25 0.20
    
  • Gaussian BNs (GBNs), in which X follows a multivariate normal distribution and the local distributions Xi are defined by the linear regression models for each Xi against its parents Pi Xi: regression
    > bn = custom.fit(dag, list(
    +   "X1" = list(coef = c("(Intercept)" = 2.4), sd = sqrt(0.8)),
    +   "X2" = list(coef = c("(Intercept)" = 1.8), sd = sqrt(0.6)),
    +   "X3" = list(coef = c("(Intercept)" = 2.1, "X1" = 1.2, "X2" = -0.4), sd = sqrt(0.9))
    + ))
    > bn
    
    
      Bayesian network parameters
    
      Parameters of node X1 (Gaussian distribution)
    
    Conditional density: X1
    Coefficients:
    (Intercept)  
            2.4  
    Standard deviation of the residuals: 0.8944272 
    
      Parameters of node X2 (Gaussian distribution)
    
    Conditional density: X2
    Coefficients:
    (Intercept)  
            1.8  
    Standard deviation of the residuals: 0.7745967 
    
      Parameters of node X3 (Gaussian distribution)
    
    Conditional density: X3 | X1 + X2
    Coefficients:
    (Intercept)           X1           X2  
            2.1          1.2         -0.4  
    Standard deviation of the residuals: 0.9486833
    
  • Conditional linear Gaussian BNs (CLGBNs or CGBNs), in which X is a mixture of multivariate normal distributions and the Xi are either multinomial, univariate normal or mixtures of normals.
    • Discrete Xi are only allowed to have discrete parents (denoted Delta Xi ), are assumed to follow a multinomial distribution parameterised with conditional probability tables.
    • Continuous Xi are allowed to have both discrete and continuous parents (denoted Gamma Xi, combined Pi Xi ), and their local distributions are normal which can be written as a mixture of linear regressions mixture against the continuous parents with one component for each configuration d Xi of the discrete parents. If Xi has no discrete parents, the mixture reverts to a single linear regression.
    > bn = custom.fit(dag, list(
    +   "X1" = list(coef = c("(Intercept)" = 3.1), sd = sqrt(0.9)),
    +   "X2" = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("a", "b"))),
    +   "X3" = list(coef = matrix(c(1:4), ncol = 2), sd = sqrt(c(0.4, 0.3)))
    + ))
    > bn
    
    
      Bayesian network parameters
    
      Parameters of node X1 (Gaussian distribution)
    
    Conditional density: X1
    Coefficients:
    (Intercept)  
            3.1  
    Standard deviation of the residuals: 0.9486833 
    
      Parameters of node X2 (multinomial distribution)
    
    Conditional probability table:
     
       a    b 
    0.34 0.66 
    
      Parameters of node X3 (conditional Gaussian distribution)
    
    Conditional density: X3 | X1 + X2
    Coefficients:
                 0  1
    (Intercept)  1  3
    X1           2  4
    Standard deviation of the residuals:
            0          1  
    0.6324555  0.5477226  
    Discrete parents' configurations:
       X2
    0   a
    1   b
    

Learning

Model selection and model estimation are collectively known as learning, in the case of BNs. They are usually performed as a two-step process:

  1. structure learning, learning the structure of the DAG from the data;
  2. parameter learning, learning the local distributions implied by the DAG learned in the previous step.

This workflow is Bayesian: given a data set D and a BN B , we have

structure and parameter learning

Structure learning can be done in practice as

marginal likelihood

where we use Bayes' theorem to move from graph probability to loglikelihood because the latter is much easier to define. Assuming that the marginal likelihood marginal likelihood decomposes along local distributions,

local marginal likelihood

which can be computed efficiently. This approach is called score-based structure learning.

Once we have learned G in structure learning, we know the parents of each node and therefore we can identify the local distributions Xi. This makes it possible to estimate their parameter sets Theta Xi independently from each other. Assuming that G is sparse, each Theta Xi. involves only few variables and thus estimating its parameters is computationally simple.

In practice, this means that we first use some structure learning algorithm to learn the DAG G from the data,

> dag = hc(data, score = "bde")

using the marginal likelihood above (which is available in closed form for discrete BNs, GBNs and CLGBNs) or some other goodness-of-fit score like BIC.

> hc.dag = hc(data, score = "bic")
> graphviz.plot(hc.dag, shape = "rectangle")
plot of chunk example_of_structure_learning_ctd_2

We then take the hc.dag together with the data and use it for parameter learning with some estimator (maximum likelihood, Bayesian posterior or something else).

> fitted = bn.fit(hc.dag, data, method = "mle")
> fitted

  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:
      a      b      c 
0.3336 0.3340 0.3324 

  Parameters of node X2 (multinomial distribution)

Conditional probability table:
 
   X1
X2           a          b          c
  a 0.85611511 0.44491018 0.11492178
  b 0.02517986 0.22095808 0.09446450
  c 0.11870504 0.33413174 0.79061372

  Parameters of node X3 (multinomial distribution)

Conditional probability table:
      a      b      c 
0.7434 0.2048 0.0518 

  Parameters of node X4 (multinomial distribution)

Conditional probability table:
 
, , X3 = a

   X1
X4           a          b          c
  a 0.80081301 0.09251810 0.10530547
  b 0.09024390 0.80209171 0.11173633
  c 0.10894309 0.10539019 0.78295820

, , X3 = b

   X1
X4           a          b          c
  a 0.18079096 0.88304094 0.24695122
  b 0.13276836 0.07017544 0.49390244
  c 0.68644068 0.04678363 0.25914634

, , X3 = c

   X1
X4           a          b          c
  a 0.42857143 0.34117647 0.13333333
  b 0.20238095 0.38823529 0.44444444
  c 0.36904762 0.27058824 0.42222222


  Parameters of node X5 (multinomial distribution)

Conditional probability table:
 
, , X6 = a

   X2
X5           a          b          c
  a 0.80524979 0.20588235 0.11937378
  b 0.09737511 0.17973856 0.11448141
  c 0.09737511 0.61437908 0.76614481

, , X6 = b

   X2
X5           a          b          c
  a 0.40050804 0.31679389 0.23759542
  b 0.49026249 0.36641221 0.50667939
  c 0.10922947 0.31679389 0.25572519


  Parameters of node X6 (multinomial distribution)

Conditional probability table:
      a      b 
0.5018 0.4982

For structure learning, we can consider using conditional independence tests instead of maximising a goodness-of-fit score. BNs are defined such that graphical separation between nodes in the DAG implies (conditional or marginal) independence in the probability distributions of the corresponding variables. But this means that we can use suitable statistical tests to assess (conditional or marginal) independence from data and insert (or not) arcs in the DAG as appropriate. This approach is called constraint-based structure learning.

> pc.dag = pc.stable(data, test = "x2")
> graphviz.plot(pc.dag, shape = "rectangle")
plot of chunk example_of_constraint_based

We will discuss in a minute the reason why the graph learned by the PC algorithm (or that would be learned by any other constraint-based algorithm, for that matter) is only partially directed.

Inference

Inference on BNs usually consists of conditional probability (CP) or maximum a posteriori (MAP) queries. The general idea of a CP query is:

  1. we have some evidence, that is, we know the values of some variables and we fix the nodes accordingly; and
  2. we want to look into the probability of some event involving (a subset of) the other variables conditional on the evidence we have.

Say we start from a discrete BN with the DAG used above.

plot of chunk figure-survey-base

Graphically a CP query looks like this (evidence node in red, event of interest in blue):

plot of chunk figure-cpquery

On the other hand, the goal of a MAP query is to find the combination of values for (a subset of) the variables in the network that has the highest probability given some evidence. If the evidence is a partially-observed new individual, then performing a MAP query amounts to a classic prediction task. Graphically, a MAP query looks like this (evidence nodes in red, variables of interest in blue):

plot of chunk figure-mapquery

This is more computationally challenging than it sounds because MAP queries do not decompose along local distributions.

As an example, performing a CP query using the gRain package with a BN from bnlearn involves the following steps:

> library(gRain)
> # 1) Export the BN from bnlearn to gRain, which transforms it into a junction tree.
> junction.tree = as.grain(sample.bn)
> # 2) Introduce the evidence in the junction tree, modifying the distribution of X2.
> with.evidence = setEvidence(junction.tree, nodes = "X2", states = "b")
> # 3) Query about the distribution of the node containing the event of interest.
> querygrain(with.evidence, node = "X6")
$X6
X6
        a         b 
0.4993782 0.5006218

A MAP query goes further and finds the highest probability in the joint distribution of the variables of interest (which is different from the maxima found in the respective marginal distribution, in general). The values of the variables associated with that probabilities are the answer to the MAP query.

> # 2) Introduce the evidence in on X1, X4 and X6 the junction tree.
> with.evidence = setEvidence(junction.tree, nodes = c("X1", "X4", "X6"),
+                             states = c("a", "c", "b"))
> # 3) Query about the distribution of the nodes of interest X2 and X5
> pdist = querygrain(with.evidence, nodes = c("X2", "X5"), type = "joint")
> # 4) Find the configuration of values with the maximum probability.
> which.max = which(pdist == max(pdist), arr.ind = TRUE)
> which.max
  X2 X5
a  1  2
> rownames(pdist)[which.max[, 1]]
[1] "a"
> colnames(pdist)[which.max[, 2]]
[1] "b"

Both types of queries can be computed by:

  • Applying exact inference algorithms to the BN, changing its data structure into a form (the “junction tree”) that allows for the efficient computation of marginal, joint and conditional probabilities.
  • Generating random samples form the BN and using them with Monte Carlo algorithms to implement approximate inference.

Exact inference is more computationally demanding on BNs with many nodes and arcs than approximate inference, but approximate inference is affected by stochastic noise from the random samples generation process.

Equivalence classes

BNs use directed acyclic graphs to represent the probabilistic relationships between the data. By default, that is all they can do: they are not inherently causal models even though their arcs have a direction that could be interpreted as a cause-effect relationship. BNs are probabilistic graphical models that represent probabilistic dependencies in graphical form: probabilistic dependence is a symmetric relationship because

conditioning is symmetric

which in turn depends on Bayes' theorem,

Bayes theorem

As a result, we can represent the same probability distribution with different DAGs. Such DAGs are said to belong to the same equivalence class because they are indistinguishable without the use of external information. Consider the following example:

> dag1 = model2network("[X1][X2|X1][X3|X1][X4|X1]")
> dag2 = model2network("[X1|X4][X2|X1][X3|X1][X4]")
> par(mfrow = c(1, 2))
> graphviz.compare(dag1, dag2, diff = "none", shape = "rectangle")
plot of chunk equivalence_class_example

We can apply Bayes' theorem and show that the decompositions into local probabilities in dag1 and dag2 are in fact equivalent:

identical CPDAGs

In other words, the directions of some arcs are not identifiable even though we can tell that the nodes are connected by an arc. The graphical representation of this idea is the completed partially-directed acyclic graph (CPDAG) in which only the arcs whose directions can be uniquely identified are directed.

> par(mfrow = c(1, 2))
> graphviz.plot(cpdag(dag1), shape = "rectangle")
> graphviz.plot(cpdag(dag2), shape = "rectangle")
plot of chunk equivalence_class_representation

Formally, CPDAGs have two key properties:

  • The CPDAG is defined by the skeleton of the DAG (the underlying undirected graph) and by the v-structures (converging arcs, where Xj and Xk are not connected by an arc). The arcs belonging to the v-structures and those that would introduce cycles/other v-structures in the CPDAG are the only directed arcs; all other arcs in the CPDAG have no direction.
  • Each CPDAG is the unique representation of an equivalence class. In other words, DAGs that can be transformed into the same CPDAG belong to the same equivalence class.

Furthermore, DAGs in the same equivalence class are score equivalent: we cannot select one member of the equivalence class over another in structure learning (without the use of external information) because goodness-of-fit scores will assign the same value to both of them. After all, all DAGs in the same equivalence class represent the same probability distribution and the goodness-of-fit scores are a function of that probability distribution. For instance:

> score(dag1, data[, nodes(dag1)], type = "bic")
[1] -17074.39
> score(dag2, data[, nodes(dag2)], type = "bic")
[1] -17074.39

This suggests that we should always compare the structure of BNs through their CPDAGs to be able to assess whether they are effectively different or not.

> cpdag1 = cpdag(dag1)
> cpdag2 = cpdag(dag2)
> unlist(compare(cpdag1, cpdag2))
tp fp fn 
 3  0  0
> all.equal(cpdag1, cpdag1)
[1] TRUE

Sometimes we need the inverse operation: we have a CPDAG and we need to find out one of the DAGs that are part of the equivalence class that the CPDAG represents. This operation is called finding a consistent extension of the CPDAG.

> cextend(cpdag1)

  Random/Generated Bayesian network

  model:
   [X1][X2|X1][X3|X1][X4|X1] 
  nodes:                                 4 
  arcs:                                  3 
    undirected arcs:                     0 
    directed arcs:                       3 
  average markov blanket size:           1.50 
  average neighbourhood size:            1.50 
  average branching factor:              0.75 

  generation algorithm:                  Empty

Causal interpretation of BNs

What do we need to do to learn BNs as causal network models, that is, a graphical model in which arcs represent direct cause-effect relationship? Firstly, we need to make some assumptions:

  • Causal Markov assumption: Each variable Xi in X is conditionally independent of its non-effects, both direct and indirect, given its direct causes. This represents the causal equivalent of the decomposition into local distributions.
  • Faithfulness: There must exist a DAG which is faithful to the probability distribution of Prob X, so that the only dependencies are those arising from d-separation in the DAG.
  • Causal sufficiency: There must be no latent variables (unobserved variables influencing the variables in the network) acting as confounding factors because they are parents to at least two observed variables. Such variables may induce spurious correlations between the observed variables, thus introducing bias in the causal network.

The last assumption is really a corollary of the first two: the presence of unobserved variables violates the faithfulness assumption (because the network structure cannot include them) and possibly the causal Markov property (if an arc is wrongly added between the observed variables due to the influence of the latent one). As an example, consider the following BN from which generate a sample of 100 observations.

> causal.bn = custom.fit(model2network("[X1][X2|X1][X3|X1]"),
+   list("X1" = list(coef = c("(Intercept)" = 0), sd = 1),
+        "X2" = list(coef = c("(Intercept)" = 0, "X1" = 3), sd = 0.5),
+        "X3" = list(coef = c("(Intercept)" = 0, "X1" = 2), sd = 0.5))
+ )
> graphviz.plot(causal.bn, shape = "rectangle")
plot of chunk causal_network_and_latent_variables
> complete.data = rbn(causal.bn, 100)

In this network, we can observe the following graphical separations:

> dsep(causal.bn, "X2", "X3")
[1] FALSE
> dsep(causal.bn, "X2", "X3", "X1")
[1] TRUE

We can successfully to re-learn the DAG of causal.bn from the complete.data because all variables are observed. However, if we omit X1 , we fail to re-learn the DAG of causal.bn because X1 (now a latent variable) is a parent of both X2 and X3 (observed variables). Violating the causal sufficiency assumption results in learning a spurious arc X2 to X3 that is not present in causal.bn.

> relearn = hc(complete.data)
> modelstring(relearn)
[1] "[X1][X2|X1][X3|X1]"
> dsep(relearn, "X2", "X3")
[1] FALSE
> dsep(relearn, "X2", "X3", "X1")
[1] TRUE
> relearn = hc(complete.data[, c("X2", "X3")])
> modelstring(relearn)
[1] "[X2][X3|X2]"
> dsep(relearn, "X2", "X3")
[1] FALSE

Even if these assumptions are satisfied, we may not be able to completely learn the BN as a causal network: we are using probabilistic tools (goodness-of-fit scores or conditional independence tests) that can only identify equivalence classes of BNs. In order to identify the remaining arc directions, we need external information that typically comes from:

  • Longitudinal (panel) data: causality can only follow the arrow of time. Any arc between a node associated with a given time point and a node associated with a later time point can only go from the former to the latter.
  • Experimental (randomised) data: data that are collected following an experimental design (as opposed to observational data that are just recorded are they arise) are such that arcs can only point outwards from the randomised variables.
  • Domain experts: information about arc directions may be either collected by interviewing experts in the domain the data come from, or by consulting the literature to find relevant results from previous investigations.

Missing data

Rubin2, Little and Rubin3 have defined the key properties of missing data. We start from:

  • the complete data of all samples, which we do not observe;
  • the missingness mechanism (also known as the pattern of missingness) which determines which variables we actually observe for each sample;
  • the incomplete data, in which the values of some variables are replaced with NA for some samples to indicate that they are missing. This is what we actually observe.

In other words, the complete data are reduced to the incomplete data by the missingness mechanism. The statistical analysis of incomplete data aims to recover the probabilistic model underlying the complete data from the incomplete data by taking into account the missingness mechanism. Not taking into account the missingness mechanism would result in a biased model.

What types of missingness mechanisms are possible? Rubin defined three:

  1. Missing Completely at Random (MCAR): the probability of the value of a variable being missing is the same for all samples because the causes of the missing data are unrelated to the data. In other words, the missingness mechanism does not depend on the variables in the data.
  2. Missing at Random (MAR): the probability of being missing is the same only within groups defined by the observed data. In other words, the missingness mechanism depends on the variables that are completely observed in the data. MCAR implies MAR.
  3. Missing Not at Random (MNAR): the probability of being missing varies for reasons that are unknown to us. In other words, the missingness mechanism depends on variables we do not observe completely or at all. Any missingness mechanism that is not MCAR or MAR is MNAR.

Formally, a typical notation is

  • D O for the observed data;
  • D M for the missing data;
  • D = D O and D M for the complete data;
  • R is a zero-one matrix with the same dimensions as D: each cell is equal to zero if the corresponding cell in the data is a missing value, or it is equal to one if the corresponding cell in the data is observed.

If the data contain no missing values, then D= D O and we can model D through D O in a straightforward way. If the data are MCAR, then

does not depend on D O or D M

if the data are MAR, then

does not depend on D M

if the data are MNAR, then

it does not simplify

Consider the following numerical example (from van Buuren4) involving two variables Y1 and Y2 that jointly follow a bivariate normal distribution.

> library(MASS)
> set.seed(80122)
> n = 300
> Y = mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, 0.5, 0.5, 1), nrow = 2))

Say that only Y2 contains missing values and that the missingness mechanism is defined by

general missingness mechanism

If gamma MCAR then

missingness mechanism

and the missingness mechanism does not depend on Y1 and Y2 at all. If gamma MAR, then

missingness mechanism

and the missingness mechanism depends on Y1 which is completely observed. This allows for modelling the missingness mechanism from the data we observe. Finally, if gamma MNAR then

missingness mechanism

and the missingness mechanism depends on Y2 which is not completely observed.

> missingness.mechanism = function(Y, gamma) {

+   gamma[1] + exp(Y[, 1]) / (1 + exp(Y[, 1])) * gamma[2] +
+     exp(Y[, 2]) / (1 + exp(Y[, 2])) * gamma[3]

+ }#MISSINGNESS.MECHANISM
> 
> R.MCAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0.5, 0, 0)))
> R.MAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 1, 0)))
> R.MNAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 0, 1)))

If we try to estimate the joint distribution of Y1 and Y2 under MCAR, that is, when we assume that the missingness mechanism is MCAR, we obtain means that are close to zero for both Y1 and Y2 . Their correlation is close to 0.5 as well.

> Y.MCAR = Y
> Y.MCAR[as.logical(R.MCAR), 2] = NA
> colMeans(Y.MCAR, na.rm = TRUE)
[1]  0.04083247 -0.03533149
> cor(Y.MCAR, use = "pairwise.complete")
          [,1]      [,2]
[1,] 1.0000000 0.4249437
[2,] 0.4249437 1.0000000
plot of chunk missingness_numeric_example_3_plot

The same is true under MAR.

> Y.MAR = Y
> Y.MAR[as.logical(R.MAR), 2] = NA
> colMeans(Y.MAR, na.rm = TRUE)
[1] 0.04083247 0.26461012
> cor(Y.MAR, use = "pairwise.complete")
          [,1]      [,2]
[1,] 1.0000000 0.5011879
[2,] 0.5011879 1.0000000
plot of chunk missingness_numeric_example_4_plot

However, under MNAR the estimates of the mean of Y2 and of the correlation between Y1 and Y2 are both heavily biased. Intuitively, the bias is unavoidable since it is impossible to correctly model a MNAR missingness mechanism that depends on information we do not have (the values missing in Y2 ).

> Y.MNAR = Y
> Y.MNAR[as.logical(R.MNAR), 2] = NA
> colMeans(Y.MNAR, na.rm = TRUE)
[1] 0.04083247 0.43020494
> cor(Y.MNAR, use = "pairwise.complete")
          [,1]      [,2]
[1,] 1.0000000 0.2672833
[2,] 0.2672833 1.0000000
plot of chunk missingness_numeric_example_5_plot

The distributions of D O and D M become more and more different as we move from MCAR to MAR to MNAR.

plot of chunk missingness_numeric_example_bwplot

This behaviour is related to ignorability, which formalises the idea that the distribution of the data is the same whether we observe or not:

same distribution

As a result, we can model the distribution of D M from D O (equivalently, the distribution D given D O and R = 1) and use this distribution to work around the missing data. For instance, we can impute them, that is, we can replace missing values with point estimates computed from the observed values.

MCAR is ignorable because

it simplfies

since R is independent from both D O and D M and D. MAR is also ignorable: the definition of MAR is

R does not depend on D M

and then

derivations

The missing data D M do not depend on R, so their distribution (and by extension that of D) is the same for R = 0 and R = 1.

MNAR is not ignorable: we need to make assumptions on R and incorporate it in our models to work around the missing data.

Bayesian networks for missing data

Working with a causal network in the presence of unknown latent variables is difficult because of the issues with learning the structure and the parameters from data. We can partially address this issue by including the latent variables in the network: we define a new DAG G star over O union U where O is the set of the fully observed variables and U is the set of the completely unobserved variables (that is, the latent variables). Now that the latent variables are included in the network, we can model them and prevent confounding: we add them as columns full of NA to the data we use for learning.

When we have variables that are only partially observed, missing values are present along with observed values and we need to augment the causal network again to model them correctly. In order to do that, Mohan and Pearl define a missingness graph5 to extend the ideas of Rubin. A missingness graph is a causal graph where the vertices in are partitioned into five disjoint subsets:

  • the fully observed nodes O;
  • the completely unobserved nodes U;
  • the partially observed variables M;
  • the proxy variables S to the partially observed variables, with the values that are actually observed;
  • the missingness indicators R.

The relationship between M, S and R is as follows:

  • The variables Mi in M are the variables for which we would have liked to observe values for all observations.
  • The variables Si in S are the variables which we actually (partially) observe.
  • The variables Ri in R are binary indicator variables that control whether the value of Mi is observed for each observation in Si or a missing value NA is observed instead.

Formally, this means

Si = mi or NA

Since missingness graphs are causal graphs, they can be queried graphically for independencies between variables in the observed-data distribution Prob O S R by using d-separation. The corresponding underlying distribution with no missing values is Prob O M R.

MCAR, MAR and MNAR result in different independence statements which can be assessed through d-separation. Therefore, we can link them to the independency statements they imply in the missingness graph:

  • MCAR implies O union U union M independent from R: the missingness is random and independent from the fully observed and the partially observed variables.
  • MAR implies U union M independent from R given O: missingness is random and depends only on the fully observed variables;
  • MNAR if neither MCAR nor MAR.

The approaches for learning BNs as causal networks or as probabilistic networks from incomplete data have reviewed here.6

Consider again the two-variable example we used to illustrate MCAR, MAR and MNAR. As a BN, it would have the following DAG since Y1 and Y2 are correlated. If the data are completely observed, that is the correct model for the data.

> graphviz.plot(model2network("[Y2][Y1|Y2]"), shape = "rectangle")
plot of chunk missingness_graph_1

Assume now that the data are incomplete and MCAR. The variable Y1 is fully observed, so Y1 in O. The variable Y2 is only partially observed: in the missingness graph, we replace it with the nodes:

  • S2 , the mix of values and NAs that we actually observe for the samples.
  • M2 , the values we would observe for Y2 for the samples if there were no missing values. They are identical to the corresponding values in S2 if R2 equal to 0 .
  • R2 , the binary indicators that determine whether the value of Y2 is observed (or not) for each sample.

To obtain the correct model for MCAR, R2 should not have any parents. As for the arc Y2 to Y1 , we replace it with M2 to Y1 to make this directed graph a direct extension of the previous one.

> missingness.graph = model2network("[S2|R2:M2][R2][M2][Y1|M2]")
> graphviz.plot(missingness.graph, shape = "rectangle")
plot of chunk missingness_graph_mcar

The independency statements of MCAR are satisfied:

> # This checks that O is d-separated from R.
> dsep(missingness.graph, "Y1", "R2")
[1] TRUE
> # This checks that M is d-separated from R.
> dsep(missingness.graph, "M2", "R2")
[1] TRUE

And since MCAR implies MAR, the independency statements of MAR are also satisfied:

> # This checks that M is d-separated from R given O.
> dsep(missingness.graph, "M2", "R2", "Y1")
[1] TRUE

In the MAR case, we had earlier that the missingness mechanism depended on Y1 which is a completely-observed variable: then we should add the arc Y1 to R2 to the missingness graph to represent it.

> missingness.graph = model2network("[S2|R2:M2][R2|Y1][M2][Y1|M2]")
> graphviz.plot(missingness.graph, shape = "rectangle")
plot of chunk missingness_graph_mar

The independency statements of MAR are satisfied:

> # This checks that M is d-separated from R given O.
> dsep(missingness.graph, "M2", "R2", "Y1")
[1] TRUE

However, the independency statements of MCAR are not longer satisfied.

> # This checks that O is d-separated from R.
> dsep(missingness.graph, "Y1", "R2")
[1] FALSE
> # This checks that M is d-separated from R.
> dsep(missingness.graph, "M2", "R2")
[1] FALSE

In the MNAR case, the missingness mechanism for Y2 depended on Y2 itself. We do not have a node for Y2 in the missingness graph, but we do have M2 which models the (unobserved) complete version of Y2 we can then add the arc M2 to R2 to represent this.

> missingness.graph = model2network("[S2|R2:M2][R2|M2][M2][Y1|M2]")
> graphviz.plot(missingness.graph, shape = "rectangle")
plot of chunk missingness_graph_mnar

Since MNAR excludes MCAR and MAR, the independency statements of both missingness mechanisms are not satisfied.

> # MCAR: O is d-separated from R.
> dsep(missingness.graph, "Y1", "R2")
[1] FALSE
> # MCAR: M is d-separated from R.
> dsep(missingness.graph, "M2", "R2")
[1] FALSE
> # MAR: M is d-separated from R given O.
> dsep(missingness.graph, "M2", "R2", "Y1")
[1] FALSE

What is the role of the missingness graph in structure and parameter learning? In some cases, using it for parameter learning may make it possible to correctly estimate the parameters of the BN: this property is called recoverability, and it is discussed in detail in Mohan and Pearl. As for structure learning, feeding a partially-constructed missingness graph to a structure learning algorithm and let it learn the rest from the data is a viable approach. However, as we will see in the following, many classical approaches to BN modelling just assume that the data are MCAR or MAR to avoid using the missingness graph altogether.

Parameter learning

Firstly, we will assume that the DAG G is given and that we only need to learn the parameters of the network from the incomplete data.

Using locally-complete data

The simplest possible way to implement parameter learning is to drop all the observations containing missing values and to use only complete observations.

> colnames(data) = c("X1", "X2", "X3", "X4", "X5", "X6")
> dag = model2network("[X1][X3][X6][X2|X1][X4|X1:X3][X5|X2:X6]")
> incomplete.data = data[1:80, ]
> for (i in seq(ncol(incomplete.data)))
+   incomplete.data[1:10 + 10 * i, i] = NA
> 
> complete.subset = incomplete.data[complete.cases(incomplete.data), ]
> bn.fit(dag, complete.subset, method = "mle")

  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:
    a    b    c 
0.25 0.45 0.30 

  Parameters of node X2 (multinomial distribution)

Conditional probability table:
 
   X1
X2          a         b         c
  a 1.0000000 0.4444444 0.1666667
  b 0.0000000 0.3333333 0.1666667
  c 0.0000000 0.2222222 0.6666667

  Parameters of node X3 (multinomial distribution)

Conditional probability table:
    a    b    c 
0.60 0.35 0.05 

  Parameters of node X4 (multinomial distribution)

Conditional probability table:
 
, , X3 = a

   X1
X4          a         b         c
  a 1.0000000 0.0000000 0.0000000
  b 0.0000000 0.8000000 0.0000000
  c 0.0000000 0.2000000 1.0000000

, , X3 = b

   X1
X4          a         b         c
  a 0.0000000 1.0000000 0.0000000
  b 0.0000000 0.0000000 0.6666667
  c 1.0000000 0.0000000 0.3333333

, , X3 = c

   X1
X4  a         b c
  a   1.0000000  
  b   0.0000000  
  c   0.0000000  


  Parameters of node X5 (multinomial distribution)

Conditional probability table:
 
, , X6 = a

   X2
X5          a         b         c
  a 0.7142857 0.0000000 0.0000000
  b 0.2857143 0.0000000 0.2500000
  c 0.0000000 1.0000000 0.7500000

, , X6 = b

   X2
X5          a         b         c
  a 0.0000000 0.0000000 0.0000000
  b 1.0000000 1.0000000 0.5000000
  c 0.0000000 0.0000000 0.5000000


  Parameters of node X6 (multinomial distribution)

Conditional probability table:
    a    b 
0.65 0.35

This not an efficient use of the data because it reduces the sample size too much.

> nrow(incomplete.data)
[1] 80
> nrow(complete.subset)
[1] 20

As a result, the estimates of the parameters can be very imprecise.

> bn.fit(dag, complete.subset, method = "mle")$X5

  Parameters of node X5 (multinomial distribution)

Conditional probability table:
 
, , X6 = a

   X2
X5          a         b         c
  a 0.7142857 0.0000000 0.0000000
  b 0.2857143 0.0000000 0.2500000
  c 0.0000000 1.0000000 0.7500000

, , X6 = b

   X2
X5          a         b         c
  a 0.0000000 0.0000000 0.0000000
  b 1.0000000 1.0000000 0.5000000
  c 0.0000000 0.0000000 0.5000000
> bn.fit(dag, data[1:80, ])$X5

  Parameters of node X5 (multinomial distribution)

Conditional probability table:
 
, , X6 = a

   X2
X5           a          b          c
  a 0.66666667 0.00000000 0.05882353
  b 0.23809524 0.28571429 0.11764706
  c 0.09523810 0.71428571 0.82352941

, , X6 = b

   X2
X5           a          b          c
  a 0.38461538 0.00000000 0.33333333
  b 0.53846154 0.50000000 0.44444444
  c 0.07692308 0.50000000 0.22222222

A better approach along the same lines is to use locally-complete observations to estimate the parameters of each local distribution: after all, a local distribution only involves one node and its parents so we do not care whether the values of the remaining variables are missing or not.

> # bn.fit() uses locally-complete observations by default for maximum likelihood
> # and Bayesian estimators.
> bn.fit(dag, incomplete.data, method = "mle")$X5

  Parameters of node X5 (multinomial distribution)

Conditional probability table:
 
, , X6 = a

   X2
X5          a         b         c
  a 0.5714286 0.0000000 0.0000000
  b 0.2857143 0.5000000 0.1666667
  c 0.1428571 0.5000000 0.8333333

, , X6 = b

   X2
X5          a         b         c
  a 0.1428571 0.0000000 0.3636364
  b 0.7142857 1.0000000 0.3636364
  c 0.1428571 0.0000000 0.2727273

The Expectation-Maximisation algorithm

In the context of parameter learning, the Expectation-Maximisation (EM) algorithm retains its classic structure:

  • In the expectation step (E), we compute the expected values of the sufficient statistics (such as the counts n i j k in discrete BNs, partial correlations in GBNs), using exact inference along the lines described above to make use of incomplete as well as complete samples.
  • In the maximisation step (M), we take the sufficient statistics from the E-step and estimate the parameters of the BN, either using maximum likelihood or Bayesian posterior estimators.

We iterate over these two steps by using the parameter estimates to update the expected values of the sufficient statistics in th next E-step. Repeated iterations of these two steps will return the maximum likelihood or maximum a posteriori estimates for the parameters in the limit. Note that the EM algorithm is only valid for MCAR and MAR missing data.

> start = bn.fit(empty.graph(colnames(data)), incomplete.data)
> fitted = bn.fit(dag, incomplete.data, method = "hard-em", fit = "mle",
+            impute = "exact", threshold = 1e-5, debug = TRUE,
+            start = start)
* expectation-maximization iteration 1 .
  > the relative difference in log-likelihood is: 0.1993889 .
* expectation-maximization iteration 2 .
  > the relative difference in log-likelihood is: 0.04342052 .
* expectation-maximization iteration 3 .
  > the relative difference in log-likelihood is: 0.002834975 .
* expectation-maximization iteration 4 .
  > the relative difference in log-likelihood is: 0.001260253 .
* expectation-maximization iteration 5 .
  > the relative difference in log-likelihood is: 0 .
  @ the difference is smaller than the threshold, stopping.
> fitted

  Bayesian network parameters

  Parameters of node X1 (multinomial distribution)

Conditional probability table:
      a      b      c 
0.2250 0.2875 0.4875 

  Parameters of node X2 (multinomial distribution)

Conditional probability table:
 
   X1
X2           a          b          c
  a 0.94444444 0.60869565 0.05128205
  b 0.05555556 0.21739130 0.05128205
  c 0.00000000 0.17391304 0.89743590

  Parameters of node X3 (multinomial distribution)

Conditional probability table:
      a      b      c 
0.6375 0.3000 0.0625 

  Parameters of node X4 (multinomial distribution)

Conditional probability table:
 
, , X3 = a

   X1
X4           a          b          c
  a 0.85714286 0.00000000 0.04166667
  b 0.00000000 0.92307692 0.20833333
  c 0.14285714 0.07692308 0.75000000

, , X3 = b

   X1
X4           a          b          c
  a 0.00000000 1.00000000 0.33333333
  b 0.00000000 0.00000000 0.58333333
  c 1.00000000 0.00000000 0.08333333

, , X3 = c

   X1
X4  a          b          c
  a   0.50000000 0.00000000
  b   0.00000000 0.33333333
  c   0.50000000 0.66666667


  Parameters of node X5 (multinomial distribution)

Conditional probability table:
 
, , X6 = a

   X2
X5           a          b          c
  a 0.68181818 0.00000000 0.00000000
  b 0.22727273 0.40000000 0.09523810
  c 0.09090909 0.60000000 0.90476190

, , X6 = b

   X2
X5           a          b          c
  a 0.09090909 0.00000000 0.27777778
  b 0.81818182 1.00000000 0.33333333
  c 0.09090909 0.00000000 0.38888889


  Parameters of node X6 (multinomial distribution)

Conditional probability table:
   a   b 
0.6 0.4

How does EM work in practice? One way of implementing it to impute the missing values in the E-step is the so-called hard EM.

> # 1) E-step: start from an empty network, so that each node encodes the marginal
> #    distribution of the corresponding variable, estimated from the observed
> #    values.
> current.bn = bn.fit(empty.graph(nodes(dag)), incomplete.data)
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> all(complete.cases(completed.data))
[1] TRUE
> # 2) Compute an initial estimate of the log-likelihood of the data.
> current.loglik = logLik(current.bn, completed.data)
> current.loglik
[1] -447.0964
> # 3) M-step: use the completed data to estimate the parameters of the BN.
> new.bn = bn.fit(dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik
[1] -387.1627
> # 4) Compute the relative improvement in the log-likelihood: is it large enough
> #    to make it worth to perform more iterations?
> (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.1340509
> # 5) Reset the current state of the algorithm.
> current.bn = new.bn
> current.loglik = new.loglik
> # 6) E-step.
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> # 7) M-step.
> new.bn = bn.fit(dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik
[1] -337.8896
> # 8) Compute the relative improvement in the log-likelihood: is it large enough
> #    to make it worth to perform more iterations?
> (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.1272672
> par(mfrow = c(1, 2))
> graphviz.chart(current.bn, type = "barprob", main = "Current BN")
> graphviz.chart(new.bn, type = "barprob", main = "New BN")
plot of chunk hard_em_step_by_step
> # ... And so on until convergence.

Structure learning

Learning the structure of a BN from incomplete data is, in general, computationally unfeasible because we need to perform a joint optimisation over the missing values and the parameters to score each candidate network. We can make this apparent by rewriting P D given G as a function of D O (the observed data) and D M (the missing data):

intergral decomposition

From this expression we can see that in order to maximise P D given G we should jointly maximise the probability of the observed data D O and the probability of the missing data D M given the observed data, for each candidate G and averaging over all possible Theta. This gives us the maximum a posteriori solution to structure learning.

A full Bayesian approach would require averaging over all the possible configurations of the missing data as well:

Bayesian integral

This implies one extra dimension of integration for each missing value in addition to one dimension for each parameter in Theta, and Prob D M given D O does not decompose along with local distributions in the general case.

Using locally-complete data

Earlier we saw how we can use locally-complete data to learn the parameters of each local distribution. The next logical step is to use the locally-complete data and the parameters we estimated from them to compute the goodness-of-fit scores we need for structure learning. The main issue with this approach is that local distributions (for the same node) with different parent sets will be learned from different subsets of the data, so:

  1. The respective log-likelihoods may be on different scales, because they may sum over different numbers of observations.
  2. The respective log-likelihoods may be estimated from non-nested subsets of the data, which in the worst case may not overlap at all.

The node-average log-likelihood (NAL)7 addresses both problems. If the data were complete, we could compute a penalised score such as BIC as

BIC

which is numerically equivalent to computing

average BIC

The quantity average log-likelihood is the empirical estimate of the expected value of the log-likelihood of a single observation: its scale does not depend on the number of observations. Furthermore, we can seamlessly replace it with the average over the locally-complete observations average BIC in the incomplete data:

PNAL

Tjebbe Bodewes and Marco Scutari proved the NAL ensures both model identifiability (up to the equivalence class) and consistency in structure learning under MCAR as long as n lambda n diverges and sqrt n lambda n diverges as n diverges. These conditions hold for neither AIC nor BIC because their penalties are not large enough when transformed into lambda n.

In practice, this means we can compute the node-average log-likelihood as we would any other score function, regardless of whether the data are complete or not.

> score(dag, incomplete.data, type = "nal", by.node = TRUE)
        X1         X2         X3         X4         X5         X6 
-1.0561521 -0.6071982 -0.8415452 -0.5380200 -0.7824792 -0.6829081

For instance, the NAL for node X2 (with parent X1 in dag) is computed as follows.

> # 1) Identify the parents of node X2.
> node = "X2"
> pars = parents(dag, node)
> # 2) Identify which observations are locally complete for X2 and its parents.
> locally.complete = complete.cases(incomplete.data[, c(node, pars)])
> # 3) Subset the data.
> locally.complete.data = incomplete.data[locally.complete, c(node, pars)]
> # 4) Compute the counts and then the conditional probabilities of X2 given the parents.
> counts = table(locally.complete.data)
> probs = prop.table(counts, margin = "X1")
> # 5) Compute the log-likelihood and divide it by the number of locally-complete observations.
> nal = sum(counts * log(probs), na.rm = TRUE) / nrow(locally.complete.data)
> nal
[1] -0.6071982

Computing the PNAL from the NAL is then just a matter of subtracting the penalty term (the coefficient lambda n , here with the default value, multiplied by the number of free parameters in the local distribution).

> score(dag, incomplete.data, type = "pnal", by.node = TRUE)
        X1         X2         X3         X4         X5         X6 
-1.0575453 -0.6113779 -0.8429384 -0.5505589 -0.7908384 -0.6836047
> lambda_n = (nrow(incomplete.data)^(-0.25)) / nnodes(dag)
> pnal = nal - lambda_n/nrow(incomplete.data) * (nrow(probs) - 1) * ncol(probs)
> pnal
[1] -0.6113779

In order to perform structure learning from incomplete data with PNAL, we just need to use score = "pnal" with a score-based learning algorithm like hill-climbing. How do the network structure learned with PNAL, that learned from the complete data (before we introduced the NAs) and the reference network structure?

> pnal.dag = hc(incomplete.data, score = "pnal", k = 1 / (nnodes(dag) * 0.2))
> complete.data.dag = hc(data[1:80, ], score = "bic")
> par(mfrow = c(1, 3))
> graphviz.compare(dag, complete.data.dag, pnal.dag, shape = "rectangle",
+   main = c("Reference DAG", "Complete Data DAG", "Learned with PNAL"))
plot of chunk hc_with_pnal

As we mentioned earlier, a better way of comparing network structures is through their CPDAGs.

> par(mfrow = c(1, 3))
> graphviz.compare(cpdag(dag), cpdag(complete.data.dag), cpdag(pnal.dag),
+   shape = "rectangle", main = c("Reference DAG", "Complete Data DAG", "Learned with PNAL"))
plot of chunk cpdag_for_pnal

While the three network structures in the first figure look very similar, the CPDAG of the network we learned with BIC is in fact very different from that of the reference network structure, while that of the network learned with PNAL is fairly close.

The Expectation-Maximisation algorithm

In the context of structure learning, the EM algorithm is called structural EM (SEM). It makes structure learning computationally feasible by searching for the best structure inside of EM, instead of embedding EM inside a structure learning algorithm. It consists of two steps like the classic EM:

  • In the expectation step (E), we complete the data by computing the expected sufficient statistics using the current network structure.
  • In the maximisation step (M), we find the structure that maximises the expected score function for the completed data.

Since the scoring in the M-step uses the completed data, structure learning can be implemented efficiently using standard algorithms for complete data. The original proposal by Nir Friedman used BIC and greedy search; he later extended SEM to a fully Bayesian approach based posterior scores, and proved the convergence of the resulting algorithm under MCAR and MAR.

> start = bn.fit(empty.graph(colnames(data)), incomplete.data)
> learned.dag = structural.em(incomplete.data, impute = "exact", fit = "mle",
+                 start = start)
> learned.dag

  Bayesian network learned from Missing Data

  model:
   [X1][X3][X6][X2|X1][X4|X1][X5|X1:X6] 
  nodes:                                 6 
  arcs:                                  4 
    undirected arcs:                     0 
    directed arcs:                       4 
  average markov blanket size:           1.67 
  average neighbourhood size:            1.33 
  average branching factor:              0.67 

  learning algorithm:                    Structural EM 
  score-based method:                    Hill-Climbing 
  parameter learning method:             Maximum Likelihood (disc.) 
  imputation method:                     Exact Inference 
  penalization coefficient:              2.191013 
  tests used in the learning procedure:  140 
  optimized:                             TRUE

Between the limited amounts of data and the missing values, structural.em() is unable to learn the correct DAG.

> par(mfrow = c(1, 2))
> graphviz.compare(dag, learned.dag, main = c("correct DAG", "learned DAG"),
+   shape = "rectangle")
plot of chunk structural_em_compare

But what is structural.em() doing in practice?

> # 1) E-step: start from an empty network, so that each node encodes the marginal
> #    distribution of the corresponding variable, estimated from the observed
> #    values.
> current.dag = empty.graph(colnames(data))
> current.bn = bn.fit(current.dag, incomplete.data)
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> # 2) Compute an initial estimate of the log-likelihood of the data.
> current.loglik = logLik(current.bn, completed.data)
> current.loglik
[1] -447.0964
> # 3) M-Step: learn the structure of the DAG from the completed data.
> new.dag = tabu(completed.data)
> # 4) Is the DAG we just learned different from the previous one?
> all.equal(current.dag, new.dag)
[1] "Different number of directed/undirected arcs"
> par(mfrow = c(1, 2))
> graphviz.compare(current.dag, new.dag, main = c("Current DAG", "New DAG"),
+   shape = "rectangle")
plot of chunk structural_em_step_by_step
> # 5) Learn the parameters as well and compute the relative improvement in the
> # log-likelihood: is it large enough to make it worth to perform more
> # iterations?
> new.bn = bn.fit(new.dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik
[1] -404.8411
> (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.09451047
> # 6) Reset the current state of the algorithm.
> current.dag = new.dag
> current.bn = new.bn
> current.loglik = new.loglik
> # 7) E-step.
> completed.data = impute(current.bn, incomplete.data, method = "exact")
> # 8) M-Step.
> new.dag = tabu(completed.data)
> # 9) Is the DAG we just learned different from the previous one?
> all.equal(current.dag, new.dag)
[1] "Different arc sets"
> par(mfrow = c(1, 2))
> graphviz.compare(current.dag, new.dag, main = c("Current DAG", "New DAG"),
+   shape = "rectangle")
plot of chunk structural_em_step_by_step
> # 10) Did the log-likelihood improve?
> new.bn = bn.fit(new.dag, completed.data)
> new.loglik = logLik(new.bn, completed.data)
> new.loglik
[1] -379.1162
> (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.06354313
> # ... And so on until convergence.

The PC algorithm

Using locally-complete data is a principled way of learning BNs using constraint-based algorithms, as shown by Eric Strobl, Shyam Visweswaran and Peter Spirtes.8 The trick is that we should perform each conditional independence test twice, as follows:

  1. Test whether Xi independent from p Xj given Sij using locally-complete observations, and record resulting the p-value.
  2. If the p-value smaller than the chosen threshold (typically 0.01)?
    • If the answer is no, then reject the null hypothesis of (conditional) independence.
    • If the answer is yes, test whether Xi independent from p Xj given Sij using globally-complete observations and record the resulting p-value.
  3. Is the minimum between the p-values of the two tests smaller than the chosen threshold?
    • If the answer is no, then accept the null hypothesis of (conditional) independence.
    • If the answer is yes, then reject the null hypothesis of (conditional) independence.

This double-testing ensures that we can apply algorithms like the PC to incomplete data that are MCAR or MAR. Interestingly, it works even if the data are MNAR! Furthermore, Strobl, Visweswaran and Spirtes show that even just performing a single test on the locally-complete data works very well in practice.

> # constraint-based algorithms in bnlearn all use locally-complete observations
> # for constraint-based learning, by default.
> learned.dag = pc.stable(incomplete.data)
> par(mfrow = c(1, 2))
> graphviz.compare(dag, learned.dag, main = c("correct DAG", "learned DAG"),
+   shape = "rectangle")
plot of chunk pc_with_mcar_example

As an example consider the following conditional independence test, which is the first of the two tests.

> # ci.test() uses locally-complete data by default.
> ci.test("X1", "X6", "X4", data = incomplete.data, test = "x2")

	Pearson's X^2

data:  X1 ~ X6 | X4
x2 = 12.368, df = 6, p-value = 0.05425
alternative hypothesis: true value is greater than 0

We can compute it manually from the locally-complete data.

> local.data = incomplete.data[, c("X1", "X6", "X4")]
> locally.complete = local.data[complete.cases(local.data), ]
> nrow(locally.complete)
[1] 50
> prop.table(table(locally.complete), margin = "X4")
, , X4 = a

   X6
X1           a          b
  a 0.23529412 0.17647059
  b 0.17647059 0.17647059
  c 0.23529412 0.00000000

, , X4 = b

   X6
X1           a          b
  a 0.00000000 0.00000000
  b 0.46666667 0.20000000
  c 0.00000000 0.33333333

, , X4 = c

   X6
X1           a          b
  a 0.11111111 0.00000000
  b 0.05555556 0.00000000
  c 0.38888889 0.44444444
> test.statistic = as.numeric(
+   chisq.test(table(locally.complete)[, , "a"])$stat  +
+   chisq.test(table(locally.complete)[, , "b"] + 1e-6)$stat  +
+   chisq.test(table(locally.complete)[, , "c"])$stat
+ )
> test.statistic
[1] 12.36782
> pchisq(test.statistic, df = 6, lower.tail = FALSE)
[1] 0.05424854

Its p-value is large, so we should compute the second test as well. The second test involves the same variables, but uses only globally-complete data.

> globally.complete = local.data[complete.cases(incomplete.data), ]
> nrow(globally.complete)
[1] 20
> prop.table(table(globally.complete), margin = "X4")
, , X4 = a

   X6
X1          a         b
  a 0.3750000 0.1250000
  b 0.2500000 0.2500000
  c 0.0000000 0.0000000

, , X4 = b

   X6
X1          a         b
  a 0.0000000 0.0000000
  b 0.5000000 0.1666667
  c 0.0000000 0.3333333

, , X4 = c

   X6
X1          a         b
  a 0.1666667 0.0000000
  b 0.1666667 0.0000000
  c 0.5000000 0.1666667
> test.statistic = as.numeric(
+   chisq.test(table(globally.complete)[, , "a"] + 1e-6)$stat  +
+   chisq.test(table(globally.complete)[, , "b"] + 1e-6)$stat  +
+   chisq.test(table(globally.complete)[, , "c"])$stat
+ )
> test.statistic
[1] 4.133331
> pchisq(test.statistic, df = 6, lower.tail = FALSE)
[1] 0.6586386

Equivalently:

> ci.test("X1", "X6", "X4",
+   data = incomplete.data[complete.cases(incomplete.data), ],
+   test = "x2")

	Pearson's X^2

data:  X1 ~ X6 | X4
x2 = 4.1333, df = 6, p-value = 0.6586
alternative hypothesis: true value is greater than 0

The second test produces an even larger p-value: since both are larger than the threshold we accept the null hypothesis of conditional independence between X1 and X6 given X4.

Inference

Incomplete data are not a concern when performing general inference on BNs: as long as we can define the evidence we are considering and the event (for CP queries) or the target nodes (for MAP queries), we can obtain the answers we need in the same way as when data are complete. The only influence missing data may have is on the quality of the BN we learn from them and that we submit our queries to.

There are, however, two inference tasks that are interesting to discuss for incomplete data: imputation and prediction. Both take data as input, as opposed to user-provided evidence and events, and therefore interact with the missing values.

Imputation

Imputing missing data means to replace them with values estimated from the observed data. We can do that with:

  1. Single Imputation: we replace the missing values with their expectations (for continuous variables) or MAP estimates (for discrete variables).
  2. Multiple Imputation: we replace the missing values with values sampled from their posterior distribution given the observed data, that is, Prob D M given D O. We do that multiple times to capture the variability of the imputation process and to incorporate it in the variability of our statistical results.

Multiple imputation is what Rubin calls a proper imputation because of this, while single imputation is not. However, it is also much more computationally expensive, more so for multivariate models such as BNs. For this reason, single imputation is more common in practical applications.

The simplest way of performing single imputation is to take the local distribution of the node whose value we are not observing and condition on the values of its parents. In a causal setting, we are effectively imputing a variable from its direct causes. If there are multiple missing values in a single observations, we must do this for each of them in the topological ordering implied by the DAG so that the parents of each node are imputed (if needed) before the node itself.

> dag = model2network("[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]")
> bn = bn.fit(dag, data)
> incomplete.data[c(11, 21, 31, 41, 51, 61), ]
     X1   X2   X3   X4   X5   X6
11 <NA>    a    a    b    c    a
21    c <NA>    a    c    b    b
31    a    a <NA>    a    b    b
41    c    c    a <NA>    a    b
51    b    a    a    b <NA>    b
61    c    c    a    c    c <NA>
> imputed.data = impute(bn, data = incomplete.data, method = "parents")
> imputed.data[c(11, 21, 31, 41, 51, 61), ]
   X1 X2 X3 X4 X5 X6
11  b  a  a  b  c  a
21  c  a  a  c  b  b
31  a  a  a  a  b  b
41  c  c  a  b  a  b
51  b  a  a  b  a  b
61  c  c  a  c  c  a

More in detail:

> # 1) Identify the missing values in the observation, encoded as NA.
> incomplete.observation = incomplete.data[61, ]
> incomplete.observation
   X1 X2 X3 X4 X5   X6
61  c  c  a  c  c <NA>
> # 2) Extract the local distribution of the node with missing values from the BN.
> cpt = coef(bn[[which(is.na(incomplete.observation))]])
> cpt
, , X5 = a

   X4
X6          a         b         c
  a 0.6163216 0.5942029 0.5316901
  b 0.3836784 0.4057971 0.4683099

, , X5 = b

   X4
X6          a         b         c
  a 0.1778656 0.2016632 0.1976285
  b 0.8221344 0.7983368 0.8023715

, , X5 = c

   X4
X6          a         b         c
  a 0.6635514 0.6890130 0.7188020
  b 0.3364486 0.3109870 0.2811980
> # 3) Find the distribution of the node given the parent values.
> condprob = cpt[, X4 = incomplete.observation$X4, X5 = incomplete.observation$X5]
> condprob
       a        b 
0.718802 0.281198
> # 4) Find the value with the highest probability.
> names(which.max(condprob))
[1] "a"

However, this approach is clearly suboptimal for missing values in nodes that have no parents.

> # 1) Identify the missing values in the observation, encoded as NA.
> incomplete.observation = incomplete.data[11, ]
> # 2) Extract the local distribution of the node with missing values from the BN.
> prob = coef(bn[[which(is.na(incomplete.observation))]])
> prob
     a      b      c 
0.3336 0.3340 0.3324
> # 3) Find the value with the highest probability.
> names(which.max(prob))
[1] "b"

The values of the other nodes are not taken into consideration at all. Conditioning on the values of all the nodes we observe for that observation makes better use of the available information. This can be done exactly or by Monte Carlo, with different computational trade-offs as the number of observations increase and as the BN becomes increasingly large.

> imputed.data = impute(bn, data = incomplete.data, method = "exact")
> imputed.data[c(11, 21, 31, 41, 51, 61), ]
   X1 X2 X3 X4 X5 X6
11  c  a  a  b  c  a
21  c  a  a  c  b  b
31  a  a  a  a  b  b
41  c  c  a  c  a  b
51  b  a  a  b  b  b
61  c  c  a  c  c  a
> imputed.data = impute(bn, data = incomplete.data, method = "bayes-lw")
> imputed.data[c(11, 21, 31, 41, 51, 61), ]
   X1 X2 X3 X4 X5 X6
11  c  a  a  b  c  a
21  c  a  a  c  b  b
31  a  a  a  a  b  b
41  c  c  a  c  a  b
51  b  a  a  b  b  b
61  c  c  a  c  c  a

More in detail (using the gRain package to make explicit the steps followed by bnlearn):

> # 1) Identify the missing and observed values in the observation.
> incomplete.observation = incomplete.data[11, ]
> observed.values = incomplete.observation[, which(!is.na(incomplete.observation))]
> # 2) Compute the distribution of the node with the missing values conditional on
> #    all observed values.
> junction.tree = as.grain(bn)
> with.evidence = setEvidence(junction.tree, nodes = names(observed.values),
+                             states = sapply(observed.values, as.character))
> condprob = querygrain(with.evidence, node = "X1")$X1
> condprob
X1
        a         b         c 
0.3330230 0.3305074 0.3364696
> # 3) Find the value with the highest probability.
> names(which.max(condprob))
[1] "c"

The result we obtain with method = "exact" ("c") is different from that we obtained earlier with method = "parents" ("b"). This highlights an important limit of single imputation: small fluctuations in the probabilities can change the imputed values. Multiple imputations would draw several values and thus produce a variety of imputed values.

Prediction

We can make similar considerations for prediction. Unlike imputation, three types of variables are involved: the target variables T we would like to predict, the observed values D O we would like to predict from, and the missing values D M we should average over (since we cannot condition on them, for obvious reasons). Therefore, following the same notation that we used to describe structure learning, we either find the value with the highest probability

MAP

for a discrete variable, or the expectation

expectation

for a continuous variable. We can also predict just from the parents of the target node but, as we have seen in the case of imputation, this approach can be suboptimal or even unsuitable (for root nodes).

Computing predictions with either exact inference or Monte Carlo performs similarly in typical scenarios: we are only estimating the value of a single variable, so it is a low-dimensional problem. Imputation is not necessarily a low-dimensional problem: its dimensionality depends on the number of missing values in individual observations.

> pred.ex = predict(bn, data = incomplete.data, node = "X5", method = "exact")
> pred.lw = predict(bn, data = incomplete.data, node = "X5", method = "bayes-lw")
> table(exact = pred.ex, lw = pred.lw)
     lw
exact  a  b  c
    a 28  1  8
    b  0 30  0
    c  3  0 10

More in detail (using the gRain package to make explicit the steps followed by bnlearn):

> # 0) Consider an observation with multiple missing values.
> incomplete.data[1, c("X1", "X3", "X5")] = NA
> # 1) Identify the target, missing and observed variables in the observation.
> target = "X6"
> observed.values = incomplete.observation[, which(!is.na(incomplete.data[1, ]))]
> observed.values = observed.values[, names(observed.values) != target]
> missing.values = setdiff(colnames(data), c(target, observed.values))
> # 2) Compute the distribution of the target conditional on all observed values,
> #    marginalising the variables we do not observe.
> junction.tree = as.grain(bn)
> with.evidence = setEvidence(junction.tree, nodes = names(observed.values),
+                             states = sapply(observed.values, as.character))
> condprob = querygrain(with.evidence, node = target)[[target]]
> condprob
X6
        a         b 
0.5067387 0.4932613
> # 3) Find the value with the highest probability.
> names(which.max(condprob))
[1] "a"

The code above is equivalent to the following, more explicit code.

> jointprob = querygrain(with.evidence, node = c(target, missing.values),
+                        type = "joint")
> margin.table(jointprob, target)
X6
        a         b 
0.5067387 0.4932613

  1. Scutari M and Denis J-B (2021). Bayesian Networks with Examples in R. CRC Press, 2nd edition. link ↩︎

  2. Rubin DB (1976). “Inference and Missing Data.” Biometrika, 63:581–592. link ↩︎

  3. Little RJA and Rubin DB (2019). Statistical Analysis with Missing Data. Wiley, 3rd edition. link ↩︎

  4. van Buuren (2018). Flexible Imputation of Missing Data. CRC Press, 2nd edition. link ↩︎

  5. Mohan K and Pearl J (2018). “Graphical Models for Processing Missing Data.” Journal of the American Statistical Association, 534(116):1023–1037. link ↩︎

  6. Scutari M (2020). “Bayesian Network Models for Incomplete and Dynamic Data.” Statistica Neerlandica, 74(3):397–419. link ↩︎

  7. Bodewes T and Scutari M (2021). “Learning Bayesian Networks from Incomplete Data with the Node-Averaged Likelihood.” International Journal of Approximate Reasoning, 138:145–160. link ↩︎

  8. Strobl E, Visweswaran S and Spirtes PL (2018). “Fast Causal Inference With Non-Random Missingness by Test-Wise Deletion.” International Journal of Data Science and Analytics, 6:47–62. link ↩︎

Last updated on Tue Sep 12 09:47:40 2023 with bnlearn 4.9 and R version 4.2.2 (2022-10-31).