A quick introduction
In this lecture we will introduce how to model data containing missing values with Bayesian networks. Missing values can arise in different ways, which were originally codified by Rubin. Bayesian networks give graphical form to Rubin's work using a directed graph where the variables and the missingness mechanisms are represented as nodes and relationships as arcs. The two key points of the lecture will be:
- Reviewing Rubin's classification of missing data and its graphical representation following the work of Mohan and Pearl.
- Adapting parameter learning and structure learning to learn Bayesian networks from incomplete data using techniques like the Expectation-Maximisation algorithm.
Bayesian networks
Definitions
Bayesian networks (BNs)1 are probabilistic graphical models defined by:
- a network structure, a directed acyclic graph (DAG)
in which each node corresponds to a random variable
;
- a global probability distribution over
(with parameters
), which can be factorised into smaller local probability distributions according to the arcs present in the graph.
The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation: nodes that are not connected by an arc are independent conditional on some other nodes. In other words,
Nodes that are connected by an arc cannot be graphically separated, and therefore they depend directly on each other in probability.
The most important consequence of this property is the factorisation of the global distribution of
:
Each local distribution has its own parameter set
;
and
is much smaller than
because many parameter are fixed by the fact that the variables they belong to are independent. This makes it possible
to work with a high-dimensional distribution
through a set of much smaller and simpler distributions
whose dimensions depends only on the respective number of parents
.
The first component of a BN is a DAG. Graphically, it looks like this: all arcs have directions, and no cycles are allowed.
> string = "[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]" > sample.dag = model2network(string) > graphviz.plot(sample.dag, shape = "rectangle")
And the implication is that:
The second component of a BN is the probability distribution
.
The choice should be such that the BN:
- can be learned efficiently from data;
- is flexible (it can encode a reasonable variety of phenomena);
- is easy to query to perform inference.
The three most common choices in the literature and in practical applications (by far) are:
- Discrete BNs, in which
and the
are multinomial. The parameters
of the local distributions are the conditional probabilities
> dag = model2network("[X1][X2][X3|X2:X1]") > cptX1 = matrix(c(0.53, 0.47), ncol = 2, dimnames = list(NULL, c("a", "b"))) > cptX2 = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("c", "d"))) > cptX3 = c(0.15, 0.85, 0.4, 0.6, 0.75, 0.25, 0.80, 0.20) > dim(cptX3) = c(2, 2, 2) > dimnames(cptX3) = list("X3" = c("e", "f"), "X1" = c("a", "b"), "X2" = c("c", "d")) > bn = custom.fit(dag, list("X1" = cptX1, "X2" = cptX2, "X3" = cptX3)) > bn
Bayesian network parameters Parameters of node X1 (multinomial distribution) Conditional probability table: a b 0.53 0.47 Parameters of node X2 (multinomial distribution) Conditional probability table: c d 0.34 0.66 Parameters of node X3 (multinomial distribution) Conditional probability table: , , X2 = c X1 X3 a b e 0.15 0.40 f 0.85 0.60 , , X2 = d X1 X3 a b e 0.75 0.80 f 0.25 0.20
- Gaussian BNs (GBNs), in which
follows a multivariate normal distribution and the local distributions
are defined by the linear regression models for each
against its parents
:
> bn = custom.fit(dag, list( + "X1" = list(coef = c("(Intercept)" = 2.4), sd = sqrt(0.8)), + "X2" = list(coef = c("(Intercept)" = 1.8), sd = sqrt(0.6)), + "X3" = list(coef = c("(Intercept)" = 2.1, "X1" = 1.2, "X2" = -0.4), sd = sqrt(0.9)) + )) > bn
Bayesian network parameters Parameters of node X1 (Gaussian distribution) Conditional density: X1 Coefficients: (Intercept) 2.4 Standard deviation of the residuals: 0.8944272 Parameters of node X2 (Gaussian distribution) Conditional density: X2 Coefficients: (Intercept) 1.8 Standard deviation of the residuals: 0.7745967 Parameters of node X3 (Gaussian distribution) Conditional density: X3 | X1 + X2 Coefficients: (Intercept) X1 X2 2.1 1.2 -0.4 Standard deviation of the residuals: 0.9486833
- Conditional linear Gaussian BNs (CLGBNs or CGBNs), in which
is a mixture of multivariate normal distributions and the
are either multinomial, univariate normal or mixtures of normals.
- Discrete
are only allowed to have discrete parents (denoted
), are assumed to follow a multinomial distribution parameterised with conditional probability tables.
- Continuous
are allowed to have both discrete and continuous parents (denoted
,
), and their local distributions are
which can be written as a mixture of linear regressions
against the continuous parents with one component for each configuration
of the discrete parents. If
has no discrete parents, the mixture reverts to a single linear regression.
> bn = custom.fit(dag, list( + "X1" = list(coef = c("(Intercept)" = 3.1), sd = sqrt(0.9)), + "X2" = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("a", "b"))), + "X3" = list(coef = matrix(c(1:4), ncol = 2), sd = sqrt(c(0.4, 0.3))) + )) > bn
Bayesian network parameters Parameters of node X1 (Gaussian distribution) Conditional density: X1 Coefficients: (Intercept) 3.1 Standard deviation of the residuals: 0.9486833 Parameters of node X2 (multinomial distribution) Conditional probability table: a b 0.34 0.66 Parameters of node X3 (conditional Gaussian distribution) Conditional density: X3 | X1 + X2 Coefficients: 0 1 (Intercept) 1 3 X1 2 4 Standard deviation of the residuals: 0 1 0.6324555 0.5477226 Discrete parents' configurations: X2 0 a 1 b
- Discrete
Learning
Model selection and model estimation are collectively known as learning, in the case of BNs. They are usually performed as a two-step process:
- structure learning, learning the structure of the DAG from the data;
- parameter learning, learning the local distributions implied by the DAG learned in the previous step.
This workflow is Bayesian: given a data set
and a BN
,
we have
Structure learning can be done in practice as
where we use Bayes' theorem to move from
to
because the latter is much easier to define. Assuming that the marginal likelihood
decomposes along local distributions,
which can be computed efficiently. This approach is called score-based structure learning.
Once we have learned
in structure learning, we know the parents of each node and therefore we can identify the local distributions
.
This makes it possible to estimate their parameter sets
independently from each other. Assuming that
is sparse, each
.
involves only few variables and thus estimating its parameters is computationally simple.
In practice, this means that we first use some structure learning algorithm to learn the DAG
from the data,
> dag = hc(data, score = "bde")
using the marginal likelihood above (which is available in closed form for discrete BNs, GBNs and CLGBNs) or some other goodness-of-fit score like BIC.
> hc.dag = hc(data, score = "bic") > graphviz.plot(hc.dag, shape = "rectangle")
We then take the hc.dag
together with the data
and use it for parameter learning with some
estimator (maximum likelihood, Bayesian posterior or something else).
> fitted = bn.fit(hc.dag, data, method = "mle") > fitted
Bayesian network parameters Parameters of node X1 (multinomial distribution) Conditional probability table: a b c 0.3336 0.3340 0.3324 Parameters of node X2 (multinomial distribution) Conditional probability table: X1 X2 a b c a 0.85611511 0.44491018 0.11492178 b 0.02517986 0.22095808 0.09446450 c 0.11870504 0.33413174 0.79061372 Parameters of node X3 (multinomial distribution) Conditional probability table: a b c 0.7434 0.2048 0.0518 Parameters of node X4 (multinomial distribution) Conditional probability table: , , X3 = a X1 X4 a b c a 0.80081301 0.09251810 0.10530547 b 0.09024390 0.80209171 0.11173633 c 0.10894309 0.10539019 0.78295820 , , X3 = b X1 X4 a b c a 0.18079096 0.88304094 0.24695122 b 0.13276836 0.07017544 0.49390244 c 0.68644068 0.04678363 0.25914634 , , X3 = c X1 X4 a b c a 0.42857143 0.34117647 0.13333333 b 0.20238095 0.38823529 0.44444444 c 0.36904762 0.27058824 0.42222222 Parameters of node X5 (multinomial distribution) Conditional probability table: , , X6 = a X2 X5 a b c a 0.80524979 0.20588235 0.11937378 b 0.09737511 0.17973856 0.11448141 c 0.09737511 0.61437908 0.76614481 , , X6 = b X2 X5 a b c a 0.40050804 0.31679389 0.23759542 b 0.49026249 0.36641221 0.50667939 c 0.10922947 0.31679389 0.25572519 Parameters of node X6 (multinomial distribution) Conditional probability table: a b 0.5018 0.4982
For structure learning, we can consider using conditional independence tests instead of maximising a goodness-of-fit score. BNs are defined such that graphical separation between nodes in the DAG implies (conditional or marginal) independence in the probability distributions of the corresponding variables. But this means that we can use suitable statistical tests to assess (conditional or marginal) independence from data and insert (or not) arcs in the DAG as appropriate. This approach is called constraint-based structure learning.
> pc.dag = pc.stable(data, test = "x2") > graphviz.plot(pc.dag, shape = "rectangle")
We will discuss in a minute the reason why the graph learned by the PC algorithm (or that would be learned by any other constraint-based algorithm, for that matter) is only partially directed.
Inference
Inference on BNs usually consists of conditional probability (CP) or maximum a posteriori (MAP) queries. The general idea of a CP query is:
- we have some evidence, that is, we know the values of some variables and we fix the nodes accordingly; and
- we want to look into the probability of some event involving (a subset of) the other variables conditional on the evidence we have.
Say we start from a discrete BN with the DAG used above.
Graphically a CP query looks like this (evidence node in red, event of interest in blue):
On the other hand, the goal of a MAP query is to find the combination of values for (a subset of) the variables in the network that has the highest probability given some evidence. If the evidence is a partially-observed new individual, then performing a MAP query amounts to a classic prediction task. Graphically, a MAP query looks like this (evidence nodes in red, variables of interest in blue):
This is more computationally challenging than it sounds because MAP queries do not decompose along local distributions.
As an example, performing a CP query using the gRain package with a BN from bnlearn involves the following steps:
> library(gRain) > # 1) Export the BN from bnlearn to gRain, which transforms it into a junction tree. > junction.tree = as.grain(sample.bn) > # 2) Introduce the evidence in the junction tree, modifying the distribution of X2. > with.evidence = setEvidence(junction.tree, nodes = "X2", states = "b") > # 3) Query about the distribution of the node containing the event of interest. > querygrain(with.evidence, node = "X6")
$X6 X6 a b 0.4993782 0.5006218
A MAP query goes further and finds the highest probability in the joint distribution of the variables of interest (which is different from the maxima found in the respective marginal distribution, in general). The values of the variables associated with that probabilities are the answer to the MAP query.
> # 2) Introduce the evidence in on X1, X4 and X6 the junction tree. > with.evidence = setEvidence(junction.tree, nodes = c("X1", "X4", "X6"), + states = c("a", "c", "b")) > # 3) Query about the distribution of the nodes of interest X2 and X5 > pdist = querygrain(with.evidence, nodes = c("X2", "X5"), type = "joint") > # 4) Find the configuration of values with the maximum probability. > which.max = which(pdist == max(pdist), arr.ind = TRUE) > which.max
X2 X5 a 1 2
> rownames(pdist)[which.max[, 1]]
[1] "a"
> colnames(pdist)[which.max[, 2]]
[1] "b"
Both types of queries can be computed by:
- Applying exact inference algorithms to the BN, changing its data structure into a form (the “junction tree”) that allows for the efficient computation of marginal, joint and conditional probabilities.
- Generating random samples form the BN and using them with Monte Carlo algorithms to implement approximate inference.
Exact inference is more computationally demanding on BNs with many nodes and arcs than approximate inference, but approximate inference is affected by stochastic noise from the random samples generation process.
Equivalence classes
BNs use directed acyclic graphs to represent the probabilistic relationships between the data. By default, that is all they can do: they are not inherently causal models even though their arcs have a direction that could be interpreted as a cause-effect relationship. BNs are probabilistic graphical models that represent probabilistic dependencies in graphical form: probabilistic dependence is a symmetric relationship because
which in turn depends on Bayes' theorem,
As a result, we can represent the same probability distribution with different DAGs. Such DAGs are said to belong to the same equivalence class because they are indistinguishable without the use of external information. Consider the following example:
> dag1 = model2network("[X1][X2|X1][X3|X1][X4|X1]") > dag2 = model2network("[X1|X4][X2|X1][X3|X1][X4]") > par(mfrow = c(1, 2)) > graphviz.compare(dag1, dag2, diff = "none", shape = "rectangle")
We can apply Bayes' theorem and show that the decompositions into local probabilities in dag1
and
dag2
are in fact equivalent:
In other words, the directions of some arcs are not identifiable even though we can tell that the nodes are connected by an arc. The graphical representation of this idea is the completed partially-directed acyclic graph (CPDAG) in which only the arcs whose directions can be uniquely identified are directed.
> par(mfrow = c(1, 2)) > graphviz.plot(cpdag(dag1), shape = "rectangle") > graphviz.plot(cpdag(dag2), shape = "rectangle")
Formally, CPDAGs have two key properties:
- The CPDAG is defined by the skeleton of the DAG (the underlying undirected graph) and by the
v-structures
(
, where
and
are not connected by an arc). The arcs belonging to the v-structures and those that would introduce cycles/other v-structures in the CPDAG are the only directed arcs; all other arcs in the CPDAG have no direction.
- Each CPDAG is the unique representation of an equivalence class. In other words, DAGs that can be transformed into the same CPDAG belong to the same equivalence class.
Furthermore, DAGs in the same equivalence class are score equivalent: we cannot select one member of the equivalence class over another in structure learning (without the use of external information) because goodness-of-fit scores will assign the same value to both of them. After all, all DAGs in the same equivalence class represent the same probability distribution and the goodness-of-fit scores are a function of that probability distribution. For instance:
> score(dag1, data[, nodes(dag1)], type = "bic")
[1] -17074.39
> score(dag2, data[, nodes(dag2)], type = "bic")
[1] -17074.39
This suggests that we should always compare the structure of BNs through their CPDAGs to be able to assess whether they are effectively different or not.
> cpdag1 = cpdag(dag1) > cpdag2 = cpdag(dag2) > unlist(compare(cpdag1, cpdag2))
tp fp fn 3 0 0
> all.equal(cpdag1, cpdag1)
[1] TRUE
Sometimes we need the inverse operation: we have a CPDAG and we need to find out one of the DAGs that are part of the equivalence class that the CPDAG represents. This operation is called finding a consistent extension of the CPDAG.
> cextend(cpdag1)
Random/Generated Bayesian network model: [X1][X2|X1][X3|X1][X4|X1] nodes: 4 arcs: 3 undirected arcs: 0 directed arcs: 3 average markov blanket size: 1.50 average neighbourhood size: 1.50 average branching factor: 0.75 generation algorithm: Empty
Causal interpretation of BNs
What do we need to do to learn BNs as causal network models, that is, a graphical model in which arcs represent direct cause-effect relationship? Firstly, we need to make some assumptions:
- Causal Markov assumption: Each variable
is conditionally independent of its non-effects, both direct and indirect, given its direct causes. This represents the causal equivalent of the decomposition into local distributions.
- Faithfulness: There must exist a DAG which is faithful to the probability distribution of
, so that the only dependencies are those arising from d-separation in the DAG.
- Causal sufficiency: There must be no latent variables (unobserved variables influencing the variables in the network) acting as confounding factors because they are parents to at least two observed variables. Such variables may induce spurious correlations between the observed variables, thus introducing bias in the causal network.
The last assumption is really a corollary of the first two: the presence of unobserved variables violates the faithfulness assumption (because the network structure cannot include them) and possibly the causal Markov property (if an arc is wrongly added between the observed variables due to the influence of the latent one). As an example, consider the following BN from which generate a sample of 100 observations.
> causal.bn = custom.fit(model2network("[X1][X2|X1][X3|X1]"), + list("X1" = list(coef = c("(Intercept)" = 0), sd = 1), + "X2" = list(coef = c("(Intercept)" = 0, "X1" = 3), sd = 0.5), + "X3" = list(coef = c("(Intercept)" = 0, "X1" = 2), sd = 0.5)) + ) > graphviz.plot(causal.bn, shape = "rectangle")
> complete.data = rbn(causal.bn, 100)
In this network, we can observe the following graphical separations:
> dsep(causal.bn, "X2", "X3")
[1] FALSE
> dsep(causal.bn, "X2", "X3", "X1")
[1] TRUE
We can successfully to re-learn the DAG of causal.bn
from the complete.data
because all
variables are observed. However, if we omit
,
we fail to re-learn the DAG of
causal.bn
because
(now a latent variable) is a parent of both
and
(observed variables). Violating the causal sufficiency assumption results in learning a spurious arc
that is not present in
causal.bn
.
> relearn = hc(complete.data) > modelstring(relearn)
[1] "[X1][X2|X1][X3|X1]"
> dsep(relearn, "X2", "X3")
[1] FALSE
> dsep(relearn, "X2", "X3", "X1")
[1] TRUE
> relearn = hc(complete.data[, c("X2", "X3")]) > modelstring(relearn)
[1] "[X2][X3|X2]"
> dsep(relearn, "X2", "X3")
[1] FALSE
Even if these assumptions are satisfied, we may not be able to completely learn the BN as a causal network: we are using probabilistic tools (goodness-of-fit scores or conditional independence tests) that can only identify equivalence classes of BNs. In order to identify the remaining arc directions, we need external information that typically comes from:
- Longitudinal (panel) data: causality can only follow the arrow of time. Any arc between a node associated with a given time point and a node associated with a later time point can only go from the former to the latter.
- Experimental (randomised) data: data that are collected following an experimental design (as opposed to observational data that are just recorded are they arise) are such that arcs can only point outwards from the randomised variables.
- Domain experts: information about arc directions may be either collected by interviewing experts in the domain the data come from, or by consulting the literature to find relevant results from previous investigations.
Missing data
Rubin2, Little and Rubin3 have defined the key properties of missing data. We start from:
- the complete data of all samples, which we do not observe;
- the missingness mechanism (also known as the pattern of missingness) which determines which variables we actually observe for each sample;
- the incomplete data, in which the values of some variables are replaced with
NA
for some samples to indicate that they are missing. This is what we actually observe.
In other words, the complete data are reduced to the incomplete data by the missingness mechanism. The statistical analysis of incomplete data aims to recover the probabilistic model underlying the complete data from the incomplete data by taking into account the missingness mechanism. Not taking into account the missingness mechanism would result in a biased model.
What types of missingness mechanisms are possible? Rubin defined three:
- Missing Completely at Random (MCAR): the probability of the value of a variable being missing is the same for all samples because the causes of the missing data are unrelated to the data. In other words, the missingness mechanism does not depend on the variables in the data.
- Missing at Random (MAR): the probability of being missing is the same only within groups defined by the observed data. In other words, the missingness mechanism depends on the variables that are completely observed in the data. MCAR implies MAR.
- Missing Not at Random (MNAR): the probability of being missing varies for reasons that are unknown to us. In other words, the missingness mechanism depends on variables we do not observe completely or at all. Any missingness mechanism that is not MCAR or MAR is MNAR.
Formally, a typical notation is
for the observed data;
for the missing data;
for the complete data;
is a zero-one matrix with the same dimensions as
: each cell is equal to zero if the corresponding cell in the data is a missing value, or it is equal to one if the corresponding cell in the data is observed.
If the data contain no missing values, then
and we can model
through
in a straightforward way. If the data are MCAR, then
if the data are MAR, then
if the data are MNAR, then
Consider the following numerical example
(from van Buuren4) involving two variables
and
that jointly follow a bivariate normal distribution.
> library(MASS) > set.seed(80122) > n = 300 > Y = mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
Say that only
contains missing values and that the missingness mechanism is defined by
If
then
and the missingness mechanism does not depend on
and
at all. If
, then
and the missingness mechanism depends on
which is completely observed. This allows for modelling the missingness mechanism from the data we observe. Finally, if
then
and the missingness mechanism depends on
which is not completely observed.
> missingness.mechanism = function(Y, gamma) { + gamma[1] + exp(Y[, 1]) / (1 + exp(Y[, 1])) * gamma[2] + + exp(Y[, 2]) / (1 + exp(Y[, 2])) * gamma[3] + }#MISSINGNESS.MECHANISM > > R.MCAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0.5, 0, 0))) > R.MAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 1, 0))) > R.MNAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 0, 1)))
If we try to estimate the joint distribution of
and
under MCAR, that is, when we assume that the missingness mechanism is MCAR, we obtain means that are close to zero for
both
and
.
Their correlation is close to 0.5 as well.
> Y.MCAR = Y > Y.MCAR[as.logical(R.MCAR), 2] = NA > colMeans(Y.MCAR, na.rm = TRUE)
[1] 0.04083247 -0.03533149
> cor(Y.MCAR, use = "pairwise.complete")
[,1] [,2] [1,] 1.0000000 0.4249437 [2,] 0.4249437 1.0000000
The same is true under MAR.
> Y.MAR = Y > Y.MAR[as.logical(R.MAR), 2] = NA > colMeans(Y.MAR, na.rm = TRUE)
[1] 0.04083247 0.26461012
> cor(Y.MAR, use = "pairwise.complete")
[,1] [,2] [1,] 1.0000000 0.5011879 [2,] 0.5011879 1.0000000
However, under MNAR the estimates of the mean of
and of the correlation between
and
are both heavily biased. Intuitively, the bias is unavoidable since it is impossible to correctly model a MNAR
missingness mechanism that depends on information we do not have (the values missing in
).
> Y.MNAR = Y > Y.MNAR[as.logical(R.MNAR), 2] = NA > colMeans(Y.MNAR, na.rm = TRUE)
[1] 0.04083247 0.43020494
> cor(Y.MNAR, use = "pairwise.complete")
[,1] [,2] [1,] 1.0000000 0.2672833 [2,] 0.2672833 1.0000000
The distributions of
and
become more and more different as we move from MCAR to MAR to MNAR.
This behaviour is related to ignorability, which formalises the idea that the distribution of the data is the same whether we observe or not:
As a result, we can model the distribution of
from
(equivalently, the distribution
)
and use this distribution to work around the missing data. For instance, we can impute them, that is, we can replace
missing values with point estimates computed from the observed values.
MCAR is ignorable because
since
is independent from both
and
and
.
MAR is also ignorable: the definition of MAR is