A quick introduction
In this lecture we will introduce how to model data containing missing values with Bayesian networks. Missing values can arise in different ways, which were originally codified by Rubin. Bayesian networks give graphical form to Rubin's work using a directed graph where the variables and the missingness mechanisms are represented as nodes and relationships as arcs. The two key points of the lecture will be:
- Reviewing Rubin's classification of missing data and its graphical representation following the work of Mohan and Pearl.
- Adapting parameter learning and structure learning to learn Bayesian networks from incomplete data using techniques like the Expectation-Maximisation algorithm.
Bayesian networks
Definitions
Bayesian networks (BNs)1 are probabilistic graphical models defined by:
- a network structure, a directed acyclic graph (DAG)
in which each node corresponds to a random variable
;
- a global probability distribution over
(with parameters
), which can be factorised into smaller local probability distributions according to the arcs present in the graph.
The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation: nodes that are not connected by an arc are independent conditional on some other nodes. In other words,
Nodes that are connected by an arc cannot be graphically separated, and therefore they depend directly on each other in probability.
The most important consequence of this property is the factorisation of the global distribution of
:
Each local distribution has its own parameter set
;
and
is much smaller than
because many parameter are fixed by the fact that the variables they belong to are independent. This makes it possible
to work with a high-dimensional distribution
through a set of much smaller and simpler distributions
whose dimensions depends only on the respective number of parents
.
The first component of a BN is a DAG. Graphically, it looks like this: all arcs have directions, and no cycles are allowed.
> string = "[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]" > sample.dag = model2network(string) > graphviz.plot(sample.dag, shape = "rectangle")
And the implication is that:
The second component of a BN is the probability distribution
.
The choice should be such that the BN:
- can be learned efficiently from data;
- is flexible (it can encode a reasonable variety of phenomena);
- is easy to query to perform inference.
The three most common choices in the literature and in practical applications (by far) are:
- Discrete BNs, in which
and the
are multinomial. The parameters
of the local distributions are the conditional probabilities
> dag = model2network("[X1][X2][X3|X2:X1]") > cptX1 = matrix(c(0.53, 0.47), ncol = 2, dimnames = list(NULL, c("a", "b"))) > cptX2 = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("c", "d"))) > cptX3 = c(0.15, 0.85, 0.4, 0.6, 0.75, 0.25, 0.80, 0.20) > dim(cptX3) = c(2, 2, 2) > dimnames(cptX3) = list("X3" = c("e", "f"), "X1" = c("a", "b"), "X2" = c("c", "d")) > bn = custom.fit(dag, list("X1" = cptX1, "X2" = cptX2, "X3" = cptX3)) > bn
Bayesian network parameters Parameters of node X1 (multinomial distribution) Conditional probability table: a b 0.53 0.47 Parameters of node X2 (multinomial distribution) Conditional probability table: c d 0.34 0.66 Parameters of node X3 (multinomial distribution) Conditional probability table: , , X2 = c X1 X3 a b e 0.15 0.40 f 0.85 0.60 , , X2 = d X1 X3 a b e 0.75 0.80 f 0.25 0.20
- Gaussian BNs (GBNs), in which
follows a multivariate normal distribution and the local distributions
are defined by the linear regression models for each
against its parents
:
> bn = custom.fit(dag, list( + "X1" = list(coef = c("(Intercept)" = 2.4), sd = sqrt(0.8)), + "X2" = list(coef = c("(Intercept)" = 1.8), sd = sqrt(0.6)), + "X3" = list(coef = c("(Intercept)" = 2.1, "X1" = 1.2, "X2" = -0.4), sd = sqrt(0.9)) + )) > bn
Bayesian network parameters Parameters of node X1 (Gaussian distribution) Conditional density: X1 Coefficients: (Intercept) 2.4 Standard deviation of the residuals: 0.8944272 Parameters of node X2 (Gaussian distribution) Conditional density: X2 Coefficients: (Intercept) 1.8 Standard deviation of the residuals: 0.7745967 Parameters of node X3 (Gaussian distribution) Conditional density: X3 | X1 + X2 Coefficients: (Intercept) X1 X2 2.1 1.2 -0.4 Standard deviation of the residuals: 0.9486833
- Conditional linear Gaussian BNs (CLGBNs or CGBNs), in which
is a mixture of multivariate normal distributions and the
are either multinomial, univariate normal or mixtures of normals.
- Discrete
are only allowed to have discrete parents (denoted
), are assumed to follow a multinomial distribution parameterised with conditional probability tables.
- Continuous
are allowed to have both discrete and continuous parents (denoted
,
), and their local distributions are
which can be written as a mixture of linear regressions
against the continuous parents with one component for each configuration
of the discrete parents. If
has no discrete parents, the mixture reverts to a single linear regression.
> bn = custom.fit(dag, list( + "X1" = list(coef = c("(Intercept)" = 3.1), sd = sqrt(0.9)), + "X2" = matrix(c(0.34, 0.66), ncol = 2, dimnames = list(NULL, c("a", "b"))), + "X3" = list(coef = matrix(c(1:4), ncol = 2), sd = sqrt(c(0.4, 0.3))) + )) > bn
Bayesian network parameters Parameters of node X1 (Gaussian distribution) Conditional density: X1 Coefficients: (Intercept) 3.1 Standard deviation of the residuals: 0.9486833 Parameters of node X2 (multinomial distribution) Conditional probability table: a b 0.34 0.66 Parameters of node X3 (conditional Gaussian distribution) Conditional density: X3 | X1 + X2 Coefficients: 0 1 (Intercept) 1 3 X1 2 4 Standard deviation of the residuals: 0 1 0.6324555 0.5477226 Discrete parents' configurations: X2 0 a 1 b
- Discrete
Learning
Model selection and model estimation are collectively known as learning, in the case of BNs. They are usually performed as a two-step process:
- structure learning, learning the structure of the DAG from the data;
- parameter learning, learning the local distributions implied by the DAG learned in the previous step.
This workflow is Bayesian: given a data set
and a BN
,
we have
Structure learning can be done in practice as
where we use Bayes' theorem to move from
to
because the latter is much easier to define. Assuming that the marginal likelihood
decomposes along local distributions,
which can be computed efficiently. This approach is called score-based structure learning.
Once we have learned
in structure learning, we know the parents of each node and therefore we can identify the local distributions
.
This makes it possible to estimate their parameter sets
independently from each other. Assuming that
is sparse, each
.
involves only few variables and thus estimating its parameters is computationally simple.
In practice, this means that we first use some structure learning algorithm to learn the DAG
from the data,
> dag = hc(data, score = "bde")
using the marginal likelihood above (which is available in closed form for discrete BNs, GBNs and CLGBNs) or some other goodness-of-fit score like BIC.
> hc.dag = hc(data, score = "bic") > graphviz.plot(hc.dag, shape = "rectangle")
We then take the hc.dag
together with the data
and use it for parameter learning with some
estimator (maximum likelihood, Bayesian posterior or something else).
> fitted = bn.fit(hc.dag, data, method = "mle") > fitted
Bayesian network parameters Parameters of node X1 (multinomial distribution) Conditional probability table: a b c 0.3336 0.3340 0.3324 Parameters of node X2 (multinomial distribution) Conditional probability table: X1 X2 a b c a 0.85611511 0.44491018 0.11492178 b 0.02517986 0.22095808 0.09446450 c 0.11870504 0.33413174 0.79061372 Parameters of node X3 (multinomial distribution) Conditional probability table: a b c 0.7434 0.2048 0.0518 Parameters of node X4 (multinomial distribution) Conditional probability table: , , X3 = a X1 X4 a b c a 0.80081301 0.09251810 0.10530547 b 0.09024390 0.80209171 0.11173633 c 0.10894309 0.10539019 0.78295820 , , X3 = b X1 X4 a b c a 0.18079096 0.88304094 0.24695122 b 0.13276836 0.07017544 0.49390244 c 0.68644068 0.04678363 0.25914634 , , X3 = c X1 X4 a b c a 0.42857143 0.34117647 0.13333333 b 0.20238095 0.38823529 0.44444444 c 0.36904762 0.27058824 0.42222222 Parameters of node X5 (multinomial distribution) Conditional probability table: , , X6 = a X2 X5 a b c a 0.80524979 0.20588235 0.11937378 b 0.09737511 0.17973856 0.11448141 c 0.09737511 0.61437908 0.76614481 , , X6 = b X2 X5 a b c a 0.40050804 0.31679389 0.23759542 b 0.49026249 0.36641221 0.50667939 c 0.10922947 0.31679389 0.25572519 Parameters of node X6 (multinomial distribution) Conditional probability table: a b 0.5018 0.4982
For structure learning, we can consider using conditional independence tests instead of maximising a goodness-of-fit score. BNs are defined such that graphical separation between nodes in the DAG implies (conditional or marginal) independence in the probability distributions of the corresponding variables. But this means that we can use suitable statistical tests to assess (conditional or marginal) independence from data and insert (or not) arcs in the DAG as appropriate. This approach is called constraint-based structure learning.
> pc.dag = pc.stable(data, test = "x2") > graphviz.plot(pc.dag, shape = "rectangle")
We will discuss in a minute the reason why the graph learned by the PC algorithm (or that would be learned by any other constraint-based algorithm, for that matter) is only partially directed.
Inference
Inference on BNs usually consists of conditional probability (CP) or maximum a posteriori (MAP) queries. The general idea of a CP query is:
- we have some evidence, that is, we know the values of some variables and we fix the nodes accordingly; and
- we want to look into the probability of some event involving (a subset of) the other variables conditional on the evidence we have.
Say we start from a discrete BN with the DAG used above.
Graphically a CP query looks like this (evidence node in red, event of interest in blue):
On the other hand, the goal of a MAP query is to find the combination of values for (a subset of) the variables in the network that has the highest probability given some evidence. If the evidence is a partially-observed new individual, then performing a MAP query amounts to a classic prediction task. Graphically, a MAP query looks like this (evidence nodes in red, variables of interest in blue):
This is more computationally challenging than it sounds because MAP queries do not decompose along local distributions.
As an example, performing a CP query using the gRain package with a BN from bnlearn involves the following steps:
> library(gRain) > # 1) Export the BN from bnlearn to gRain, which transforms it into a junction tree. > junction.tree = as.grain(sample.bn) > # 2) Introduce the evidence in the junction tree, modifying the distribution of X2. > with.evidence = setEvidence(junction.tree, nodes = "X2", states = "b") > # 3) Query about the distribution of the node containing the event of interest. > querygrain(with.evidence, node = "X6")
$X6 X6 a b 0.4993782 0.5006218
A MAP query goes further and finds the highest probability in the joint distribution of the variables of interest (which is different from the maxima found in the respective marginal distribution, in general). The values of the variables associated with that probabilities are the answer to the MAP query.
> # 2) Introduce the evidence in on X1, X4 and X6 the junction tree. > with.evidence = setEvidence(junction.tree, nodes = c("X1", "X4", "X6"), + states = c("a", "c", "b")) > # 3) Query about the distribution of the nodes of interest X2 and X5 > pdist = querygrain(with.evidence, nodes = c("X2", "X5"), type = "joint") > # 4) Find the configuration of values with the maximum probability. > which.max = which(pdist == max(pdist), arr.ind = TRUE) > which.max
X2 X5 a 1 2
> rownames(pdist)[which.max[, 1]]
[1] "a"
> colnames(pdist)[which.max[, 2]]
[1] "b"
Both types of queries can be computed by:
- Applying exact inference algorithms to the BN, changing its data structure into a form (the “junction tree”) that allows for the efficient computation of marginal, joint and conditional probabilities.
- Generating random samples form the BN and using them with Monte Carlo algorithms to implement approximate inference.
Exact inference is more computationally demanding on BNs with many nodes and arcs than approximate inference, but approximate inference is affected by stochastic noise from the random samples generation process.
Equivalence classes
BNs use directed acyclic graphs to represent the probabilistic relationships between the data. By default, that is all they can do: they are not inherently causal models even though their arcs have a direction that could be interpreted as a cause-effect relationship. BNs are probabilistic graphical models that represent probabilistic dependencies in graphical form: probabilistic dependence is a symmetric relationship because
which in turn depends on Bayes' theorem,
As a result, we can represent the same probability distribution with different DAGs. Such DAGs are said to belong to the same equivalence class because they are indistinguishable without the use of external information. Consider the following example:
> dag1 = model2network("[X1][X2|X1][X3|X1][X4|X1]") > dag2 = model2network("[X1|X4][X2|X1][X3|X1][X4]") > par(mfrow = c(1, 2)) > graphviz.compare(dag1, dag2, diff = "none", shape = "rectangle")
We can apply Bayes' theorem and show that the decompositions into local probabilities in dag1
and
dag2
are in fact equivalent:
In other words, the directions of some arcs are not identifiable even though we can tell that the nodes are connected by an arc. The graphical representation of this idea is the completed partially-directed acyclic graph (CPDAG) in which only the arcs whose directions can be uniquely identified are directed.
> par(mfrow = c(1, 2)) > graphviz.plot(cpdag(dag1), shape = "rectangle") > graphviz.plot(cpdag(dag2), shape = "rectangle")
Formally, CPDAGs have two key properties:
- The CPDAG is defined by the skeleton of the DAG (the underlying undirected graph) and by the
v-structures
(
, where
and
are not connected by an arc). The arcs belonging to the v-structures and those that would introduce cycles/other v-structures in the CPDAG are the only directed arcs; all other arcs in the CPDAG have no direction.
- Each CPDAG is the unique representation of an equivalence class. In other words, DAGs that can be transformed into the same CPDAG belong to the same equivalence class.
Furthermore, DAGs in the same equivalence class are score equivalent: we cannot select one member of the equivalence class over another in structure learning (without the use of external information) because goodness-of-fit scores will assign the same value to both of them. After all, all DAGs in the same equivalence class represent the same probability distribution and the goodness-of-fit scores are a function of that probability distribution. For instance:
> score(dag1, data[, nodes(dag1)], type = "bic")
[1] -17074.39
> score(dag2, data[, nodes(dag2)], type = "bic")
[1] -17074.39
This suggests that we should always compare the structure of BNs through their CPDAGs to be able to assess whether they are effectively different or not.
> cpdag1 = cpdag(dag1) > cpdag2 = cpdag(dag2) > unlist(compare(cpdag1, cpdag2))
tp fp fn 3 0 0
> all.equal(cpdag1, cpdag1)
[1] TRUE
Sometimes we need the inverse operation: we have a CPDAG and we need to find out one of the DAGs that are part of the equivalence class that the CPDAG represents. This operation is called finding a consistent extension of the CPDAG.
> cextend(cpdag1)
Random/Generated Bayesian network model: [X1][X2|X1][X3|X1][X4|X1] nodes: 4 arcs: 3 undirected arcs: 0 directed arcs: 3 average markov blanket size: 1.50 average neighbourhood size: 1.50 average branching factor: 0.75 generation algorithm: Empty
Causal interpretation of BNs
What do we need to do to learn BNs as causal network models, that is, a graphical model in which arcs represent direct cause-effect relationship? Firstly, we need to make some assumptions:
- Causal Markov assumption: Each variable
is conditionally independent of its non-effects, both direct and indirect, given its direct causes. This represents the causal equivalent of the decomposition into local distributions.
- Faithfulness: There must exist a DAG which is faithful to the probability distribution of
, so that the only dependencies are those arising from d-separation in the DAG.
- Causal sufficiency: There must be no latent variables (unobserved variables influencing the variables in the network) acting as confounding factors because they are parents to at least two observed variables. Such variables may induce spurious correlations between the observed variables, thus introducing bias in the causal network.
The last assumption is really a corollary of the first two: the presence of unobserved variables violates the faithfulness assumption (because the network structure cannot include them) and possibly the causal Markov property (if an arc is wrongly added between the observed variables due to the influence of the latent one). As an example, consider the following BN from which generate a sample of 100 observations.
> causal.bn = custom.fit(model2network("[X1][X2|X1][X3|X1]"), + list("X1" = list(coef = c("(Intercept)" = 0), sd = 1), + "X2" = list(coef = c("(Intercept)" = 0, "X1" = 3), sd = 0.5), + "X3" = list(coef = c("(Intercept)" = 0, "X1" = 2), sd = 0.5)) + ) > graphviz.plot(causal.bn, shape = "rectangle")
> complete.data = rbn(causal.bn, 100)
In this network, we can observe the following graphical separations:
> dsep(causal.bn, "X2", "X3")
[1] FALSE
> dsep(causal.bn, "X2", "X3", "X1")
[1] TRUE
We can successfully to re-learn the DAG of causal.bn
from the complete.data
because all
variables are observed. However, if we omit
,
we fail to re-learn the DAG of
causal.bn
because
(now a latent variable) is a parent of both
and
(observed variables). Violating the causal sufficiency assumption results in learning a spurious arc
that is not present in
causal.bn
.
> relearn = hc(complete.data) > modelstring(relearn)
[1] "[X1][X2|X1][X3|X1]"
> dsep(relearn, "X2", "X3")
[1] FALSE
> dsep(relearn, "X2", "X3", "X1")
[1] TRUE
> relearn = hc(complete.data[, c("X2", "X3")]) > modelstring(relearn)
[1] "[X2][X3|X2]"
> dsep(relearn, "X2", "X3")
[1] FALSE
Even if these assumptions are satisfied, we may not be able to completely learn the BN as a causal network: we are using probabilistic tools (goodness-of-fit scores or conditional independence tests) that can only identify equivalence classes of BNs. In order to identify the remaining arc directions, we need external information that typically comes from:
- Longitudinal (panel) data: causality can only follow the arrow of time. Any arc between a node associated with a given time point and a node associated with a later time point can only go from the former to the latter.
- Experimental (randomised) data: data that are collected following an experimental design (as opposed to observational data that are just recorded are they arise) are such that arcs can only point outwards from the randomised variables.
- Domain experts: information about arc directions may be either collected by interviewing experts in the domain the data come from, or by consulting the literature to find relevant results from previous investigations.
Missing data
Rubin2, Little and Rubin3 have defined the key properties of missing data. We start from:
- the complete data of all samples, which we do not observe;
- the missingness mechanism (also known as the pattern of missingness) which determines which variables we actually observe for each sample;
- the incomplete data, in which the values of some variables are replaced with
NA
for some samples to indicate that they are missing. This is what we actually observe.
In other words, the complete data are reduced to the incomplete data by the missingness mechanism. The statistical analysis of incomplete data aims to recover the probabilistic model underlying the complete data from the incomplete data by taking into account the missingness mechanism. Not taking into account the missingness mechanism would result in a biased model.
What types of missingness mechanisms are possible? Rubin defined three:
- Missing Completely at Random (MCAR): the probability of the value of a variable being missing is the same for all samples because the causes of the missing data are unrelated to the data. In other words, the missingness mechanism does not depend on the variables in the data.
- Missing at Random (MAR): the probability of being missing is the same only within groups defined by the observed data. In other words, the missingness mechanism depends on the variables that are completely observed in the data. MCAR implies MAR.
- Missing Not at Random (MNAR): the probability of being missing varies for reasons that are unknown to us. In other words, the missingness mechanism depends on variables we do not observe completely or at all. Any missingness mechanism that is not MCAR or MAR is MNAR.
Formally, a typical notation is
for the observed data;
for the missing data;
for the complete data;
is a zero-one matrix with the same dimensions as
: each cell is equal to zero if the corresponding cell in the data is a missing value, or it is equal to one if the corresponding cell in the data is observed.
If the data contain no missing values, then
and we can model
through
in a straightforward way. If the data are MCAR, then
if the data are MAR, then
if the data are MNAR, then
Consider the following numerical example
(from van Buuren4) involving two variables
and
that jointly follow a bivariate normal distribution.
> library(MASS) > set.seed(80122) > n = 300 > Y = mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
Say that only
contains missing values and that the missingness mechanism is defined by
If
then
and the missingness mechanism does not depend on
and
at all. If
, then
and the missingness mechanism depends on
which is completely observed. This allows for modelling the missingness mechanism from the data we observe. Finally, if
then
and the missingness mechanism depends on
which is not completely observed.
> missingness.mechanism = function(Y, gamma) { + gamma[1] + exp(Y[, 1]) / (1 + exp(Y[, 1])) * gamma[2] + + exp(Y[, 2]) / (1 + exp(Y[, 2])) * gamma[3] + }#MISSINGNESS.MECHANISM > > R.MCAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0.5, 0, 0))) > R.MAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 1, 0))) > R.MNAR = 1 - rbinom(n, size = 1, prob = missingness.mechanism(Y, gamma = c(0, 0, 1)))
If we try to estimate the joint distribution of
and
under MCAR, that is, when we assume that the missingness mechanism is MCAR, we obtain means that are close to zero for
both
and
.
Their correlation is close to 0.5 as well.
> Y.MCAR = Y > Y.MCAR[as.logical(R.MCAR), 2] = NA > colMeans(Y.MCAR, na.rm = TRUE)
[1] 0.04083247 -0.03533149
> cor(Y.MCAR, use = "pairwise.complete")
[,1] [,2] [1,] 1.0000000 0.4249437 [2,] 0.4249437 1.0000000
The same is true under MAR.
> Y.MAR = Y > Y.MAR[as.logical(R.MAR), 2] = NA > colMeans(Y.MAR, na.rm = TRUE)
[1] 0.04083247 0.26461012
> cor(Y.MAR, use = "pairwise.complete")
[,1] [,2] [1,] 1.0000000 0.5011879 [2,] 0.5011879 1.0000000
However, under MNAR the estimates of the mean of
and of the correlation between
and
are both heavily biased. Intuitively, the bias is unavoidable since it is impossible to correctly model a MNAR
missingness mechanism that depends on information we do not have (the values missing in
).
> Y.MNAR = Y > Y.MNAR[as.logical(R.MNAR), 2] = NA > colMeans(Y.MNAR, na.rm = TRUE)
[1] 0.04083247 0.43020494
> cor(Y.MNAR, use = "pairwise.complete")
[,1] [,2] [1,] 1.0000000 0.2672833 [2,] 0.2672833 1.0000000
The distributions of
and
become more and more different as we move from MCAR to MAR to MNAR.
This behaviour is related to ignorability, which formalises the idea that the distribution of the data is the same whether we observe or not:
As a result, we can model the distribution of
from
(equivalently, the distribution
)
and use this distribution to work around the missing data. For instance, we can impute them, that is, we can replace
missing values with point estimates computed from the observed values.
MCAR is ignorable because
since
is independent from both
and
and
.
MAR is also ignorable: the definition of MAR is
and then
The missing data
do not depend on
,
so their distribution (and by extension that of
)
is the same for
and
.
MNAR is not ignorable: we need to make assumptions on
and incorporate it in our models to work around the missing data.
Bayesian networks for missing data
Working with a causal network in the presence of unknown latent variables is difficult because of the issues with
learning the structure and the parameters from data. We can partially address this issue by including the latent
variables in the network: we define a new DAG
over
where
is the set of the fully observed variables and
is the set of the completely unobserved variables (that is, the latent variables). Now that the latent variables are
included in the network, we can model them and prevent confounding: we add them as columns full of
NA
to
the data we use for learning.
When we have variables that are only partially observed, missing values are present along with observed values and we need to augment the causal network again to model them correctly. In order to do that, Mohan and Pearl define a missingness graph5 to extend the ideas of Rubin. A missingness graph is a causal graph where the vertices in are partitioned into five disjoint subsets:
- the fully observed nodes
;
- the completely unobserved nodes
;
- the partially observed variables
;
- the proxy variables
to the partially observed variables, with the values that are actually observed;
- the missingness indicators
.
The relationship between
,
and
is as follows:
- The variables
are the variables for which we would have liked to observe values for all observations.
- The variables
are the variables which we actually (partially) observe.
- The variables
are binary indicator variables that control whether the value of
is observed for each observation in
or a missing value
NA
is observed instead.
Formally, this means
Since missingness graphs are causal graphs, they can be queried graphically for independencies between variables in
the observed-data distribution
by using d-separation. The corresponding underlying distribution with no missing values is
.
MCAR, MAR and MNAR result in different independence statements which can be assessed through d-separation. Therefore, we can link them to the independency statements they imply in the missingness graph:
- MCAR implies
: the missingness is random and independent from the fully observed and the partially observed variables.
- MAR implies
: missingness is random and depends only on the fully observed variables;
- MNAR if neither MCAR nor MAR.
The approaches for learning BNs as causal networks or as probabilistic networks from incomplete data have reviewed here.6
Consider again the two-variable example we used to illustrate MCAR, MAR and MNAR. As a BN, it would have the
following DAG since
and
are correlated. If the data are completely observed, that is the correct model for the data.
> graphviz.plot(model2network("[Y2][Y1|Y2]"), shape = "rectangle")
Assume now that the data are incomplete and MCAR. The variable
is fully observed, so
.
The variable
is only partially observed: in the missingness graph, we replace it with the nodes:
, the mix of values and
NA
s that we actually observe for the samples., the values we would observe for
for the samples if there were no missing values. They are identical to the corresponding values in
if
.
, the binary indicators that determine whether the value of
is observed (or not) for each sample.
To obtain the correct model for MCAR,
should not have any parents. As for the arc
,
we replace it with
to make this directed graph a direct extension of the previous one.
> missingness.graph = model2network("[S2|R2:M2][R2][M2][Y1|M2]") > graphviz.plot(missingness.graph, shape = "rectangle")
The independency statements of MCAR are satisfied:
> # This checks that O is d-separated from R. > dsep(missingness.graph, "Y1", "R2")
[1] TRUE
> # This checks that M is d-separated from R. > dsep(missingness.graph, "M2", "R2")
[1] TRUE
And since MCAR implies MAR, the independency statements of MAR are also satisfied:
> # This checks that M is d-separated from R given O. > dsep(missingness.graph, "M2", "R2", "Y1")
[1] TRUE
In the MAR case, we had earlier that the missingness mechanism depended on
which is a completely-observed variable: then we should add the arc
to the missingness graph to represent it.
> missingness.graph = model2network("[S2|R2:M2][R2|Y1][M2][Y1|M2]") > graphviz.plot(missingness.graph, shape = "rectangle")
The independency statements of MAR are satisfied:
> # This checks that M is d-separated from R given O. > dsep(missingness.graph, "M2", "R2", "Y1")
[1] TRUE
However, the independency statements of MCAR are not longer satisfied.
> # This checks that O is d-separated from R. > dsep(missingness.graph, "Y1", "R2")
[1] FALSE
> # This checks that M is d-separated from R. > dsep(missingness.graph, "M2", "R2")
[1] FALSE
In the MNAR case, the missingness mechanism for
depended on
itself. We do not have a node for
in the missingness graph, but we do have
which models the (unobserved) complete version of
we can then add the arc
to represent this.
> missingness.graph = model2network("[S2|R2:M2][R2|M2][M2][Y1|M2]") > graphviz.plot(missingness.graph, shape = "rectangle")
Since MNAR excludes MCAR and MAR, the independency statements of both missingness mechanisms are not satisfied.
> # MCAR: O is d-separated from R. > dsep(missingness.graph, "Y1", "R2")
[1] FALSE
> # MCAR: M is d-separated from R. > dsep(missingness.graph, "M2", "R2")
[1] FALSE
> # MAR: M is d-separated from R given O. > dsep(missingness.graph, "M2", "R2", "Y1")
[1] FALSE
What is the role of the missingness graph in structure and parameter learning? In some cases, using it for parameter learning may make it possible to correctly estimate the parameters of the BN: this property is called recoverability, and it is discussed in detail in Mohan and Pearl. As for structure learning, feeding a partially-constructed missingness graph to a structure learning algorithm and let it learn the rest from the data is a viable approach. However, as we will see in the following, many classical approaches to BN modelling just assume that the data are MCAR or MAR to avoid using the missingness graph altogether.
Parameter learning
Firstly, we will assume that the DAG
is given and that we only need to learn the parameters of the network from the incomplete data.
Using locally-complete data
The simplest possible way to implement parameter learning is to drop all the observations containing missing values and to use only complete observations.
> colnames(data) = c("X1", "X2", "X3", "X4", "X5", "X6") > dag = model2network("[X1][X3][X6][X2|X1][X4|X1:X3][X5|X2:X6]") > incomplete.data = data[1:80, ] > for (i in seq(ncol(incomplete.data))) + incomplete.data[1:10 + 10 * i, i] = NA > > complete.subset = incomplete.data[complete.cases(incomplete.data), ] > bn.fit(dag, complete.subset, method = "mle")
Bayesian network parameters Parameters of node X1 (multinomial distribution) Conditional probability table: a b c 0.25 0.45 0.30 Parameters of node X2 (multinomial distribution) Conditional probability table: X1 X2 a b c a 1.0000000 0.4444444 0.1666667 b 0.0000000 0.3333333 0.1666667 c 0.0000000 0.2222222 0.6666667 Parameters of node X3 (multinomial distribution) Conditional probability table: a b c 0.60 0.35 0.05 Parameters of node X4 (multinomial distribution) Conditional probability table: , , X3 = a X1 X4 a b c a 1.0000000 0.0000000 0.0000000 b 0.0000000 0.8000000 0.0000000 c 0.0000000 0.2000000 1.0000000 , , X3 = b X1 X4 a b c a 0.0000000 1.0000000 0.0000000 b 0.0000000 0.0000000 0.6666667 c 1.0000000 0.0000000 0.3333333 , , X3 = c X1 X4 a b c a 1.0000000 b 0.0000000 c 0.0000000 Parameters of node X5 (multinomial distribution) Conditional probability table: , , X6 = a X2 X5 a b c a 0.7142857 0.0000000 0.0000000 b 0.2857143 0.0000000 0.2500000 c 0.0000000 1.0000000 0.7500000 , , X6 = b X2 X5 a b c a 0.0000000 0.0000000 0.0000000 b 1.0000000 1.0000000 0.5000000 c 0.0000000 0.0000000 0.5000000 Parameters of node X6 (multinomial distribution) Conditional probability table: a b 0.65 0.35
This not an efficient use of the data because it reduces the sample size too much.
> nrow(incomplete.data)
[1] 80
> nrow(complete.subset)
[1] 20
As a result, the estimates of the parameters can be very imprecise.
> bn.fit(dag, complete.subset, method = "mle")$X5
Parameters of node X5 (multinomial distribution) Conditional probability table: , , X6 = a X2 X5 a b c a 0.7142857 0.0000000 0.0000000 b 0.2857143 0.0000000 0.2500000 c 0.0000000 1.0000000 0.7500000 , , X6 = b X2 X5 a b c a 0.0000000 0.0000000 0.0000000 b 1.0000000 1.0000000 0.5000000 c 0.0000000 0.0000000 0.5000000
> bn.fit(dag, data[1:80, ])$X5
Parameters of node X5 (multinomial distribution) Conditional probability table: , , X6 = a X2 X5 a b c a 0.66666667 0.00000000 0.05882353 b 0.23809524 0.28571429 0.11764706 c 0.09523810 0.71428571 0.82352941 , , X6 = b X2 X5 a b c a 0.38461538 0.00000000 0.33333333 b 0.53846154 0.50000000 0.44444444 c 0.07692308 0.50000000 0.22222222
A better approach along the same lines is to use locally-complete observations to estimate the parameters of each local distribution: after all, a local distribution only involves one node and its parents so we do not care whether the values of the remaining variables are missing or not.
> # bn.fit() uses locally-complete observations by default for maximum likelihood > # and Bayesian estimators. > bn.fit(dag, incomplete.data, method = "mle")$X5
Parameters of node X5 (multinomial distribution) Conditional probability table: , , X6 = a X2 X5 a b c a 0.5714286 0.0000000 0.0000000 b 0.2857143 0.5000000 0.1666667 c 0.1428571 0.5000000 0.8333333 , , X6 = b X2 X5 a b c a 0.1428571 0.0000000 0.3636364 b 0.7142857 1.0000000 0.3636364 c 0.1428571 0.0000000 0.2727273
The Expectation-Maximisation algorithm
In the context of parameter learning, the Expectation-Maximisation (EM) algorithm retains its classic structure:
- In the expectation step (E), we compute the expected values of the sufficient statistics (such as the
counts
in discrete BNs, partial correlations in GBNs), using exact inference along the lines described above to make use of incomplete as well as complete samples.
- In the maximisation step (M), we take the sufficient statistics from the E-step and estimate the parameters of the BN, either using maximum likelihood or Bayesian posterior estimators.
We iterate over these two steps by using the parameter estimates to update the expected values of the sufficient statistics in th next E-step. Repeated iterations of these two steps will return the maximum likelihood or maximum a posteriori estimates for the parameters in the limit. Note that the EM algorithm is only valid for MCAR and MAR missing data.
> start = bn.fit(empty.graph(colnames(data)), incomplete.data) > fitted = bn.fit(dag, incomplete.data, method = "hard-em", fit = "mle", + impute = "exact", threshold = 1e-5, debug = TRUE, + start = start)
* expectation-maximization iteration 1 . > the relative difference in log-likelihood is: 0.1993889 . * expectation-maximization iteration 2 . > the relative difference in log-likelihood is: 0.04342052 . * expectation-maximization iteration 3 . > the relative difference in log-likelihood is: 0.002834975 . * expectation-maximization iteration 4 . > the relative difference in log-likelihood is: 0.001260253 . * expectation-maximization iteration 5 . > the relative difference in log-likelihood is: 0 . @ the difference is smaller than the threshold, stopping.
> fitted
Bayesian network parameters Parameters of node X1 (multinomial distribution) Conditional probability table: a b c 0.2250 0.2875 0.4875 Parameters of node X2 (multinomial distribution) Conditional probability table: X1 X2 a b c a 0.94444444 0.60869565 0.05128205 b 0.05555556 0.21739130 0.05128205 c 0.00000000 0.17391304 0.89743590 Parameters of node X3 (multinomial distribution) Conditional probability table: a b c 0.6375 0.3000 0.0625 Parameters of node X4 (multinomial distribution) Conditional probability table: , , X3 = a X1 X4 a b c a 0.85714286 0.00000000 0.04166667 b 0.00000000 0.92307692 0.20833333 c 0.14285714 0.07692308 0.75000000 , , X3 = b X1 X4 a b c a 0.00000000 1.00000000 0.33333333 b 0.00000000 0.00000000 0.58333333 c 1.00000000 0.00000000 0.08333333 , , X3 = c X1 X4 a b c a 0.50000000 0.00000000 b 0.00000000 0.33333333 c 0.50000000 0.66666667 Parameters of node X5 (multinomial distribution) Conditional probability table: , , X6 = a X2 X5 a b c a 0.68181818 0.00000000 0.00000000 b 0.22727273 0.40000000 0.09523810 c 0.09090909 0.60000000 0.90476190 , , X6 = b X2 X5 a b c a 0.09090909 0.00000000 0.27777778 b 0.81818182 1.00000000 0.33333333 c 0.09090909 0.00000000 0.38888889 Parameters of node X6 (multinomial distribution) Conditional probability table: a b 0.6 0.4
How does EM work in practice? One way of implementing it to impute the missing values in the E-step is the so-called hard EM.
> # 1) E-step: start from an empty network, so that each node encodes the marginal > # distribution of the corresponding variable, estimated from the observed > # values. > current.bn = bn.fit(empty.graph(nodes(dag)), incomplete.data) > completed.data = impute(current.bn, incomplete.data, method = "exact") > all(complete.cases(completed.data))
[1] TRUE
> # 2) Compute an initial estimate of the log-likelihood of the data. > current.loglik = logLik(current.bn, completed.data) > current.loglik
[1] -447.0964
> # 3) M-step: use the completed data to estimate the parameters of the BN. > new.bn = bn.fit(dag, completed.data) > new.loglik = logLik(new.bn, completed.data) > new.loglik
[1] -387.1627
> # 4) Compute the relative improvement in the log-likelihood: is it large enough > # to make it worth to perform more iterations? > (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.1340509
> # 5) Reset the current state of the algorithm. > current.bn = new.bn > current.loglik = new.loglik > # 6) E-step. > completed.data = impute(current.bn, incomplete.data, method = "exact") > # 7) M-step. > new.bn = bn.fit(dag, completed.data) > new.loglik = logLik(new.bn, completed.data) > new.loglik
[1] -337.8896
> # 8) Compute the relative improvement in the log-likelihood: is it large enough > # to make it worth to perform more iterations? > (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.1272672
> par(mfrow = c(1, 2)) > graphviz.chart(current.bn, type = "barprob", main = "Current BN") > graphviz.chart(new.bn, type = "barprob", main = "New BN")
> # ... And so on until convergence.
Structure learning
Learning the structure of a BN from incomplete data is, in general, computationally unfeasible because we need to
perform a joint optimisation over the missing values and the parameters to score each candidate network. We can make
this apparent by rewriting
as a function of
(the observed data) and
(the missing data):
From this expression we can see that in order to maximise
we should jointly maximise the probability of the observed data
and the probability of the missing data
given the observed data, for each candidate
and averaging over all possible
.
This gives us the maximum a posteriori solution to structure learning.
A full Bayesian approach would require averaging over all the possible configurations of the missing data as well:
This implies one extra dimension of integration for each missing value in addition to one dimension for each
parameter in
,
and
does not decompose along with local distributions in the general case.
Using locally-complete data
Earlier we saw how we can use locally-complete data to learn the parameters of each local distribution. The next logical step is to use the locally-complete data and the parameters we estimated from them to compute the goodness-of-fit scores we need for structure learning. The main issue with this approach is that local distributions (for the same node) with different parent sets will be learned from different subsets of the data, so:
- The respective log-likelihoods may be on different scales, because they may sum over different numbers of observations.
- The respective log-likelihoods may be estimated from non-nested subsets of the data, which in the worst case may not overlap at all.
The node-average log-likelihood (NAL)7 addresses both problems. If the data were complete, we could compute a penalised score such as BIC as
which is numerically equivalent to computing
The quantity
is the empirical estimate of the expected value of the log-likelihood of a single observation: its scale does not depend
on the number of observations. Furthermore, we can seamlessly replace it with the average over the locally-complete
observations
in the incomplete data:
Tjebbe Bodewes and Marco Scutari proved the NAL ensures both model identifiability (up to the equivalence class) and
consistency in structure learning under MCAR as long as
and
as
.
These conditions hold for neither AIC nor BIC because their penalties are not large enough when transformed into
.
In practice, this means we can compute the node-average log-likelihood as we would any other score function, regardless of whether the data are complete or not.
> score(dag, incomplete.data, type = "nal", by.node = TRUE)
X1 X2 X3 X4 X5 X6 -1.0561521 -0.6071982 -0.8415452 -0.5380200 -0.7824792 -0.6829081
For instance, the NAL for node
(with parent
in
dag
) is computed as follows.
> # 1) Identify the parents of node X2. > node = "X2" > pars = parents(dag, node) > # 2) Identify which observations are locally complete for X2 and its parents. > locally.complete = complete.cases(incomplete.data[, c(node, pars)]) > # 3) Subset the data. > locally.complete.data = incomplete.data[locally.complete, c(node, pars)] > # 4) Compute the counts and then the conditional probabilities of X2 given the parents. > counts = table(locally.complete.data) > probs = prop.table(counts, margin = "X1") > # 5) Compute the log-likelihood and divide it by the number of locally-complete observations. > nal = sum(counts * log(probs), na.rm = TRUE) / nrow(locally.complete.data) > nal
[1] -0.6071982
Computing the PNAL from the NAL is then just a matter of subtracting the penalty term (the coefficient
,
here with the default value, multiplied by the number of free parameters in the local distribution).
> score(dag, incomplete.data, type = "pnal", by.node = TRUE)
X1 X2 X3 X4 X5 X6 -1.0575453 -0.6113779 -0.8429384 -0.5505589 -0.7908384 -0.6836047
> lambda_n = (nrow(incomplete.data)^(-0.25)) / nnodes(dag) > pnal = nal - lambda_n/nrow(incomplete.data) * (nrow(probs) - 1) * ncol(probs) > pnal
[1] -0.6113779
In order to perform structure learning from incomplete data with PNAL, we just need to use
score = "pnal"
with a score-based learning algorithm like hill-climbing. How do the network structure
learned with PNAL, that learned from the complete data (before we introduced the NA
s) and the reference
network structure?
> pnal.dag = hc(incomplete.data, score = "pnal", k = 1 / (nnodes(dag) * 0.2)) > complete.data.dag = hc(data[1:80, ], score = "bic") > par(mfrow = c(1, 3)) > graphviz.compare(dag, complete.data.dag, pnal.dag, shape = "rectangle", + main = c("Reference DAG", "Complete Data DAG", "Learned with PNAL"))
As we mentioned earlier, a better way of comparing network structures is through their CPDAGs.
> par(mfrow = c(1, 3)) > graphviz.compare(cpdag(dag), cpdag(complete.data.dag), cpdag(pnal.dag), + shape = "rectangle", main = c("Reference DAG", "Complete Data DAG", "Learned with PNAL"))
While the three network structures in the first figure look very similar, the CPDAG of the network we learned with BIC is in fact very different from that of the reference network structure, while that of the network learned with PNAL is fairly close.
The Expectation-Maximisation algorithm
In the context of structure learning, the EM algorithm is called structural EM (SEM). It makes structure learning computationally feasible by searching for the best structure inside of EM, instead of embedding EM inside a structure learning algorithm. It consists of two steps like the classic EM:
- In the expectation step (E), we complete the data by computing the expected sufficient statistics using the current network structure.
- In the maximisation step (M), we find the structure that maximises the expected score function for the completed data.
Since the scoring in the M-step uses the completed data, structure learning can be implemented efficiently using standard algorithms for complete data. The original proposal by Nir Friedman used BIC and greedy search; he later extended SEM to a fully Bayesian approach based posterior scores, and proved the convergence of the resulting algorithm under MCAR and MAR.
> start = bn.fit(empty.graph(colnames(data)), incomplete.data) > learned.dag = structural.em(incomplete.data, impute = "exact", fit = "mle", + start = start) > learned.dag
Bayesian network learned from Missing Data model: [X1][X3][X6][X2|X1][X4|X1][X5|X1:X6] nodes: 6 arcs: 4 undirected arcs: 0 directed arcs: 4 average markov blanket size: 1.67 average neighbourhood size: 1.33 average branching factor: 0.67 learning algorithm: Structural EM score-based method: Hill-Climbing parameter learning method: Maximum Likelihood (disc.) imputation method: Exact Inference penalization coefficient: 2.191013 tests used in the learning procedure: 140 optimized: TRUE
Between the limited amounts of data and the missing values, structural.em()
is unable to learn the
correct DAG.
> par(mfrow = c(1, 2)) > graphviz.compare(dag, learned.dag, main = c("correct DAG", "learned DAG"), + shape = "rectangle")
But what is structural.em()
doing in practice?
> # 1) E-step: start from an empty network, so that each node encodes the marginal > # distribution of the corresponding variable, estimated from the observed > # values. > current.dag = empty.graph(colnames(data)) > current.bn = bn.fit(current.dag, incomplete.data) > completed.data = impute(current.bn, incomplete.data, method = "exact") > # 2) Compute an initial estimate of the log-likelihood of the data. > current.loglik = logLik(current.bn, completed.data) > current.loglik
[1] -447.0964
> # 3) M-Step: learn the structure of the DAG from the completed data. > new.dag = tabu(completed.data) > # 4) Is the DAG we just learned different from the previous one? > all.equal(current.dag, new.dag)
[1] "Different number of directed/undirected arcs"
> par(mfrow = c(1, 2)) > graphviz.compare(current.dag, new.dag, main = c("Current DAG", "New DAG"), + shape = "rectangle")
> # 5) Learn the parameters as well and compute the relative improvement in the > # log-likelihood: is it large enough to make it worth to perform more > # iterations? > new.bn = bn.fit(new.dag, completed.data) > new.loglik = logLik(new.bn, completed.data) > new.loglik
[1] -404.8411
> (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.09451047
> # 6) Reset the current state of the algorithm. > current.dag = new.dag > current.bn = new.bn > current.loglik = new.loglik > # 7) E-step. > completed.data = impute(current.bn, incomplete.data, method = "exact") > # 8) M-Step. > new.dag = tabu(completed.data) > # 9) Is the DAG we just learned different from the previous one? > all.equal(current.dag, new.dag)
[1] "Different arc sets"
> par(mfrow = c(1, 2)) > graphviz.compare(current.dag, new.dag, main = c("Current DAG", "New DAG"), + shape = "rectangle")
> # 10) Did the log-likelihood improve? > new.bn = bn.fit(new.dag, completed.data) > new.loglik = logLik(new.bn, completed.data) > new.loglik
[1] -379.1162
> (new.loglik - current.loglik) / abs(current.loglik)
[1] 0.06354313
> # ... And so on until convergence.
The PC algorithm
Using locally-complete data is a principled way of learning BNs using constraint-based algorithms, as shown by Eric Strobl, Shyam Visweswaran and Peter Spirtes.8 The trick is that we should perform each conditional independence test twice, as follows:
- Test whether
using locally-complete observations, and record resulting the p-value.
- If the p-value smaller than the chosen threshold (typically 0.01)?
- If the answer is no, then reject the null hypothesis of (conditional) independence.
- If the answer is yes, test whether
using globally-complete observations and record the resulting p-value.
- Is the minimum between the p-values of the two tests smaller than the chosen threshold?
- If the answer is no, then accept the null hypothesis of (conditional) independence.
- If the answer is yes, then reject the null hypothesis of (conditional) independence.
This double-testing ensures that we can apply algorithms like the PC to incomplete data that are MCAR or MAR. Interestingly, it works even if the data are MNAR! Furthermore, Strobl, Visweswaran and Spirtes show that even just performing a single test on the locally-complete data works very well in practice.
> # constraint-based algorithms in bnlearn all use locally-complete observations > # for constraint-based learning, by default. > learned.dag = pc.stable(incomplete.data) > par(mfrow = c(1, 2)) > graphviz.compare(dag, learned.dag, main = c("correct DAG", "learned DAG"), + shape = "rectangle")
As an example consider the following conditional independence test, which is the first of the two tests.
> # ci.test() uses locally-complete data by default. > ci.test("X1", "X6", "X4", data = incomplete.data, test = "x2")
Pearson's X^2 data: X1 ~ X6 | X4 x2 = 12.368, df = 6, p-value = 0.05425 alternative hypothesis: true value is greater than 0
We can compute it manually from the locally-complete data.
> local.data = incomplete.data[, c("X1", "X6", "X4")] > locally.complete = local.data[complete.cases(local.data), ] > nrow(locally.complete)
[1] 50
> prop.table(table(locally.complete), margin = "X4")
, , X4 = a X6 X1 a b a 0.23529412 0.17647059 b 0.17647059 0.17647059 c 0.23529412 0.00000000 , , X4 = b X6 X1 a b a 0.00000000 0.00000000 b 0.46666667 0.20000000 c 0.00000000 0.33333333 , , X4 = c X6 X1 a b a 0.11111111 0.00000000 b 0.05555556 0.00000000 c 0.38888889 0.44444444
> test.statistic = as.numeric( + chisq.test(table(locally.complete)[, , "a"])$stat + + chisq.test(table(locally.complete)[, , "b"] + 1e-6)$stat + + chisq.test(table(locally.complete)[, , "c"])$stat + ) > test.statistic
[1] 12.36782
> pchisq(test.statistic, df = 6, lower.tail = FALSE)
[1] 0.05424854
Its p-value is large, so we should compute the second test as well. The second test involves the same variables, but uses only globally-complete data.
> globally.complete = local.data[complete.cases(incomplete.data), ] > nrow(globally.complete)
[1] 20
> prop.table(table(globally.complete), margin = "X4")
, , X4 = a X6 X1 a b a 0.3750000 0.1250000 b 0.2500000 0.2500000 c 0.0000000 0.0000000 , , X4 = b X6 X1 a b a 0.0000000 0.0000000 b 0.5000000 0.1666667 c 0.0000000 0.3333333 , , X4 = c X6 X1 a b a 0.1666667 0.0000000 b 0.1666667 0.0000000 c 0.5000000 0.1666667
> test.statistic = as.numeric( + chisq.test(table(globally.complete)[, , "a"] + 1e-6)$stat + + chisq.test(table(globally.complete)[, , "b"] + 1e-6)$stat + + chisq.test(table(globally.complete)[, , "c"])$stat + ) > test.statistic
[1] 4.133331
> pchisq(test.statistic, df = 6, lower.tail = FALSE)
[1] 0.6586386
Equivalently:
> ci.test("X1", "X6", "X4", + data = incomplete.data[complete.cases(incomplete.data), ], + test = "x2")
Pearson's X^2 data: X1 ~ X6 | X4 x2 = 4.1333, df = 6, p-value = 0.6586 alternative hypothesis: true value is greater than 0
The second test produces an even larger p-value: since both are larger than the threshold we accept the null
hypothesis of conditional independence between X1
and X6
given X4
.
Inference
Incomplete data are not a concern when performing general inference on BNs: as long as we can define the evidence we are considering and the event (for CP queries) or the target nodes (for MAP queries), we can obtain the answers we need in the same way as when data are complete. The only influence missing data may have is on the quality of the BN we learn from them and that we submit our queries to.
There are, however, two inference tasks that are interesting to discuss for incomplete data: imputation and prediction. Both take data as input, as opposed to user-provided evidence and events, and therefore interact with the missing values.
Imputation
Imputing missing data means to replace them with values estimated from the observed data. We can do that with:
- Single Imputation: we replace the missing values with their expectations (for continuous variables) or MAP estimates (for discrete variables).
- Multiple Imputation: we replace the missing values with values sampled from their posterior distribution
given the observed data, that is,
. We do that multiple times to capture the variability of the imputation process and to incorporate it in the variability of our statistical results.
Multiple imputation is what Rubin calls a proper imputation because of this, while single imputation is not. However, it is also much more computationally expensive, more so for multivariate models such as BNs. For this reason, single imputation is more common in practical applications.
The simplest way of performing single imputation is to take the local distribution of the node whose value we are not observing and condition on the values of its parents. In a causal setting, we are effectively imputing a variable from its direct causes. If there are multiple missing values in a single observations, we must do this for each of them in the topological ordering implied by the DAG so that the parents of each node are imputed (if needed) before the node itself.
> dag = model2network("[X1][X2][X3|X1:X2][X4|X3][X5|X3][X6|X4:X5]") > bn = bn.fit(dag, data) > incomplete.data[c(11, 21, 31, 41, 51, 61), ]
X1 X2 X3 X4 X5 X6 11 <NA> a a b c a 21 c <NA> a c b b 31 a a <NA> a b b 41 c c a <NA> a b 51 b a a b <NA> b 61 c c a c c <NA>
> imputed.data = impute(bn, data = incomplete.data, method = "parents") > imputed.data[c(11, 21, 31, 41, 51, 61), ]
X1 X2 X3 X4 X5 X6 11 b a a b c a 21 c a a c b b 31 a a a a b b 41 c c a b a b 51 b a a b a b 61 c c a c c a
More in detail:
> # 1) Identify the missing values in the observation, encoded as NA. > incomplete.observation = incomplete.data[61, ] > incomplete.observation
X1 X2 X3 X4 X5 X6 61 c c a c c <NA>
> # 2) Extract the local distribution of the node with missing values from the BN. > cpt = coef(bn[[which(is.na(incomplete.observation))]]) > cpt
, , X5 = a X4 X6 a b c a 0.6163216 0.5942029 0.5316901 b 0.3836784 0.4057971 0.4683099 , , X5 = b X4 X6 a b c a 0.1778656 0.2016632 0.1976285 b 0.8221344 0.7983368 0.8023715 , , X5 = c X4 X6 a b c a 0.6635514 0.6890130 0.7188020 b 0.3364486 0.3109870 0.2811980
> # 3) Find the distribution of the node given the parent values. > condprob = cpt[, X4 = incomplete.observation$X4, X5 = incomplete.observation$X5] > condprob
a b 0.718802 0.281198
> # 4) Find the value with the highest probability. > names(which.max(condprob))
[1] "a"
However, this approach is clearly suboptimal for missing values in nodes that have no parents.
> # 1) Identify the missing values in the observation, encoded as NA. > incomplete.observation = incomplete.data[11, ] > # 2) Extract the local distribution of the node with missing values from the BN. > prob = coef(bn[[which(is.na(incomplete.observation))]]) > prob
a b c 0.3336 0.3340 0.3324
> # 3) Find the value with the highest probability. > names(which.max(prob))
[1] "b"
The values of the other nodes are not taken into consideration at all. Conditioning on the values of all the nodes we observe for that observation makes better use of the available information. This can be done exactly or by Monte Carlo, with different computational trade-offs as the number of observations increase and as the BN becomes increasingly large.
> imputed.data = impute(bn, data = incomplete.data, method = "exact") > imputed.data[c(11, 21, 31, 41, 51, 61), ]
X1 X2 X3 X4 X5 X6 11 c a a b c a 21 c a a c b b 31 a a a a b b 41 c c a c a b 51 b a a b b b 61 c c a c c a
> imputed.data = impute(bn, data = incomplete.data, method = "bayes-lw") > imputed.data[c(11, 21, 31, 41, 51, 61), ]
X1 X2 X3 X4 X5 X6 11 c a a b c a 21 c a a c b b 31 a a a a b b 41 c c a c a b 51 b a a b b b 61 c c a c c a
More in detail (using the gRain package to make explicit the steps followed by bnlearn):
> # 1) Identify the missing and observed values in the observation. > incomplete.observation = incomplete.data[11, ] > observed.values = incomplete.observation[, which(!is.na(incomplete.observation))] > # 2) Compute the distribution of the node with the missing values conditional on > # all observed values. > junction.tree = as.grain(bn) > with.evidence = setEvidence(junction.tree, nodes = names(observed.values), + states = sapply(observed.values, as.character)) > condprob = querygrain(with.evidence, node = "X1")$X1 > condprob
X1 a b c 0.3330230 0.3305074 0.3364696
> # 3) Find the value with the highest probability. > names(which.max(condprob))
[1] "c"
The result we obtain with method = "exact"
("c"
) is different
from that we obtained earlier with method = "parents"
("b"
). This
highlights an important limit of single imputation: small fluctuations in the probabilities can change the imputed
values. Multiple imputations would draw several values and thus produce a variety of imputed values.
Prediction
We can make similar considerations for prediction. Unlike imputation, three types of variables are involved: the
target variables
we would like to predict, the observed values
we would like to predict from, and the missing values
we should average over (since we cannot condition on them, for obvious reasons). Therefore, following the same notation
that we used to describe structure learning, we either find the value with the highest probability
for a discrete variable, or the expectation
for a continuous variable. We can also predict just from the parents of the target node but, as we have seen in the case of imputation, this approach can be suboptimal or even unsuitable (for root nodes).
Computing predictions with either exact inference or Monte Carlo performs similarly in typical scenarios: we are only estimating the value of a single variable, so it is a low-dimensional problem. Imputation is not necessarily a low-dimensional problem: its dimensionality depends on the number of missing values in individual observations.
> pred.ex = predict(bn, data = incomplete.data, node = "X5", method = "exact") > pred.lw = predict(bn, data = incomplete.data, node = "X5", method = "bayes-lw") > table(exact = pred.ex, lw = pred.lw)
lw exact a b c a 28 1 8 b 0 30 0 c 3 0 10
More in detail (using the gRain package to make explicit the steps followed by bnlearn):
> # 0) Consider an observation with multiple missing values. > incomplete.data[1, c("X1", "X3", "X5")] = NA > # 1) Identify the target, missing and observed variables in the observation. > target = "X6" > observed.values = incomplete.observation[, which(!is.na(incomplete.data[1, ]))] > observed.values = observed.values[, names(observed.values) != target] > missing.values = setdiff(colnames(data), c(target, observed.values)) > # 2) Compute the distribution of the target conditional on all observed values, > # marginalising the variables we do not observe. > junction.tree = as.grain(bn) > with.evidence = setEvidence(junction.tree, nodes = names(observed.values), + states = sapply(observed.values, as.character)) > condprob = querygrain(with.evidence, node = target)[[target]] > condprob
X6 a b 0.5067387 0.4932613
> # 3) Find the value with the highest probability. > names(which.max(condprob))
[1] "a"
The code above is equivalent to the following, more explicit code.
> jointprob = querygrain(with.evidence, node = c(target, missing.values), + type = "joint") > margin.table(jointprob, target)
X6 a b 0.5067387 0.4932613
Scutari M and Denis J-B (2021). Bayesian Networks with Examples in R. CRC Press, 2nd edition. link ↩︎
Rubin DB (1976). “Inference and Missing Data.” Biometrika, 63:581–592. link ↩︎
Little RJA and Rubin DB (2019). Statistical Analysis with Missing Data. Wiley, 3rd edition. link ↩︎
van Buuren (2018). Flexible Imputation of Missing Data. CRC Press, 2nd edition. link ↩︎
Mohan K and Pearl J (2018). “Graphical Models for Processing Missing Data.” Journal of the American Statistical Association, 534(116):1023–1037. link ↩︎
Scutari M (2020). “Bayesian Network Models for Incomplete and Dynamic Data.” Statistica Neerlandica, 74(3):397–419. link ↩︎
Bodewes T and Scutari M (2021). “Learning Bayesian Networks from Incomplete Data with the Node-Averaged Likelihood.” International Journal of Approximate Reasoning, 138:145–160. link ↩︎
Strobl E, Visweswaran S and Spirtes PL (2018). “Fast Causal Inference With Non-Random Missingness by Test-Wise Deletion.” International Journal of Data Science and Analytics, 6:47–62. link ↩︎
Tue Sep 12 09:47:40 2023
with bnlearn
4.9
and R version 4.2.2 (2022-10-31)
.