Imputing missing values from a Bayesian network
Imputing missing values is essential to make it possible to apply methods thought for complete data (that is, most of
them) to incomplete data. Conceptually, imputation is similar to prediction: both are most probable explanation
queries in which we observe a subset of the variables in the data and we infer the values of some of the remaining
variables. For this reason, the implementation of impute()
in bnlearn has the same
interface as impute()
(both documented here).
Like predict()
(illustrated here), impute()
takes as arguments a
fitted Bayesian network, a data frame with missing data to impute and the label of the method used to perform the
imputation. Available methods are the same as in predict()
: "parents"
,
"bayes-lw"
and "exact"
.
Unlike predict()
, impute()
by default produces an error instead of returning data
containing NA
s because that means that the imputation was (at least partially) unsuccessful. This behaviour
can be overridden by setting strict = FALSE
to make impute
just produce a warning instead.
Imputing from the parents
With method = "parents"
, the missing values in each variable are imputed from the parents of that
variable in the Bayesian network. The imputation is performed in topological order, starting from the root nodes, so
that the parents are completed by the time they are needed to impute their children.
> library(bnlearn) > > dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > dfitted = bn.fit(dag, learning.test) > > incomplete = learning.test > missing = matrix(c(sample(nrow(incomplete), 1000), + sample(ncol(incomplete), 1000, replace = TRUE)), + ncol = 2) > incomplete[missing] = NA > head(incomplete, n = 10)
A B C D E F 1 b c b a b b 2 b a c a b <NA> 3 a a a a a a 4 a a a a b b 5 a a b c a a 6 c c a c c a 7 c c b c c a 8 b b a <NA> b b 9 b b b a c a 10 b a <NA> a a a
> completed = impute(dfitted, data = incomplete, method = "parents") > all(complete.cases(completed))
[1] TRUE
As in predict()
, method = "parents"
is ill-suited to impute missing values in root nodes.
Those nodes do not have any parents, so all missing values are imputed with either the average (for Gaussian nodes) or
the mode (for discrete nodes) of the respective local distributions. This issue will impact the quality of the
imputation of other variables that are descendants of those nodes.
Imputing with Monte Carlo posterior inference
With method = "bayes-lw"
, the missing values in each observation are imputed from their joint posterior
distribution conditional on the variables that are observed. The posterior distribution is estimated empirically using
likelihood weighting, and the imputed values are either the mean or the mode of the distribution. Therefore, the
imputed values may vary between different runs of impute()
. As in predict()
,
method = "bayes-lw"
takes as an optional argument the number n
of particles produced by
likelihood weighting for each observation in the data.
> completed = impute(dfitted, data = incomplete, method = "bayes-lw") > completed = impute(dfitted, data = incomplete, method = "bayes-lw", n = 5000) > head(completed, n = 10)
A B C D E F 1 b c b a b b 2 b a c a b b 3 a a a a a a 4 a a a a b b 5 a a b c a a 6 c c a c c a 7 c c b c c a 8 b b a b b b 9 b b b a c a 10 b a b a a a
Imputing with exact inference
Similarly, with method = "bayes-exact"
the missing values in each observation are imputed from their
joint posterior distribution conditional on the variables that are observed. However, the posterior distribution is
reconstructed using exact inference and therefore the imputed values have no simulation variability.
> completed = impute(dfitted, data = incomplete, method = "exact")
> head(completed, n = 10)
A B C D E F 1 b c b a b b 2 b a c a b b 3 a a a a a a 4 a a a a b b 5 a a b c a a 6 c c a c c a 7 c c b c c a 8 b b a b b b 9 b b b a c a 10 b a b a a a
Mon Aug 5 02:43:52 2024
with bnlearn
5.0
and R version 4.4.1 (2024-06-14)
.