Imputing missing values from a Bayesian network

Imputing missing values is essential to make it possible to apply methods thought for complete data (that is, most of them) to incomplete data. Conceptually, imputation is similar to prediction: both are most probable explanation queries in which we observe a subset of the variables in the data and we infer the values of some of the remaining variables. For this reason, the implementation of impute() in bnlearn has the same interface as impute() (both documented here).

Like predict() (illustrated here), impute() takes as arguments a fitted Bayesian network, a data frame with missing data to impute and the label of the method used to perform the imputation. Available methods are the same as in predict(): "parents", "bayes-lw" and "exact".

Unlike predict(), impute() by default produces an error instead of returning data containing NAs because that means that the imputation was (at least partially) unsuccessful. This behaviour can be overridden by setting strict = FALSE to make impute just produce a warning instead.

Imputing from the parents

With method = "parents", the missing values in each variable are imputed from the parents of that variable in the Bayesian network. The imputation is performed in topological order, starting from the root nodes, so that the parents are completed by the time they are needed to impute their children.

> library(bnlearn)
> 
> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> dfitted = bn.fit(dag, learning.test)
> 
> incomplete = learning.test
> missing = matrix(c(sample(nrow(incomplete), 1000),
+                    sample(ncol(incomplete), 1000, replace = TRUE)),
+             ncol = 2)
> incomplete[missing] = NA
> head(incomplete, n = 10)
   A B    C    D E    F
1  b c    b    a b    b
2  b a    c    a b <NA>
3  a a    a    a a    a
4  a a    a    a b    b
5  a a    b    c a    a
6  c c    a    c c    a
7  c c    b    c c    a
8  b b    a <NA> b    b
9  b b    b    a c    a
10 b a <NA>    a a    a
> completed = impute(dfitted, data = incomplete, method = "parents")
> all(complete.cases(completed))
[1] TRUE

As in predict(), method = "parents" is ill-suited to impute missing values in root nodes. Those nodes do not have any parents, so all missing values are imputed with either the average (for Gaussian nodes) or the mode (for discrete nodes) of the respective local distributions. This issue will impact the quality of the imputation of other variables that are descendants of those nodes.

Imputing with Monte Carlo posterior inference

With method = "bayes-lw", the missing values in each observation are imputed from their joint posterior distribution conditional on the variables that are observed. The posterior distribution is estimated empirically using likelihood weighting, and the imputed values are either the mean or the mode of the distribution. Therefore, the imputed values may vary between different runs of impute(). As in predict(), method = "bayes-lw" takes as an optional argument the number n of particles produced by likelihood weighting for each observation in the data.

> completed = impute(dfitted, data = incomplete, method = "bayes-lw")
> completed = impute(dfitted, data = incomplete, method = "bayes-lw", n = 5000)
> head(completed, n = 10)
   A B C D E F
1  b c b a b b
2  b a c a b b
3  a a a a a a
4  a a a a b b
5  a a b c a a
6  c c a c c a
7  c c b c c a
8  b b a b b b
9  b b b a c a
10 b a b a a a

Imputing with exact inference

Similarly, with method = "bayes-exact" the missing values in each observation are imputed from their joint posterior distribution conditional on the variables that are observed. However, the posterior distribution is reconstructed using exact inference and therefore the imputed values have no simulation variability.

> completed = impute(dfitted, data = incomplete, method = "exact")
Loading required namespace: gRain

Attaching package: 'gRbase'
The following objects are masked from 'package:bnlearn':

    ancestors, children, nodes, parents
> head(completed, n = 10)
   A B C D E F
1  b c b a b b
2  b a c a b b
3  a a a a a a
4  a a a a b b
5  a a b c a a
6  c c a c c a
7  c c b c c a
8  b b a b b b
9  b b b a c a
10 b a b a a a
Last updated on Sat Feb 17 23:44:22 2024 with bnlearn 5.0-20240208 and R version 4.3.2 (2023-10-31).