## Imputing missing values from a Bayesian network

Imputing missing values is essential to make it possible to apply methods thought for complete data (that is, most of them) to incomplete data. Conceptually, imputation is similar to prediction: both are most probable explanation queries in which we observe a subset of the variables in the data and we infer the values of some of the remaining variables. For this reason, the implementation of `impute()` in bnlearn has the same interface as `impute()` (both documented here).

Like `predict()` (illustrated here), `impute()` takes as arguments a fitted Bayesian network, a data frame with missing data to impute and the label of the method used to perform the imputation. Available methods are the same as in `predict()`: `"parents"`, `"bayes-lw"` and `"exact"`.

### Imputing from the parents

With `method = "parents"`, the missing values in each variable are imputed from the parents of that variable in the Bayesian network. The imputation is performed in topological order, starting from the root nodes, so that the parents are completed by the time they are needed to impute their children.

```> library(bnlearn)
>
> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> dfitted = bn.fit(dag, learning.test)
>
> incomplete = learning.test
> missing = matrix(c(sample(nrow(incomplete), 1000),
+                    sample(ncol(incomplete), 1000, replace = TRUE)),
+             ncol = 2)
> incomplete[missing] = NA
```
```   A B    C    D E    F
1  b c    b    a b    b
2  b a    c    a b <NA>
3  a a    a    a a    a
4  a a    a    a b    b
5  a a    b    c a    a
6  c c    a    c c    a
7  c c    b    c c    a
8  b b    a <NA> b    b
9  b b    b    a c    a
10 b a <NA>    a a    a
```
```> completed = impute(dfitted, data = incomplete, method = "parents")
> all(complete.cases(completed))
```
```[1] TRUE
```

As in `predict()`, `method = "parents"` is ill-suited to impute missing values in root nodes. Those nodes do not have any parents, so all missing values are imputed with either the average (for Gaussian nodes) or the mode (for discrete nodes) of the respective local distributions. This issue will impact the quality of the imputation of other variables that are descendants of those nodes.

### Imputing with Monte Carlo posterior inference

With `method = "bayes-lw"`, the missing values in each observation are imputed from their joint posterior distribution conditional on the variables that are observed. The posterior distribution is estimated empirically using likelihood weighting, and the imputed values are either the mean or the mode of the distribution. Therefore, the imputed values may vary between different runs of `impute()`. As in `predict()`, `method = "bayes-lw"` takes as an optional argument the number `n` of particles produced by likelihood weighting for each observation in the data.

```> completed = impute(dfitted, data = incomplete, method = "bayes-lw")
> completed = impute(dfitted, data = incomplete, method = "bayes-lw", n = 5000)
```
```   A B C D E F
1  b c b a b b
2  b a c a b b
3  a a a a a a
4  a a a a b b
5  a a b c a a
6  c c a c c a
7  c c b c c a
8  b b a b b b
9  b b b a c a
10 b a b a a a
```

### Imputing with exact inference

Similarly, with `method = "bayes-exact"` the missing values in each observation are imputed from their joint posterior distribution conditional on the variables that are observed. However, the posterior distribution is reconstructed using exact inference and therefore the imputed values have no simulation variability.

```> completed = impute(dfitted, data = incomplete, method = "exact")
```
```Loading required namespace: gRain
```
```
Attaching package: 'gRbase'
```
```The following objects are masked from 'package:bnlearn':

ancestors, children, nodes, parents
```
```> head(completed, n = 10)
```
```   A B C D E F
1  b c b a b b
2  b a c a b b
3  a a a a a a
4  a a a a b b
5  a a b c a a
6  c c a c c a
7  c c b c c a
8  b b a b b b
9  b b b a c a
10 b a b a a a
```
Last updated on `Fri Nov 11 18:55:34 2022` with bnlearn `4.9-20221107` and `R version 4.2.2 (2022-10-31)`.