## Parameter learning from data with missing values

### Parameter estimators for complete data

Most approaches to parameter learning assume that local distributions are independent from each other in order to estimate their parameters individually. In other words, the parameter estimates for any local distribution are computed independently from the parameter estimates for other local distributions. As a result, estimating the parameters of a local distribution only requires the data observed for the variables involved in that specific local distribution (the node and its parents).

This allows `bn.fit()` to handle missing values transparently by only using locally complete observations for each local distribution. For instance, in the case of discrete Bayesian networks we have:

```> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> missing = matrix(FALSE, 4000, ncol(learning.test))
> missing[sample(length(missing), 1000)] = TRUE
> incomplete = learning.test[1:4000, ]
> incomplete[missing] = NA
> fitted = bn.fit(dag, incomplete)
```

To estimate the conditional probability table of `B` given `A`,

```> fitted\$B
```
```
Parameters of node B (multinomial distribution)

Conditional probability table:

A
B            a          b          c
a 0.85714286 0.44426363 0.11822660
b 0.02435065 0.22375915 0.09852217
c 0.11850649 0.33197722 0.78325123
```

the relevant data can be summarized as:

```> table(incomplete[, c("B", "A")], useNA = "always")
```
```      A
B         a    b    c <NA>
a    1056  546  144   69
b      30  275  120   19
c     146  408  954   56
<NA>   48   55   64   10
```

and the maximum likelihood estimates of the conditional probabilities are computed from the counts arising from the observations in which both `A` and `B` are observed.

```> prop.table(table(incomplete[, c("B", "A")], useNA = "no"), "A")
```
```   A
B            a          b          c
a 0.85714286 0.44426363 0.11822660
b 0.02435065 0.22375915 0.09852217
c 0.11850649 0.33197722 0.78325123
```

Bayesian estimates of conditional probabilities are computed in the same way, but involving a prior distribution.

As a second example, in the case of Gaussian Bayesian networks we have:

```> dagG = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
> missing = matrix(FALSE, nrow(gaussian.test), ncol(gaussian.test))
> missing[sample(length(missing), 1000)] = TRUE
> incompleteG = gaussian.test
> incompleteG[missing] = NA
> fitted = bn.fit(dagG, incompleteG)
```

and the maximum likelihood estimates of the regression coefficients in the local distribution of `C` against `A` and `B` are identical to those produced by `lm()` when ```na.action = na.omit```.

```> fitted\$C
```
```
Parameters of node C (Gaussian distribution)

Conditional density: C | A + B
Coefficients:
(Intercept)            A            B
2.001521     1.997936     1.997536
Standard deviation of the residuals: 0.5100404
```
```> lm(C ~ A + B, data = incompleteG, na.action = na.omit)
```
```
Call:
lm(formula = C ~ A + B, data = incompleteG, na.action = na.omit)

Coefficients:
(Intercept)            A            B
2.002        1.998        1.998
```

### The Expectation-Maximization (EM) algorithm

On the other hand, the Expectation-Maximization (EM) algorithm is explicitly designed to incorporate incomplete data into parameter estimates. In its hard-EM form, the steps are implemented as follows:

• missing data are imputed in the expectation step using the current parameter estimates;
• parameters are estimated in the maximization step using the current completed data.

These two steps are reflected in the EM implementations for discrete (`method = "hard-em"`), Gaussian (`method = "hard-em-g"`) and conditional Gaussian networks (`method = "hard-em-cg"`). Their optional arguments `impute` and `fit` control the methods used in the expectation and maximization steps:

```> fitted = bn.fit(dag, incomplete, method = "hard-em", impute = "bayes-lw", fit = "bayes")
> fitted\$B
```
```
Parameters of node B (multinomial distribution)

Conditional probability table:

A
B            a          b          c
a 0.86118744 0.45281450 0.11659567
b 0.02331585 0.21484992 0.09479663
c 0.11549671 0.33233558 0.78860770
```

Optional arguments can be passed to either method using the `impute.args` and `fit.args` arguments; if none are provided, default values from `impute()` (documented here) and `bn.fit` will be used.

```> fitted = bn.fit(dag, incomplete, method = "hard-em",
+            impute = "bayes-lw", impute.args = list(n = 1000),
+            fit = "bayes", fit.args = list(iss = 2))
> fitted\$B
```
```
Parameters of node B (multinomial distribution)

Conditional probability table:

A
B            a          b          c
a 0.86041095 0.45210823 0.11540715
b 0.02262707 0.21530544 0.09507072
c 0.11696198 0.33258632 0.78952214
```

The number of iterations of the EM algorithm are controlled by the `max.iter` argument (defaulting to 5 iterations) and by the `threshold` argument (defaulting to a minimum increase in the relative likelihood by 0.001, after dividing by the sample size).

```> flat.prior = bn.fit(dag, incomplete[1:5, ], method = "bayes", iss = 10^4)
> fitted = bn.fit(dag, incomplete, method = "hard-em", start = flat.prior, max.iter = 1, debug = TRUE)
```
```* expectation-maximization iteration 1 .
> the relative difference in log-likelihood is: 0.2933575 .
```
```> fitted = bn.fit(dag, incomplete, method = "hard-em", start = flat.prior, threshold = 1e-10, debug = TRUE)
```
```* expectation-maximization iteration 1 .
> the relative difference in log-likelihood is: 0.2935193 .
* expectation-maximization iteration 2 .
> the relative difference in log-likelihood is: 0.001798532 .
* expectation-maximization iteration 3 .
> the relative difference in log-likelihood is: -0.0001061403 .
@ the difference is smaller than the threshold, stopping.
```

As suggested in the book by Koller & Friedman, convergence should be assessed by computing the likelihood of a data set other than that used to estimate the parameters to avoid overfitting. This is possible with the `newdata` argument.

```> fitted = bn.fit(dag, incomplete, method = "hard-em", newdata = learning.test[4001:5000, ])
```

Furthermore, Koller & Friedman suggest to initialize the EM algorithm with different parameter values to avoid converging to a local maximum. The `start` argument can be used to pass a `bn.fit` object that will be used to perform the initial imputation and to compute the initial value of the log-likelihood. Note that this `bn.fit` object can encode a network with a different structure than the network we are estimating the parameters for.

```> start.dag = pdag2dag(chow.liu(learning.test), ordering = names(learning.test))
> start.bn = bn.fit(start.dag, learning.test)
> fitted = bn.fit(dag, incomplete, method = "hard-em", start = start.bn)
```

The `start` argument is also required to initialize EM when at least one of the variables in the data is latent, that is, when it takes value `NA` for all observations. By default, the starting network would be set to a `bn.fit` object learned from locally-complete data, but there are no available complete data in that case.

Last updated on `Tue Mar 14 12:20:25 2023` with bnlearn `4.9-20230309` and `R version 4.2.2 (2022-10-31)`.