## Log-likelihood of data with missing values

The behaviour of `logLik()` when data are complete is described here here. When data are incomplete, `logLik()` will return `NA` for all observations containing missing values if `by.sample = TRUE`.

```> library(bnlearn)
> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> bn = bn.fit(dag, learning.test)
> incomplete = rbn(bn, 20)
> incomplete[1, "A"] = NA
> incomplete[2, "B"] = NA
> incomplete[3, "C"] = NA
> incomplete[4, "D"] = NA
> incomplete[5, "E"] = NA
> incomplete[6, "F"] = NA
> logLik(bn, incomplete, by.sample = TRUE, debug = TRUE)
```
```> computing the log-likelihood of a discrete network.
* processing node A.
* processing node B.
* processing node C.
* processing node D.
* processing node E.
* processing node F.
```
```         NA        NA        NA        NA        NA        NA -3.383587
 -2.677968 -5.702894 -5.862672 -2.833497 -5.687906 -2.833497 -2.833497
 -4.121276 -6.822133 -5.004131 -6.459965 -4.672770 -5.004131
```

If we are computing the log-likelihood only for a subset of nodes, `logLik()` will only return `NA` for those observations that are not locally complete (that is, either the nodes or some of their parents are missing).

```> logLik(bn, incomplete, by.sample = TRUE, nodes = "B")
```
```          NA         NA -1.5097823 -0.1553504 -0.1553504 -0.1553504
 -0.1553504 -0.1553504 -2.3595312 -0.8098829 -0.2349458 -0.2349458
 -0.2349458 -0.2349458 -0.1553504 -0.1553504 -0.2349458 -1.5097823
 -0.1553504 -0.2349458
```

If `by.sample = FALSE` (the default) and we are computing the log-likelihood of the whole sample, `logLik()` will return `NA` if the data are incomplete at all.

```> logLik(bn, incomplete, by.sample = FALSE, debug = TRUE)
```
```> computing the log-likelihood of a discrete network.
* incomplete data for node A, the log-likelihood is NA.
```
``` NA
```

We can prevent `logLik()` from returning `NA` when `by.sample = FALSE` by setting `na.rm = TRUE`. `logLik()` will then compute the log-likelihood as the node-average log-likelihood scaled by the sample size for each node.

```> logLik(bn, incomplete, by.sample = FALSE, na.rm = TRUE, debug = TRUE)
```
```> computing the log-likelihood of a discrete network.
* processing node A.
> 19 locally-complete observations out of 20.
> log-likelihood is -21.982814.
* processing node B.
> 18 locally-complete observations out of 20.
> log-likelihood is -9.823841.
* processing node C.
> 19 locally-complete observations out of 20.
> log-likelihood is -16.876785.
* processing node D.
> 17 locally-complete observations out of 20.
> log-likelihood is -16.938355.
* processing node E.
> 17 locally-complete observations out of 20.
> log-likelihood is -12.396791.
* processing node F.
> 19 locally-complete observations out of 20.
> log-likelihood is -13.866863.
```
``` -91.88545
```

In practice, this means that:

```> logLik(bn, incomplete, node = "B", na.rm = TRUE)
```
``` -9.823841
```
```> locally.complete = complete.cases(incomplete[, c("B", parents(bn, "B"))])
> ncomplete = sum(locally.complete)
> logLik(bn, incomplete[locally.complete, ], node = "B") / ncomplete * nrow(incomplete)
```
``` -9.823841
```

Note that the log-likelihood can still be `NA` if any of the parameters of the Bayesian network involved in its computation is equal to `NA`: `na.rm = TRUE` only prevents the missing values in the data from propagating, not those in the parameters.

Last updated on `Tue Mar 14 12:05:13 2023` with bnlearn `4.9-20230309` and `R version 4.2.2 (2022-10-31)`.