## Log-likelihood of data with missing values

The behaviour of `logLik()`

when data are complete is described here
here. When data are incomplete, `logLik()`

will return `NA`

for all observations containing missing values if `by.sample = TRUE`

.

> library(bnlearn) > dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > bn = bn.fit(dag, learning.test) > incomplete = rbn(bn, 20) > incomplete[1, "A"] = NA > incomplete[2, "B"] = NA > incomplete[3, "C"] = NA > incomplete[4, "D"] = NA > incomplete[5, "E"] = NA > incomplete[6, "F"] = NA > logLik(bn, incomplete, by.sample = TRUE, debug = TRUE)

> computing the log-likelihood of a discrete network. * processing node A. * processing node B. * processing node C. * processing node D. * processing node E. * processing node F.

[1] NA NA NA NA NA NA -3.383587 [8] -2.677968 -5.702894 -5.862672 -2.833497 -5.687906 -2.833497 -2.833497 [15] -4.121276 -6.822133 -5.004131 -6.459965 -4.672770 -5.004131

If we are computing the log-likelihood only for a subset of nodes, `logLik()`

will only return
`NA`

for those observations that are not locally complete (that is, either the nodes or some of their parents
are missing).

> logLik(bn, incomplete, by.sample = TRUE, nodes = "B")

[1] NA NA -1.5097823 -0.1553504 -0.1553504 -0.1553504 [7] -0.1553504 -0.1553504 -2.3595312 -0.8098829 -0.2349458 -0.2349458 [13] -0.2349458 -0.2349458 -0.1553504 -0.1553504 -0.2349458 -1.5097823 [19] -0.1553504 -0.2349458

If `by.sample = FALSE`

(the default) and we are computing the log-likelihood of the whole sample,
`logLik()`

will return `NA`

if the data are incomplete at all.

> logLik(bn, incomplete, by.sample = FALSE, debug = TRUE)

> computing the log-likelihood of a discrete network. * incomplete data for node A, the log-likelihood is NA.

[1] NA

We can prevent `logLik()`

from returning `NA`

when `by.sample = FALSE`

by setting
`na.rm = TRUE`

. `logLik()`

will then compute the log-likelihood as the node-average log-likelihood
scaled by the sample size for each node.

> logLik(bn, incomplete, by.sample = FALSE, na.rm = TRUE, debug = TRUE)

> computing the log-likelihood of a discrete network. * processing node A. > 19 locally-complete observations out of 20. > log-likelihood is -21.982814. * processing node B. > 18 locally-complete observations out of 20. > log-likelihood is -9.823841. * processing node C. > 19 locally-complete observations out of 20. > log-likelihood is -16.876785. * processing node D. > 17 locally-complete observations out of 20. > log-likelihood is -16.938355. * processing node E. > 17 locally-complete observations out of 20. > log-likelihood is -12.396791. * processing node F. > 19 locally-complete observations out of 20. > log-likelihood is -13.866863.

[1] -91.88545

In practice, this means that:

> logLik(bn, incomplete, node = "B", na.rm = TRUE)

[1] -9.823841

> locally.complete = complete.cases(incomplete[, c("B", parents(bn, "B"))]) > ncomplete = sum(locally.complete) > logLik(bn, incomplete[locally.complete, ], node = "B") / ncomplete * nrow(incomplete)

[1] -9.823841

Note that the log-likelihood can still be `NA`

if any of the parameters of the Bayesian network involved
in its computation is equal to `NA`

: `na.rm = TRUE`

only prevents the missing values in the data
from propagating, not those in the parameters.

`Tue Mar 14 12:05:13 2023`

with **bnlearn**

`4.9-20230309`

and `R version 4.2.2 (2022-10-31)`

.