## Log-likelihood of data with missing values

The behaviour of `logLik()`

when data are complete is described
here. When data are incomplete, `logLik()`

will return `NA`

for all observations containing missing values if `by.sample = TRUE`

.

> library(bnlearn) > dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > bn = bn.fit(dag, learning.test) > incomplete = rbn(bn, 20) > incomplete[1, "A"] = NA > incomplete[2, "B"] = NA > incomplete[3, "C"] = NA > incomplete[4, "D"] = NA > incomplete[5, "E"] = NA > incomplete[6, "F"] = NA > logLik(bn, incomplete, by.sample = TRUE, debug = TRUE)

> computing the log-likelihood of a discrete network. * processing node A. * processing node B. * processing node C. * processing node D. * processing node E. * processing node F.

'log Lik.' NA, NA, NA, NA, NA, NA, -3.937965, -6.162759, -3.383587, -5.004131, -2.677968, -6.442550, -2.833497, -6.246337, -4.861080, -6.332851, -3.254190, -3.329707, -5.455455, -3.383587 (df=41)

If we are computing the log-likelihood only for a subset of nodes, `logLik()`

will only return
`NA`

for those observations that are not locally complete (that is, either the nodes or some of their parents
are missing).

> logLik(bn, incomplete, by.sample = TRUE, nodes = "B")

'log Lik.' NA, NA, -1.0962199, -0.2349458, -0.1553504, -0.1553504, -0.2349458, -1.5097823, -0.1553504, -0.2349458, -0.1553504, -0.2349458, -0.2349458, -1.0962199, -0.1553504, -0.2349458, -0.2349458, -0.8098829, -0.1553504, -0.1553504 (df=41)

If `by.sample = FALSE`

(the default) and we are computing the log-likelihood of the whole sample,
`logLik()`

will return `NA`

if the data are incomplete at all.

> logLik(bn, incomplete, by.sample = FALSE, debug = TRUE)

> computing the log-likelihood of a discrete network. * incomplete data for node A, the log-likelihood is NA.

'log Lik.' NA (df=41)

We can prevent `logLik()`

from returning `NA`

when `by.sample = FALSE`

by setting
`na.rm = TRUE`

. `logLik()`

will then compute the log-likelihood as the node-average log-likelihood
scaled by the sample size for each node.

> logLik(bn, incomplete, by.sample = FALSE, na.rm = TRUE, debug = TRUE)

> computing the log-likelihood of a discrete network. * processing node A. > 19 locally-complete observations out of 20. > log-likelihood is -21.981553. * processing node B. > 18 locally-complete observations out of 20. > log-likelihood is -8.049087. * processing node C. > 19 locally-complete observations out of 20. > log-likelihood is -15.429792. * processing node D. > 17 locally-complete observations out of 20. > log-likelihood is -13.754659. * processing node E. > 17 locally-complete observations out of 20. > log-likelihood is -14.373432. * processing node F. > 19 locally-complete observations out of 20. > log-likelihood is -13.859284.

'log Lik.' -87.44781 (df=41)

In practice, this means that:

> logLik(bn, incomplete, node = "B", na.rm = TRUE)

'log Lik.' -8.049087 (df=41)

> locally.complete = complete.cases(incomplete[, c("B", parents(bn, "B"))]) > ncomplete = sum(locally.complete) > logLik(bn, incomplete[locally.complete, ], node = "B") / ncomplete * nrow(incomplete)

'log Lik.' -8.049087 (df=41)

Note that the log-likelihood can still be `NA`

if any of the parameters of the Bayesian network involved
in its computation is equal to `NA`

: `na.rm = TRUE`

only prevents the missing values in the data
from propagating, not those in the parameters.

`Tue Aug 6 11:20:45 2024`

with **bnlearn**

`5.0`

and `R version 4.4.1 (2024-06-14)`

.