Log-likelihood of data with missing values
The behaviour of logLik()
when data are complete is described
here. When data are incomplete, logLik()
will return NA
for all observations containing missing values if by.sample = TRUE
.
> library(bnlearn) > dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > bn = bn.fit(dag, learning.test) > incomplete = rbn(bn, 20) > incomplete[1, "A"] = NA > incomplete[2, "B"] = NA > incomplete[3, "C"] = NA > incomplete[4, "D"] = NA > incomplete[5, "E"] = NA > incomplete[6, "F"] = NA > logLik(bn, incomplete, by.sample = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network. * processing node A. * processing node B. * processing node C. * processing node D. * processing node E. * processing node F.
'log Lik.' NA, NA, NA, NA, NA, NA, -3.937965, -6.162759, -3.383587, -5.004131, -2.677968, -6.442550, -2.833497, -6.246337, -4.861080, -6.332851, -3.254190, -3.329707, -5.455455, -3.383587 (df=41)
If we are computing the log-likelihood only for a subset of nodes, logLik()
will only return
NA
for those observations that are not locally complete (that is, either the nodes or some of their parents
are missing).
> logLik(bn, incomplete, by.sample = TRUE, nodes = "B")
'log Lik.' NA, NA, -1.0962199, -0.2349458, -0.1553504, -0.1553504, -0.2349458, -1.5097823, -0.1553504, -0.2349458, -0.1553504, -0.2349458, -0.2349458, -1.0962199, -0.1553504, -0.2349458, -0.2349458, -0.8098829, -0.1553504, -0.1553504 (df=41)
If by.sample = FALSE
(the default) and we are computing the log-likelihood of the whole sample,
logLik()
will return NA
if the data are incomplete at all.
> logLik(bn, incomplete, by.sample = FALSE, debug = TRUE)
> computing the log-likelihood of a discrete network. * incomplete data for node A, the log-likelihood is NA.
'log Lik.' NA (df=41)
We can prevent logLik()
from returning NA
when by.sample = FALSE
by setting
na.rm = TRUE
. logLik()
will then compute the log-likelihood as the node-average log-likelihood
scaled by the sample size for each node.
> logLik(bn, incomplete, by.sample = FALSE, na.rm = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network. * processing node A. > 19 locally-complete observations out of 20. > log-likelihood is -21.981553. * processing node B. > 18 locally-complete observations out of 20. > log-likelihood is -8.049087. * processing node C. > 19 locally-complete observations out of 20. > log-likelihood is -15.429792. * processing node D. > 17 locally-complete observations out of 20. > log-likelihood is -13.754659. * processing node E. > 17 locally-complete observations out of 20. > log-likelihood is -14.373432. * processing node F. > 19 locally-complete observations out of 20. > log-likelihood is -13.859284.
'log Lik.' -87.44781 (df=41)
In practice, this means that:
> logLik(bn, incomplete, node = "B", na.rm = TRUE)
'log Lik.' -8.049087 (df=41)
> locally.complete = complete.cases(incomplete[, c("B", parents(bn, "B"))]) > ncomplete = sum(locally.complete) > logLik(bn, incomplete[locally.complete, ], node = "B") / ncomplete * nrow(incomplete)
'log Lik.' -8.049087 (df=41)
Note that the log-likelihood can still be NA
if any of the parameters of the Bayesian network involved
in its computation is equal to NA
: na.rm = TRUE
only prevents the missing values in the data
from propagating, not those in the parameters.
Tue Aug 6 11:20:45 2024
with bnlearn
5.0
and R version 4.4.1 (2024-06-14)
.