Log-likelihood of data with missing values
The behaviour of logLik()
when data are complete is described here
here. When data are incomplete, logLik()
will return NA
for all observations containing missing values if by.sample = TRUE
.
> library(bnlearn) > dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > bn = bn.fit(dag, learning.test) > incomplete = rbn(bn, 20) > incomplete[1, "A"] = NA > incomplete[2, "B"] = NA > incomplete[3, "C"] = NA > incomplete[4, "D"] = NA > incomplete[5, "E"] = NA > incomplete[6, "F"] = NA > logLik(bn, incomplete, by.sample = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network. * processing node A. * processing node B. * processing node C. * processing node D. * processing node E. * processing node F.
[1] NA NA NA NA NA NA -3.383587 [8] -2.677968 -5.702894 -5.862672 -2.833497 -5.687906 -2.833497 -2.833497 [15] -4.121276 -6.822133 -5.004131 -6.459965 -4.672770 -5.004131
If we are computing the log-likelihood only for a subset of nodes, logLik()
will only return
NA
for those observations that are not locally complete (that is, either the nodes or some of their parents
are missing).
> logLik(bn, incomplete, by.sample = TRUE, nodes = "B")
[1] NA NA -1.5097823 -0.1553504 -0.1553504 -0.1553504 [7] -0.1553504 -0.1553504 -2.3595312 -0.8098829 -0.2349458 -0.2349458 [13] -0.2349458 -0.2349458 -0.1553504 -0.1553504 -0.2349458 -1.5097823 [19] -0.1553504 -0.2349458
If by.sample = FALSE
(the default) and we are computing the log-likelihood of the whole sample,
logLik()
will return NA
if the data are incomplete at all.
> logLik(bn, incomplete, by.sample = FALSE, debug = TRUE)
> computing the log-likelihood of a discrete network. * incomplete data for node A, the log-likelihood is NA.
[1] NA
We can prevent logLik()
from returning NA
when by.sample = FALSE
by setting
na.rm = TRUE
. logLik()
will then compute the log-likelihood as the node-average log-likelihood
scaled by the sample size for each node.
> logLik(bn, incomplete, by.sample = FALSE, na.rm = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network. * processing node A. > 19 locally-complete observations out of 20. > log-likelihood is -21.982814. * processing node B. > 18 locally-complete observations out of 20. > log-likelihood is -9.823841. * processing node C. > 19 locally-complete observations out of 20. > log-likelihood is -16.876785. * processing node D. > 17 locally-complete observations out of 20. > log-likelihood is -16.938355. * processing node E. > 17 locally-complete observations out of 20. > log-likelihood is -12.396791. * processing node F. > 19 locally-complete observations out of 20. > log-likelihood is -13.866863.
[1] -91.88545
In practice, this means that:
> logLik(bn, incomplete, node = "B", na.rm = TRUE)
[1] -9.823841
> locally.complete = complete.cases(incomplete[, c("B", parents(bn, "B"))]) > ncomplete = sum(locally.complete) > logLik(bn, incomplete[locally.complete, ], node = "B") / ncomplete * nrow(incomplete)
[1] -9.823841
Note that the log-likelihood can still be NA
if any of the parameters of the Bayesian network involved
in its computation is equal to NA
: na.rm = TRUE
only prevents the missing values in the data
from propagating, not those in the parameters.
Tue Mar 14 12:05:13 2023
with bnlearn
4.9-20230309
and R version 4.2.2 (2022-10-31)
.