Log-likelihood of data with missing values

The behaviour of logLik() when data are complete is described here. When data are incomplete, logLik() will return NA for all observations containing missing values if by.sample = TRUE.

> library(bnlearn)
> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> bn = bn.fit(dag, learning.test)
> incomplete = rbn(bn, 20)
> incomplete[1, "A"] = NA
> incomplete[2, "B"] = NA
> incomplete[3, "C"] = NA
> incomplete[4, "D"] = NA
> incomplete[5, "E"] = NA
> incomplete[6, "F"] = NA
> logLik(bn, incomplete, by.sample = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network.
* processing node A.
* processing node B.
* processing node C.
* processing node D.
* processing node E.
* processing node F.
'log Lik.'        NA,        NA,        NA,        NA,        NA,        NA, -3.937965, -6.162759, -3.383587, -5.004131, -2.677968, -6.442550, -2.833497, -6.246337, -4.861080, -6.332851, -3.254190, -3.329707, -5.455455, -3.383587 (df=41)

If we are computing the log-likelihood only for a subset of nodes, logLik() will only return NA for those observations that are not locally complete (that is, either the nodes or some of their parents are missing).

> logLik(bn, incomplete, by.sample = TRUE, nodes = "B")
'log Lik.'         NA,         NA, -1.0962199, -0.2349458, -0.1553504, -0.1553504, -0.2349458, -1.5097823, -0.1553504, -0.2349458, -0.1553504, -0.2349458, -0.2349458, -1.0962199, -0.1553504, -0.2349458, -0.2349458, -0.8098829, -0.1553504, -0.1553504 (df=41)

If by.sample = FALSE (the default) and we are computing the log-likelihood of the whole sample, logLik() will return NA if the data are incomplete at all.

> logLik(bn, incomplete, by.sample = FALSE, debug = TRUE)
> computing the log-likelihood of a discrete network.
* incomplete data for node A, the log-likelihood is NA.
'log Lik.' NA (df=41)

We can prevent logLik() from returning NA when by.sample = FALSE by setting na.rm = TRUE. logLik() will then compute the log-likelihood as the node-average log-likelihood scaled by the sample size for each node.

> logLik(bn, incomplete, by.sample = FALSE, na.rm = TRUE, debug = TRUE)
> computing the log-likelihood of a discrete network.
* processing node A.
  > 19 locally-complete observations out of 20.
  > log-likelihood is -21.981553.
* processing node B.
  > 18 locally-complete observations out of 20.
  > log-likelihood is -8.049087.
* processing node C.
  > 19 locally-complete observations out of 20.
  > log-likelihood is -15.429792.
* processing node D.
  > 17 locally-complete observations out of 20.
  > log-likelihood is -13.754659.
* processing node E.
  > 17 locally-complete observations out of 20.
  > log-likelihood is -14.373432.
* processing node F.
  > 19 locally-complete observations out of 20.
  > log-likelihood is -13.859284.
'log Lik.' -87.44781 (df=41)

In practice, this means that:

> logLik(bn, incomplete, node = "B", na.rm = TRUE)
'log Lik.' -8.049087 (df=41)
> locally.complete = complete.cases(incomplete[, c("B", parents(bn, "B"))])
> ncomplete = sum(locally.complete)
> logLik(bn, incomplete[locally.complete, ], node = "B") / ncomplete * nrow(incomplete)
'log Lik.' -8.049087 (df=41)

Note that the log-likelihood can still be NA if any of the parameters of the Bayesian network involved in its computation is equal to NA: na.rm = TRUE only prevents the missing values in the data from propagating, not those in the parameters.

Last updated on Tue Aug 6 11:20:45 2024 with bnlearn 5.0 and R version 4.4.1 (2024-06-14).