Bootstrap and cross-validation from data with missing values

Resampling does not require any modifications to handle incomplete data: missing values are carried over together with observed values when the data points are resampled. This is how bn.boot() (documented here), boot.strength() (here) and bn.cv() (here) handle incomplete data.

bn.boot(), boot.strength() and bn.cv() no longer check that the data are complete. The struture learning, parameter learning and inference methods that they call on the resampled data must either be able to handle missing values or generate errors.

The loss functions called by bn.cv() have been modified to propagate missing values correctly:

  • For log-likelihood loss functions ("logl", "logl-g", "logl-cg"): the node and all its parents must be observed for the node's log-likelihood to be well-defined. If any of them is NA, the log-likelihood is NA and is not used to compute the loss function. The average log-likelihood for each node, which is an empirical estimate of the node's entropy, is computed from locally-complete observations whose log-likelihood is not NA.
  • For all other losses: we use the same code as predict() to produce the value of the target node. If the prediction is NA, it is not used to compute the loss function.