Bootstrap and cross-validation from data with missing values

Resampling does not require any modifications to handle incomplete data: missing values are carried over together with observed values when the data points are resampled. This is how bn.boot() (documented here), boot.strength() (here) and bn.cv() (here) handle incomplete data.

bn.boot(), boot.strength() and bn.cv() no longer check that the data are complete. The struture learning, parameter learning and inference methods that they call on the resampled data must either be able to handle missing values or generate errors.

The loss functions called by bn.cv() have been modified to propagate missing values correctly:

  • The log-likelihood loss function ("logl") is based on the node-average log-likelihood.
  • The predictive accuracy ("pred"), F1 score ("f1"), predictive correlation ("cor") and predictive mean square error ("mse") use whatever complete (observed, predicted) pairs are available. The ability of producing non-NA predictions depends on the prediction method specified in the loss.args argument (see here).
  • The area under the ROC curve ("auroc") will use whatever complete (observed, probability) pairs are available. Note that if the predictors are not complete, exact and posterior predictions will call impute() to produce values for both the target and the missing predictors. As a result, it will be impossible to compute the "auroc" loss because impute() does not produce prediction probabilities just for the target node. However, predicting from parents will produce NA prediction probabilities only for observations without complete parents.
Last updated on Mon Aug 5 03:09:14 2024 with bnlearn 5.0 and R version 4.4.1 (2024-06-14).