bnlearn - man/bn.cv.html

bn.cv {bnlearn}

R Documentation

Cross-validation for Bayesian networks

Description

Perform a k-fold or hold-out cross-validation for a learning algorithm or a fixed network structure.

Usage

bn.cv(data, bn, loss = NULL, ..., algorithm.args = list(),
  loss.args = list(), fit, fit.args = list(), method = "k-fold",
  cluster, debug = FALSE)

## S3 method for class 'bn.kcv'
plot(x, ..., main, xlab, ylab, connect = FALSE)
## S3 method for class 'bn.kcv.list'
plot(x, ..., main, xlab, ylab, connect = FALSE)

loss(x)

Arguments

data

a data frame containing the variables in the model.

bn

either a character string (the label of the learning algorithm to be applied to the training data in each iteration) or an object of class bn (a fixed network structure).

loss

a character string, the label of a loss function. If none is specified, the default loss function is the Classification Error for Bayesian networks classifiers; otherwise, the Log-Likelihood Loss for both discrete and continuous data sets. See below for additional details.

algorithm.args

a list of extra arguments to be passed to the learning algorithm.

loss.args

a list of extra arguments to be passed to the loss function specified by loss.

fit

a character string, the label of the method used to fit the parameters of the network. See bn.fit for details.

fit.args

additional arguments for the parameter estimation procedure, see again bn.fit for details.

method

a character string, either k-fold, custom-folds or hold-out. See below for details.

cluster

an optional cluster object from package parallel.

debug

a boolean value. If TRUE a lot of debugging output is printed; otherwise the function is completely silent.

x

an object of class bn.kcv or bn.kcv.list returned by bn.cv().

...

additional objects of class bn.kcv or bn.kcv.list to plot alongside the first.

main, xlab, ylab

the title of the plot, an array of labels for the boxplot, the label for the y axis.

connect

a logical value. If TRUE, the medians points in the boxplots will be connected by a segmented line.

Value

bn.cv() returns an object of class bn.kcv.list if runs is at least 2, an object of class bn.kcv if runs is equal to 1.

loss() returns a numeric vector with a length equal to runs.

Cross-Validation Strategies

The following cross-validation methods are implemented:

k-fold: the data are split in k subsets of equal size. For each subset in turn, bn is fitted (and possibly learned as well) on the other k - 1 subsets and the loss function is then computed using that subset. Loss estimates for each of the k subsets are then combined to give an overall loss for data.
custom-folds: the data are manually partitioned by the user into subsets, which are then used as in k-fold cross-validation. Subsets are not constrained to have the same size, and every observation must be assigned to one subset.
hold-out: k subsamples of size m are sampled independently without replacement from the data. For each subsample, bn is fitted (and possibly learned) on the remaining m - nrow(data) samples and the loss function is computed on the m observations in the subsample. The overall loss estimate is the average of the k loss estimates from the subsamples.

If cross-validation is used with multiple runs, the overall loss is the averge of the loss estimates from the different runs.

To clarify, cross-validation methods accept the following optional arguments:

k: a positive integer number, the number of groups into which the data will be split (in k-fold cross-validation) or the number of times the data will be split in training and test samples (in hold-out cross-validation).
m: a positive integer number, the size of the test set in hold-out cross-validation.
runs: a positive integer number, the number of times k-fold or hold-out cross-validation will be run.
folds: a list in which element corresponds to one fold and contains the indices for the observations that are included to that fold; or a list with an element for each run, in which each element is itself a list of the folds to be used for that run.

Loss Functions

The following loss functions are implemented:

Log-Likelihood Loss (logl): also known as negative entropy or negentropy, it is the negated expected log-likelihood of the test set for the Bayesian network fitted from the training set. Lower valuer are better.
Classification Error (pred): the prediction error for a single discrete node. Lower values are better.
Exact Classification Error (pred-exact): closed-form exact posterior predictions are available for Bayesian network classifiers. Lower values are better.
Predictive Correlation (cor): the correlation between the observed and the predicted values for a single continuous node. Higher values are better.
Mean Squared Error (mse): the mean squared error between the observed and the predicted values for a single continuous node. Lower values are better.
F1 score (f1): the F1 score between observed and predicted values for both binary and multiclass target variables.
AUROC (auroc): the area under the ROC curve for both binary and multiclass target variables. The multiclass AUROC score is computed as one-vs-rest by averaging the AUROC for each level of the target variable.

Optional arguments that can be specified in loss.args are:

predict: a character string, the label of the method used to predict the observations in the test set. The default is "parents". Other possible values are the same as in predict().
predict.args: a list containing the optional arguments for the prediction method. See the documentation for predict() for more details.
target: a character string, the label of target node for prediction in all loss functions but logl, logl-g and logl-cg.

Plotting Results from Cross-Validation

Both plot methods accept any combination of objects of class bn.kcv or bn.kcv.list (the first as the x argument, the remaining as the ... argument) and plot the respected expected loss values side by side. For a bn.kcv object, this mean a single point; for a bn.kcv.list object this means a boxplot.

Author(s)

Marco Scutari

References

Koller D, Friedman N (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Examples

bn.cv(learning.test, 'hc', loss = "pred",
  loss.args = list(predict = "bayes-lw", target = "F"))

folds = list(1:2000, 2001:3000, 3001:5000)
bn.cv(learning.test, 'hc', loss = "logl", method = "custom-folds",
  folds = folds)

xval = bn.cv(gaussian.test, 'mmhc', method = "hold-out",
         k = 5, m = 50, runs = 2)
xval
loss(xval)

## Not run: 
# comparing algorithms with multiple runs of cross-validation.
gaussian.subset = gaussian.test[1:50, ]
cv.gs = bn.cv(gaussian.subset, 'gs', runs = 10)
cv.iamb = bn.cv(gaussian.subset, 'iamb', runs = 10)
cv.inter = bn.cv(gaussian.subset, 'inter.iamb', runs = 10)
plot(cv.gs, cv.iamb, cv.inter,
  xlab = c("Grow-Shrink", "IAMB", "Inter-IAMB"), connect = TRUE)

# use custom folds.
folds = split(sample(nrow(gaussian.subset)), seq(5))
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)

# multiple runs, with custom folds.
folds = replicate(5, split(sample(nrow(gaussian.subset)), seq(5)),
          simplify = FALSE)
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)

## End(Not run)

[Package bnlearn version 5.1 Index]