Index | Topics |
bn.cv {bnlearn} | R Documentation |
Cross-validation for Bayesian networks
Description
Perform a k-fold or hold-out cross-validation for a learning algorithm or a fixed network structure.
Usage
bn.cv(data, bn, loss = NULL, ..., algorithm.args = list(),
loss.args = list(), fit, fit.args = list(), method = "k-fold",
cluster, debug = FALSE)
## S3 method for class 'bn.kcv'
plot(x, ..., main, xlab, ylab, connect = FALSE)
## S3 method for class 'bn.kcv.list'
plot(x, ..., main, xlab, ylab, connect = FALSE)
loss(x)
Arguments
data |
a data frame containing the variables in the model. |
bn |
either a character string (the label of the learning algorithm to be applied to the training data
in each iteration) or an object of class |
loss |
a character string, the label of a loss function. If none is specified, the default loss function is the Classification Error for Bayesian networks classifiers; otherwise, the Log-Likelihood Loss for both discrete and continuous data sets. See below for additional details. |
algorithm.args |
a list of extra arguments to be passed to the learning algorithm. |
loss.args |
a list of extra arguments to be passed to the loss function specified by |
fit |
a character string, the label of the method used to fit the parameters of the network. See
|
fit.args |
additional arguments for the parameter estimation procedure, see again |
method |
a character string, either |
cluster |
an optional cluster object from package parallel. |
debug |
a boolean value. If |
x |
an object of class |
... |
additional objects of class |
main , xlab , ylab |
the title of the plot, an array of labels for the boxplot, the label for the y axis. |
connect |
a logical value. If |
Value
bn.cv()
returns an object of class bn.kcv.list
if runs
is at
least 2, an object of class bn.kcv
if runs
is equal to 1.
loss()
returns a numeric vector with a length equal to runs
.
Cross-Validation Strategies
The following cross-validation methods are implemented:
-
k-fold: the
data
are split ink
subsets of equal size. For each subset in turn,bn
is fitted (and possibly learned as well) on the otherk - 1
subsets and the loss function is then computed using that subset. Loss estimates for each of thek
subsets are then combined to give an overall loss fordata
. -
custom-folds: the data are manually partitioned by the user into subsets, which are then used as in k-fold cross-validation. Subsets are not constrained to have the same size, and every observation must be assigned to one subset.
-
hold-out:
k
subsamples of sizem
are sampled independently without replacement from thedata
. For each subsample,bn
is fitted (and possibly learned) on the remainingm - nrow(data)
samples and the loss function is computed on them
observations in the subsample. The overall loss estimate is the average of thek
loss estimates from the subsamples.
If cross-validation is used with multiple runs
, the overall loss is the averge of the loss
estimates from the different runs.
To clarify, cross-validation methods accept the following optional arguments:
-
k
: a positive integer number, the number of groups into which the data will be split (in k-fold cross-validation) or the number of times the data will be split in training and test samples (in hold-out cross-validation). -
m
: a positive integer number, the size of the test set in hold-out cross-validation. -
runs
: a positive integer number, the number of times k-fold or hold-out cross-validation will be run. -
folds
: a list in which element corresponds to one fold and contains the indices for the observations that are included to that fold; or a list with an element for each run, in which each element is itself a list of the folds to be used for that run.
Loss Functions
The following loss functions are implemented:
-
Log-Likelihood Loss (
logl
): also known as negative entropy or negentropy, it is the negated expected log-likelihood of the test set for the Bayesian network fitted from the training set. Lower valuer are better. -
Classification Error (
pred
): the prediction error for a single discrete node. Lower values are better. -
Exact Classification Error (
pred-exact
): closed-form exact posterior predictions are available for Bayesian network classifiers. Lower values are better. -
Predictive Correlation (
cor
): the correlation between the observed and the predicted values for a single continuous node. Higher values are better. -
Mean Squared Error (
mse
): the mean squared error between the observed and the predicted values for a single continuous node. Lower values are better. -
F1 score (
f1
): the F1 score between observed and predicted values for both binary and multiclass target variables. -
AUROC (
auroc
): the area under the ROC curve for both binary and multiclass target variables. The multiclass AUROC score is computed as one-vs-rest by averaging the AUROC for each level of the target variable.
Optional arguments that can be specified in loss.args
are:
-
predict
: a character string, the label of the method used to predict the observations in the test set. The default is"parents"
. Other possible values are the same as inpredict()
. -
predict.args
: a list containing the optional arguments for the prediction method. See the documentation forpredict()
for more details. -
target
: a character string, the label of target node for prediction in all loss functions butlogl
,logl-g
andlogl-cg
.
Plotting Results from Cross-Validation
Both plot methods accept any combination of objects of class bn.kcv
or
bn.kcv.list
(the first as the x
argument, the remaining as the ...
argument) and plot the respected expected loss values side by side. For a bn.kcv
object, this
mean a single point; for a bn.kcv.list
object this means a boxplot.
Author(s)
Marco Scutari
References
Koller D, Friedman N (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
See Also
Examples
bn.cv(learning.test, 'hc', loss = "pred",
loss.args = list(predict = "bayes-lw", target = "F"))
folds = list(1:2000, 2001:3000, 3001:5000)
bn.cv(learning.test, 'hc', loss = "logl", method = "custom-folds",
folds = folds)
xval = bn.cv(gaussian.test, 'mmhc', method = "hold-out",
k = 5, m = 50, runs = 2)
xval
loss(xval)
## Not run:
# comparing algorithms with multiple runs of cross-validation.
gaussian.subset = gaussian.test[1:50, ]
cv.gs = bn.cv(gaussian.subset, 'gs', runs = 10)
cv.iamb = bn.cv(gaussian.subset, 'iamb', runs = 10)
cv.inter = bn.cv(gaussian.subset, 'inter.iamb', runs = 10)
plot(cv.gs, cv.iamb, cv.inter,
xlab = c("Grow-Shrink", "IAMB", "Inter-IAMB"), connect = TRUE)
# use custom folds.
folds = split(sample(nrow(gaussian.subset)), seq(5))
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)
# multiple runs, with custom folds.
folds = replicate(5, split(sample(nrow(gaussian.subset)), seq(5)),
simplify = FALSE)
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)
## End(Not run)
Index | Topics |