Parallel structure learning benchmarking in Scutari, Journal of Statistical Software (2017)
This is a short HOWTO describing the simulation setup used to benchmark the performance and the accuracy of constraint-based structure learning in “Bayesian Network Constraint-Based Structure Learning Algorithms: Parallel and Optimised Implementations in the bnlearn R Package” by Scutari (Journal of Statistical Software, 2017).
All the code and data sets discussed below is attached to the paper as supplementary material, and only slightly edited here for convenience. Note that even though the paper has been published in 2017, it was originally submitted to the Journal of Statistical Software in 2014 and accepted in 2015, and bnlearn has seen many updates and bug fixes since then. (Including a complete re-implementation of all conditional independence tests and most of the code behind constraint-based algorithms in C.) Hence results (especially running times) may not be reproducible with current versions of R and bnlearn.
The simulations performed to generate the results the paper is based on can be divided in two groups:
- parallel computing benchmarks to measure the speed-ups that can be obtained by splitting constraint-based structure learning across multiple processors; and
- benchmarks of the effects symmetry correction applied via backtracking to make sure Markov blankets and neighbourhoods of various nodes learned in constraint-based algorithms are symmetric.
The former are implemented in a series of scripts in the benchmarks
directory:
cd benchmarks Rscript --vanilla link.gs20k.R Rscript --vanilla link.hiton20k.R Rscript --vanilla link.iamb20k.R Rscript --vanilla link.mmpc20k.R Rscript --vanilla link.opt.R Rscript --vanilla munin.gs20k.R Rscript --vanilla munin.hiton20k.R Rscript --vanilla munin.iamb20k.R Rscript --vanilla munin.mmpc20k.R Rscript --vanilla munin.opt.R Rscript --vanilla carcinoma.hiton.R Rscript --vanilla mice.hiton.R
The latter are implemented in a single script, symmetry.R
, that takes the
label of a constraint-based structure learning algorithm (algorithm
) and the
ratio between sample size and the number of parameters of the golden standard network
(p
).
cd symmetry for algorithm in gs inter.iamb mmpc si.hiton.pc do for p in 0.1 0.2 0.5 1.0 2.0 5.0 do Rscript --vanilla symmetry.R $algorithm $p done done
The golden standard networks used in the scripts are available from the
Bayesian network repository; the mice
and
carcinoma
are included in the supplementary material and are originally
from literature papers.
Parallel computing benchmarks
The R scripts in benchmarks
all have the same general structure, differing
only in the structure learning algorithm being called and the data. Below is the R script
for si.hiton.pc()
and the carcinoma
data set.
First, we load the data and remove highly correlated variables with dedup()
,
sourced from dedup.R
.
This preprocessing step is strongly recommended to improve the numeric stability of
structure learning: the conditional independence tests used by si.hiton.pc()
require the (Moore-Penrose generalized) inversion of the covariance matrices of the
variables involved, so we need those covariance matrices to be well behaved (or at least
not perfectly singular). It also helps in keeping running times bounded, because the
preprocessing reduces highly connected clusters of nodes whose structure takes an
exponential time to learn.
> library(bnlearn) > library(parallel) > > source("dedup.R") > > xzfd = xzfile("prepd-carcinoma.txt.xz") > data = read.table(xzfd, header = TRUE, colClasses = "numeric") > data = dedup(data[, -(1:2)], threshold = 0.95)
Then we set the tuning parameters of the simulation: the number of parallel processes
(nprocs
) and the number of runs for each number of parallel processes
(run
).
> nprocs = c(1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 18, 20) > run = 1:5 > elapsed = numeric(0)
Then, we start recording the running times with system.time()
. Running
with a single processor is a special case, since it does not require “real”
parallel computing nor the parallel package.
> for (r in run) { + # get the running time. + t = system.time(si.hiton.pc(data, optimized = FALSE, alpha = 0.01)) + # save the elapsed time. + elapsed = c(elapsed, t["elapsed"]) + # debugging output. + cat("NPROCS: ", 1, "TIME:", t["elapsed"], "\n") + }#FOR
Running with more than one processor is essentially the same; the only difference
is that we create a cluster of slave processes with makeCluster()
and
pass it as an argument to si.hiton.pc()
.
> # these are for parallel execution. > for (np in nprocs[-1]) { + # start the cluster with the correct number of slave processes. + cl = makeCluster(np, type = "SOCK") + for (r in run) { + # get the running time. + t = system.time(si.hiton.pc(data, cluster = cl, alpha = 0.01)) + # save the elapsed time. + elapsed = c(elapsed, t["elapsed"]) + # debugging output. + cat("NPROCS: ", np, "TIME:", t["elapsed"], "\n") + }#FOR + # stop the cluster. + stopCluster(cl) + }#FOR
Note that to obtain smooth results it is important to run all the slave processes on the same machine. Starting with one machine and then moving to two, three, etc. as the number of slave processes grows introduces bias in the timings because inter-process communications become slower when slave processes are spread over multiple machines.
Finally, we save the results in a data frame which we write to a file for later use.
> timings = data.frame( + PROCS = c(rep("OPT", length(run)), + rep(1, length(run)), rep(2, length(run)), rep(3, length(run)), + rep(4, length(run)), rep(6, length(run)), rep(8, length(run)), + rep(10, length(run)), rep(12, length(run)), rep(14, length(run)), + rep(16, length(run)), rep(18, length(run)), rep(20, length(run))), + RUN = rep(run, length(nprocs)), + SECS = round(elapsed, digits = 3) + ) > > write.table(timings, file = "carcinoma.hiton.timings.txt", row.names = FALSE)
Symmetry correction benchmarks
This second R scripts is different in that it takes the structure learning algorithm
and the number of samples to generate relative to the number of parameters of the
golden standard network as commandline arguments from Rscript
.
The golden standard networks are loaded from the current directory, where they are
assumed to be stored as *.rda
files.
> library(bnlearn) > > rda = system("ls -1 *.rda", intern = TRUE) > > options("stringsAsFactors" = FALSE) > > algo = get(commandArgs()[7]) > mult = as.numeric(commandArgs()[8])
First, we create a zero-rows data frame to hold the results of the simulations, which will be appended in batches later. This ensures that variable types are consistent. The meaning of the columns is as follows:
- algorithm: the label of the constraint-based algorithm (
gs
,inter.iamb
, etc.); - bn: the file the golden standard network has been read from
(
alarm.rda
, etc.); - sample: the size of the sample generated from the golden standard network and passed to the constraint-based algorithm;
- flip: the Hamming distance between the network learned from the generated
data, and the same data with the columns in the opposite order, without backtracking
and symmetry correction (
optimized = FALSE
); - flip2: the same, but with backtracking and symmetry correction
(
optimized = TRUE
);
> d = data.frame( + algorithm = character(0), + bn = character(0), + sample = numeric(0), + flip = numeric(0), + flip2 = numeric(0) + )
For each of 20 reps
, we then learn the skeletons of 4 networks with the
same algorithm but with/without backtracking and with/without reversing the number of the
variables. We use the 4 skeletons to compute flip
and flip2
and
we save them in the data frame.
> reps = 20 > > for (file in rda) { + load(file) + n = ceiling(min(nparams(bn) * mult, 2 * 10^5)) + print(n) + print(file) + for (i in seq(reps)) { + sim = rbn(bn, n = n) + skel.orig = skeleton(algo(sim, optimized = FALSE, alpha = 0.01)) + skel.back = skeleton(algo(sim, optimized = TRUE, alpha = 0.01)) + rskel.orig = skeleton(algo(sim[, rev(names(sim))], optimized = FALSE, alpha = 0.01)) + rskel.back = skeleton(algo(sim[, rev(names(sim))], optimized = TRUE, alpha = 0.01)) + d = rbind(d, list(algorithm = toupper(algo), + bn = file, sample = n, + flip = hamming(skel.orig, rskel.orig), + flip2 = hamming(skel.back, rskel.back) + )) + write.table(d, file = paste(toupper(algo), mult, "log", sep = "."), + row.names = FALSE) + }#FOR + }#FOR
Note that we save the data frame with the results at every iteration, so that if the R script crashes for some reason we can still recover all the results produced up to that time.
Tue Apr 25 11:38:44 2017
with bnlearn
4.2-20170416
and R version 3.0.2 (2013-09-25)
.