Structure learning benchmarks in Scutari, Graafland and Gutiérrez, International Journal of Approximate Reasoning (2019)
This is a short HOWTO describing the simulation setup in “Who Learns Better Bayesian Network Structures: Accuracy and Speed of Structure Learning Algorithms” by Scutari, Graafland and Gutiérrez (2019, IJAR), which is an extended version of “Who Learns Better Bayesian Network Structures: Constraint-Based, Score-Based or Hybrid Algorithms?” by the same authors (2018, PGM). The code below includes several incremental improvements over that used here for PGM 2016.
The makefile
An abridged version of the makefile
that drives the simulation study is below.
SHELL := /bin/bash REPS := 20 NETWORKS := $(sort $(patsubst %.rds,%.rds.done,$(wildcard networks/*.rds))) .PHONY: setup plots clean learn-all # perform the whole simulation. learn-all: pc-bict ... setup: $(NETWORKS) # generating data. networks/%.rds.done: networks/%.rds mkdir -p data results Rscript --vanilla setup.R $< $(REPS) touch $@ # learn the network using the BIC test and PC. pc-bict: setup @for DATA in `ls -1 data`; \ do \ TARGET="$$TARGET results/`sed -e s/rds/pc.bict.rds/ <<< $$DATA`"; \ done; \ $(MAKE) $$TARGET; results/%.pc.bict.rds: data/%.rds Rscript pc.bict.R $< $@ ... # summarise. summaries.rds: pc-bict ... Rscript tabulate.R # plot. plots: summaries.rds Rscript plot.R # clean up. clean: rm -rf data results networks/*.done
The targets perform the following actions:
learn-all
: runs all the structure learning algorithms.networks/%.rds.done
: generates 20 data sets for each sample size, from each of the network.pc-bict
: runs the PC algorithm with the BIC test described in the paper. It dynamically generates the call to the target in the following line (results/%.pc.bict.rds
) populating it with the file names of the individual simulations (one for each combination of network, sample size, rep).summaries.rds
: extracts the statistics of interest from the RDS files containing the results of the individual simulations.plots
: creates the figures in the paper from the files created bysummaries
.clean
: removes the results and the generated data.
There is one target like pc-bict
(identical in all but the name and the R script it calls) for each of
the algorithms in the simulation study. Calling $MAKE
with a dynamic argument set makes it possible to run
individual simulations in parallel; and saving them each in a separate files makes it easy to interrupt and resume the
simulation without losing too much work.
Generating the data
The setup.R
script reads the RDS files with the reference networks from the Bayesian network repository
(stored in the networks
directory) and generated 20 data sets for each sample size
ceiling(np * nparams(bn))
.
> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > > rds.file = commandArgs()[7] > reps = as.integer(commandArgs()[8]) > np.ratios = c(0.1, 0.2, 0.5, 1, 2, 5) > > bn = readRDS(rds.file) > label = toupper(gsub("networks/(.+)\\.rds$", "\\1", rds.file)) > > for (np in np.ratios) { + for (r in seq(reps)) { + id = sprintf("%02d", r) + npr = sprintf("%.1f", np) + filename = paste("data/", label, "[", npr, "]", ".", id, ".rds", sep = "") + if (file.exists(filename)) + next + data = rbn(bn, ceiling(np * nparams(bn))) + attr(data, "network") = rds.file + attr(data, "label") = label + attr(data, "np") = np + saveRDS(data, file = filename) + }#FOR + }#FOR
Performing structure learning
The scripts that implement structure learning all follow the same structure; pc.bict.R
is shown below as
an example. In addition to saving the bn
object for the network, they store the sample size and the network
label to make it easier to label results later in tabulate.R
.
> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > > data.file = commandArgs()[6] > data = readRDS(data.file) > out.file = commandArgs()[7] > > if (bnlearn:::data.type(data) == "factor") + test = "bic-t" > if (bnlearn:::data.type(data) == "continuous") + test = "bic-gt" > > net = pc.stable(data, test = test) > net$learning$from = attr(data, "network") > net$learning$label = attr(data, "label") > net$learning$np = attr(data, "np") > > saveRDS(net, file = out.file)
The Greedy Equivalent Search (GES) is performed using the rcausal package
(github), which wraps the TETRAD Bayesian network suite. The
fges.bde.R
script is as follows; not that the number of model comparisons is measured separately by
instrumenting the TETRAD code.
> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > library(rcausal) > > data.file = commandArgs()[6] > data = readRDS(data.file) > out.file = commandArgs()[7] > > if (bnlearn:::data.type(data) == "factor") { + sink("/dev/null") + net = fges.discrete(data, structurePrior = 1, samplePrior = 1) + sink() + directed = do.call(rbind, strsplit(grep("-->", net$edges, value = TRUE), " --> ")) + undirected = do.call(rbind, strsplit(grep("---", net$edges, value = TRUE), " --- ")) + net = empty.graph(names(data)) + arcs(net) = rbind(directed, undirected, undirected[, 2:1]) + net$learning$from = attr(data, "network") + net$learning$label = attr(data, "label") + net$learning$np = attr(data, "np") + net$learning$algo = "fges" + net$learning$test = "bde" + saveRDS(net, file = out.file) + }#THEN > > if (bnlearn:::data.type(data) == "continuous") + saveRDS(NA, file = out.file)
Simulated annealing is implemented using the catnet package in sann.bic.R
. After the
node ordering is learned with cnSearchSA()
and cnFindBIC()
, we confirm it by running
tabu()
from bnlearn. The number of model comparisons is again handled separately by instrumenting the
code in catnet.
> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > library(catnet) > > data.file = commandArgs()[6] > data = readRDS(data.file) > out.file = commandArgs()[7] > > if (bnlearn:::data.type(data) == "factor") + score = "bic" > if (bnlearn:::data.type(data) == "continuous") + score = "bic-g" > > netlist = cnSearchSA(data = data, maxParentSet = 1, + tempCheckOrders = 10, maxIter = 10, numThreads = 10) > net = cnFindBIC(netlist) > arcs = bnlearn:::elist2arcs(cnEdges(net)) > net = empty.graph(names(data)) > arcs(net) = arcs > net = tabu(data, score = score, + blacklist = ordering2blacklist(node.ordering(net))) > > net$learning$from = attr(data, "network") > net$learning$label = attr(data, "label") > net$learning$np = attr(data, "np") > net$learning$algo = "sann" > net$learning$test = score > > saveRDS(net, file = out.file)
Collecting summary statistics
The script tabulate.R
creates a data frame with the statistics of interest: SHD from the true network
structure and number of model evaluations. It iterates over all the files with the simulation results stored in the
results
directory and appends the statistics of each individual simulation to the data frame, which
admittedly is not the most efficient way of going about it.
> library(bnlearn) > > shd = data.frame( + label = character(0), + np = character(0), + algo = character(0), + criterion = character(0), + shd = numeric(0) + ) > > learned = list.files("results", full.names = TRUE) > > for (l in learned) { + bn = readRDS(l) + true = bn.net(readRDS(bn$learning$from)) + shd = rbind(shd, data.frame( + label = bn$learning$label, + np = as.character(bn$learning$np), + algo = bn$learning$algo, + criterion = gsub("-.*", "", bn$learning$test), + shd = shd(bn, true) + )) + }#FOR > > saveRDS(shd, file = "summaries.rds")
Generating the figures
The figures in the paper are generated from the tabulated statistics stored in the summaries.rds
using
the lattice package.
> library(lattice) > library(latticeExtra) > > summaries = readRDS("summaries.rds") > > # separate discrete and continuous data sets. > discrete = !grepl("MAGIC", summaries$label) > bic = grepl("bic", summaries$criterion) > dsummaries = summaries[discrete & bic, ] > gsummaries = summaries[!discrete & bic, ] > > p1 = xyplot(shd ~ np | label, group = droplevels(interaction(algo, criterion)), + data = dsummaries, auto.key = TRUE, as.table = TRUE, + scales = list(alternating = FALSE, y = list(relation = "free")), + panel = function(x, y, groups, ...) { + panel.xyplot(x = x, y = y, groups = groups, ..., pch = 19, alpha = 0.10) + panel.superpose(x = x, y = y, groups = groups, panel.group = "panel.smoother", + ..., span = 0.99) + }) > > p2 = xyplot(shd ~ np | label, group = droplevels(interaction(algo, criterion)), + data = gsummaries, auto.key = TRUE, as.table = TRUE, + scales = list(alternating = FALSE, y = list(relation = "free")), + panel = function(x, y, groups, ...) { + panel.xyplot(x = x, y = y, groups = groups, ..., pch = 19, alpha = 0.10) + panel.superpose(x = x, y = y, groups = groups, panel.group = "panel.smoother", + ..., span = 0.99) + }) > > pdf(file = "shd-discrete-bic.pdf", width = 9, height = 12) > print(p1) > dev.off() > > pdf(file = "shd-gaussian-bic.pdf", width = 9, height = 12) > print(p2) > dev.off()
Thu Aug 13 13:16:10 2020
with bnlearn
4.6-20200410
and R version 4.0.2 (2020-06-22)
.