## Structure learning benchmarks in Scutari, Graafland and Gutiérrez, International Journal of Approximate Reasoning (2019)

This is a short HOWTO describing the simulation setup in “Who Learns Better Bayesian Network Structures: Accuracy and Speed of Structure Learning Algorithms” by Scutari, Graafland and Gutiérrez (2019, IJAR), which is an extended version of “Who Learns Better Bayesian Network Structures: Constraint-Based, Score-Based or Hybrid Algorithms?” by the same authors (2018, PGM). The code below includes several incremental improvements over that used here for PGM 2016.

### The `makefile`

An abridged version of the `makefile`

that drives the simulation study is below.

SHELL := /bin/bash REPS := 20 NETWORKS := $(sort $(patsubst %.rds,%.rds.done,$(wildcard networks/*.rds))) .PHONY: setup plots clean learn-all # perform the whole simulation. learn-all: pc-bict ... setup: $(NETWORKS) # generating data. networks/%.rds.done: networks/%.rds mkdir -p data results Rscript --vanilla setup.R $< $(REPS) touch $@ # learn the network using the BIC test and PC. pc-bict: setup @for DATA in `ls -1 data`; \ do \ TARGET="$$TARGET results/`sed -e s/rds/pc.bict.rds/ <<< $$DATA`"; \ done; \ $(MAKE) $$TARGET; results/%.pc.bict.rds: data/%.rds Rscript pc.bict.R $< $@ ... # summarise. summaries.rds: pc-bict ... Rscript tabulate.R # plot. plots: summaries.rds Rscript plot.R # clean up. clean: rm -rf data results networks/*.done

The targets perform the following actions:

`learn-all`

: runs all the structure learning algorithms.`networks/%.rds.done`

: generates 20 data sets for each sample size, from each of the network.`pc-bict`

: runs the PC algorithm with the BIC test described in the paper. It dynamically generates the call to the target in the following line (`results/%.pc.bict.rds`

) populating it with the file names of the individual simulations (one for each combination of network, sample size, rep).`summaries.rds`

: extracts the statistics of interest from the RDS files containing the results of the individual simulations.`plots`

: creates the figures in the paper from the files created by`summaries`

.`clean`

: removes the results and the generated data.

There is one target like `pc-bict`

(identical in all but the name and the R script it calls) for each of
the algorithms in the simulation study. Calling `$MAKE`

with a dynamic argument set makes it possible to run
individual simulations in parallel; and saving them each in a separate files makes it easy to interrupt and resume the
simulation without losing too much work.

### Generating the data

The `setup.R`

script reads the RDS files with the reference networks from the Bayesian network repository
(stored in the `networks`

directory) and generated 20 data sets for each sample size
`ceiling(np * nparams(bn))`

.

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > > rds.file = commandArgs()[7] > reps = as.integer(commandArgs()[8]) > np.ratios = c(0.1, 0.2, 0.5, 1, 2, 5) > > bn = readRDS(rds.file) > label = toupper(gsub("networks/(.+)\\.rds$", "\\1", rds.file)) > > for (np in np.ratios) { + for (r in seq(reps)) { + id = sprintf("%02d", r) + npr = sprintf("%.1f", np) + filename = paste("data/", label, "[", npr, "]", ".", id, ".rds", sep = "") + if (file.exists(filename)) + next + data = rbn(bn, ceiling(np * nparams(bn))) + attr(data, "network") = rds.file + attr(data, "label") = label + attr(data, "np") = np + saveRDS(data, file = filename) + }#FOR + }#FOR

### Performing structure learning

The scripts that implement structure learning all follow the same structure; `pc.bict.R`

is shown below as
an example. In addition to saving the `bn`

object for the network, they store the sample size and the network
label to make it easier to label results later in `tabulate.R`

.

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > > data.file = commandArgs()[6] > data = readRDS(data.file) > out.file = commandArgs()[7] > > if (bnlearn:::data.type(data) == "factor") + test = "bic-t" > if (bnlearn:::data.type(data) == "continuous") + test = "bic-gt" > > net = pc.stable(data, test = test) > net$learning$from = attr(data, "network") > net$learning$label = attr(data, "label") > net$learning$np = attr(data, "np") > > saveRDS(net, file = out.file)

The Greedy Equivalent Search (GES) is performed using the **rcausal** package
(github), which wraps the TETRAD Bayesian network suite. The
`fges.bde.R`

script is as follows; not that the number of model comparisons is measured separately by
instrumenting the TETRAD code.

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > library(rcausal) > > data.file = commandArgs()[6] > data = readRDS(data.file) > out.file = commandArgs()[7] > > if (bnlearn:::data.type(data) == "factor") { + sink("/dev/null") + net = fges.discrete(data, structurePrior = 1, samplePrior = 1) + sink() + directed = do.call(rbind, strsplit(grep("-->", net$edges, value = TRUE), " --> ")) + undirected = do.call(rbind, strsplit(grep("---", net$edges, value = TRUE), " --- ")) + net = empty.graph(names(data)) + arcs(net) = rbind(directed, undirected, undirected[, 2:1]) + net$learning$from = attr(data, "network") + net$learning$label = attr(data, "label") + net$learning$np = attr(data, "np") + net$learning$algo = "fges" + net$learning$test = "bde" + saveRDS(net, file = out.file) + }#THEN > > if (bnlearn:::data.type(data) == "continuous") + saveRDS(NA, file = out.file)

Simulated annealing is implemented using the **catnet** package in `sann.bic.R`

. After the
node ordering is learned with `cnSearchSA()`

and `cnFindBIC()`

, we confirm it by running
`tabu()`

from bnlearn. The number of model comparisons is again handled separately by instrumenting the
code in **catnet**.

> library(bnlearn, quietly = TRUE, warn.conflict = FALSE) > library(catnet) > > data.file = commandArgs()[6] > data = readRDS(data.file) > out.file = commandArgs()[7] > > if (bnlearn:::data.type(data) == "factor") + score = "bic" > if (bnlearn:::data.type(data) == "continuous") + score = "bic-g" > > netlist = cnSearchSA(data = data, maxParentSet = 1, + tempCheckOrders = 10, maxIter = 10, numThreads = 10) > net = cnFindBIC(netlist) > arcs = bnlearn:::elist2arcs(cnEdges(net)) > net = empty.graph(names(data)) > arcs(net) = arcs > net = tabu(data, score = score, + blacklist = ordering2blacklist(node.ordering(net))) > > net$learning$from = attr(data, "network") > net$learning$label = attr(data, "label") > net$learning$np = attr(data, "np") > net$learning$algo = "sann" > net$learning$test = score > > saveRDS(net, file = out.file)

### Collecting summary statistics

The script `tabulate.R`

creates a data frame with the statistics of interest: SHD from the true network
structure and number of model evaluations. It iterates over all the files with the simulation results stored in the
`results`

directory and appends the statistics of each individual simulation to the data frame, which
admittedly is not the most efficient way of going about it.

> library(bnlearn) > > shd = data.frame( + label = character(0), + np = character(0), + algo = character(0), + criterion = character(0), + shd = numeric(0) + ) > > learned = list.files("results", full.names = TRUE) > > for (l in learned) { + bn = readRDS(l) + true = bn.net(readRDS(bn$learning$from)) + shd = rbind(shd, data.frame( + label = bn$learning$label, + np = as.character(bn$learning$np), + algo = bn$learning$algo, + criterion = gsub("-.*", "", bn$learning$test), + shd = shd(bn, true) + )) + }#FOR > > saveRDS(shd, file = "summaries.rds")

### Generating the figures

The figures in the paper are generated from the tabulated statistics stored in the `summaries.rds`

using
the **lattice** package.

> library(lattice) > library(latticeExtra) > > summaries = readRDS("summaries.rds") > > # separate discrete and continuous data sets. > discrete = !grepl("MAGIC", summaries$label) > bic = grepl("bic", summaries$criterion) > dsummaries = summaries[discrete & bic, ] > gsummaries = summaries[!discrete & bic, ] > > p1 = xyplot(shd ~ np | label, group = droplevels(interaction(algo, criterion)), + data = dsummaries, auto.key = TRUE, as.table = TRUE, + scales = list(alternating = FALSE, y = list(relation = "free")), + panel = function(x, y, groups, ...) { + panel.xyplot(x = x, y = y, groups = groups, ..., pch = 19, alpha = 0.10) + panel.superpose(x = x, y = y, groups = groups, panel.group = "panel.smoother", + ..., span = 0.99) + }) > > p2 = xyplot(shd ~ np | label, group = droplevels(interaction(algo, criterion)), + data = gsummaries, auto.key = TRUE, as.table = TRUE, + scales = list(alternating = FALSE, y = list(relation = "free")), + panel = function(x, y, groups, ...) { + panel.xyplot(x = x, y = y, groups = groups, ..., pch = 19, alpha = 0.10) + panel.superpose(x = x, y = y, groups = groups, panel.group = "panel.smoother", + ..., span = 0.99) + }) > > pdf(file = "shd-discrete-bic.pdf", width = 9, height = 12) > print(p1) > dev.off() > > pdf(file = "shd-gaussian-bic.pdf", width = 9, height = 12) > print(p2) > dev.off()

`Thu Aug 13 13:16:10 2020`

with **bnlearn**

`4.6-20200410`

and `R version 4.0.2 (2020-06-22)`

.