Using custom tests in structure learning
The possibility of using a user-provided "custom-test"
conditional independence test in
ci.test()
and in structure learning algorithms allows for experimenting with new tests appearing in the
literature, similarly to "custom-score"
(see here). This is essential
in applications to structured data that require tweaks to produce correctly-specified and performant models. Built-in
tests are documented here.
From the documentation (here), the custom test has two arguments:
fun
: a function with five arguments,x
,y
,z
,data
andargs
.x
andy
will be the labels of the variables that will be tested;z
will be a (possibly empty) set of labels of the separating/conditioning nodes;data
will contain the complete data set, with all the variables (a data frame); andargs
will be a list containing any additional arguments to the score.args
: a list containing the optional arguments tofun
, for tuning the test functions.
In practice, this leads to code like the following and like that used for a custom score:
> independence.test = function(x, y, z, data, args) { + # compute a meaningful test statistic and its p-value. + return(c(statistic = 42, p.value = 0.05)) + }#INDEPENDENCE.TEST > > si.hiton.pc(learning.test, test = "custom-test", fun = independence.test, + args = list())
The function fun
is required to return a numeric vector of length two containing the value of the test
statistic (first element) and the associated p-value (second element). It has no other constraints and is, therefore,
completely flexible.
Replicating an existing test
The most basic usage of a custom test is to (re)implement a classical conditional independence test to study and troubleshoot it.
> my.fisher.z = function(x, y, z, data, args) { + require(corpcor) + # compute the partial correlation... + omega = cor2pcor(cor(data[, c(x, y, z)])) + # ... apply the Fisher's Z transform... + statistic = 0.5 * log((1 + omega[1, 2]) / (1 - omega[1, 2])) + # ... scale it to unit variance... + statistic = statistic * sqrt(nrow(data) - ncol(omega) - 1) + # ... and compute the (two-sided) p-value. + p.value = 2 * pnorm(statistic, lower.tail = FALSE) + return(c(statistic = statistic, p.value = p.value)) + }#MY.FISHER.Z > > ci.test("A", "F", c("D", "E"), data = gaussian.test, + test = "custom-test", fun = my.fisher.z)
User-Provided Test Function data: A ~ F | D + E custom-test = 42.534, p-value < 2.2e-16
> ci.test("A", "F", c("D", "E"), data = gaussian.test, test = "zf")
Fisher's Z data: A ~ F | D + E zf = 42.534, df = 4995, p-value < 2.2e-16 alternative hypothesis: true value is not equal to 0
As we can see, the test statistic and the p-values both match. The same code works for unconditional tests, in
which case z
will be set to NULL
before being passed to fun
.
> ci.test("A", "F", data = gaussian.test, test = "custom-test", fun = my.fisher.z)
User-Provided Test Function data: A ~ F custom-test = 22.392, p-value < 2.2e-16
> ci.test("A", "F", data = gaussian.test, test = "zf")
Fisher's Z data: A ~ F zf = 22.392, df = 4997, p-value < 2.2e-16 alternative hypothesis: true value is not equal to 0
Instrumenting tests to debug them
It is very common for p-values to take values like NA
or NaN
if the underlying null
distribution is singular or otherwise ill-defined. Being able to instrument tests to debug them and work around such
corner cases is a life-saver when we need to make structure learning work! We can wrap ci.test()
and
browser()
in a custom test to allow for interactive debugging, as follows.
> instrumented = function(x, y, z, data, args) { + test = ci.test(x, y, z, data = data, test = args$test) + statistic = test$statistic + p.value = test$p.value + if (is.nan(statistic) || is.nan(p.value)) + browser() + return(c(statistic = statistic, p.value = p.value)) + }#INSTRUMENTED
Extending existing conditional independence tests
As an example, let's try to add observation weights to the Fisher's Z test.
> weighted.fisher.z = function(x, y, z, data, args) { + require(corpcor) + # compute the weighted correlation matrix... + rho = cov2cor(cov.wt(data[, c(x, y, z)], wt = args$weights)$cov) + # ... compute the partial correlation... + omega = cor2pcor(rho) + # ... apply the Fisher's Z transform... + statistic = 0.5 * log((1 + omega[1, 2]) / (1 - omega[1, 2])) + # ... scale it to unit variance... + statistic = statistic * sqrt(nrow(data) - ncol(omega) - 1) + # ... and compute the (two-sided) p-value. + p.value = 2 * pnorm(statistic, lower.tail = FALSE) + return(c(statistic = statistic, p.value = p.value)) + }#WEIGHTED.FISHER.Z > > ci.test("A", "F", c("D", "E"), data = gaussian.test, + test = "custom-test", fun = weighted.fisher.z, + args = list(weights = runif(nrow(gaussian.test))))
User-Provided Test Function data: A ~ F | D + E custom-test = 44.106, p-value < 2.2e-16
Implementing new conditional independence tests
Ordinal logistic regressions as local distributions
Conditional independence tests for discrete Bayesian networks typically assume that variables are categorical. However, ordinal variables ariese naturally in many settings and have a very different definition of probabilistic dependence. We can model them using ordered logistic regressions as local distributions and test conditional independence with a log-likelihood/mutual information test.
> polr.test = function(x, y, z, data, args) { + require(MASS) + if (is.null(z)) { + formula = paste(x, "~", y) + num = logLik(polr(formula, data = data)) + formula = paste(x, "~ 1") + den = logLik(polr(formula, data = data)) + }#THEN + else { + formula = paste(x, "~", paste(c(y, z), collapse = "+")) + num = logLik(polr(formula, data = data)) + formula = paste(x, "~", paste(z, collapse = "+")) + den = logLik(polr(formula, data = data)) + }#ELSE + statistic = as.numeric(num / den) + p.value = pchisq(statistic, + df = attr(num, "df") - attr(den, "df"), lower.tail = FALSE) + return(c(statistic = statistic, p.value = p.value)) + }#POLR.TEST > > ordered.test = learning.test > for (i in seq(ncol(ordered.test))) + ordered.test[, i] = as.ordered(ordered.test[, i]) > > pc.stable(ordered.test, test = "custom-test", fun = polr.test)
This polr.test
can easily be adapted to any other local distribution for which we have a
logLik()
method.
The oracle test
Much like the oracle score, the oracle test supports a conditional indepdence statement if and only if that statement holds in the true network structure. Therefore, it acts as a conditional independence test that never commits any type-I or type-II errors by completely avoiding false positive and false negative inclusions.
We can implement it as follows using dsep()
to query the true network structure.
> oracle.test = function(x, y, z, data, args) { + if (is.null(z)) + holds = dsep(args$dag, x = x, y = y) + else + holds = dsep(args$dag, x = x, y = y, z = z) + if (holds) + return(c(statistic = NA, p.value = 1)) + else + return(c(statistic = NA, p.value = 0)) + }#ORACLE.TEST > > true.dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]") > pc.stable(gaussian.test, test = "custom-test", fun = oracle.test, + args = list(dag = true.dag))
Mon Aug 5 02:41:00 2024
with bnlearn
5.0
and R version 4.4.1 (2024-06-14)
.