Using custom tests in structure learning

The possibility of using a user-provided "custom-test" conditional independence test in ci.test() and in structure learning algorithms allows for experimenting with new tests appearing in the literature, similarly to "custom-score" (see here). This is essential in applications to structured data that require tweaks to produce correctly-specified and performant models. Built-in tests are documented here.

From the documentation (here), the custom test has two arguments:

  • fun: a function with five arguments, x, y, z, data and args. x and y will be the labels of the variables that will be tested; z will be a (possibly empty) set of labels of the separating/conditioning nodes; data will contain the complete data set, with all the variables (a data frame); and args will be a list containing any additional arguments to the score.
  • args: a list containing the optional arguments to fun, for tuning the test functions.

In practice, this leads to code like the following and like that used for a custom score:

> independence.test = function(x, y, z, data, args) {

+   # compute a meaningful test statistic and its p-value.
+   return(c(statistic = 42, p.value = 0.05))

+ }#INDEPENDENCE.TEST
> 
> si.hiton.pc(learning.test, test = "custom-test", fun = independence.test,
+   args = list())

The function fun is required to return a numeric vector of length two containing the value of the test statistic (first element) and the associated p-value (second element). It has no other constraints and is, therefore, completely flexible.

Replicating an existing test

The most basic usage of a custom test is to (re)implement a classical conditional independence test to study and troubleshoot it.

> my.fisher.z = function(x, y, z, data, args) {

+   require(corpcor)

+   # compute the partial correlation...
+   omega = cor2pcor(cor(data[, c(x, y, z)]))
+   # ... apply the Fisher's Z transform...
+   statistic = 0.5 * log((1 + omega[1, 2]) / (1 - omega[1, 2]))
+   # ... scale it to unit variance...
+   statistic = statistic * sqrt(nrow(data) - ncol(omega) - 1)
+   # ... and compute the (two-sided) p-value.
+   p.value = 2 * pnorm(statistic, lower.tail = FALSE)

+   return(c(statistic = statistic, p.value = p.value))

+ }#MY.FISHER.Z
> 
> ci.test("A", "F", c("D", "E"), data = gaussian.test,
+   test = "custom-test", fun = my.fisher.z)
Loading required package: corpcor

	User-Provided Test Function

data:  A ~ F | D + E
custom-test = 42.534, p-value < 2.2e-16
> ci.test("A", "F", c("D", "E"), data = gaussian.test, test = "zf")

	Fisher's Z

data:  A ~ F | D + E
zf = 42.534, df = 4995, p-value < 2.2e-16
alternative hypothesis: true value is not equal to 0

As we can see, the test statistic and the p-values both match. The same code works for unconditional tests, in which case z will be set to NULL before being passed to fun.

> ci.test("A", "F", data = gaussian.test, test = "custom-test", fun = my.fisher.z)

	User-Provided Test Function

data:  A ~ F  
custom-test = 22.392, p-value < 2.2e-16
> ci.test("A", "F", data = gaussian.test, test = "zf")

	Fisher's Z

data:  A ~ F  
zf = 22.392, df = 4997, p-value < 2.2e-16
alternative hypothesis: true value is not equal to 0

Instrumenting tests to debug them

It is very common for p-values to take values like NA or NaN if the underlying null distribution is singular or otherwise ill-defined. Being able to instrument tests to debug them and work around such corner cases is a life-saver when we need to make structure learning work! We can wrap ci.test() and browser() in a custom test to allow for interactive debugging, as follows.

> instrumented = function(x, y, z, data, args) {

+   test = ci.test(x, y, z, data = data, test = args$test)
+   statistic = test$statistic
+   p.value = test$p.value

+   if (is.nan(statistic) || is.nan(p.value))
+     browser()

+   return(c(statistic = statistic, p.value = p.value))

+ }#INSTRUMENTED

Extending existing conditional independence tests

As an example, let's try to add observation weights to the Fisher's Z test.

> weighted.fisher.z = function(x, y, z, data, args) {

+   require(corpcor)

+   # compute the weighted correlation matrix...
+   rho = cov2cor(cov.wt(data[, c(x, y, z)], wt = args$weights)$cov)
+   # ... compute the partial correlation...
+   omega = cor2pcor(rho)
+   # ... apply the Fisher's Z transform...
+   statistic = 0.5 * log((1 + omega[1, 2]) / (1 - omega[1, 2]))
+   # ... scale it to unit variance...
+   statistic = statistic * sqrt(nrow(data) - ncol(omega) - 1)
+   # ... and compute the (two-sided) p-value.
+   p.value = 2 * pnorm(statistic, lower.tail = FALSE)

+   return(c(statistic = statistic, p.value = p.value))

+ }#WEIGHTED.FISHER.Z
> 
> ci.test("A", "F", c("D", "E"), data = gaussian.test,
+   test = "custom-test", fun = weighted.fisher.z,
+   args = list(weights = runif(nrow(gaussian.test))))

	User-Provided Test Function

data:  A ~ F | D + E
custom-test = 44.106, p-value < 2.2e-16

Implementing new conditional independence tests

Ordinal logistic regressions as local distributions

Conditional independence tests for discrete Bayesian networks typically assume that variables are categorical. However, ordinal variables ariese naturally in many settings and have a very different definition of probabilistic dependence. We can model them using ordered logistic regressions as local distributions and test conditional independence with a log-likelihood/mutual information test.

> polr.test = function(x, y, z, data, args) {

+   require(MASS)

+   if (is.null(z)) {

+     formula = paste(x, "~", y)
+     num = logLik(polr(formula, data = data))
+     formula = paste(x, "~ 1")
+     den = logLik(polr(formula, data = data))

+   }#THEN
+   else {

+     formula = paste(x, "~", paste(c(y, z), collapse = "+"))
+     num = logLik(polr(formula, data = data))
+     formula = paste(x, "~", paste(z, collapse = "+"))
+     den = logLik(polr(formula, data = data))

+   }#ELSE

+   statistic = as.numeric(num / den)
+   p.value = pchisq(statistic,
+               df = attr(num, "df") - attr(den, "df"), lower.tail = FALSE)

+   return(c(statistic = statistic, p.value = p.value))

+ }#POLR.TEST
> 
> ordered.test = learning.test
> for (i in seq(ncol(ordered.test)))
+   ordered.test[, i] = as.ordered(ordered.test[, i])
> 
> pc.stable(ordered.test, test = "custom-test", fun = polr.test)
Loading required package: MASS

This polr.test can easily be adapted to any other local distribution for which we have a logLik() method.

The oracle test

Much like the oracle score, the oracle test supports a conditional indepdence statement if and only if that statement holds in the true network structure. Therefore, it acts as a conditional independence test that never commits any type-I or type-II errors by completely avoiding false positive and false negative inclusions.

We can implement it as follows using dsep() to query the true network structure.

> oracle.test = function(x, y, z, data, args) {

+   if (is.null(z))
+     holds = dsep(args$dag, x = x, y = y)
+   else
+     holds = dsep(args$dag, x = x, y = y, z = z)

+   if (holds)
+     return(c(statistic = NA, p.value = 1))
+   else
+     return(c(statistic = NA, p.value = 0))

+ }#ORACLE.TEST
> 
> true.dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
> pc.stable(gaussian.test, test = "custom-test", fun = oracle.test,
+     args = list(dag = true.dag))
Last updated on Mon Aug 5 02:41:00 2024 with bnlearn 5.0 and R version 4.4.1 (2024-06-14).