## Using custom tests in structure learning

The possibility of using a user-provided `"custom-test"`

conditional independence test in
`ci.test()`

and in structure learning algorithms allows for experimenting with new tests appearing in the
literature, similarly to `"custom-score"`

(see here). This is essential
in applications to structured data that require tweaks to produce correctly-specified and performant models. Built-in
tests are documented here.

From the documentation (here), the custom test has two arguments:

`fun`

: a function with five arguments,`x`

,`y`

,`z`

,`data`

and`args`

.`x`

and`y`

will be the labels of the variables that will be tested;`z`

will be a (possibly empty) set of labels of the separating/conditioning nodes;`data`

will contain the complete data set, with all the variables (a data frame); and`args`

will be a list containing any additional arguments to the score.`args`

: a list containing the optional arguments to`fun`

, for tuning the test functions.

In practice, this leads to code like the following and like that used for a custom score:

> independence.test = function(x, y, z, data, args) { + # compute a meaningful test statistic and its p-value. + return(c(statistic = 42, p.value = 0.05)) + }#INDEPENDENCE.TEST > > si.hiton.pc(learning.test, test = "custom-test", fun = independence.test, + args = list())

The function `fun`

is required to return a numeric vector of length two containing the value of the test
statistic (first element) and the associated p-value (second element). It has no other constraints and is, therefore,
completely flexible.

### Replicating an existing test

The most basic usage of a custom test is to (re)implement a classical conditional independence test to study and troubleshoot it.

> my.fisher.z = function(x, y, z, data, args) { + require(corpcor) + # compute the partial correlation... + omega = cor2pcor(cor(data[, c(x, y, z)])) + # ... apply the Fisher's Z transform... + statistic = 0.5 * log((1 + omega[1, 2]) / (1 - omega[1, 2])) + # ... scale it to unit variance... + statistic = statistic * sqrt(nrow(data) - ncol(omega) - 1) + # ... and compute the (two-sided) p-value. + p.value = 2 * pnorm(statistic, lower.tail = FALSE) + return(c(statistic = statistic, p.value = p.value)) + }#MY.FISHER.Z > > ci.test("A", "F", c("D", "E"), data = gaussian.test, + test = "custom-test", fun = my.fisher.z)

User-Provided Test Function data: A ~ F | D + E custom-test = 42.534, p-value < 2.2e-16

> ci.test("A", "F", c("D", "E"), data = gaussian.test, test = "zf")

Fisher's Z data: A ~ F | D + E zf = 42.534, df = 4995, p-value < 2.2e-16 alternative hypothesis: true value is not equal to 0

As we can see, the test statistic and the p-values both match. The same code works for unconditional tests, in
which case `z`

will be set to `NULL`

before being passed to `fun`

.

> ci.test("A", "F", data = gaussian.test, test = "custom-test", fun = my.fisher.z)

User-Provided Test Function data: A ~ F custom-test = 22.392, p-value < 2.2e-16

> ci.test("A", "F", data = gaussian.test, test = "zf")

Fisher's Z data: A ~ F zf = 22.392, df = 4997, p-value < 2.2e-16 alternative hypothesis: true value is not equal to 0

### Instrumenting tests to debug them

It is very common for p-values to take values like `NA`

or `NaN`

if the underlying null
distribution is singular or otherwise ill-defined. Being able to instrument tests to debug them and work around such
corner cases is a life-saver when we need to make structure learning work! We can wrap `ci.test()`

and
`browser()`

in a custom test to allow for interactive debugging, as follows.

> instrumented = function(x, y, z, data, args) { + test = ci.test(x, y, z, data = data, test = args$test) + statistic = test$statistic + p.value = test$p.value + if (is.nan(statistic) || is.nan(p.value)) + browser() + return(c(statistic = statistic, p.value = p.value)) + }#INSTRUMENTED

### Extending existing conditional independence tests

As an example, let's try to add observation weights to the Fisher's Z test.

> weighted.fisher.z = function(x, y, z, data, args) { + require(corpcor) + # compute the weighted correlation matrix... + rho = cov2cor(cov.wt(data[, c(x, y, z)], wt = args$weights)$cov) + # ... compute the partial correlation... + omega = cor2pcor(rho) + # ... apply the Fisher's Z transform... + statistic = 0.5 * log((1 + omega[1, 2]) / (1 - omega[1, 2])) + # ... scale it to unit variance... + statistic = statistic * sqrt(nrow(data) - ncol(omega) - 1) + # ... and compute the (two-sided) p-value. + p.value = 2 * pnorm(statistic, lower.tail = FALSE) + return(c(statistic = statistic, p.value = p.value)) + }#WEIGHTED.FISHER.Z > > ci.test("A", "F", c("D", "E"), data = gaussian.test, + test = "custom-test", fun = weighted.fisher.z, + args = list(weights = runif(nrow(gaussian.test))))

User-Provided Test Function data: A ~ F | D + E custom-test = 44.106, p-value < 2.2e-16

### Implementing new conditional independence tests

#### Ordinal logistic regressions as local distributions

Conditional independence tests for discrete Bayesian networks typically assume that variables are categorical. However, ordinal variables ariese naturally in many settings and have a very different definition of probabilistic dependence. We can model them using ordered logistic regressions as local distributions and test conditional independence with a log-likelihood/mutual information test.

> polr.test = function(x, y, z, data, args) { + require(MASS) + if (is.null(z)) { + formula = paste(x, "~", y) + num = logLik(polr(formula, data = data)) + formula = paste(x, "~ 1") + den = logLik(polr(formula, data = data)) + }#THEN + else { + formula = paste(x, "~", paste(c(y, z), collapse = "+")) + num = logLik(polr(formula, data = data)) + formula = paste(x, "~", paste(z, collapse = "+")) + den = logLik(polr(formula, data = data)) + }#ELSE + statistic = as.numeric(num / den) + p.value = pchisq(statistic, + df = attr(num, "df") - attr(den, "df"), lower.tail = FALSE) + return(c(statistic = statistic, p.value = p.value)) + }#POLR.TEST > > ordered.test = learning.test > for (i in seq(ncol(ordered.test))) + ordered.test[, i] = as.ordered(ordered.test[, i]) > > pc.stable(ordered.test, test = "custom-test", fun = polr.test)

This `polr.test`

can easily be adapted to any other local distribution for which we have a
`logLik()`

method.

#### The oracle test

Much like the oracle score, the *oracle test* supports a conditional indepdence statement if and only if that
statement holds in the true network structure. Therefore, it acts as a conditional independence test that never commits
any type-I or type-II errors by completely avoiding false positive and false negative inclusions.

We can implement it as follows using `dsep()`

to query the true network structure.

> oracle.test = function(x, y, z, data, args) { + if (is.null(z)) + holds = dsep(args$dag, x = x, y = y) + else + holds = dsep(args$dag, x = x, y = y, z = z) + if (holds) + return(c(statistic = NA, p.value = 1)) + else + return(c(statistic = NA, p.value = 0)) + }#ORACLE.TEST > > true.dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]") > pc.stable(gaussian.test, test = "custom-test", fun = oracle.test, + args = list(dag = true.dag))

`Mon Aug 5 02:41:00 2024`

with **bnlearn**

`5.0`

and `R version 4.4.1 (2024-06-14)`

.