Constraint-based structure learning from data with missing values

All the constraint-based algorithms implemented in bnlearn assume that data are complete in their original definition in the causal discovery literature. However, they can easily be adapted to handle data with missing values. The general idea is:

  • A conditional independence test typically only uses a small subset of the variables in the data. Most observations will be locally complete for those variables, and we can use them to compute the test statistic without losing much in terms of sample size.
  • The scale of the test statistic may vary with the number of locally complete observations. However, the null distribution also changes with the number of locally complete observations (for instance, in terms of degrees of freedom), which makes p-values comparable even when different tests are computed from different numbers of locally complete observations.

This, of course, does not mean that using incomplete data has no effect on structure learning: if the number of locally complete observations is markedly smaller than the sample size, both the type-I and type-II error rates of the conditional independence test will increase. The accuracy of the network learned by the structure learning algorithm may suffer as a result.

From the user's point of view, bnlearn handles incomplete data sets transparently as shown below. And if the missing data are few and missing at completely random, we may very well be able to learn the same network structure we would if the data were complete.

> dag.from.complete.data = pc.stable(learning.test)
> missing = matrix(FALSE, nrow(learning.test), ncol(learning.test))
> missing[sample(length(missing), 100)] = TRUE
> incomplete = learning.test
> incomplete[missing] = NA
> dag.from.incomplete.data = pc.stable(incomplete)
> all.equal(dag.from.complete.data, dag.from.incomplete.data)
[1] TRUE

As the number of missing values grows, the structure learning algorithm will become unable to learn the structure correctly.

> missing[sample(length(missing), 20000)] = TRUE
> table(missing)
missing
FALSE  TRUE 
10000 20000
> incomplete[missing] = NA
> dag.from.incomplete.data = pc.stable(incomplete)
## Warning in check.data(x, allow.missing = TRUE): some observations in the data
## contain only missing values.
> all.equal(dag.from.complete.data, dag.from.incomplete.data)
[1] "Different arc sets"

Note that this approach only works for data in which all variables are at least partially observed. Variables with no observed values are more like latent variables, and should be handled using the appropriate structure learning algorithms from the pcalg package (link).

Last updated on Mon Aug 5 02:44:51 2024 with bnlearn 5.0 and R version 4.4.1 (2024-06-14).