Preprocessing data with missing values
bnlearn provides two functions to carry out the most common preprocessing tasks in the Bayesian
network literature: discretize()
and dedup()
.
Discretizing data
The discretize()
function (documented here) takes a
data frame containing at least some continuous variables and returns a second data frame in which those continuous
variables have been transformed into discrete variables. The end goal is to be able to use the returned data set to
learn a discrete Bayesian network.
Any of the variables in the input data frame is allowed contain missing values, regardless of the method
used to discretize them. In particular:
- For marginal discretization methods (
"interval"
and"quantile"
), the missing values in each variable are ignored when computing the boundaries of the intervals that will be used as levels. Furthermore, they will still appear asNA
s in the discretized data. So:produces the same intervals (thus, a factor with same levels) as:> library(bnlearn) > complete = data.frame(A = rnorm(10)) > discretize(complete, method = "interval", breaks = 4)
A 1 (0.726863,1.37264] 2 [-0.564698,0.0810823] 3 (0.0810823,0.726863] 4 (0.0810823,0.726863] 5 (0.0810823,0.726863] 6 [-0.564698,0.0810823] 7 (1.37264,2.01842] 8 [-0.564698,0.0810823] 9 (1.37264,2.01842] 10 [-0.564698,0.0810823]
> incomplete = data.frame(A = c(complete$A, rep(NA, 3))) > discretize(incomplete, method = "interval", breaks = 4)
A 1 (0.726863,1.37264] 2 [-0.564698,0.0810823] 3 (0.0810823,0.726863] 4 (0.0810823,0.726863] 5 (0.0810823,0.726863] 6 [-0.564698,0.0810823] 7 (1.37264,2.01842] 8 [-0.564698,0.0810823] 9 (1.37264,2.01842] 10 [-0.564698,0.0810823] 11 <NA> 12 <NA> 13 <NA>
- For joint discretization methods (
"hartemink"
), variables are considered in pairs. Observations that are not complete for each pair are ignored when computing the boundaries of the intervals that will be used as levels. Missing values are preserved asNA
s in the discretized data. So:produces the same intervals (thus, a factor with same levels) as:> complete = data.frame(A = rnorm(10), B = rnorm(10)) > discretize(complete, method = "interval", breaks = 4)
A B 1 (1.05087,2.28665] (-0.862183,0.0569425] 2 (1.05087,2.28665] [-1.78131,-0.862183] 3 (-1.42068,-0.184905] (-0.862183,0.0569425] 4 (-1.42068,-0.184905] (0.976068,1.89519] 5 (-0.184905,1.05087] (0.976068,1.89519] 6 (-0.184905,1.05087] (-0.862183,0.0569425] 7 (-1.42068,-0.184905] (-0.862183,0.0569425] 8 [-2.65646,-1.42068] [-1.78131,-0.862183] 9 [-2.65646,-1.42068] (0.0569425,0.976068] 10 (1.05087,2.28665] (-0.862183,0.0569425]
> incomplete = data.frame( + A = c(complete$A, rep(NA, 3)), + B = c(complete$B, rnorm(3)) + ) > discretize(incomplete, method = "interval", breaks = 4)
A B 1 (1.05087,2.28665] (-0.862183,0.0569425] 2 (1.05087,2.28665] [-1.78131,-0.862183] 3 (-1.42068,-0.184905] (-0.862183,0.0569425] 4 (-1.42068,-0.184905] (0.976068,1.89519] 5 (-0.184905,1.05087] (0.976068,1.89519] 6 (-0.184905,1.05087] (-0.862183,0.0569425] 7 (-1.42068,-0.184905] (-0.862183,0.0569425] 8 [-2.65646,-1.42068] [-1.78131,-0.862183] 9 [-2.65646,-1.42068] (0.0569425,0.976068] 10 (1.05087,2.28665] (-0.862183,0.0569425] 11 <NA> (0.0569425,0.976068] 12 <NA> (0.0569425,0.976068] 13 <NA> (0.976068,1.89519]
Removing highly-correlated variables
The dedup()
function (documented here) takes a data
frame containing (only) continuous variables and looks for pairs of variables with strong correlation, regardless of the
sign. It then removes one of variable in each such pair. The end goal is to avoid learning Gaussian Bayesian networks
which clusters of highly-connected nodes, for both speed and interpretability.
Observations that are not complete for a pair of variables are ignored when computing the absolute correlation between those two variables. Missing values in the variables that are retained are preserved. So:
> observations = rnorm(10) > complete = data.frame( + A = observations, + B = 10 * observations + rnorm(10) + ) > dedup(complete, debug = TRUE)
* caching means and variances. * looking at A with 1 variables still to check. A is collinear with B, dropping B with COR = 0.9950
A 1 -0.60892638 2 0.50495512 3 -1.71700868 4 -0.78445901 5 -0.85090759 6 -2.41420765 7 0.03612261 8 0.20599860 9 -0.36105730 10 0.75816324
will give the same output (modulo the missing values) as:
> incomplete = data.frame( + A = c(complete$A, rep(NA, 3)), + B = c(complete$B, rnorm(3)) + ) > dedup(incomplete, debug = TRUE)
* caching means and variances. * looking at A with 1 variables still to check. A is collinear with B, dropping B with COR = 0.9950
A 1 -0.60892638 2 0.50495512 3 -1.71700868 4 -0.78445901 5 -0.85090759 6 -2.41420765 7 0.03612261 8 0.20599860 9 -0.36105730 10 0.75816324 11 NA 12 NA 13 NA
Mon Aug 5 02:47:41 2024
with bnlearn
5.0
and R version 4.4.1 (2024-06-14)
.