Preprocessing data with missing values

bnlearn provides two functions to carry out the most common preprocessing tasks in the Bayesian network literature: discretize() and dedup().

Discretizing data

The discretize() function (documented here) takes a data frame containing at least some continuous variables and returns a second data frame in which those continuous variables have been transformed into discrete variables. The end goal is to be able to use the returned data set to learn a discrete Bayesian network.

Any of the variables in the input data frame is allowed contain missing values, regardless of the method used to discretize them. In particular:

  1. For marginal discretization methods ("interval" and "quantile"), the missing values in each variable are ignored when computing the boundaries of the intervals that will be used as levels. Furthermore, they will still appear as NAs in the discretized data. So:
    > library(bnlearn)
    > complete = data.frame(A = rnorm(10))
    > discretize(complete, method = "interval", breaks = 4)
    
                           A
    1     (0.726863,1.37264]
    2  [-0.564698,0.0810823]
    3   (0.0810823,0.726863]
    4   (0.0810823,0.726863]
    5   (0.0810823,0.726863]
    6  [-0.564698,0.0810823]
    7      (1.37264,2.01842]
    8  [-0.564698,0.0810823]
    9      (1.37264,2.01842]
    10 [-0.564698,0.0810823]
    
    produces the same intervals (thus, a factor with same levels) as:
    > incomplete = data.frame(A = c(complete$A, rep(NA, 3)))
    > discretize(incomplete, method = "interval", breaks = 4)
    
                           A
    1     (0.726863,1.37264]
    2  [-0.564698,0.0810823]
    3   (0.0810823,0.726863]
    4   (0.0810823,0.726863]
    5   (0.0810823,0.726863]
    6  [-0.564698,0.0810823]
    7      (1.37264,2.01842]
    8  [-0.564698,0.0810823]
    9      (1.37264,2.01842]
    10 [-0.564698,0.0810823]
    11                  <NA>
    12                  <NA>
    13                  <NA>
    
  2. For joint discretization methods ("hartemink"), variables are considered in pairs. Observations that are not complete for each pair are ignored when computing the boundaries of the intervals that will be used as levels. Missing values are preserved as NAs in the discretized data. So:
    > complete = data.frame(A = rnorm(10), B = rnorm(10))
    > discretize(complete, method = "interval", breaks = 4)
    
                          A                     B
    1     (1.05087,2.28665] (-0.862183,0.0569425]
    2     (1.05087,2.28665]  [-1.78131,-0.862183]
    3  (-1.42068,-0.184905] (-0.862183,0.0569425]
    4  (-1.42068,-0.184905]    (0.976068,1.89519]
    5   (-0.184905,1.05087]    (0.976068,1.89519]
    6   (-0.184905,1.05087] (-0.862183,0.0569425]
    7  (-1.42068,-0.184905] (-0.862183,0.0569425]
    8   [-2.65646,-1.42068]  [-1.78131,-0.862183]
    9   [-2.65646,-1.42068]  (0.0569425,0.976068]
    10    (1.05087,2.28665] (-0.862183,0.0569425]
    
    produces the same intervals (thus, a factor with same levels) as:
    > incomplete = data.frame(
    +   A = c(complete$A, rep(NA, 3)),
    +   B = c(complete$B, rnorm(3))
    + )
    > discretize(incomplete, method = "interval", breaks = 4)
    
                          A                     B
    1     (1.05087,2.28665] (-0.862183,0.0569425]
    2     (1.05087,2.28665]  [-1.78131,-0.862183]
    3  (-1.42068,-0.184905] (-0.862183,0.0569425]
    4  (-1.42068,-0.184905]    (0.976068,1.89519]
    5   (-0.184905,1.05087]    (0.976068,1.89519]
    6   (-0.184905,1.05087] (-0.862183,0.0569425]
    7  (-1.42068,-0.184905] (-0.862183,0.0569425]
    8   [-2.65646,-1.42068]  [-1.78131,-0.862183]
    9   [-2.65646,-1.42068]  (0.0569425,0.976068]
    10    (1.05087,2.28665] (-0.862183,0.0569425]
    11                 <NA>  (0.0569425,0.976068]
    12                 <NA>  (0.0569425,0.976068]
    13                 <NA>    (0.976068,1.89519]
    

Removing highly-correlated variables

The dedup() function (documented here) takes a data frame containing (only) continuous variables and looks for pairs of variables with strong correlation, regardless of the sign. It then removes one of variable in each such pair. The end goal is to avoid learning Gaussian Bayesian networks which clusters of highly-connected nodes, for both speed and interpretability.

Observations that are not complete for a pair of variables are ignored when computing the absolute correlation between those two variables. Missing values in the variables that are retained are preserved. So:

> observations = rnorm(10)
> complete = data.frame(
+   A = observations,
+   B = 10 * observations + rnorm(10)
+ )
> dedup(complete, debug = TRUE)
* caching means and variances.
* looking at A with 1 variables still to check.
A is collinear with B, dropping B with COR = 0.9950
             A
1  -0.60892638
2   0.50495512
3  -1.71700868
4  -0.78445901
5  -0.85090759
6  -2.41420765
7   0.03612261
8   0.20599860
9  -0.36105730
10  0.75816324

will give the same output (modulo the missing values) as:

> incomplete = data.frame(
+   A = c(complete$A, rep(NA, 3)),
+   B = c(complete$B, rnorm(3))
+ )
> dedup(incomplete, debug = TRUE)
* caching means and variances.
* looking at A with 1 variables still to check.
A is collinear with B, dropping B with COR = 0.9950
             A
1  -0.60892638
2   0.50495512
3  -1.71700868
4  -0.78445901
5  -0.85090759
6  -2.41420765
7   0.03612261
8   0.20599860
9  -0.36105730
10  0.75816324
11          NA
12          NA
13          NA
Last updated on Mon Aug 5 02:47:41 2024 with bnlearn 5.0 and R version 4.4.1 (2024-06-14).