Preprocessing data before structure learning

bnlearn provides two functions to carry out the most common preprocessing tasks in the Bayesian network literature: discretize() and dedup().

Discretizing data

The discretize() function (documented here) takes a data frame containing at least some continuous variables and returns a second data frame in which those continuous variables have been transformed into discrete variables. The end goal is to be able to use the returned data set to learn a discrete Bayesian network.

It supports:

  • Marginal discretization into equal-length intervals spanning the range of each individual variable (method = "interval").
  • Marginal discretization into equi-spaced quantiles from the empirical distribution of each individual variable (method = "quantile").
  • Joint discretization using the mutual information approach from Hartemink, which strives to preserve the pairwise dependence patterns between variables (method = "hartemink").

The breaks argument specifies the number of levels the variables will be discretised into, and the ordered argument determins whether the return value contains ordered or unordered factors.

> factors = discretize(gaussian.test, method = "quantile", breaks = 3)
> lapply(factors, levels)
$A
[1] "[-2.24652,0.555839]" "(0.555839,1.44378]"  "(1.44378,4.84739]"  

$B
[1] "[-10.0946,0.78477]" "(0.78477,3.35529]"  "(3.35529,14.2154]" 

$C
[1] "[-15.81,5.33039]"  "(5.33039,10.8336]" "(10.8336,32.4408]"

$D
[1] "[-9.0438,7.13546]" "(7.13546,11.0273]" "(11.0273,26.9773]"

$E
[1] "[-3.55877,2.62531]" "(2.62531,4.38026]"  "(4.38026,11.4944]" 

$F
[1] "[-1.17025,19.3574]" "(19.3574,24.7354]"  "(24.7354,45.8496]" 

$G
[1] "[-1.36582,4.20784]" "(4.20784,5.86491]"  "(5.86491,12.4096]"
> head(factors)
                    A                 B                 C                 D
1  (0.555839,1.44378] (0.78477,3.35529] (5.33039,10.8336] (7.13546,11.0273]
2 [-2.24652,0.555839] (3.35529,14.2154] (10.8336,32.4408] (11.0273,26.9773]
3   (1.44378,4.84739] (0.78477,3.35529] (10.8336,32.4408] (11.0273,26.9773]
4  (0.555839,1.44378] (3.35529,14.2154] (10.8336,32.4408] (11.0273,26.9773]
5 [-2.24652,0.555839] (3.35529,14.2154] (5.33039,10.8336] (11.0273,26.9773]
6   (1.44378,4.84739] (0.78477,3.35529] (5.33039,10.8336] [-9.0438,7.13546]
                   E                 F                  G
1 [-3.55877,2.62531] (19.3574,24.7354]  (5.86491,12.4096]
2  (4.38026,11.4944] (24.7354,45.8496] [-1.36582,4.20784]
3  (2.62531,4.38026] (19.3574,24.7354] [-1.36582,4.20784]
4 [-3.55877,2.62531] (19.3574,24.7354]  (5.86491,12.4096]
5  (2.62531,4.38026] (19.3574,24.7354]  (4.20784,5.86491]
6  (4.38026,11.4944] (24.7354,45.8496]  (5.86491,12.4096]

The breaks argument can also be a vector specifying a different number of levels for each variable.

> factors = discretize(gaussian.test, method = "quantile", breaks = c(2, 4, 7, 4, 3, 3, 2))
> sapply(factors, nlevels)
A B C D E F G 
2 4 7 4 3 3 2

Variables that are already factors in the input data frame are left unchanged.

Additionally Hartemink's discretization takes two additional arguments: idisc and ibreaks. These arguments specify the discretization method (either "quantile" or "interval") and the number of levels in the initial discretization of the data. The resulting factors are then reduced to the number of levels specified by breaks by merging adjacent levels while minimising mutual information loss.

> factors = discretize(gaussian.test, method = "hartemink", breaks = 3, idisc = "quantile", ibreaks = 10)
> lapply(factors, levels)
$A
[1] "[-2.24652,0.983649]" "(0.983649,2.29878]"  "(2.29878,4.84739]"  

$B
[1] "[-10.0946,-0.567162]" "(-0.567162,2.81054]"  "(2.81054,14.2154]"   

$C
[1] "[-15.81,4.72992]"  "(4.72992,11.3814]" "(11.3814,32.4408]"

$D
[1] "[-9.0438,5.13117]" "(5.13117,10.2337]" "(10.2337,26.9773]"

$E
[1] "[-3.55877,2.98817]" "(2.98817,5.16809]"  "(5.16809,11.4944]" 

$F
[1] "[-1.17025,18.7884]" "(18.7884,23.5891]"  "(23.5891,45.8496]" 

$G
[1] "[-1.36582,4.03283]" "(4.03283,5.02842]"  "(5.02842,12.4096]"

The data frame returned by discretize() has a custom attribute called "cutpoints" containing the cutpoints computed internally to discretize each variable. These cutpoints can be passed as is to cut() to reproduce the discretization on different data (say, to discretize a validation set in the same way as the corresponding traning set).

> cutpoints = attr(factors, "cutpoints")
> cutpoints
$A
[1] -2.2465197  0.9836491  2.2987842  4.8473877

$B
[1] -10.0945581  -0.5671621   2.8105389  14.2153797

$C
[1] -15.809961   4.729918  11.381408  32.440769

$D
[1] -9.043796  5.131174 10.233725 26.977326

$E
[1] -3.558768  2.988170  5.168090 11.494383

$F
[1] -1.170247 18.788360 23.589134 45.849594

$G
[1] -1.365823  4.032827  5.028420 12.409607
> rediscretized = cut(gaussian.test$A, breaks = cutpoints$A)
> levels(factors$A)
[1] "[-2.24652,0.983649]" "(0.983649,2.29878]"  "(2.29878,4.84739]"
> levels(rediscretized)
[1] "(-2.25,0.984]" "(0.984,2.3]"   "(2.3,4.85]"

Label may differe somewhat due to rounding, but are easy to reconcile since the unerlying levels are the same.

> levels(rediscretized) = levels(factors$A)
Last updated on Tue Aug 5 15:08:32 2025 with bnlearn 5.1 and R version 4.5.0 (2025-04-11).