Preprocessing data before structure learning
bnlearn provides two functions to carry out the most common preprocessing tasks in the Bayesian
network literature: discretize()
and dedup()
.
Discretizing data
The discretize()
function (documented here) takes a
data frame containing at least some continuous variables and returns a second data frame in which those continuous
variables have been transformed into discrete variables. The end goal is to be able to use the returned data set to
learn a discrete Bayesian network.
It supports:
- Marginal discretization into equal-length intervals spanning the range of each individual variable
(
method = "interval"
). - Marginal discretization into equi-spaced quantiles from the empirical distribution of each individual
variable (
method = "quantile"
). - Joint discretization using the mutual information approach from Hartemink, which strives to preserve
the pairwise dependence patterns between variables (
method = "hartemink"
).
The breaks
argument specifies the number of levels the variables will be discretised into, and the
ordered
argument determins whether the return value contains ordered or unordered factors.
> factors = discretize(gaussian.test, method = "quantile", breaks = 3) > lapply(factors, levels)
$A [1] "[-2.24652,0.555839]" "(0.555839,1.44378]" "(1.44378,4.84739]" $B [1] "[-10.0946,0.78477]" "(0.78477,3.35529]" "(3.35529,14.2154]" $C [1] "[-15.81,5.33039]" "(5.33039,10.8336]" "(10.8336,32.4408]" $D [1] "[-9.0438,7.13546]" "(7.13546,11.0273]" "(11.0273,26.9773]" $E [1] "[-3.55877,2.62531]" "(2.62531,4.38026]" "(4.38026,11.4944]" $F [1] "[-1.17025,19.3574]" "(19.3574,24.7354]" "(24.7354,45.8496]" $G [1] "[-1.36582,4.20784]" "(4.20784,5.86491]" "(5.86491,12.4096]"
> head(factors)
A B C D 1 (0.555839,1.44378] (0.78477,3.35529] (5.33039,10.8336] (7.13546,11.0273] 2 [-2.24652,0.555839] (3.35529,14.2154] (10.8336,32.4408] (11.0273,26.9773] 3 (1.44378,4.84739] (0.78477,3.35529] (10.8336,32.4408] (11.0273,26.9773] 4 (0.555839,1.44378] (3.35529,14.2154] (10.8336,32.4408] (11.0273,26.9773] 5 [-2.24652,0.555839] (3.35529,14.2154] (5.33039,10.8336] (11.0273,26.9773] 6 (1.44378,4.84739] (0.78477,3.35529] (5.33039,10.8336] [-9.0438,7.13546] E F G 1 [-3.55877,2.62531] (19.3574,24.7354] (5.86491,12.4096] 2 (4.38026,11.4944] (24.7354,45.8496] [-1.36582,4.20784] 3 (2.62531,4.38026] (19.3574,24.7354] [-1.36582,4.20784] 4 [-3.55877,2.62531] (19.3574,24.7354] (5.86491,12.4096] 5 (2.62531,4.38026] (19.3574,24.7354] (4.20784,5.86491] 6 (4.38026,11.4944] (24.7354,45.8496] (5.86491,12.4096]
The breaks
argument can also be a vector specifying a different number of levels for each variable.
> factors = discretize(gaussian.test, method = "quantile", breaks = c(2, 4, 7, 4, 3, 3, 2)) > sapply(factors, nlevels)
A B C D E F G 2 4 7 4 3 3 2
Variables that are already factors in the input data frame are left unchanged.
Additionally Hartemink's discretization takes two additional arguments: idisc
and ibreaks
.
These arguments specify the discretization method (either "quantile"
or "interval"
) and the
number of levels in the initial discretization of the data. The resulting factors are then reduced to the number of
levels specified by breaks
by merging adjacent levels while minimising mutual information loss.
> factors = discretize(gaussian.test, method = "hartemink", breaks = 3, idisc = "quantile", ibreaks = 10) > lapply(factors, levels)
$A [1] "[-2.24652,0.983649]" "(0.983649,2.29878]" "(2.29878,4.84739]" $B [1] "[-10.0946,-0.567162]" "(-0.567162,2.81054]" "(2.81054,14.2154]" $C [1] "[-15.81,4.72992]" "(4.72992,11.3814]" "(11.3814,32.4408]" $D [1] "[-9.0438,5.13117]" "(5.13117,10.2337]" "(10.2337,26.9773]" $E [1] "[-3.55877,2.98817]" "(2.98817,5.16809]" "(5.16809,11.4944]" $F [1] "[-1.17025,18.7884]" "(18.7884,23.5891]" "(23.5891,45.8496]" $G [1] "[-1.36582,4.03283]" "(4.03283,5.02842]" "(5.02842,12.4096]"
The data frame returned by discretize()
has a custom attribute called "cutpoints"
containing the cutpoints computed internally to discretize each variable. These cutpoints can be passed as is to
cut()
to reproduce the discretization on different data (say, to discretize a validation set in the same
way as the corresponding traning set).
> cutpoints = attr(factors, "cutpoints") > cutpoints
$A [1] -2.2465197 0.9836491 2.2987842 4.8473877 $B [1] -10.0945581 -0.5671621 2.8105389 14.2153797 $C [1] -15.809961 4.729918 11.381408 32.440769 $D [1] -9.043796 5.131174 10.233725 26.977326 $E [1] -3.558768 2.988170 5.168090 11.494383 $F [1] -1.170247 18.788360 23.589134 45.849594 $G [1] -1.365823 4.032827 5.028420 12.409607
> rediscretized = cut(gaussian.test$A, breaks = cutpoints$A) > levels(factors$A)
[1] "[-2.24652,0.983649]" "(0.983649,2.29878]" "(2.29878,4.84739]"
> levels(rediscretized)
[1] "(-2.25,0.984]" "(0.984,2.3]" "(2.3,4.85]"
Label may differe somewhat due to rounding, but are easy to reconcile since the unerlying levels are the same.
> levels(rediscretized) = levels(factors$A)
Tue Aug 5 15:08:32 2025
with bnlearn
5.1
and R version 4.5.0 (2025-04-11)
.