Entropy and Kullback-Leibler divergence
Entropy and the Kullback-Leibler divergence are fundamental quantities in information theory and in machine learning, and they play a central role in learning and evaluating probabilistic models. bnlearn implements both for all of discrete, Gaussian and conditional Gaussian networks. The full mathematical description of how they are computed is in Scutari (2024, Algorithms), available here.
Consider three different Bayesian networks for learning.test
: the empty network, the generating model
for the data and the complete network.
> dag1 = empty.graph(names(learning.test)) > bn1 = bn.fit(dag1, learning.test, method = "mle") > dag2 = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > bn2 = bn.fit(dag2, learning.test, method = "mle") > dag3 = complete.graph(names(learning.test)) > bn3 = bn.fit(dag3, learning.test, method = "mle", replace.unidentifiable = TRUE)
The entropy, which can be computed with H()
, gradually decreases as the Bayesian networks become more
informative for the data. Even the complete network is not a singular models, so they are all strictly positive.
> H(bn1)
[1] 5.64615
> H(bn2)
[1] 4.76624
> H(bn3)
[1] 4.721232
The Kullback-Leibler divergence, computed with KL()
, are all positive since the Bayesian networks
represent different probability distributions.
> KL(bn2, bn1)
[1] 0.8797299
> KL(bn3, bn2)
[1] 0.04519399
Naturally, the Kullback-Leibler divergence between two Bayesian network in the same equivalence class will be zero
because they encode the same probability distribution. We can verify that by changing the direction of the
arc A
→ B
in bn2
to produce bn4
, which is score-equivalent.
> dag4 = model2network("[A|B][C][F][B][D|A:C][E|B:F]") > bn4 = bn.fit(dag2, learning.test, method = "mle") > > KL(bn2, bn4)
[1] 0
In the case of Gaussian and conditional Gaussian networks, H()
and KL
compute the
differential entropy and the Kullback-Leibler divergences. The former can be negative!
> dag1 = model2network("[A][B|A]") > bn1 = custom.fit(dag1, list(A = list(coef = 0, sd = 1e-3), + B = list(coef = c(0, 1), sd = 1))) > > H(bn1)
[1] -4.069878
Note the following corner cases:
H()
returns-Inf
if a(t least one) continuous variable in the network is singular.> dag1 = model2network("[A][B|A]") > bn1 = custom.fit(dag1, list(A = list(coef = 0, sd = 0), + B = list(coef = c(0, 1), sd = 1))) > > H(bn1)
[1] -Inf
H()
does not returns-Inf
even if a(t least one) discrete variable has a zero-one distribution.> dag1 = model2network("[A][B|A]") > cptA = matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("a", "b"))) > cptB = matrix(c(0.8, 0.2, 0, 1), ncol = 2, + dimnames = list(c("a", "b"), c("a", "b"))) > bn1 = custom.fit(dag1, list(A = cptA, B = cptB)) > > H(bn1)
[1] 0.8731726
KL()
returnsInf
or-Inf
if at least one Bayesian network is singular This is true for Gaussian and conditional Gaussian networks.For discrete networks, the second network must contain a zero-one distribution for this to happen.> dag1 = model2network("[A][B|A]") > bn1 = custom.fit(dag1, list(A = list(coef = 0, sd = 0), + B = list(coef = c(0, 1), sd = 1))) > bn2 = custom.fit(dag1, list(A = list(coef = 0, sd = 1), + B = list(coef = c(0, 1), sd = 1))) > > KL(bn1, bn2)
[1] Inf
> KL(bn2, bn1)
[1] -Inf
> dag1 = model2network("[A][B|A]") > cptA = matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("a", "b"))) > cptB = matrix(c(0.8, 0.2, 0, 1), ncol = 2, + dimnames = list(c("a", "b"), c("a", "b"))) > bn1 = custom.fit(dag1, list(A = cptA, B = cptB)) > cptB = matrix(c(0.8, 0.2, 0.4, 0.6), ncol = 2, + dimnames = list(c("a", "b"), c("a", "b"))) > bn2 = custom.fit(dag1, list(A = cptA, B = cptB)) > > KL(bn2, bn1)
[1] Inf
> KL(bn1, bn2)
[1] 0.3064954
- If any parameter is
NA
, bothH()
andKL()
will returnNA
.> class(bn1) = NULL > bn1$B$prob[, "b"] = c(NA, NA) > class(bn1) = c("bn.fit", "bn.fit.dnet") > > H(bn1)
[1] NA
> KL(bn1, bn2)
[1] NA
> KL(bn2, bn1)
[1] NA
Mon Aug 5 02:44:12 2024
with bnlearn
5.0
and R version 4.4.1 (2024-06-14)
.