Entropy and Kullback-Leibler divergence

Entropy and the Kullback-Leibler divergence are fundamental quantities in information theory and in machine learning, and they play a central role in learning and evaluating probabilistic models. bnlearn implements both for all of discrete, Gaussian and conditional Gaussian networks. The full mathematical description of how they are computed is in Scutari (2024, Algorithms), available here.

Consider three different Bayesian networks for learning.test: the empty network, the generating model for the data and the complete network.

> dag1 = empty.graph(names(learning.test))
> bn1 = bn.fit(dag1, learning.test, method = "mle")
> dag2 = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> bn2 = bn.fit(dag2, learning.test, method = "mle")
> dag3 = complete.graph(names(learning.test))
> bn3 = bn.fit(dag3, learning.test, method = "mle", replace.unidentifiable = TRUE)

The entropy, which can be computed with H(), gradually decreases as the Bayesian networks become more informative for the data. Even the complete network is not a singular models, so they are all strictly positive.

> H(bn1)
[1] 5.64615
> H(bn2)
[1] 4.76624
> H(bn3)
[1] 4.721232

The Kullback-Leibler divergence, computed with KL(), are all positive since the Bayesian networks represent different probability distributions.

> KL(bn2, bn1)
[1] 0.8797299
> KL(bn3, bn2)
[1] 0.04519399

Naturally, the Kullback-Leibler divergence between two Bayesian network in the same equivalence class will be zero because they encode the same probability distribution. We can verify that by changing the direction of the arc AB in bn2 to produce bn4, which is score-equivalent.

> dag4 = model2network("[A|B][C][F][B][D|A:C][E|B:F]")
> bn4 = bn.fit(dag2, learning.test, method = "mle")
> 
> KL(bn2, bn4)
[1] 0

In the case of Gaussian and conditional Gaussian networks, H() and KL compute the differential entropy and the Kullback-Leibler divergences. The former can be negative!

> dag1 = model2network("[A][B|A]")
> bn1 = custom.fit(dag1, list(A = list(coef = 0, sd = 1e-3),
+                             B = list(coef = c(0, 1), sd = 1)))
> 
> H(bn1)
[1] -4.069878

Note the following corner cases:

  1. H() returns -Inf if a(t least one) continuous variable in the network is singular.
    > dag1 = model2network("[A][B|A]")
    > bn1 = custom.fit(dag1, list(A = list(coef = 0, sd = 0),
    +                             B = list(coef = c(0, 1), sd = 1)))
    > 
    > H(bn1)
    
    [1] -Inf
    
  2. H() does not returns -Inf even if a(t least one) discrete variable has a zero-one distribution.
    > dag1 = model2network("[A][B|A]")
    > cptA = matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("a", "b")))
    > cptB = matrix(c(0.8, 0.2, 0, 1), ncol = 2,
    +          dimnames = list(c("a", "b"), c("a", "b")))
    > bn1 = custom.fit(dag1, list(A = cptA, B = cptB))
    > 
    > H(bn1)
    
    [1] 0.8731726
    
  3. KL() returns Inf or -Inf if at least one Bayesian network is singular This is true for Gaussian and conditional Gaussian networks.
    > dag1 = model2network("[A][B|A]")
    > bn1 = custom.fit(dag1, list(A = list(coef = 0, sd = 0),
    +                             B = list(coef = c(0, 1), sd = 1)))
    > bn2 = custom.fit(dag1, list(A = list(coef = 0, sd = 1),
    +                             B = list(coef = c(0, 1), sd = 1)))
    > 
    > KL(bn1, bn2)
    
    [1] Inf
    
    > KL(bn2, bn1)
    
    [1] -Inf
    
    For discrete networks, the second network must contain a zero-one distribution for this to happen.
    > dag1 = model2network("[A][B|A]")
    > cptA = matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("a", "b")))
    > cptB = matrix(c(0.8, 0.2, 0, 1), ncol = 2,
    +          dimnames = list(c("a", "b"), c("a", "b")))
    > bn1 = custom.fit(dag1, list(A = cptA, B = cptB))
    > cptB = matrix(c(0.8, 0.2, 0.4, 0.6), ncol = 2,
    +          dimnames = list(c("a", "b"), c("a", "b")))
    > bn2 = custom.fit(dag1, list(A = cptA, B = cptB))
    > 
    > KL(bn2, bn1)
    
    [1] Inf
    
    > KL(bn1, bn2)
    
    [1] 0.3064954
    
  4. If any parameter is NA, both H() and KL() will return NA.
    > class(bn1) = NULL
    > bn1$B$prob[, "b"] = c(NA, NA)
    > class(bn1) = c("bn.fit", "bn.fit.dnet")
    > 
    > H(bn1)
    
    [1] NA
    
    > KL(bn1, bn2)
    
    [1] NA
    
    > KL(bn2, bn1)
    
    [1] NA
    
Last updated on Mon Aug 5 02:44:12 2024 with bnlearn 5.0 and R version 4.4.1 (2024-06-14).