Score-based structure learning from data with missing values
Score-based algorithms in the literature are typically defined to use a generic score function to compare different network structures. However, for the most part network scores assume that data are complete.
The Structural Expectation-Maximization (Structural EM) algorithm
A possible approach to sidestep this limitation is the Structural EM algorithm from Nir Friedman (link), which scores candidate network structures on completed data by iterating over:
- an expectation step (E): in which we complete the data by imputing missing values from a fitted Bayesian network;
- the maximization step (M): in which we learn a Bayesian network by maximizing a network score over the completed data.
This algorithm is implemented in the structural.em()
function in bnlearn
(documented here). The arguments of structural.em()
reflect its modular nature:
maximize
, the label of a score-based structure learning learning algorithm, andmaximize.args
, a list containing its arguments (other than the data);fit
, the label of a parameter estimator inbn.fit()
(documented here), andfit.args
, a list containing its arguments (other than the data);impute
, the label an imputation method inimpute()
(documented here), andimpute.args
, a list containing its arguments (other than the data).
The number of iterations of the E and M steps is controlled by the max.iter
argument, which defaults to
5 iterations.
With partially observed variables
Consider some simple MCAR data in which 5% of the values are missing for each variable.
> incomplete.data = learning.test > for (col in seq(ncol(incomplete.data))) + incomplete.data[sample(nrow(incomplete.data), 100), col] = NA
With the default arguments, structural.em()
uses hill-climbing as the structure learning algorithm,
maximum likelihood for estimating the parameters of the Bayesian network, and likelihood weighting to impute the
missing values.
> dag = structural.em(incomplete.data) > dag
Bayesian network learned from Missing Data model: [A][C][F][B|A][D|A:C][E|B:F] nodes: 6 arcs: 5 undirected arcs: 0 directed arcs: 5 average markov blanket size: 2.33 average neighbourhood size: 1.67 average branching factor: 0.83 learning algorithm: Structural EM score-based method: Hill-Climbing parameter learning method: Maximum Likelihood (disc.) imputation method: Posterior Expectation (Likelihood Weighting) penalization coefficient: 4.258597 tests used in the learning procedure: 148 optimized: TRUE
We can change that using the arguments listed above.
> dag = structural.em(incomplete.data, + maximize = "tabu", maximize.args = list(tabu = 50, max.tabu = 50), + fit = "bayes", fit.args = list(iss = 1), + impute = "exact", max.iter = 3)
> dag
Bayesian network learned from Missing Data model: [A][C][F][B|A][D|A:C][E|B:F] nodes: 6 arcs: 5 undirected arcs: 0 directed arcs: 5 average markov blanket size: 2.33 average neighbourhood size: 1.67 average branching factor: 0.83 learning algorithm: Structural EM score-based method: Tabu Search parameter learning method: Bayesian Dirichlet imputation method: Exact Inference penalization coefficient: 4.258597 tests used in the learning procedure: 1649 optimized: TRUE
In particular, changing the default impute = "bayes-lw"
into impute = "exact"
may be
useful because the convergence of the Structural EM is not guaranteed if the imputation is performed using approximate
Monte Carlo inference. However, it is usually much slower.
In addition, we can set the argument return.all
to TRUE
to have
structural.em()
return its complete status at the last iteration: the network structure it has learned,
the completed data it was learned from and the fitted Bayesian network used to perform the imputation.
> info = structural.em(incomplete.data, return.all = TRUE, + maximize = "tabu", maximize.args = list(tabu = 50, max.tabu = 50), + fit = "bayes", fit.args = list(iss = 1), + impute = "exact", max.iter = 3) > names(info)
[1] "dag" "imputed" "fitted"
The network structure is the same as that returned when return.all = FALSE
, which is the default.
> info$dag
Bayesian network learned from Missing Data model: [A][C][F][B|A][D|A:C][E|B:F] nodes: 6 arcs: 5 undirected arcs: 0 directed arcs: 5 average markov blanket size: 2.33 average neighbourhood size: 1.67 average branching factor: 0.83 learning algorithm: Structural EM score-based method: Tabu Search parameter learning method: Bayesian Dirichlet imputation method: Exact Inference penalization coefficient: 4.258597 tests used in the learning procedure: 1649 optimized: TRUE
The completed data are stored in a data frame with the same structure as the original data.
> head(info$imputed)
A B C D E F 1 b c b a b b 2 b a c a b b 3 a a a a a a 4 a a a a b b 5 a a b c a a 6 c c a c c a
The fitted Bayesian network is a bn.fit
object.
> info$fitted
Bayesian network parameters Parameters of node A (multinomial distribution) Conditional probability table: a b c 0.3335999 0.3335999 0.3328001 Parameters of node B (multinomial distribution) Conditional probability table: A B a b c a 0.85361305 0.44721945 0.11723079 b 0.02584083 0.21585082 0.09319714 c 0.12054612 0.33692974 0.78957207 Parameters of node C (multinomial distribution) Conditional probability table: a b c 0.7447177 0.2044258 0.0508565 Parameters of node D (multinomial distribution) Conditional probability table: , , C = a A D a b c a 0.81195042 0.08851746 0.09945949 b 0.08511149 0.81009415 0.10507246 c 0.10293809 0.10138839 0.79546804 , , C = b A D a b c a 0.17433619 0.88710540 0.23316298 b 0.13149265 0.06944890 0.50914253 c 0.69417116 0.04344569 0.25769449 , , C = c A D a b c a 0.45222369 0.30383895 0.14308943 b 0.17877587 0.41760300 0.46138211 c 0.36900044 0.27855805 0.39552846 Parameters of node E (multinomial distribution) Conditional probability table: , , F = a B E a b c a 0.81872408 0.20469122 0.11243877 b 0.09315439 0.17830310 0.10565692 c 0.08812153 0.61700568 0.78190430 , , F = b B E a b c a 0.38532936 0.32157631 0.23733588 b 0.50466449 0.37644241 0.51958752 c 0.11000616 0.30198128 0.24307660 Parameters of node F (multinomial distribution) Conditional probability table: a b 0.5053989 0.4946011
With completely unobserved (latent) variables
If the data contain a latent variable which we do not observe for any observation, the E step in the fist iteration
fails because it cannot fit a Bayesian network to impute the missing values. (If all variables are at least partially
observed, structural.em()
uses locally complete observations for this purpose. bn.fit()
does
the same as illustrated here.)
> incomplete.data[, "A"] = factor(rep(NA, nrow(incomplete.data)), levels = levels(incomplete.data[, "A"])) > structural.em(incomplete.data)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE, ## warn.if.no.missing = TRUE, : at least one variable in the data has no observed ## values.
## Error: the data contain latent variables, so the 'start' argument must be a 'bn.fit' object.
As the error message suggests, we can side-step this issue by providing a bn.fit
object ourselves via
the start
argument: it will be used to perform the initial imputation.
> start.dag = empty.graph(names(incomplete.data)) > cptA = matrix(c(0.3336, 0.3340, 0.3324), ncol = 3, dimnames = list(NULL, c("a", "b", "c"))) > cptB = matrix(c(0.4724, 0.1136, 0.4140), ncol = 3, dimnames = list(NULL, c("a", "b", "c"))) > cptC = matrix(c(0.7434, 0.2048, 0.0518), ncol = 3, dimnames = list(NULL, c("a", "b", "c"))) > cptD = matrix(c(0.351, 0.314, 0.335), ncol = 3, dimnames = list(NULL, c("a", "b", "c"))) > cptE = matrix(c(0.3882, 0.2986, 0.3132), ncol = 3, dimnames = list(NULL, c("a", "b", "c"))) > cptF = matrix(c(0.5018, 0.4982), ncol = 2, dimnames = list(NULL, c("a", "b"))) > start = custom.fit(start.dag, list(A = cptA, B = cptB, C = cptC, D = cptD, E = cptE, F = cptF)) > dag = structural.em(incomplete.data, start = start, max.iter = 3)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE, ## warn.if.no.missing = TRUE, : at least one variable in the data has no observed ## values.
## Warning in check.data(x, allow.missing = TRUE): variable A in the data has ## levels that are not observed in the data.
> dag
Bayesian network learned from Missing Data model: [A][B][C][F][D|B:C][E|B:F] nodes: 6 arcs: 4 undirected arcs: 0 directed arcs: 4 average markov blanket size: 2.00 average neighbourhood size: 1.33 average branching factor: 0.67 learning algorithm: Structural EM score-based method: Hill-Climbing parameter learning method: Maximum Likelihood (disc.) imputation method: Posterior Expectation (Likelihood Weighting) penalization coefficient: 4.258597 tests used in the learning procedure: 83 optimized: TRUE
Unfortunately, the latent variable will almost certainly end up as an isolated nodes unless we connect it to at
least some nodes that are partially observed: the noisiness of Monte Carlo inference can easily overwhelm the
dependence relationships we encode in the network in start
argument.
> start.dag = model2network("[A][B|A][C][D][E][F]") > cptA = matrix(c(0.3336, 0.3340, 0.3324), ncol = 3, dimnames = list(NULL, c("a", "b", "c"))) > cptB = matrix(c(0.856, 0.025, 0.118, 0.444, 0.221, 0.334, 0.114, 0.094, 0.790), nrow = 3, ncol = 3, + dimnames = list(B = c("a", "b", "c"), A = c("a", "b", "c"))) > start = custom.fit(start.dag, list(A = cptA, B = cptB, C = cptC, D = cptD, E = cptE, F = cptF)) > dag = structural.em(incomplete.data, start = start, max.iter = 3)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE, ## warn.if.no.missing = TRUE, : at least one variable in the data has no observed ## values.
> dag
Bayesian network learned from Missing Data model: [A][C][F][B|A][D|A:C][E|A:F] nodes: 6 arcs: 5 undirected arcs: 0 directed arcs: 5 average markov blanket size: 2.33 average neighbourhood size: 1.67 average branching factor: 0.83 learning algorithm: Structural EM score-based method: Hill-Climbing parameter learning method: Maximum Likelihood (disc.) imputation method: Posterior Expectation (Likelihood Weighting) penalization coefficient: 4.258597 tests used in the learning procedure: 94 optimized: TRUE
Passing a whitelist to the structure learning algorithm is the simplest way to do that.
> start.dag = model2network("[A][B|A][C][D][E][F]") > cptA = matrix(c(0.3336, 0.3340, 0.3324), ncol = 3, dimnames = list(NULL, c("a", "b", "c"))) > cptB = matrix(c(0.856, 0.025, 0.118, 0.444, 0.221, 0.334, 0.114, 0.094, 0.790), nrow = 3, ncol = 3, + dimnames = list(B = c("a", "b", "c"), A = c("a", "b", "c"))) > start = custom.fit(start.dag, list(A = cptA, B = cptB, C = cptC, D = cptD, E = cptE, F = cptF)) > dag = structural.em(incomplete.data, + maximize.args = list(whitelist = data.frame(from = "A", to = "B")), + start = start, max.iter = 3)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE, ## warn.if.no.missing = TRUE, : at least one variable in the data has no observed ## values.
> dag
Bayesian network learned from Missing Data model: [A][C][F][B|A][E|A:F][D|B:C] nodes: 6 arcs: 5 undirected arcs: 0 directed arcs: 5 average markov blanket size: 2.33 average neighbourhood size: 1.67 average branching factor: 0.83 learning algorithm: Structural EM score-based method: Hill-Climbing parameter learning method: Maximum Likelihood (disc.) imputation method: Posterior Expectation (Likelihood Weighting) penalization coefficient: 4.258597 tests used in the learning procedure: 91 optimized: TRUE
Exact inference does not have this issue because it has no stochastic noise: the imputed values are deterministic given the observed values in each observation.
> dag = structural.em(incomplete.data, start = start, max.iter = 3, impute = "exact")
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE, ## warn.if.no.missing = TRUE, : at least one variable in the data has no observed ## values.
> dag
Bayesian network learned from Missing Data model: [A][C][F][B|A][D|A:C][E|A:F] nodes: 6 arcs: 5 undirected arcs: 0 directed arcs: 5 average markov blanket size: 2.33 average neighbourhood size: 1.67 average branching factor: 0.83 learning algorithm: Structural EM score-based method: Hill-Climbing parameter learning method: Maximum Likelihood (disc.) imputation method: Exact Inference penalization coefficient: 4.258597 tests used in the learning procedure: 94 optimized: TRUE
The Node-Average Likelihood
Another approach is using the node-average likelihood score, originally from Nikolay Balov
(link) and later extended by Tjebbe Bodewes and Marco Scutari
(link). The key idea behind this score is that scoring local distributions
by computing penalized likelihood scores using locally-complete data gives consistency and identifiability as long
as the penalty coefficient is larger than that of BIC. In practice, this means we can plug score = "pnal"
(discrete networks), score = "pnal-g"
(Gaussian networks) or score = "pnal-cg"
(conditional
Gaussian networks) into any score-based structure learning algorithm and use it without modification. The penalty
coefficient is controlled by the k
argument as in BIC and AIC.
> incomplete.data = learning.test > for (col in seq(ncol(incomplete.data))) + incomplete.data[sample(nrow(incomplete.data), 100), col] = NA > dag = hc(incomplete.data, score = "pnal", k = 10) > dag
Bayesian network learned via Score-based methods model: [B][C][F][A|B][E|B:F][D|A:C] nodes: 6 arcs: 5 undirected arcs: 0 directed arcs: 5 average markov blanket size: 2.33 average neighbourhood size: 1.67 average branching factor: 0.83 learning algorithm: Hill-Climbing score: Penalized Node-Average Likelihood (disc.) penalization coefficient: 10 tests used in the learning procedure: 55 optimized: TRUE
The corresponding (unpenalized) log-likelihood scores score = "nal"
(discrete networks),
score = "nal-g"
(Gaussian networks) and score = "nal-cg"
(conditional Gaussian networks)
will always overfit and learn complete graphs like their complete-data equivalents.
> dag = hc(incomplete.data, score = "nal") > dag
Bayesian network learned via Score-based methods model: [C][F|C][B|C:F][E|B:C:F][D|B:C:E:F][A|B:C:D:E:F] nodes: 6 arcs: 15 undirected arcs: 0 directed arcs: 15 average markov blanket size: 5.00 average neighbourhood size: 5.00 average branching factor: 2.50 learning algorithm: Hill-Climbing score: Node-Average Likelihood (disc.) tests used in the learning procedure: 115 optimized: TRUE
Mon Aug 5 02:48:08 2024
with bnlearn
5.0
and R version 4.4.1 (2024-06-14)
.