Score-based structure learning from data with missing values

Score-based algorithms in the literature are typically defined to use a generic score function to compare different network structures. However, for the most part network scores assume that data are complete.

The Structural Expectation-Maximization (Structural EM) algorithm

A possible approach to sidestep this limitation is the Structural EM algorithm from Nir Friedman (link), which scores candidate network structures on completed data by iterating over:

  • an expectation step (E): in which we complete the data by imputing missing values from a fitted Bayesian network;
  • the maximization step (M): in which we learn a Bayesian network by maximizing a network score over the completed data.

This algorithm is implemented in the structural.em() function in bnlearn (documented here). The arguments of structural.em() reflect its modular nature:

  1. maximize, the label of a score-based structure learning learning algorithm, and maximize.args, a list containing its arguments (other than the data);
  2. fit, the label of a parameter estimator in bn.fit() (documented here), and fit.args, a list containing its arguments (other than the data);
  3. impute, the label an imputation method in impute() (documented here), and impute.args, a list containing its arguments (other than the data).

The number of iterations of the E and M steps is controlled by the max.iter argument, which defaults to 5 iterations.

With partially observed variables

Consider some simple MCAR data in which 5% of the values are missing for each variable.

> incomplete.data = learning.test
> for (col in seq(ncol(incomplete.data)))
+   incomplete.data[sample(nrow(incomplete.data), 100), col] = NA

With the default arguments, structural.em() uses hill-climbing as the structure learning algorithm, maximum likelihood for estimating the parameters of the Bayesian network, and likelihood weighting to impute the missing values.

> dag = structural.em(incomplete.data)
> dag

  Bayesian network learned from Missing Data

  model:
   [A][C][F][B|A][D|A:C][E|B:F] 
  nodes:                                 6 
  arcs:                                  5 
    undirected arcs:                     0 
    directed arcs:                       5 
  average markov blanket size:           2.33 
  average neighbourhood size:            1.67 
  average branching factor:              0.83 

  learning algorithm:                    Structural EM 
  score-based method:                    Hill-Climbing 
  parameter learning method:             Maximum Likelihood (disc.) 
  imputation method:                     
                                     Posterior Expectation (Likelihood Weighting) 
  penalization coefficient:              4.258597 
  tests used in the learning procedure:  148 
  optimized:                             TRUE

We can change that using the arguments listed above.

> dag = structural.em(incomplete.data,
+         maximize = "tabu", maximize.args = list(tabu = 50, max.tabu = 50),
+         fit = "bayes", fit.args = list(iss = 1),
+         impute = "exact", max.iter = 3)
Loading required namespace: gRain

Attaching package: 'gRbase'
The following objects are masked from 'package:bnlearn':

    ancestors, children, nodes, parents
> dag

  Bayesian network learned from Missing Data

  model:
   [A][C][F][B|A][D|A:C][E|B:F] 
  nodes:                                 6 
  arcs:                                  5 
    undirected arcs:                     0 
    directed arcs:                       5 
  average markov blanket size:           2.33 
  average neighbourhood size:            1.67 
  average branching factor:              0.83 

  learning algorithm:                    Structural EM 
  score-based method:                    Tabu Search 
  parameter learning method:             Bayesian Dirichlet 
  imputation method:                     Exact Inference 
  penalization coefficient:              4.258597 
  tests used in the learning procedure:  1627 
  optimized:                             TRUE

In particular, changing the default impute = "bayes-lw" into impute = "exact" may be useful because the convergence of the Structural EM is not guaranteed if the imputation is performed using approximate Monte Carlo inference. However, it is usually much slower.

In addition, we can set the argument return.all to TRUE to have structural.em() return its complete status at the last iteration: the network structure it has learned, the completed data it was learned from and the fitted Bayesian network used to perform the imputation.

> info = structural.em(incomplete.data, return.all = TRUE,
+         maximize = "tabu", maximize.args = list(tabu = 50, max.tabu = 50),
+         fit = "bayes", fit.args = list(iss = 1),
+         impute = "exact", max.iter = 3)
> names(info)
[1] "dag"     "imputed" "fitted"

The network structure is the same as that returned when return.all = FALSE, which is the default.

> info$dag

  Bayesian network learned from Missing Data

  model:
   [A][C][F][B|A][D|A:C][E|B:F] 
  nodes:                                 6 
  arcs:                                  5 
    undirected arcs:                     0 
    directed arcs:                       5 
  average markov blanket size:           2.33 
  average neighbourhood size:            1.67 
  average branching factor:              0.83 

  learning algorithm:                    Structural EM 
  score-based method:                    Tabu Search 
  parameter learning method:             Bayesian Dirichlet 
  imputation method:                     Exact Inference 
  penalization coefficient:              4.258597 
  tests used in the learning procedure:  1627 
  optimized:                             TRUE

The completed data are stored in a data frame with the same structure as the original data.

> head(info$imputed)
  A B C D E F
1 b c b a b b
2 b a c a b b
3 a a a a a a
4 a a a a b b
5 a a b c a a
6 c c a c c a

The fitted Bayesian network is a bn.fit object.

> info$fitted

  Bayesian network parameters

  Parameters of node A (multinomial distribution)

Conditional probability table:
         a         b         c 
0.3328001 0.3353996 0.3318003 

  Parameters of node B (multinomial distribution)

Conditional probability table:
 
   A
B            a          b          c
  a 0.85746712 0.44481982 0.11758404
  b 0.02289872 0.21707737 0.09227267
  c 0.11963415 0.33810281 0.79014330

  Parameters of node C (multinomial distribution)

Conditional probability table:
          a          b          c 
0.74551756 0.20382590 0.05065654 

  Parameters of node D (multinomial distribution)

Conditional probability table:
 
, , C = a

   A
D            a          b          c
  a 0.80793106 0.08461902 0.10756427
  b 0.08708492 0.81320565 0.10274929
  c 0.10498403 0.10217533 0.78968643

, , C = b

   A
D            a          b          c
  a 0.17334596 0.89455451 0.22772841
  b 0.12790575 0.06149183 0.51686033
  c 0.69874829 0.04395366 0.25541126

, , C = c

   A
D            a          b          c
  a 0.45766488 0.34145241 0.11391341
  b 0.19295900 0.40234551 0.47709121
  c 0.34937611 0.25620207 0.40899538


  Parameters of node E (multinomial distribution)

Conditional probability table:
 
, , F = a

   B
E            a          b          c
  a 0.81880294 0.20615269 0.10985155
  b 0.09354673 0.17238792 0.11082321
  c 0.08765033 0.62145939 0.77932524

, , F = b

   B
E            a          b          c
  a 0.38949250 0.35905681 0.23665660
  b 0.50039979 0.33976420 0.51428420
  c 0.11010771 0.30117899 0.24905920


  Parameters of node F (multinomial distribution)

Conditional probability table:
         a         b 
0.5023995 0.4976005

With completely unobserved (latent) variables

If the data contain a latent variable which we do not observe for any observation, the E step in the fist iteration fails because it cannot fit a Bayesian network to impute the missing values. (If all variables are at least partially observed, structural.em() uses locally complete observations for this purpose. bn.fit() does the same as illustrated here.)

> incomplete.data[, "A"] = factor(rep(NA, nrow(incomplete.data)), levels = levels(incomplete.data[, "A"]))
> structural.em(incomplete.data)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
## warn.if.no.missing = TRUE, : at least one variable has no observed values.
## Error: the data contain latent variables, so the 'start' argument must be a 'bn.fit' object.

As the error message suggests, we can side-step this issue by providing a bn.fit object ourselves via the start argument: it will be used to perform the initial imputation.

> start.dag = empty.graph(names(incomplete.data))
> cptA = matrix(c(0.3336, 0.3340, 0.3324), ncol = 3, dimnames = list(NULL, c("a", "b", "c")))
> cptB = matrix(c(0.4724, 0.1136, 0.4140), ncol = 3, dimnames = list(NULL, c("a", "b", "c")))
> cptC = matrix(c(0.7434, 0.2048, 0.0518), ncol = 3, dimnames = list(NULL, c("a", "b", "c")))
> cptD = matrix(c(0.351, 0.314, 0.335), ncol = 3, dimnames = list(NULL, c("a", "b", "c")))
> cptE = matrix(c(0.3882, 0.2986, 0.3132), ncol = 3, dimnames = list(NULL, c("a", "b", "c")))
> cptF = matrix(c(0.5018, 0.4982), ncol = 2, dimnames = list(NULL, c("a", "b")))
> start = custom.fit(start.dag, list(A = cptA, B = cptB, C = cptC, D = cptD, E = cptE, F = cptF))
> dag = structural.em(incomplete.data, start = start, max.iter = 3)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
## warn.if.no.missing = TRUE, : at least one variable has no observed values.
## Warning in check.data(x, allow.missing = TRUE): variable A has levels that are
## not observed in the data.
> dag

  Bayesian network learned from Missing Data

  model:
   [A][B][C][F][D|B:C][E|B:F] 
  nodes:                                 6 
  arcs:                                  4 
    undirected arcs:                     0 
    directed arcs:                       4 
  average markov blanket size:           2.00 
  average neighbourhood size:            1.33 
  average branching factor:              0.67 

  learning algorithm:                    Structural EM 
  score-based method:                    Hill-Climbing 
  parameter learning method:             Maximum Likelihood (disc.) 
  imputation method:                     
                                     Posterior Expectation (Likelihood Weighting) 
  penalization coefficient:              4.258597 
  tests used in the learning procedure:  83 
  optimized:                             TRUE

Unfortunately, the latent variable will almost certainly end up as an isolated nodes unless we connect it to at least some nodes that are partially observed: the noisiness of Monte Carlo inference can easily overwhelm the dependence relationships we encode in the network in start argument.

> start.dag = model2network("[A][B|A][C][D][E][F]")
> cptA = matrix(c(0.3336, 0.3340, 0.3324), ncol = 3, dimnames = list(NULL, c("a", "b", "c")))
> cptB = matrix(c(0.856, 0.025, 0.118, 0.444, 0.221, 0.334, 0.114, 0.094, 0.790), nrow = 3, ncol = 3,
+          dimnames = list(B = c("a", "b", "c"), A = c("a", "b", "c")))
> start = custom.fit(start.dag, list(A = cptA, B = cptB, C = cptC, D = cptD, E = cptE, F = cptF))
> dag = structural.em(incomplete.data, start = start, max.iter = 3)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
## warn.if.no.missing = TRUE, : at least one variable has no observed values.
> dag

  Bayesian network learned from Missing Data

  model:
   [A][C][F][B|A][D|A:C][E|A:F] 
  nodes:                                 6 
  arcs:                                  5 
    undirected arcs:                     0 
    directed arcs:                       5 
  average markov blanket size:           2.33 
  average neighbourhood size:            1.67 
  average branching factor:              0.83 

  learning algorithm:                    Structural EM 
  score-based method:                    Hill-Climbing 
  parameter learning method:             Maximum Likelihood (disc.) 
  imputation method:                     
                                     Posterior Expectation (Likelihood Weighting) 
  penalization coefficient:              4.258597 
  tests used in the learning procedure:  94 
  optimized:                             TRUE

Passing a whitelist to the structure learning algorithm is the simplest way to do that.

> start.dag = model2network("[A][B|A][C][D][E][F]")
> cptA = matrix(c(0.3336, 0.3340, 0.3324), ncol = 3, dimnames = list(NULL, c("a", "b", "c")))
> cptB = matrix(c(0.856, 0.025, 0.118, 0.444, 0.221, 0.334, 0.114, 0.094, 0.790), nrow = 3, ncol = 3,
+          dimnames = list(B = c("a", "b", "c"), A = c("a", "b", "c")))
> start = custom.fit(start.dag, list(A = cptA, B = cptB, C = cptC, D = cptD, E = cptE, F = cptF))
> dag = structural.em(incomplete.data,
+         maximize.args = list(whitelist = data.frame(from = "A", to = "B")),
+         start = start, max.iter = 3)
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
## warn.if.no.missing = TRUE, : at least one variable has no observed values.
> dag

  Bayesian network learned from Missing Data

  model:
   [A][C][F][B|A][D|A:C][E|A:F] 
  nodes:                                 6 
  arcs:                                  5 
    undirected arcs:                     0 
    directed arcs:                       5 
  average markov blanket size:           2.33 
  average neighbourhood size:            1.67 
  average branching factor:              0.83 

  learning algorithm:                    Structural EM 
  score-based method:                    Hill-Climbing 
  parameter learning method:             Maximum Likelihood (disc.) 
  imputation method:                     
                                     Posterior Expectation (Likelihood Weighting) 
  penalization coefficient:              4.258597 
  tests used in the learning procedure:  91 
  optimized:                             TRUE

Exact inference does not have this issue because it has no stochastic noise: the imputed values are deterministic given the observed values in each observation.

> dag = structural.em(incomplete.data, start = start, max.iter = 3, impute = "exact")
## Warning in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
## warn.if.no.missing = TRUE, : at least one variable has no observed values.
> dag

  Bayesian network learned from Missing Data

  model:
   [A][C][F][B|A][D|A:C][E|A:F] 
  nodes:                                 6 
  arcs:                                  5 
    undirected arcs:                     0 
    directed arcs:                       5 
  average markov blanket size:           2.33 
  average neighbourhood size:            1.67 
  average branching factor:              0.83 

  learning algorithm:                    Structural EM 
  score-based method:                    Hill-Climbing 
  parameter learning method:             Maximum Likelihood (disc.) 
  imputation method:                     Exact Inference 
  penalization coefficient:              4.258597 
  tests used in the learning procedure:  94 
  optimized:                             TRUE
Last updated on Fri Dec 9 03:12:25 2022 with bnlearn 4.9-20221107 and R version 4.2.2 (2022-10-31).