## Predicting new observations from a Bayesian network

Predicting the values of one or more variables is the prototypical task of a machine learning model. In the context of Bayesian networks, we take it to involve just a single variable: producing the best possible instantiation of multiple variables is a most probable explanation (MPE) problem that deserves a separate implementation. Prediction is simpler and it can be optimized more effectively.

bnlearn provides a `predict()` function (documented here) for the fitted Bayesian networks returned by `bn.fit()` (illustrated here). `predict()` provides different methods to compute predictions, with different trade-offs: `"parents"`, `"bayes-lw"` and `"exact"`. For all methods, `predict()` takes

1. a `bn.fit` object encoding the Bayesian network;
2. a `node` that we would like to predict;
3. a `data` set we will predict it from;
4. a `prob` logical argument to return the prediction probabilities for discrete Bayesian networks;

and returns a vector with the predictions. Note that one or more predicted values may be `NA`s if the fitted Bayesian network contains `NA` parameters, which can be avoided by fitting it using posteriors estimators or using maximum likelihood estimators with `replace.unidentifiable = TRUE`.

### Predicting from the parents

The most computationally-efficient way of computing predictions for a variable is to use its local distribution and to find its most probable value conditional on the values of its parents. This does not require any inference: it boils down to a look-up in the conditional probability table (for a discrete node in a discrete or conditional Gaussian network) or to a vector product (for a continuous node in a Gaussian or conditional Gaussian network). However, it does not use the Bayesian network and the information in the observation being predicted to the fullest because it disregards the rest of the Markov blanket of the node that is the target of the prediction.

Consider a Bayesian network learned and fitted from a `training.set`, with an associated `validation.set` for which we want to predict node `E`.

```> library(bnlearn)
> dtraining.set = learning.test[1:4000, ]
> dvalidation.set = learning.test[4001:5000, ]
> dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]")
> dfitted = bn.fit(dag, dtraining.set)
```

The `method = "parents"` is the default method for prediction, and has no optional arguments.

```> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "parents")
```
``` c b b a c b
Levels: a b c
```
```> head(dvalidation.set\$E)
```
``` c a c b c b
Levels: a b c
```
```> table(dvalidation.set\$E, pred)
```
```   pred
a   b   c
a 176 156  30
b  26 258  36
c  23  99 196
```

Like other prediction methods, if the `prob` argument is set to `TRUE` and the network is a discrete Bayesian network the prediction probabilities for all values of the target variables are attached as an attribute to the predictions. The columns correspond to the observations in `validation.set` and the rows to the values of `E`.

```> pred = predict(dfitted, node = "E", data = dvalidation.set, prob = TRUE)
> attr(pred, "prob")[, 1:5]
```
```       [,1]      [,2]      [,3]       [,4]      [,5]
a 0.2301587 0.4125133 0.2345079 0.81066946 0.1191646
b 0.1785714 0.4772004 0.5018226 0.09309623 0.1117936
c 0.5912698 0.1102863 0.2636695 0.09623431 0.7690418
```

Similarly, for continuous data (this time predicting node `F`).

```> gtraining.set = gaussian.test[1:4000, ]
> gvalidation.set = gaussian.test[4001:5000, ]
> dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
> gfitted = bn.fit(dag, gtraining.set)
> pred = predict(gfitted, node = "F", data = gvalidation.set, method = "parents")
```
``` 18.57683 17.26517 21.57541 28.03751 29.75274 24.87890
```
```> head(gvalidation.set\$F)
```
``` 19.61011 16.12044 19.88823 26.99181 29.55846 25.65805
```
```> cor(gvalidation.set\$F, pred)
```
``` 0.9869819
```

The main shortcoming of predicting from parents is that it fails to produce meaningful predictions for root nodes, since they have no parents. In the case of a discrete node, all predictions will be equal to the mode of its distribution.

```> table(predict(dfitted, node = "A", data = dvalidation.set, method = "parents"))
```
```
a    b    c
0 1000    0
```

In the case of a continuous node, all predictions will be equal to the expected values of its distribution.

```> unique(predict(gfitted, node = "A", data = gvalidation.set, method = "parents"))
```
``` 1.002117
```

### Predicting with Monte Carlo posterior inference

A better approach to prediction is to use some form of inference and predict from more nodes than just the parents of the target node. Likelihood weighting is better suited for this task than logic sampling because it guarantees that none of the particles it generates are discarded, no matter how unlikely the values taken by the predictors are. (However, they must be observable.) This is what `method = "bayes-lw"` does for each observation being predicted:

1. generate a sufficiently large number of particles and the associated likelihood weights using the predictors as the evidence;
2. compute the value with the largest weight mass (the posterior mode, for a discrete target node) or the weighted average of the particles (the posterior expectation, for a continuous target node).

Therefore, `predict()` takes two optional arguments when `method = "bayes-lw"`: `n`, the number of particles to generate for each observation, and `from`, the nodes to use as predictors.

```> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw")
```
``` c b b a c b
Levels: a b c
```
```> head(dvalidation.set\$E)
```
``` c a c b c b
Levels: a b c
```
```> table(dvalidation.set\$E, pred)
```
```   pred
a   b   c
a 183 149  30
b  27 257  36
c  24  98 196
```

The predicted values are computed from the empirical posterior distribution of the particles, so in general they may be different each time we invoke `predict()`. Generating larger number of particles makes predictions more stable at the cost of longer running times.

```> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", n = 500))
```
```   user  system elapsed
0.104   0.000   0.104
```
```> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", n = 5000))
```
```   user  system elapsed
0.784   0.000   0.785
```

As for the predictors, at a minimum `from` should contain the nodes that are part of the Markov blanket of the target node to achieve the best possible accuracy. By default, `from` contains all nodes in the network apart the predictor itself to minimize the computational effort of generating the particles. However, we can choose to use any valid set of predictors depending on the application.

```> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", from = c("A", "F"))
```
``` c b b a c b
Levels: a b c
```

### Predicting with exact inference

Similarly, we can use exact inference instead of approximate inference to compute predictions. On the one hand, the predicted values will not have any simulation variability because we no longer rely on Monte Carlo simulations. On the other hand, exact inference is often more computationally expensive than approximate inference, especially for large Bayesian networks.

The `method = "exact"` leverages either junction trees (interfacing with the gRain package) or the closed-form properties of the multivariate normal distribution to produce predictions for discrete and Gaussian Bayesian networks. Conditional Gaussian networks are not supported. It has an optional argument `from` like the `"bayes-lw"` method, which by default contains the nodes in the Markov blanket of the target node.

```> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "exact")
```
```Loading required namespace: gRain
```
```
Attaching package: 'gRbase'
```
```The following objects are masked from 'package:bnlearn':

ancestors, children, nodes, parents
```
```> head(pred)
```
``` c b b a c b
Levels: a b c
```
```> pred = predict(gfitted, node = "F", data = gvalidation.set, method = "parents")
```
``` 18.57683 17.26517 21.57541 28.03751 29.75274 24.87890
```

The number of nodes in `from` and their location in the network relative to the target node has a strong influence the prediction speed. In fact, it can make `method = "exact"` faster than `method "bayes-lw"`.

```> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "exact"))
```
```   user  system elapsed
0.035   0.000   0.035
```
```> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw"))
```
```   user  system elapsed
0.085   0.000   0.085
```
```> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "exact", from = c("A", "F", "B")))
```
```   user  system elapsed
0.034   0.000   0.034
```
```> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", from = c("A", "F", "B")))
```
```   user  system elapsed
0.085   0.000   0.084
```

### Predictions in Bayesian network classifiers

Bayesian network classifiers like `naive.bayes()` and `tree.bayes()` (which implements TAN), have separate `predict()` methods that use the closed-form expressions available for those models. Note this method has no `method` argument to choose how to compute the predicted values, but it still has the `prob` argument as illustrated here.

Last updated on `Thu Nov 17 14:20:58 2022` with bnlearn `4.9-20221107` and `R version 4.2.2 (2022-10-31)`.