## Predicting new observations from a Bayesian network

Predicting the values of one or more variables is the prototypical task of a machine learning model. In the context
of Bayesian networks, we take it to involve *just a single variable*: producing the best possible instantiation
of multiple variables is a *most probable explanation* (MPE) problem that deserves a separate implementation.
Prediction is simpler and it can be optimized more effectively.

**bnlearn** provides a `predict()`

function (documented
here) for the fitted Bayesian networks returned by
`bn.fit()`

(illustrated here). `predict()`

provides different methods to
compute predictions, with different trade-offs: `"parents"`

, `"bayes-lw"`

and
`"exact"`

. For all methods, `predict()`

takes

- a
`bn.fit`

object encoding the Bayesian network; - a
`node`

that we would like to predict; - a
`data`

set we will predict it from; - a
`prob`

logical argument to return the prediction probabilities for discrete Bayesian networks;

and returns a vector with the predictions. Note that one or more predicted values may be `NA`

s if the
fitted Bayesian network contains `NA`

parameters, which can be avoided by fitting it using
posteriors estimators or using maximum likelihood estimators with `replace.unidentifiable = TRUE`

.

### Predicting from the parents

The most computationally-efficient way of computing predictions for a variable is to use its local distribution and to find its most probable value conditional on the values of its parents. This does not require any inference: it boils down to a look-up in the conditional probability table (for a discrete node in a discrete or conditional Gaussian network) or to a vector product (for a continuous node in a Gaussian or conditional Gaussian network). However, it does not use the Bayesian network and the information in the observation being predicted to the fullest because it disregards the rest of the Markov blanket of the node that is the target of the prediction.

Consider a Bayesian network learned and fitted from a `training.set`

, with an associated
`validation.set`

for which we want to predict node `E`

.

> library(bnlearn) > dtraining.set = learning.test[1:4000, ] > dvalidation.set = learning.test[4001:5000, ] > dag = model2network("[A][C][F][B|A][D|A:C][E|B:F]") > dfitted = bn.fit(dag, dtraining.set)

The `method = "parents"`

is the default method for prediction, and has no optional arguments.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "parents") > head(pred)

[1] c b b a c b Levels: a b c

> head(dvalidation.set$E)

[1] c a c b c b Levels: a b c

> table(dvalidation.set$E, pred)

pred a b c a 176 156 30 b 26 258 36 c 23 99 196

Like other prediction methods, if the `prob`

argument is set to `TRUE`

and the network is a
discrete Bayesian network the prediction probabilities for all values of the target variables are attached as an
attribute to the predictions. The columns correspond to the observations in `validation.set`

and the rows
to the values of `E`

.

> pred = predict(dfitted, node = "E", data = dvalidation.set, prob = TRUE) > attr(pred, "prob")[, 1:5]

[,1] [,2] [,3] [,4] [,5] a 0.2301587 0.4125133 0.2345079 0.81066946 0.1191646 b 0.1785714 0.4772004 0.5018226 0.09309623 0.1117936 c 0.5912698 0.1102863 0.2636695 0.09623431 0.7690418

Similarly, for continuous data (this time predicting node `F`

).

> gtraining.set = gaussian.test[1:4000, ] > gvalidation.set = gaussian.test[4001:5000, ] > dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]") > gfitted = bn.fit(dag, gtraining.set) > pred = predict(gfitted, node = "F", data = gvalidation.set, method = "parents") > head(pred)

[1] 18.57683 17.26517 21.57541 28.03751 29.75274 24.87890

> head(gvalidation.set$F)

[1] 19.61011 16.12044 19.88823 26.99181 29.55846 25.65805

> cor(gvalidation.set$F, pred)

[1] 0.9869819

The main shortcoming of predicting from parents is that it fails to produce meaningful predictions for root nodes, since they have no parents. In the case of a discrete node, all predictions will be equal to the mode of its distribution.

> table(predict(dfitted, node = "A", data = dvalidation.set, method = "parents"))

a b c 0 1000 0

In the case of a continuous node, all predictions will be equal to the expected values of its distribution.

> unique(predict(gfitted, node = "A", data = gvalidation.set, method = "parents"))

[1] 1.002117

### Predicting with Monte Carlo posterior inference

A better approach to prediction is to use some form of inference and predict from more nodes than just the parents of
the target node. Likelihood weighting is better suited for this task than logic sampling because it guarantees that none
of the particles it generates are discarded, no matter how unlikely the values taken by the predictors are. (However,
they must be observable.) This is what `method = "bayes-lw"`

does for each observation being predicted:

- generate a sufficiently large number of particles and the associated likelihood weights using the predictors as the evidence;
- compute the value with the largest weight mass (the posterior mode, for a discrete target node) or the weighted average of the particles (the posterior expectation, for a continuous target node).

Therefore, `predict()`

takes two optional arguments when `method = "bayes-lw"`

: `n`

,
the number of particles to generate for each observation, and `from`

, the nodes to use as predictors.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw") > head(pred)

[1] c b b a c b Levels: a b c

> head(dvalidation.set$E)

[1] c a c b c b Levels: a b c

> table(dvalidation.set$E, pred)

pred a b c a 183 149 30 b 27 257 36 c 24 98 196

The predicted values are computed from the empirical posterior distribution of the particles, so in general they may
be different each time we invoke `predict()`

. Generating larger number of particles makes predictions more
stable at the cost of longer running times.

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", n = 500))

user system elapsed 0.104 0.000 0.104

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", n = 5000))

user system elapsed 0.784 0.000 0.785

As for the predictors, at a minimum `from`

should contain the nodes that are part of the Markov blanket of
the target node to achieve the best possible accuracy. By default, `from`

contains all nodes in the network
apart the predictor itself to minimize the computational effort of generating the particles. However, we can choose to
use any valid set of predictors depending on the application.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", from = c("A", "F")) > head(pred)

[1] c b b a c b Levels: a b c

### Predicting with exact inference

Similarly, we can use exact inference instead of approximate inference to compute predictions. On the one hand, the predicted values will not have any simulation variability because we no longer rely on Monte Carlo simulations. On the other hand, exact inference is often more computationally expensive than approximate inference, especially for large Bayesian networks.

The `method = "exact"`

leverages either junction trees (interfacing with the **gRain**
package) or the closed-form properties of the multivariate normal distribution to produce predictions for discrete and
Gaussian Bayesian networks. Conditional Gaussian networks are not supported. It has an optional argument
`from`

like the `"bayes-lw"`

method, which by default contains the nodes in the Markov blanket of
the target node.

> pred = predict(dfitted, node = "E", data = dvalidation.set, method = "exact")

> head(pred)

[1] c b b a c b Levels: a b c

> pred = predict(gfitted, node = "F", data = gvalidation.set, method = "parents") > head(pred)

[1] 18.57683 17.26517 21.57541 28.03751 29.75274 24.87890

The number of nodes in `from`

and their location in the network relative to the target node has a strong
influence the prediction speed. In fact, it can make `method = "exact"`

faster than
`method "bayes-lw"`

.

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "exact"))

user system elapsed 0.035 0.000 0.035

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw"))

user system elapsed 0.085 0.000 0.085

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "exact", from = c("A", "F", "B")))

user system elapsed 0.034 0.000 0.034

> system.time(predict(dfitted, node = "E", data = dvalidation.set, method = "bayes-lw", from = c("A", "F", "B")))

user system elapsed 0.085 0.000 0.084

### Predictions in Bayesian network classifiers

Bayesian network classifiers like `naive.bayes()`

and `tree.bayes()`

(which implements TAN),
have separate `predict()`

methods that use the closed-form expressions available for those models. Note
this method has no `method`

argument to choose how to compute the predicted values, but it still has the
`prob`

argument as illustrated here.

`Thu Nov 17 14:20:58 2022`

with **bnlearn**

`4.9-20221107`

and `R version 4.2.2 (2022-10-31)`

.