## Analysis of the MAGIC population in Scutari *et al.*, Genetics (2014)

This is a short HOWTO describing how to fit the averaged Bayesian network in “Multiple Quantitative Trait Analysis Using Bayesian Networks” by Scutari, Howell, Balding, Mackay (Genetics, 2014).

### Reading and preparing the preprocessed data

First we load the data and the required libraries: **lme4** (to adjust
for family structure), **bnlearn** (to learn the model and perform
predictions) and **parallel** (to speed up learning). The preprocessed data
is in a file named `prepd-magic.txt.xz`

, which can be downloaded
here.

> library(lme4)

> library(bnlearn)

> library(parallel) > > # load the data. > xzfd = xzfile("prepd-magic.txt.xz", open = "r") > magic = read.table(xzfd, header = TRUE, + colClasses = c(rep("factor", 4), rep("numeric", 3164))) > close(xzfd)

Afterwards, we substitute short labels to the marker names, which are quite long and make plotting problematic. In addition we identify which variables in the data are traits, which are markers, which contain variety IDs and pedigree information.

> # code the gene names. > names(magic)[12:ncol(magic)] = paste("G", 12:ncol(magic) - 11, sep = "") > > # split the variables. > ids = names(magic)[1:4] > traits = names(magic)[5:11] > genes = names(magic)[12:ncol(magic)]

Finally, we extract observations with incomplete data (in the traits and variety information, missing values in the markers have all been imputed).

> # separate observations with missing data from the rest. > partial = magic[!complete.cases(magic), ] > magic = magic[complete.cases(magic), ]

### Performing cross-validation

The Bayesian networks model is fitted by the `fit.the.model()`

function below, which takes the `data`

and the type I error threshold
`alpha`

to use for structure learning as arguments.

> fit.the.model = function(data, alpha) { + cpc = vector(length(traits), mode = "list") + names(cpc) = traits + # find the parents of each trait (may be genes or other traits). + for (t in seq_along(traits)) { + # BLUP away the family structure. + m = lmer(as.formula(paste(traits[t], "~ (1|FUNNEL:PLANT)")), data = data) + data[!is.na(data[, traits[t]]), traits[t]] = + data[, traits[t]] - ranef(m)[[1]][paste(data$FUNNEL, data$PLANT, sep = ":"), 1] + # find out the parents. + cpc[[t]] = learn.nbr(data[, c(traits, genes)], node = traits[t], debug = FALSE, + method = "si.hiton.pc", test = "cor", alpha = alpha) + }#FOR + # merge the relevant variables to use for learning. + nodes = unique(c(traits, unlist(cpc))) + # yield has no children, and genes cannot depend on traits. + blacklist = tiers2blacklist(list(nodes[!(nodes %in% traits)], + traits[traits != "YLD"], "YLD")) + # build the Bayesian network. + bn = hc(data[, nodes], blacklist = blacklist) + return(bn) + }#FIT.THE.MODEL

The general layout of `fit.the.model()`

is as follows:

- First, we remove family structure using the pedigree information stored in the
`FUNNEL`

and`PLANT`

. This is done for each trait in turn using`lmer()`

to fit a linear mixed model with a random effect for the combination of these variables; the estimated effect is then subtracted from the raw value of the trait. Note that this procedure is specific to the data, others will require different adjustments using either pedigree- or marker-based measures of kinship. - Subsequently, we use
`learn.nbr()`

to run SI-HITON-PC on each of the traits and identify their parents and children in the Bayesian network. We save them in the`cpc`

variable. The parents of a trait can be either markers or other traits; children can only be traits. The test used to assess (conditional) independence is the exact*t*test for Pearson' correlation (`test = "cor"`

) with the specified`alpha`

threshold. - We then merge all the elements of
`cpc`

and the traits into a vector of relevant`nodes`

to be used in learning the Bayesian networks. Markers that are not included in this list have not been found to be directly related to any trait and are therefore discarded from the analysis. - The
`nodes`

are then partitioned in three subsets: markers, traits other than yield, and yield. The`tiers2blacklist()`

function creates a blacklist that prevents arcs from going to yield to any other node, and from traits to the markers. These restrictions restrict the space of the candidate models further and have the aim of forcing causal relationships to flow in the right direction while learning them from the data. - Finally, we learn the structure of the Bayesian network by maximising BIC
with hill-climbing given the
`blacklist`

we just generated.

In order to have multiple models to average and to assess the predictive power
of the Bayesian network model, we run `fit.the.model()`

under 10-fold
cross-validation for 10 times. Since structure learning has been customised for
the analysis, we cannot use `bn.cv()`

from **bnlearn**.
Instead, we implement a custom `xval.the.model()`

function, shown below.

> xval.the.model = function(data, k = 10, cluster, alpha, ridge) { + n = nrow(data) + predcor = numeric(length(traits)) + names(predcor) = traits + postcor = numeric(length(traits)) + names(postcor) = traits + # shuffle the data to get unbiased splits. + kcv = split(sample(n), seq_len(k)) + # store the length of each test set. + kcv.length = sapply(kcv, length) + predicted = parLapply(kcv, cl = cluster, function(test) { + # create a matrix to store the predicted values. + pred = matrix(0, nrow = length(test), ncol = length(traits)) + colnames(pred) = traits + # create a matrix to store posterior estimates. + post = matrix(0, nrow = length(test), ncol = length(traits)) + colnames(post) = traits + cat("* beginning cross-validation fold.\n") + # split training and test. + dtraining = data[-test, ] + dtest = data[test, ] + # fit the model on the training data. + model = fit.the.model(dtraining, alpha = alpha) + fitted = bn.fit(model, dtraining[, nodes(model)]) + # maybe re-fit with ridge regression. + if (ridge) { + library(penalized) + for (no in nodes(fitted)) { + node.parents = parents(fitted, no) + if (length(node.parents) < 3) + next + opt.lambda = optL2(response = dtraining[, no], + penalized = dtraining[, node.parents], + model = "linear", trace = FALSE, + minlambda2 = 10e-5, maxlambda = 500)$lambda + fitted[[no]] = penalized(response = dtraining[, no], + penalized = dtraining[, node.parents], + model = "linear", trace = FALSE, + lambda1 = 0, lambda2 = opt.lambda) + }#FOR + }#THEN + # subset the test data. + dtest = dtest[, nodes(model)] + cat(" > model has", length(nodes(model)), "nodes.\n") + # predict each trait in turn, given all the parents. + for (t in traits) + pred[, t] = predict(fitted, node = t, data = dtest[, nodes(model)]) + for (i in seq(nrow(dtest))) + post[i, traits] = colMeans(cpdist(fitted, nodes = traits, + evidence = as.list(dtest[i, names(dtest) %in% genes]), + method = "lw", n = 1000)) + return(list(model = fitted, pred = pred, post = post)) + }) + # merge all the predicted values. + posterior = do.call(rbind, lapply(predicted, `[[`, "post")) + causal = do.call(rbind, lapply(predicted, `[[`, "pred")) + cat("* overall cross-validated correlations:\n") + for (t in traits) { + predcor[t] = cor(causal[, t], data[unlist(kcv), t]) + cat(" > PREDCOR(", t, "):", predcor[t], "\n") + postcor[t] = cor(posterior[, t], data[unlist(kcv), t]) + cat(" > POSTCOR(", t, "):", postcor[t], "\n") + }#FOR + return(list(predicted = causal, posterior = posterior, + observed = data[unlist(kcv), t], predcor = predcor, postcor = postcor, + models = lapply(predicted, `[[`, "model"))) + }#XVAL.THE.MODEL

The general layout of `xval.the.model()`

is as follows:

- We
`split()`

the sample into 10 folds, allocating observations at random. Folds are saved in a list,`kcv`

, which then passed to`parLapply()`

to run the analysis over multiple processors. - Each fold is used as the test data set (
`dtest`

) while the rest of the data set is used as the training set (`dtrain`

). We learn the structure of the Bayesian network on`dtrain`

with`fit.the.model`

, the we fit its parameters with`bn.fit()`

. This produces ordinary least squares estimates of the regression coefficients in the local distributions. - As an alternative, if
`ridge = TRUE`

the parameters using the ridge regression implementation in the**penalized**package:`optL2()`

finds the optimal λ_{2}penalty and`penalized()`

estimates the regression coefficients using that λ_{2}.**bnlearn**takes care of extracting the coefficients from the models and storing them in the fitted Bayesian network. - Subsequently, we produce predictions for the traits in
`dtest`

.*Causal predictions*are computed with`predict()`

conditional on the parents of the trait, which can be considered putative direct causal effects.*Genetic predictions*are the expected values of the posterior density of the traits given the markers, as computed using`colMeans()`

,`cpdist()`

and likelihood weighting. The former are used to compute ρ_{C}, the latter for ρ_{G}. Note that in general the average should be weighted, but in this case all the weights are identical due to the`blacklist`

used in`fit.the.model()`

.

Using α=0.01 and least squares estimation as an example, we will perform the rest of the analysis below., First, we run the 10 rounds of cross-validation to produce a set of 100 models and the corresponding estimates of predictive ability.

> cl = makeCluster(10) > invisible(clusterEvalQ(cl, library(bnlearn))) > invisible(clusterEvalQ(cl, library(lme4))) > clusterExport(cl = cl, c("traits", "genes", "ids", "fit.the.model")) > pr001 = vector(10, mode = "list") > for (i in 1:10) + pr001[[i]] = xval.the.model(magic, cluster = cl, alpha = 0.01, ridge = FALSE) > stopCluster(cl)

We can compute the average predictive correlation by extracting the values from
the 10 runs of cross-validation with `sapply()`

and averaging them with
`rowMeans()`

.

> pred.summary = sapply(pr001, `[[`, "predcor") > print(rowMeans(pred.summary))

YR.GLASS YLD HT YR.FIELD FUS MIL FT 0.6154429 0.3556396 0.4590911 0.4510041 0.3169979 0.1467224 0.2655817

> post.summary = sapply(pr001, `[[`, "postcor") > print(rowMeans(post.summary))

YR.GLASS YLD HT YR.FIELD FUS MIL FT 0.6043912 0.1775958 0.4570592 0.3789709 0.2477976 0.1313265 0.2643684

### Averaging the network structures

Similarly, we can compute the averaged network structure by extracting all the
Bayesian networks, save their arc sets in a list (`arclist`

) and
passing it to `custom.strength()`

. Note that different networks will
have different node sets, so we need to merge them and pass the result as the
`nodes`

argument. Arcs are identified as significant by
`averaged.network()`

based on their frequency in the cross-validated
networks, which is computed by `custom.strength()`

.

> # gather all the arc lists. > arclist = list() > > for (i in seq_along(pr001)) { + # extract the models. + run = pr001[[i]]$models + for (j in seq_along(run)) + arclist[[length(arclist) + 1]] = arcs(run[[j]]) + }#FOR > > # compute the arc strengths. > nodes = unique(unlist(arclist)) > strength = custom.strength(arclist, nodes = nodes) > # estimate the threshold and average the networks. > averaged = averaged.network(strength)

The averaged network `averaged`

and the arc strengths `strength`

can then be used to produce a plot similar to that included in the paper. For readability,
it is useful to remove isolated nodes beforehand; it may happen that all arcs incident on
a node have been deemed non-significant by `averaged.network()`

and thus the
node has no actual impact on the model.

### Plotting the averaged network

> # subset the network to remove isolated nodes. > relevant.nodes = nodes(averaged)[sapply(nodes, degree, object = averaged) > 0] > averaged2 = subgraph(averaged, relevant.nodes) > strength2 = strength[(strength$from %in% relevant.nodes) & + (strength$to %in% relevant.nodes), ] > gR = strength.plot(averaged2, strength2, shape = "rectangle", layout = "fdp")

We can add colours to the plot using the `gR`

object returned by
`strength.plot()`

; and the `nodeRenderInfo()`

and
`edgeRenderInfo()`

from the **Rgraphviz** package.
`renderGraph()`

can then be called to plot the network again.

> nodeRenderInfo(gR)$fill = "lightblue" > nodeRenderInfo(gR)$fill = "lightblue" > nodeRenderInfo(gR)$col = "darkblue" > nodeRenderInfo(gR)$fill[traits] = "limegreen" > nodeRenderInfo(gR)$col[traits] = "darkgreen" > a = arcs(subgraph(averaged, traits)) > a = as.character(interaction(a[, "from"], a[, "to"], sep = "~")) > edgeRenderInfo(gR)$col = "grey" > edgeRenderInfo(gR)$col[a] = "darkgreen" > renderGraph(gR)

`Sun Jul 17 22:36:55 2016`

with **bnlearn**

`4.1-20160617`

and `R version 3.0.2 (2013-09-25)`

.