Analysis of the MAGIC population in Scutari et al., Genetics (2014)
This is a short HOWTO describing how to fit the averaged Bayesian network in “Multiple Quantitative Trait Analysis Using Bayesian Networks” by Scutari, Howell, Balding, Mackay (Genetics, 2014).
Reading and preparing the preprocessed data
First we load the data and the required libraries: lme4 (to adjust
for family structure), bnlearn (to learn the model and perform
predictions) and parallel (to speed up learning). The preprocessed data
is in a file named prepd-magic.txt.xz
, which can be downloaded
here.
> library(lme4)
> library(bnlearn) > library(parallel) > > # load the data. > xzfd = xzfile("prepd-magic.txt.xz", open = "r") > magic = read.table(xzfd, header = TRUE, + colClasses = c(rep("factor", 4), rep("numeric", 3164))) > close(xzfd)
Afterwards, we substitute short labels to the marker names, which are quite long and make plotting problematic. In addition we identify which variables in the data are traits, which are markers, which contain variety IDs and pedigree information.
> # code the gene names. > names(magic)[12:ncol(magic)] = paste("G", 12:ncol(magic) - 11, sep = "") > > # split the variables. > ids = names(magic)[1:4] > traits = names(magic)[5:11] > genes = names(magic)[12:ncol(magic)]
Finally, we extract observations with incomplete data (in the traits and variety information, missing values in the markers have all been imputed).
> # separate observations with missing data from the rest. > partial = magic[!complete.cases(magic), ] > magic = magic[complete.cases(magic), ]
Performing cross-validation
The Bayesian networks model is fitted by the fit.the.model()
function below, which takes the data
and the type I error threshold
alpha
to use for structure learning as arguments.
> fit.the.model = function(data, alpha) { + cpc = vector(length(traits), mode = "list") + names(cpc) = traits + # find the parents of each trait (may be genes or other traits). + for (t in seq_along(traits)) { + # BLUP away the family structure. + m = lmer(as.formula(paste(traits[t], "~ (1|FUNNEL:PLANT)")), data = data) + data[!is.na(data[, traits[t]]), traits[t]] = + data[, traits[t]] - ranef(m)[[1]][paste(data$FUNNEL, data$PLANT, sep = ":"), 1] + # find out the parents. + cpc[[t]] = learn.nbr(data[, c(traits, genes)], node = traits[t], debug = FALSE, + method = "si.hiton.pc", test = "cor", alpha = alpha) + }#FOR + # merge the relevant variables to use for learning. + nodes = unique(c(traits, unlist(cpc))) + # yield has no children, and genes cannot depend on traits. + blacklist = tiers2blacklist(list(nodes[!(nodes %in% traits)], + traits[traits != "YLD"], "YLD")) + # build the Bayesian network. + bn = hc(data[, nodes], blacklist = blacklist) + return(bn) + }#FIT.THE.MODEL
The general layout of fit.the.model()
is as follows:
- First, we remove family structure using the pedigree information stored in the
FUNNEL
andPLANT
. This is done for each trait in turn usinglmer()
to fit a linear mixed model with a random effect for the combination of these variables; the estimated effect is then subtracted from the raw value of the trait. Note that this procedure is specific to the data, others will require different adjustments using either pedigree- or marker-based measures of kinship. - Subsequently, we use
learn.nbr()
to run SI-HITON-PC on each of the traits and identify their parents and children in the Bayesian network. We save them in thecpc
variable. The parents of a trait can be either markers or other traits; children can only be traits. The test used to assess (conditional) independence is the exact t test for Pearson' correlation (test = "cor"
) with the specifiedalpha
threshold. - We then merge all the elements of
cpc
and the traits into a vector of relevantnodes
to be used in learning the Bayesian networks. Markers that are not included in this list have not been found to be directly related to any trait and are therefore discarded from the analysis. - The
nodes
are then partitioned in three subsets: markers, traits other than yield, and yield. Thetiers2blacklist()
function creates a blacklist that prevents arcs from going to yield to any other node, and from traits to the markers. These restrictions restrict the space of the candidate models further and have the aim of forcing causal relationships to flow in the right direction while learning them from the data. - Finally, we learn the structure of the Bayesian network by maximising BIC
with hill-climbing given the
blacklist
we just generated.
In order to have multiple models to average and to assess the predictive power
of the Bayesian network model, we run fit.the.model()
under 10-fold
cross-validation for 10 times. Since structure learning has been customised for
the analysis, we cannot use bn.cv()
from bnlearn.
Instead, we implement a custom xval.the.model()
function, shown below.
> xval.the.model = function(data, k = 10, cluster, alpha, ridge) { + n = nrow(data) + predcor = numeric(length(traits)) + names(predcor) = traits + postcor = numeric(length(traits)) + names(postcor) = traits + # shuffle the data to get unbiased splits. + kcv = split(sample(n), seq_len(k)) + # store the length of each test set. + kcv.length = sapply(kcv, length) + predicted = parLapply(kcv, cl = cluster, function(test) { + # create a matrix to store the predicted values. + pred = matrix(0, nrow = length(test), ncol = length(traits)) + colnames(pred) = traits + # create a matrix to store posterior estimates. + post = matrix(0, nrow = length(test), ncol = length(traits)) + colnames(post) = traits + cat("* beginning cross-validation fold.\n") + # split training and test. + dtraining = data[-test, ] + dtest = data[test, ] + # fit the model on the training data. + model = fit.the.model(dtraining, alpha = alpha) + fitted = bn.fit(model, dtraining[, nodes(model)]) + # maybe re-fit with ridge regression. + if (ridge) { + library(penalized) + for (no in nodes(fitted)) { + node.parents = parents(fitted, no) + if (length(node.parents) < 3) + next + opt.lambda = optL2(response = dtraining[, no], + penalized = dtraining[, node.parents], + model = "linear", trace = FALSE, + minlambda2 = 10e-5, maxlambda = 500)$lambda + fitted[[no]] = penalized(response = dtraining[, no], + penalized = dtraining[, node.parents], + model = "linear", trace = FALSE, + lambda1 = 0, lambda2 = opt.lambda) + }#FOR + }#THEN + # subset the test data. + dtest = dtest[, nodes(model)] + cat(" > model has", length(nodes(model)), "nodes.\n") + # predict each trait in turn, given all the parents. + for (t in traits) + pred[, t] = predict(fitted, node = t, data = dtest[, nodes(model)]) + for (i in seq(nrow(dtest))) + post[i, traits] = colMeans(cpdist(fitted, nodes = traits, + evidence = as.list(dtest[i, names(dtest) %in% genes]), + method = "lw", n = 1000)) + return(list(model = fitted, pred = pred, post = post)) + }) + # merge all the predicted values. + posterior = do.call(rbind, lapply(predicted, `[[`, "post")) + causal = do.call(rbind, lapply(predicted, `[[`, "pred")) + cat("* overall cross-validated correlations:\n") + for (t in traits) { + predcor[t] = cor(causal[, t], data[unlist(kcv), t]) + cat(" > PREDCOR(", t, "):", predcor[t], "\n") + postcor[t] = cor(posterior[, t], data[unlist(kcv), t]) + cat(" > POSTCOR(", t, "):", postcor[t], "\n") + }#FOR + return(list(predicted = causal, posterior = posterior, + observed = data[unlist(kcv), t], predcor = predcor, postcor = postcor, + models = lapply(predicted, `[[`, "model"))) + }#XVAL.THE.MODEL
The general layout of xval.the.model()
is as follows:
- We
split()
the sample into 10 folds, allocating observations at random. Folds are saved in a list,kcv
, which then passed toparLapply()
to run the analysis over multiple processors. - Each fold is used as the test data set (
dtest
) while the rest of the data set is used as the training set (dtrain
). We learn the structure of the Bayesian network ondtrain
withfit.the.model
, the we fit its parameters withbn.fit()
. This produces ordinary least squares estimates of the regression coefficients in the local distributions. - As an alternative, if
ridge = TRUE
the parameters using the ridge regression implementation in the penalized package:optL2()
finds the optimal λ2 penalty andpenalized()
estimates the regression coefficients using that λ2. bnlearn takes care of extracting the coefficients from the models and storing them in the fitted Bayesian network. - Subsequently, we produce predictions for the traits in
dtest
. Causal predictions are computed withpredict()
conditional on the parents of the trait, which can be considered putative direct causal effects. Genetic predictions are the expected values of the posterior density of the traits given the markers, as computed usingcolMeans()
,cpdist()
and likelihood weighting. The former are used to compute ρC, the latter for ρG. Note that in general the average should be weighted, but in this case all the weights are identical due to theblacklist
used infit.the.model()
.
Using α=0.01 and least squares estimation as an example, we will perform the rest of the analysis below., First, we run the 10 rounds of cross-validation to produce a set of 100 models and the corresponding estimates of predictive ability.
> cl = makeCluster(10) > invisible(clusterEvalQ(cl, library(bnlearn))) > invisible(clusterEvalQ(cl, library(lme4))) > clusterExport(cl = cl, c("traits", "genes", "ids", "fit.the.model")) > pr001 = vector(10, mode = "list") > for (i in 1:10) + pr001[[i]] = xval.the.model(magic, cluster = cl, alpha = 0.01, ridge = FALSE) > stopCluster(cl)
We can compute the average predictive correlation by extracting the values from
the 10 runs of cross-validation with sapply()
and averaging them with
rowMeans()
.
> pred.summary = sapply(pr001, `[[`, "predcor") > print(rowMeans(pred.summary))
YR.GLASS YLD HT YR.FIELD FUS MIL FT 0.6200367 0.3678327 0.4682776 0.4460342 0.3221489 0.1300052 0.2755232
> post.summary = sapply(pr001, `[[`, "postcor") > print(rowMeans(post.summary))
YR.GLASS YLD HT YR.FIELD FUS MIL FT 0.6064308 0.1879596 0.4692378 0.3834637 0.2549408 0.1200143 0.2756023
Averaging the network structures
Similarly, we can compute the averaged network structure by extracting all the
Bayesian networks, save their arc sets in a list (arclist
) and
passing it to custom.strength()
. Note that different networks will
have different node sets, so we need to merge them and pass the result as the
nodes
argument. Arcs are identified as significant by
averaged.network()
based on their frequency in the cross-validated
networks, which is computed by custom.strength()
.
> # gather all the arc lists. > arclist = list() > > for (i in seq_along(pr001)) { + # extract the models. + run = pr001[[i]]$models + for (j in seq_along(run)) + arclist[[length(arclist) + 1]] = arcs(run[[j]]) + }#FOR > > # compute the arc strengths. > nodes = unique(unlist(arclist)) > strength = custom.strength(arclist, nodes = nodes) > # estimate the threshold and average the networks. > averaged = averaged.network(strength)
The averaged network averaged
and the arc strengths strength
can then be used to produce a plot similar to that included in the paper. For readability,
it is useful to remove isolated nodes beforehand; it may happen that all arcs incident on
a node have been deemed non-significant by averaged.network()
and thus the
node has no actual impact on the model.
Plotting the averaged network
> # subset the network to remove isolated nodes. > relevant.nodes = nodes(averaged)[sapply(nodes, degree, object = averaged) > 0] > averaged2 = subgraph(averaged, relevant.nodes) > strength2 = strength[(strength$from %in% relevant.nodes) & + (strength$to %in% relevant.nodes), ] > attr(strength2, "nodes") = relevant.nodes > gR = strength.plot(averaged2, strength2, shape = "rectangle", layout = "fdp")
We can add colours to the plot using the gR
object returned by
strength.plot()
; and the nodeRenderInfo()
and
edgeRenderInfo()
from the Rgraphviz package.
renderGraph()
can then be called to plot the network again.
> nodeRenderInfo(gR)$fill = "lightblue" > nodeRenderInfo(gR)$fill = "lightblue" > nodeRenderInfo(gR)$col = "darkblue" > nodeRenderInfo(gR)$fill[traits] = "limegreen" > nodeRenderInfo(gR)$col[traits] = "darkgreen" > a = arcs(subgraph(averaged, traits)) > a = as.character(interaction(a[, "from"], a[, "to"], sep = "~")) > edgeRenderInfo(gR)$col = "grey" > edgeRenderInfo(gR)$col[a] = "darkgreen" > renderGraph(gR)
Fri Nov 25 19:36:06 2022
with bnlearn
4.9-20221107
and R version 4.2.2 Patched (2022-11-10 r83330)
.