Reproducing the causal signalling network in Sachs et al., Science (2005)
This is a short HOWTO describing how to reproduce the causal signalling network analysis in “Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data” by Sachs, Perez, Pe'er, Lauffenburger and Nolan (2015, Science). The goal of the original analysis was to learn the causal pathways linking a set of 11 proteins from a combination of observational and experimental data subject to various knock-outs and spikings. The pathways linking those proteins were in fact already known from the literature, so it was possible for the authors to compare the model they learned from the data to a the "golden standard" model and thus validate Bayesian networks as a powerful tool in systems biology and genetics. This golden standard model is shown below.
The raw data
The data consist in the simultaneous measurements of 11 phosphorylated proteins and phospholipids derived from thousands of individual primary immune system cells, subjected to both general and specific molecular interventions. The former ensure that the relevant signalling pathways are active, while the latter make causal inference possible by elucidating arc directions through stimulatory cues and inhibitory interventions.
> sachs = read.table("sachs.data.txt", header = TRUE) > head(sachs)
Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk 1 26.4 13.20 8.82 18.30 58.80 6.61 17.0 414 17.00 44.9 40.0 2 35.9 16.50 12.30 16.80 8.13 18.60 32.5 352 3.37 16.5 61.5 3 59.4 44.10 14.60 10.20 13.00 14.90 32.5 403 11.40 31.9 19.5 4 73.0 82.80 23.10 13.50 1.29 5.83 11.8 528 13.70 28.6 23.1 5 33.7 19.80 5.19 9.73 24.80 21.10 46.1 305 4.66 25.7 81.3 6 18.8 3.75 17.60 22.10 10.90 11.90 25.7 610 13.70 49.1 57.8
The data are continuous, as they represent the concentration of the molecules under investigation. Therefore, the standard approach in literature is to assume the concentrations follow a Gaussian distribution and to use a GBN to build the protein-signalling network.
However, the DAGs learned under this parametric assumption are not satisfactory. For example, using Inter-IAMB we obtain a DAG with only 8 arcs and only 2 of them are directed.
> library(bnlearn) > dag.iamb = inter.iamb(sachs, test = "cor") > narcs(dag.iamb)
[1] 8
> directed.arcs(dag.iamb)
from to [1,] "P38" "PKC" [2,] "Jnk" "PKC"
Other combinations of constraint-based algorithms and conditional independence tests, such as gs()
with
test = "mc-cor"
, return the same DAG. The same is true for score-based and hybrid algorithms. If we compare
dag.iamb
with the DAG from Figure (minus the arcs that were missed in the original paper), we can see that
they have completely different structures.
> sachs.modelstring <- + paste("[PKC][PKA|PKC][Raf|PKC:PKA][Mek|PKC:PKA:Raf]", + "[Erk|Mek:PKA][Akt|Erk:PKA][P38|PKC:PKA]", + "[Jnk|PKC:PKA][Plcg][PIP3|Plcg][PIP2|Plcg:PIP3]", sep = "") > dag.sachs = model2network(sachs.modelstring) > unlist(compare(dag.sachs, dag.iamb))
tp fp fn 0 8 17
Comparing the two DAGs again, but disregarding arc directions, reveals that some of the dependencies are correctly
identified by inter.iamb()
but their directions are not.
> unlist(compare(skeleton(dag.sachs), skeleton(dag.iamb)))
tp fp fn 8 0 9
The reason for the discrepancy between dag.sachs
and dag.iamb
is apparent from a graphical
exploratory analysis of the data.
Firstly, the empirical distributions of the molecules' concentrations are markedly different from a normal
distribution. As an example, we plotted the distributions of Mek
, P38
, PIP2
and
PIP3
above. All are strongly skewed, because concentrations are positive numbers but a lot of them are
small, and therefore cluster around zero. As a result, the variables are not symmetric and clearly violate the
distributional assumptions underlying GBNs. Secondly, the dependence relationships in the data are not always linear;
this is the case, as shown in Figure , for PKA
and PKC
. Most conditional independence tests
and network scores designed to capture linear relationships have very low power in detecting nonlinear ones. In turn,
structure learning algorithms using such statistics are not able to correctly learn the arcs in the DAG.
Since we have concluded that GBNs are not appropriate for these data, we must now consider some alternative parametric assumptions. One possibility could be to explore monotone transformations like the logarithm. Another possibility would be to specify an appropriate conditional distribution for each variable, thus obtaining a hybrid network like those we explored in Chapter. However, this approach requires substantial prior knowledge on the signalling pathways, which may or may not be available. In the case of Sachs et al., such information was indeed available from the literature. However, the aim of the analysis was to use BNs as an automated probabilistic method to verify such information, not to build a BN with prior information and use it as an expert system.
Consequently, Sachs et al. decided to discretise the data and to model them with a discrete BN, which can accommodate skewness and nonlinear relationships at the cost of losing the ordering information. Since the variables in the BN represent concentration levels, it is intuitively appealing to discretise them into three levels corresponding to low, average and high concentrations.
To that end, we can use the discretize()
function in bnlearn, which implements three
common discretisation methods:
- quantile discretisation: each variable is discretised independently into k intervals delimited by its 0, 1/k, 2/k, ..., (k-1)/k, 1 empirical quantiles;
- interval discretisation: each variable is discretised independently into k equally-spaced intervals;
- information-preserving discretisation: variables are jointly discretised while preserving as much of the pairwise mutual information between the variables as possible.
The last approach has been introduced by Hartemink (link). The key idea is to initially discretise each variable into a large number k_1 of intervals, thus losing as little information as possible. Subsequently, the algorithm iterates over the variables and collapses, for each of them, the pair of adjacent intervals that minimises the loss of pairwise mutual information. The algorithm stops when all variables have k_2 << k_1 intervals left. The resulting set of discretised variables reflects the dependence structure of the original data much better than either quantile or interval discretisation would allow, because the discretisation takes pairwise dependencies into account. Clearly some information is always lost in the process; for instance, higher-level dependencies are completely disregarded and therefore are not likely to be preserved.
Algorithm is implemented in the discretize()
function (method = "hartemink"
) along with
quantile (method = "quantile"
) and interval discretisation (method = "interval"
).
> dsachs = discretize(sachs, method = "hartemink", + breaks = 3, ibreaks = 60, idisc = "quantile")
The relevant arguments are idisc
and ibreaks
, which control how the data are initially
discretised, and breaks
, which specifies the number of levels of each discretised variable.
ibreaks
corresponds to k_1 in Algorithm, while breaks
corresponds to k_2. Choosing good values
for these arguments is a trade-off between quality and speed; high values of ibreaks
preserve the
characteristics of the original data to a greater extent, whereas smaller values result in much smaller memory usage and
shorter running times.
Model averaging
Following the analysis in Sachs et al., we can improve the quality of the structure learned from the data by
averaging multiple CPDAGs. One possible approach to that end is to apply bootstrap resampling to dsachs
and
learn a set of 500 network structures.
> boot = boot.strength(dsachs, R = 500, algorithm = "hc", + algorithm.args = list(score = "bde", iss = 10))
In the code above we learn a CPDAG with hill-climbing from each of the R bootstrap samples. Each variable in
dsachs
is now a factor with three levels, so we use the BDe score with a very low imaginary sample
size.
> options(digits = 6)
The boot
object returned by boot.strength()
is a data frame containing the strength of all
possible arcs (in the strength
column) and the probability of their direction (in the
direction
column) conditional on the fact that the from
and to
nodes are
connected by an arc. This structure makes it easy to select the most significant arcs, as below.
> boot[(boot$strength > 0.85) & (boot$direction >= 0.5), ]
from to strength direction 1 Raf Mek 1.000 0.513000 23 Plcg PIP2 1.000 0.515000 24 Plcg PIP3 1.000 0.525000 34 PIP2 PIP3 1.000 0.511000 56 Erk Akt 1.000 0.554000 57 Erk PKA 0.990 0.560606 67 Akt PKA 1.000 0.560000 89 PKC P38 1.000 0.509000 90 PKC Jnk 1.000 0.507000 100 P38 Jnk 0.946 0.507400
Arcs are considered significant if they appear in at least 85% of the networks, and in the direction that appears
most frequently. Arcs that are score equivalent in all the CPDAGs are considered to appear 50% of the time in each
direction. Since all the values in the direction
column above are close to 0.5, we can infer that the
direction of the arcs is not well established and that they are probably all score equivalent. Interestingly, lowering
the threshold from 85% to 50% does not change the results of the analysis, which seems to indicate that in this case the
results are not sensitive to its value.
Having computed the significance for all possible arcs, we can now build the averaged network using
averaged.network()
and the 85% threshold.
> avg.boot = averaged.network(boot, threshold = 0.85)
The averaged network avg.boot
contains the same arcs as the network learned from the observational data
in Sachs et al., which is shown in Figure. Even though the probability of the presence of each arc and of its
possible directions are computed separately in boot.strength()
, we are not able to determine with any
confidence which direction has better support from the discretised data. Therefore, we remove the directions from the
arcs in avg.boot
, which amounts to constructing its skeleton.
> avg.boot = skeleton(avg.boot)
As an alternative, we can average the results of several hill-climbing searches, each starting from a different DAG.
Such DAGs can be generated randomly from a uniform distribution over the space of connected graphs with the MCMC
algorithm proposed by ide cozman and implemented in random.graph()
as method = "ic-dag"
. This
ensures that no systematic bias is introduced in the learned DAGs. In addition, keeping only one randomly generated DAG
every 50 ensures that the DAGs are different from each other so that the search space is covered as thoroughly as
possible.
> nodes = names(dsachs) > start = random.graph(nodes = nodes, method = "ic-dag", + num = 500, every = 50) > netlist = lapply(start, function(net) { + hc(dsachs, score = "bde", iss = 10, start = net) + })
After using lapply()
to iterate easily over the DAGs in the start
list and pass each of
them to hc()
with the start
argument, we obtain a list of bn
objects, which we
call netlist
. We can pass such a list to the custom.strength()
function to obtain a data frame
with the same structure as those returned by boot.strength()
.
> rnd = custom.strength(netlist, nodes = nodes) > rnd[(rnd$strength > 0.85) & (rnd$direction >= 0.5), ]
from to strength direction 1 Raf Mek 1 0.510 33 PIP2 Plcg 1 0.503 34 PIP2 PIP3 1 0.501 43 PIP3 Plcg 1 0.502 57 Erk PKA 1 0.503 66 Akt Erk 1 0.504 67 Akt PKA 1 0.507 99 P38 PKC 1 0.522 100 P38 Jnk 1 0.508 109 Jnk PKC 1 0.514
> avg.start = averaged.network(rnd, threshold = 0.85)
The arcs identified as significant with this approach are the same as in avg.boot
(even though some are
reversed), thus confirming the stability of the averaged network obtained from bootstrap resampling. Arc directions are
again very close to 0.5
, to the point we can safely disregard them. A comparison of the equivalence classes
of avg.boot
and avg.start
shows that the two networks are equivalent.
> all.equal(cpdag(avg.boot), cpdag(avg.start))
[1] TRUE
Furthermore, note how averaged networks, like the networks they are computed from, are not necessarily completely
directed. In that case, it is not possible to compute their score directly. However, we can identify the equivalence
class the averaged network belongs to (with cpdag()
) and then select a DAG within that equivalence class
(with cextend()
).
> score(cextend(cpdag(avg.start)), dsachs, type = "bde", + iss = 10)
[1] -8498.88
Since all networks in the same equivalence class have the same score (for score-equivalent functions), the value
returned by score()
is a correct estimate for the original, partially directed network.
Choosing the Significance Threshold
The value of the threshold beyond which an arc is considered significant, which is called the significance
threshold, does not seem to have a huge influence on the analysis in Sachs et al.. In fact, any value
between 0.5
and 0.85
yields exactly the same results. So, for instance:
> all.equal(averaged.network(boot, threshold = 0.50), + averaged.network(boot, threshold = 0.70))
[1] TRUE
The same holds for avg.start
. However, this is often not the case. Therefore, it is important to use a
statistically motivated algorithm for choosing a suitable threshold instead of relying on ad-hoc values.
A solution to this problem is presented in here, and
implemented in bnlearn as the default value for the threshold
argument in
averaged.network()
.
It is used when we do not specify threshold
ourselves as in the code below.
> averaged.network(boot)
Random/Generated Bayesian network model: [Raf][Plcg][Erk][PKC][Mek|Raf][PIP2|Plcg][Akt|Erk] [P38|PKC][PIP3|Plcg:PIP2][PKA|Erk:Akt][Jnk|PKC:P38] nodes: 11 arcs: 10 undirected arcs: 0 directed arcs: 10 average markov blanket size: 1.82 average neighbourhood size: 1.82 average branching factor: 0.91 generation algorithm: Model Averaging significance threshold: 0.358
For the dsachs
data, the estimated value for the threshold is
0.358
; so, any arc with a strength
value strictly greater
than that is considered significant. The resulting averaged network is similar to that obtained with the 85% threshold
from Sachs et al.. Compared to avg.boot
, only the arc P38
→ Jnk
is
missing, which is an improvement because it was a false positive not present in the validated DAG shown in Figure.
> unlist(compare(cpdag(dag.sachs), cpdag(avg.boot)))
tp fp fn 9 1 8
> unlist(compare(cpdag(dag.sachs), + cpdag(averaged.network(boot))))
tp fp fn 9 1 8
The reason for the insensitivity of the averaged network to the value of the threshold is apparent from the plot of
the CDF of the arc strengths: arcs that are well supported by the data are clearly separated from those that are not,
with the exception of P38
→ Jnk
(which has strength 0.958
, i.e.,
the estimated threshold). The arc with highest strength
after P38
→ Jnk
is
Plcg
→ Akt
with strength
0.328
, so any threshold that falls
between 0.958 and 0.328 results in the same averaged network.
Handling Interventional Data
Usually, all the observations in a sample are collected under the same general conditions. This is true both for observational data, in which treatment allocation is outside the control of the investigator, and for experimental data, which are collected from randomised controlled trials. As a result, the sample can be easily modelled with a single BN, because all the observations follow the same probability distribution.
However, this is not the case when several samples resulting from different experiments are analysed together with a single, encompassing model. Such an approach is called meta analysis. First, environmental conditions and other exogenous factors may differ between those experiments. Furthermore, the experiments may be different in themselves; for example, they may explore different treatment regimes or target different populations.
This is the case with the protein-signalling data analysed in Sachs et al.. In addition to the data set we have analysed so far, which is subject only to general stimuli meant to activate the desired paths, 9 other data sets subject to different stimulatory cues and inhibitory interventions were used to elucidate the direction of the causal relationships in the network. Such data are often called interventional, because the values of specific variables in the model are set by an external intervention of the investigator.
Overall, the 10 data sets contain 5,400 observations; in addition to the 11 signalling levels analysed above, the
protein which is activated or inhibited (INT
) is recorded for each sample.
> isachs = read.table("sachs.interventional.txt", + header = TRUE, colClasses = "factor")
One intuitive way to model these data sets with a single, encompassing BN is to include the intervention
INT
in the network and to make all variables depend on it. This can be achieved with a
whitelist
containing all possible arcs from INT
to the other nodes, thus forcing these arcs to
be present in the learned DAG.
> wh = matrix(c(rep("INT", 11), names(isachs)[1:11]), ncol = 2) > dag.wh = tabu(isachs, whitelist = wh, score = "bde", + iss = 10, tabu = 50)
Using tabu search instead of hill-climbing improves the stability of the score-based search; once a locally optimum
DAG is found, tabu search performs an additional 50 iterations (as specified by the tabu()
argument) to
ensure that no other (and potentially better) local optimum is found.
We can also let the structure learning algorithm decide which arcs connecting INT
to the other nodes
should be included in the DAG. To this end, we can use the tiers2blacklist()
function to blacklist all arcs
toward INT
, thus ensuring that only outgoing arcs may be included in the DAG. In the general case,
tiers2blacklist()
builds a blacklist such that all arcs going from a node in a particular element of the
nodes
argument to a node in one of the previous elements are blacklisted.
> tiers = list("INT", names(isachs)[1:11]) > bl = tiers2blacklist(tiers = tiers) > dag.tiers = tabu(isachs, blacklist = bl, + score = "bde", iss = 1, tabu = 50)
The BNs learned with these two approaches are shown below. Some of the structural features detected in Sachs et
al. are present in both dag.wh
and dag.tiers
. For example, the interplay between
Plcg
, PIP2
and PIP3
and between PKC
, P38
and
Jnk
are both modelled correctly. The lack of any direct intervention on PIP2
is also correctly
modelled in dag.tiers
. The most noticeable feature missing from both DAGs is the pathway linking
Raf
to Akt
through Mek
and Erk
.
The approach used in Sachs et al. yields much better results. Instead of including the interventions in the
network as an additional node, they used a modified BDe score (labelled "mbde"
in bnlearn)
incorporating the effects of the interventions into the score components associated with each node. When we are
controlling the value of a node experimentally, its value is not determined by the other nodes in the BN, but by the
experimenter's intervention. Accordingly, mbde
disregards the effects of the parents on a controlled node
for those observations that are subject to interventions (on that node) while otherwise behaving as the standard
bde
for other observations.
Since the value of INT
identifies which node is subject to either a stimulatory cue or an inhibitory
intervention for each observation, we can easily construct a named list of which observations are manipulated for each
node.
> INT = sapply(1:11, function(x) { + which(isachs$INT == x) }) > nodes = names(isachs)[1:11] > names(INT) = nodes
We can then pass this list to tabu()
as an additional argument for mbde
. In addition, we
can combine the use of mbde
with model averaging and random starting points as discussed above. To improve
the stability of the averaged network, we generate the set of the starting networks for the tabu searches. In addition,
we actually use only one generated network every 100 to obtain a more diverse set.
> start = random.graph(nodes = nodes, method = "melancon", + num = 500, burn.in = 10^5, every = 100) > netlist = lapply(start, function(net) { + tabu(isachs[, 1:11], score = "mbde", exp = INT, + iss = 1, start = net, tabu = 50) + }) > intscore = custom.strength(netlist, nodes = nodes, + cpdag = FALSE)
Note that we have set cpdag = FALSE
in custom.strength()
, so that the DAGs being averaged
are not transformed into CPDAGs before computing arc strengths and direction probabilities. The reason is that, unlike
the BDe score, mbde
is not score equivalent as a consequence of incorporating the interventions. In fact,
information about the interventions is a form of prior information, and makes it possible to identify arc directions
even when the arc would be score equivalent from the data alone.
Averaging the DAGs with averaged.network()
and the default threshold, we are finally able to correctly
identify all the arcs in the network from Sachs et al..
> dag.mbde = averaged.network(intscore) > unlist(compare(dag.sachs, dag.mbde))
tp fp fn 17 8 0
As we can see from, dag.mbde
is much closer to the validated network from Sachs et al. than any
of the other BNs. All the arcs from the validated network are correctly learned; in the output above we have 17 true
positives and 0 false negatives, and the original network contains 17 arcs. The three arcs that were missing in Sachs
et al. are missing in dag.mbde
as well. The arcs from dag.mbde
that are not present
in the validated network were identified in Sachs et al. and discarded due to their comparatively low strength;
this may imply that the simulated annealing algorithm used in Sachs et al. performs better on this data set
than tabu search.
Querying the Network
In their paper, Sachs et al. used the validated BN to substantiate two claims:
- a direct perturbation of
Erk
should influenceAkt
; - a direct perturbation of
Erk
should not influencePKA
.
The probability distributions of Erk
, Akt
and PKA
were then compared with the
results of two ad-hoc experiments to confirm the validity and the direction of the inferred causal influences.
Given the size of the BN, we can perform the queries corresponding to the two claims above using any exact and
approximate inference algorithm. First, we need to create a bn.fit
object for the validated network
structure from Sachs et al..
> isachs = isachs[, 1:11] > for (i in names(isachs)) + levels(isachs[, i]) = c("LOW", "AVG", "HIGH") > fitted = bn.fit(dag.sachs, isachs, method = "bayes")
The INT
variable, which codifies the intervention applied to each observation, is not needed for
inference and is therefore dropped from the data set. Furthermore, we rename the expression levels of each protein to
make both the subsequent code and its output more readable.
Subsequently, we can perform the two queries using the junction tree algorithm provided by the gRain package.
> library(gRain) > jtree = compile(as.grain(fitted))
We can introduce the direct perturbation of Erk
required by both queries by calling
setEvidence
as follows. In causal terms, this would be an ideal intervention.
> jlow = setEvidence(jtree, nodes = "Erk", states = "LOW")
As we can see from the code below, the marginal distribution of Akt
changes depending on whether we take
the evidence (intervention) into account or not.
> querygrain(jtree, nodes = "Akt")$Akt
Akt LOW AVG HIGH 0.6093933 0.3103746 0.0802321
> querygrain(jlow, nodes = "Akt")$Akt
Akt LOW AVG HIGH 0.666515564 0.333333333 0.000151103
The slight inhibition of Akt
induced by the inhibition of Erk
agrees with both the
direction of the arc linking the two nodes and the additional experiments performed by Sachs et al.. In causal
terms, the fact that changes in Erk
affect Akt
supports the existence of a causal link from
the former to the latter.
As far as PKA
is concerned, both the validated network and the additional experimental evidence support
the existence of a causal link from PKA
to Erk
. Therefore, interventions to Erk
cannot affect PKA
. In a causal setting, interventions on Erk
would block all biological
influences from other proteins, which amounts to removing all the parents of Erk
from the DAG.
> causal.sachs = drop.arc(dag.sachs, "PKA", "Erk") > causal.sachs = drop.arc(causal.sachs, "Mek", "Erk") > cfitted = bn.fit(causal.sachs, isachs, method = "bayes") > cjtree = compile(as.grain(cfitted)) > cjlow = setEvidence(cjtree, nodes = "Erk", states = "LOW")
After building the junction tree for this new DAG, called causal.sachs
, we can perform the same query on
both cjtree
and cjlow
as we did above.
> querygrain(cjtree, nodes = "PKA")$PKA
PKA LOW AVG HIGH 0.194100 0.696229 0.109671
> querygrain(cjlow, nodes = "PKA")$PKA
PKA LOW AVG HIGH 0.194100 0.696229 0.109671
Indeed, PKA
has exactly the same distribution in both cases. However, knowledge of the expression level
of Erk
may still alter our expectations on PKA
if we treat it as evidence instead of an ideal
intervention. In practice this implies the use of the original junction trees jtree
and jlow
,
as opposed to the modified cjtree
and cjlow
we used for the previous query.
> querygrain(jtree, nodes = "PKA")$PKA
PKA LOW AVG HIGH 0.194100 0.696229 0.109671
> querygrain(jlow, nodes = "PKA")$PKA
PKA LOW AVG HIGH 0.489725 0.451647 0.058628
All the queries illustrated above can be easily changed to maximum a posteriori queries by finding the largest
element (with which.max()
) in the distribution of the target node.
> names(which.max(querygrain(jlow, nodes = c("PKA"))$PKA))
[1] "LOW"
Clearly, such a simple approach is possible because of the nature of the evidence and the small number of nodes we are exploring. When many nodes are explored simultaneously, inference on their joint conditional distribution quickly becomes very difficult and computationally expensive. In these high-dimensional settings, algorithms specifically designed for MAP queries and ad-hoc approaches are preferable.
Wed Nov 30 12:35:56 2022
with bnlearn
4.9-20221107
and R version 4.2.2 Patched (2022-11-10 r83330)
.