Reproducing the causal signalling network in Sachs et al., Science (2005)

This is a short HOWTO describing how to reproduce the causal signalling network analysis in “Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data” by Sachs, Perez, Pe'er, Lauffenburger and Nolan (2015, Science). The goal of the original analysis was to learn the causal pathways linking a set of 11 proteins from a combination of observational and experimental data subject to various knock-outs and spikings. The pathways linking those proteins were in fact already known from the literature, so it was possible for the authors to compare the model they learned from the data to a the "golden standard" model and thus validate Bayesian networks as a powerful tool in systems biology and genetics. This golden standard model is shown below.

plot of chunk sachs-dag-from-the-paper

The raw data

The data consist in the simultaneous measurements of 11 phosphorylated proteins and phospholipids derived from thousands of individual primary immune system cells, subjected to both general and specific molecular interventions. The former ensure that the relevant signalling pathways are active, while the latter make causal inference possible by elucidating arc directions through stimulatory cues and inhibitory interventions.

> sachs = read.table("sachs.data.txt", header = TRUE)
> head(sachs)
   Raf   Mek  Plcg  PIP2  PIP3   Erk  Akt PKA   PKC  P38  Jnk
1 26.4 13.20  8.82 18.30 58.80  6.61 17.0 414 17.00 44.9 40.0
2 35.9 16.50 12.30 16.80  8.13 18.60 32.5 352  3.37 16.5 61.5
3 59.4 44.10 14.60 10.20 13.00 14.90 32.5 403 11.40 31.9 19.5
4 73.0 82.80 23.10 13.50  1.29  5.83 11.8 528 13.70 28.6 23.1
5 33.7 19.80  5.19  9.73 24.80 21.10 46.1 305  4.66 25.7 81.3
6 18.8  3.75 17.60 22.10 10.90 11.90 25.7 610 13.70 49.1 57.8

The data are continuous, as they represent the concentration of the molecules under investigation. Therefore, the standard approach in literature is to assume the concentrations follow a Gaussian distribution and to use a GBN to build the protein-signalling network.

However, the DAGs learned under this parametric assumption are not satisfactory. For example, using Inter-IAMB we obtain a DAG with only 8 arcs and only 2 of them are directed.

> library(bnlearn)
> dag.iamb = inter.iamb(sachs, test = "cor")
> narcs(dag.iamb)
[1] 8
> directed.arcs(dag.iamb)
     from  to   
[1,] "P38" "PKC"
[2,] "Jnk" "PKC"

Other combinations of constraint-based algorithms and conditional independence tests, such as gs() with test = "mc-cor", return the same DAG. The same is true for score-based and hybrid algorithms. If we compare dag.iamb with the DAG from Figure (minus the arcs that were missed in the original paper), we can see that they have completely different structures.

> sachs.modelstring <-
+   paste("[PKC][PKA|PKC][Raf|PKC:PKA][Mek|PKC:PKA:Raf]",
+         "[Erk|Mek:PKA][Akt|Erk:PKA][P38|PKC:PKA]",
+         "[Jnk|PKC:PKA][Plcg][PIP3|Plcg][PIP2|Plcg:PIP3]", sep = "")
> dag.sachs = model2network(sachs.modelstring)
> unlist(compare(dag.sachs, dag.iamb))
tp fp fn 
 0  8 17

Comparing the two DAGs again, but disregarding arc directions, reveals that some of the dependencies are correctly identified by inter.iamb() but their directions are not.

> unlist(compare(skeleton(dag.sachs), skeleton(dag.iamb)))
tp fp fn 
 8  0  9

The reason for the discrepancy between dag.sachs and dag.iamb is apparent from a graphical exploratory analysis of the data.

plot of chunk sachs-continuous-diagnostic-plots
plot of chunk sachs-continuous-diagnostic-plots

Firstly, the empirical distributions of the molecules' concentrations are markedly different from a normal distribution. As an example, we plotted the distributions of Mek, P38, PIP2 and PIP3 above. All are strongly skewed, because concentrations are positive numbers but a lot of them are small, and therefore cluster around zero. As a result, the variables are not symmetric and clearly violate the distributional assumptions underlying GBNs. Secondly, the dependence relationships in the data are not always linear; this is the case, as shown in Figure , for PKA and PKC. Most conditional independence tests and network scores designed to capture linear relationships have very low power in detecting nonlinear ones. In turn, structure learning algorithms using such statistics are not able to correctly learn the arcs in the DAG.

Since we have concluded that GBNs are not appropriate for these data, we must now consider some alternative parametric assumptions. One possibility could be to explore monotone transformations like the logarithm. Another possibility would be to specify an appropriate conditional distribution for each variable, thus obtaining a hybrid network like those we explored in Chapter. However, this approach requires substantial prior knowledge on the signalling pathways, which may or may not be available. In the case of Sachs et al., such information was indeed available from the literature. However, the aim of the analysis was to use BNs as an automated probabilistic method to verify such information, not to build a BN with prior information and use it as an expert system.

Consequently, Sachs et al. decided to discretise the data and to model them with a discrete BN, which can accommodate skewness and nonlinear relationships at the cost of losing the ordering information. Since the variables in the BN represent concentration levels, it is intuitively appealing to discretise them into three levels corresponding to low, average and high concentrations.

To that end, we can use the discretize() function in bnlearn, which implements three common discretisation methods:

  • quantile discretisation: each variable is discretised independently into k intervals delimited by its 0, 1/k, 2/k, ..., (k-1)/k, 1 empirical quantiles;
  • interval discretisation: each variable is discretised independently into k equally-spaced intervals;
  • information-preserving discretisation: variables are jointly discretised while preserving as much of the pairwise mutual information between the variables as possible.

The last approach has been introduced by Hartemink (link). The key idea is to initially discretise each variable into a large number k_1 of intervals, thus losing as little information as possible. Subsequently, the algorithm iterates over the variables and collapses, for each of them, the pair of adjacent intervals that minimises the loss of pairwise mutual information. The algorithm stops when all variables have k_2 << k_1 intervals left. The resulting set of discretised variables reflects the dependence structure of the original data much better than either quantile or interval discretisation would allow, because the discretisation takes pairwise dependencies into account. Clearly some information is always lost in the process; for instance, higher-level dependencies are completely disregarded and therefore are not likely to be preserved.

Algorithm is implemented in the discretize() function (method = "hartemink") along with quantile (method = "quantile") and interval discretisation (method = "interval").

> dsachs = discretize(sachs, method = "hartemink",
+             breaks = 3, ibreaks = 60, idisc = "quantile")

The relevant arguments are idisc and ibreaks, which control how the data are initially discretised, and breaks, which specifies the number of levels of each discretised variable. ibreaks corresponds to k_1 in Algorithm, while breaks corresponds to k_2. Choosing good values for these arguments is a trade-off between quality and speed; high values of ibreaks preserve the characteristics of the original data to a greater extent, whereas smaller values result in much smaller memory usage and shorter running times.

Model averaging

Following the analysis in Sachs et al., we can improve the quality of the structure learned from the data by averaging multiple CPDAGs. One possible approach to that end is to apply bootstrap resampling to dsachs and learn a set of 500 network structures.

> boot = boot.strength(dsachs, R = 500, algorithm = "hc",
+           algorithm.args = list(score = "bde", iss = 10))

In the code above we learn a CPDAG with hill-climbing from each of the R bootstrap samples. Each variable in dsachs is now a factor with three levels, so we use the BDe score with a very low imaginary sample size.

> options(digits = 6)

The boot object returned by boot.strength() is a data frame containing the strength of all possible arcs (in the strength column) and the probability of their direction (in the direction column) conditional on the fact that the from and to nodes are connected by an arc. This structure makes it easy to select the most significant arcs, as below.

> boot[(boot$strength > 0.85) & (boot$direction >= 0.5), ]
    from   to strength direction
1    Raf  Mek    1.000  0.513000
23  Plcg PIP2    1.000  0.515000
24  Plcg PIP3    1.000  0.525000
34  PIP2 PIP3    1.000  0.511000
56   Erk  Akt    1.000  0.554000
57   Erk  PKA    0.990  0.560606
67   Akt  PKA    1.000  0.560000
89   PKC  P38    1.000  0.509000
90   PKC  Jnk    1.000  0.507000
100  P38  Jnk    0.946  0.507400

Arcs are considered significant if they appear in at least 85% of the networks, and in the direction that appears most frequently. Arcs that are score equivalent in all the CPDAGs are considered to appear 50% of the time in each direction. Since all the values in the direction column above are close to 0.5, we can infer that the direction of the arcs is not well established and that they are probably all score equivalent. Interestingly, lowering the threshold from 85% to 50% does not change the results of the analysis, which seems to indicate that in this case the results are not sensitive to its value.

Having computed the significance for all possible arcs, we can now build the averaged network using averaged.network() and the 85% threshold.

> avg.boot = averaged.network(boot, threshold = 0.85)
plot of chunk plot-sachs-averaged-skeleton

The averaged network avg.boot contains the same arcs as the network learned from the observational data in Sachs et al., which is shown in Figure. Even though the probability of the presence of each arc and of its possible directions are computed separately in boot.strength(), we are not able to determine with any confidence which direction has better support from the discretised data. Therefore, we remove the directions from the arcs in avg.boot, which amounts to constructing its skeleton.

> avg.boot = skeleton(avg.boot)

As an alternative, we can average the results of several hill-climbing searches, each starting from a different DAG. Such DAGs can be generated randomly from a uniform distribution over the space of connected graphs with the MCMC algorithm proposed by ide cozman and implemented in random.graph() as method = "ic-dag". This ensures that no systematic bias is introduced in the learned DAGs. In addition, keeping only one randomly generated DAG every 50 ensures that the DAGs are different from each other so that the search space is covered as thoroughly as possible.

> nodes = names(dsachs)
> start = random.graph(nodes = nodes, method = "ic-dag",
+            num = 500, every = 50)
> netlist = lapply(start, function(net) {
+   hc(dsachs, score = "bde", iss = 10, start = net)
+ })

After using lapply() to iterate easily over the DAGs in the start list and pass each of them to hc() with the start argument, we obtain a list of bn objects, which we call netlist. We can pass such a list to the custom.strength() function to obtain a data frame with the same structure as those returned by boot.strength().

> rnd = custom.strength(netlist, nodes = nodes)
> rnd[(rnd$strength > 0.85) & (rnd$direction >= 0.5), ]
    from   to strength direction
1    Raf  Mek        1     0.510
33  PIP2 Plcg        1     0.503
34  PIP2 PIP3        1     0.501
43  PIP3 Plcg        1     0.502
57   Erk  PKA        1     0.503
66   Akt  Erk        1     0.504
67   Akt  PKA        1     0.507
99   P38  PKC        1     0.522
100  P38  Jnk        1     0.508
109  Jnk  PKC        1     0.514
> avg.start = averaged.network(rnd, threshold = 0.85)

The arcs identified as significant with this approach are the same as in avg.boot (even though some are reversed), thus confirming the stability of the averaged network obtained from bootstrap resampling. Arc directions are again very close to 0.5, to the point we can safely disregard them. A comparison of the equivalence classes of avg.boot and avg.start shows that the two networks are equivalent.

> all.equal(cpdag(avg.boot), cpdag(avg.start))
[1] TRUE

Furthermore, note how averaged networks, like the networks they are computed from, are not necessarily completely directed. In that case, it is not possible to compute their score directly. However, we can identify the equivalence class the averaged network belongs to (with cpdag()) and then select a DAG within that equivalence class (with cextend()).

> score(cextend(cpdag(avg.start)), dsachs, type = "bde",
+   iss = 10)
[1] -8498.88

Since all networks in the same equivalence class have the same score (for score-equivalent functions), the value returned by score() is a correct estimate for the original, partially directed network.

Choosing the Significance Threshold

The value of the threshold beyond which an arc is considered significant, which is called the significance threshold, does not seem to have a huge influence on the analysis in Sachs et al.. In fact, any value between 0.5 and 0.85 yields exactly the same results. So, for instance:

> all.equal(averaged.network(boot, threshold = 0.50),
+           averaged.network(boot, threshold = 0.70))
[1] TRUE

The same holds for avg.start. However, this is often not the case. Therefore, it is important to use a statistically motivated algorithm for choosing a suitable threshold instead of relying on ad-hoc values.

A solution to this problem is presented in here, and implemented in bnlearn as the default value for the threshold argument in averaged.network().

plot of chunk plot-threshold-ecdf

It is used when we do not specify threshold ourselves as in the code below.

> averaged.network(boot)

  Random/Generated Bayesian network

  model:
   [Raf][Plcg][Erk][PKC][Mek|Raf][PIP2|Plcg][Akt|Erk]
   [P38|PKC][PIP3|Plcg:PIP2][PKA|Erk:Akt][Jnk|PKC:P38]
  nodes:                                 11 
  arcs:                                  10 
    undirected arcs:                     0 
    directed arcs:                       10 
  average markov blanket size:           1.82 
  average neighbourhood size:            1.82 
  average branching factor:              0.91 

  generation algorithm:                  Model Averaging 
  significance threshold:                0.358

For the dsachs data, the estimated value for the threshold is 0.358; so, any arc with a strength value strictly greater than that is considered significant. The resulting averaged network is similar to that obtained with the 85% threshold from Sachs et al.. Compared to avg.boot, only the arc P38Jnk is missing, which is an improvement because it was a false positive not present in the validated DAG shown in Figure.

> unlist(compare(cpdag(dag.sachs), cpdag(avg.boot)))
tp fp fn 
 9  1  8
> unlist(compare(cpdag(dag.sachs),
+                cpdag(averaged.network(boot))))
tp fp fn 
 9  1  8

The reason for the insensitivity of the averaged network to the value of the threshold is apparent from the plot of the CDF of the arc strengths: arcs that are well supported by the data are clearly separated from those that are not, with the exception of P38Jnk (which has strength 0.958, i.e., the estimated threshold). The arc with highest strength after P38Jnk is PlcgAkt with strength 0.328, so any threshold that falls between 0.958 and 0.328 results in the same averaged network.

Handling Interventional Data

Usually, all the observations in a sample are collected under the same general conditions. This is true both for observational data, in which treatment allocation is outside the control of the investigator, and for experimental data, which are collected from randomised controlled trials. As a result, the sample can be easily modelled with a single BN, because all the observations follow the same probability distribution.

However, this is not the case when several samples resulting from different experiments are analysed together with a single, encompassing model. Such an approach is called meta analysis. First, environmental conditions and other exogenous factors may differ between those experiments. Furthermore, the experiments may be different in themselves; for example, they may explore different treatment regimes or target different populations.

This is the case with the protein-signalling data analysed in Sachs et al.. In addition to the data set we have analysed so far, which is subject only to general stimuli meant to activate the desired paths, 9 other data sets subject to different stimulatory cues and inhibitory interventions were used to elucidate the direction of the causal relationships in the network. Such data are often called interventional, because the values of specific variables in the model are set by an external intervention of the investigator.

Overall, the 10 data sets contain 5,400 observations; in addition to the 11 signalling levels analysed above, the protein which is activated or inhibited (INT) is recorded for each sample.

> isachs = read.table("sachs.interventional.txt",
+             header = TRUE, colClasses = "factor")

One intuitive way to model these data sets with a single, encompassing BN is to include the intervention INT in the network and to make all variables depend on it. This can be achieved with a whitelist containing all possible arcs from INT to the other nodes, thus forcing these arcs to be present in the learned DAG.

> wh = matrix(c(rep("INT", 11), names(isachs)[1:11]), ncol = 2)
> dag.wh = tabu(isachs, whitelist = wh, score = "bde",
+            iss = 10, tabu = 50)

Using tabu search instead of hill-climbing improves the stability of the score-based search; once a locally optimum DAG is found, tabu search performs an additional 50 iterations (as specified by the tabu() argument) to ensure that no other (and potentially better) local optimum is found.

We can also let the structure learning algorithm decide which arcs connecting INT to the other nodes should be included in the DAG. To this end, we can use the tiers2blacklist() function to blacklist all arcs toward INT, thus ensuring that only outgoing arcs may be included in the DAG. In the general case, tiers2blacklist() builds a blacklist such that all arcs going from a node in a particular element of the nodes argument to a node in one of the previous elements are blacklisted.

> tiers = list("INT", names(isachs)[1:11])
> bl = tiers2blacklist(tiers = tiers)
> dag.tiers = tabu(isachs, blacklist = bl,
+               score = "bde", iss = 1, tabu = 50)

The BNs learned with these two approaches are shown below. Some of the structural features detected in Sachs et al. are present in both dag.wh and dag.tiers. For example, the interplay between Plcg, PIP2 and PIP3 and between PKC, P38 and Jnk are both modelled correctly. The lack of any direct intervention on PIP2 is also correctly modelled in dag.tiers. The most noticeable feature missing from both DAGs is the pathway linking Raf to Akt through Mek and Erk.

The approach used in Sachs et al. yields much better results. Instead of including the interventions in the network as an additional node, they used a modified BDe score (labelled "mbde" in bnlearn) incorporating the effects of the interventions into the score components associated with each node. When we are controlling the value of a node experimentally, its value is not determined by the other nodes in the BN, but by the experimenter's intervention. Accordingly, mbde disregards the effects of the parents on a controlled node for those observations that are subject to interventions (on that node) while otherwise behaving as the standard bde for other observations.

Since the value of INT identifies which node is subject to either a stimulatory cue or an inhibitory intervention for each observation, we can easily construct a named list of which observations are manipulated for each node.

> INT = sapply(1:11, function(x) {
+                           which(isachs$INT == x) })
> nodes = names(isachs)[1:11]
> names(INT) = nodes

We can then pass this list to tabu() as an additional argument for mbde. In addition, we can combine the use of mbde with model averaging and random starting points as discussed above. To improve the stability of the averaged network, we generate the set of the starting networks for the tabu searches. In addition, we actually use only one generated network every 100 to obtain a more diverse set.

> start = random.graph(nodes = nodes, method = "melancon",
+            num = 500, burn.in = 10^5, every = 100)
> netlist = lapply(start, function(net) {
+   tabu(isachs[, 1:11], score = "mbde", exp = INT,
+      iss = 1, start = net, tabu = 50)
+ })
> intscore = custom.strength(netlist, nodes = nodes,
+               cpdag = FALSE)

Note that we have set cpdag = FALSE in custom.strength(), so that the DAGs being averaged are not transformed into CPDAGs before computing arc strengths and direction probabilities. The reason is that, unlike the BDe score, mbde is not score equivalent as a consequence of incorporating the interventions. In fact, information about the interventions is a form of prior information, and makes it possible to identify arc directions even when the arc would be score equivalent from the data alone.

Averaging the DAGs with averaged.network() and the default threshold, we are finally able to correctly identify all the arcs in the network from Sachs et al..

> dag.mbde = averaged.network(intscore)
> unlist(compare(dag.sachs, dag.mbde))
tp fp fn 
17  8  0
plot of chunk sachs-various-network-comparison

As we can see from, dag.mbde is much closer to the validated network from Sachs et al. than any of the other BNs. All the arcs from the validated network are correctly learned; in the output above we have 17 true positives and 0 false negatives, and the original network contains 17 arcs. The three arcs that were missing in Sachs et al. are missing in dag.mbde as well. The arcs from dag.mbde that are not present in the validated network were identified in Sachs et al. and discarded due to their comparatively low strength; this may imply that the simulated annealing algorithm used in Sachs et al. performs better on this data set than tabu search.

Querying the Network

In their paper, Sachs et al. used the validated BN to substantiate two claims:

  1. a direct perturbation of Erk should influence Akt;
  2. a direct perturbation of Erk should not influence PKA.

The probability distributions of Erk, Akt and PKA were then compared with the results of two ad-hoc experiments to confirm the validity and the direction of the inferred causal influences.

Given the size of the BN, we can perform the queries corresponding to the two claims above using any exact and approximate inference algorithm. First, we need to create a bn.fit object for the validated network structure from Sachs et al..

> isachs = isachs[, 1:11]
> for (i in names(isachs))
+   levels(isachs[, i]) = c("LOW", "AVG", "HIGH")
> fitted = bn.fit(dag.sachs, isachs, method = "bayes")

The INT variable, which codifies the intervention applied to each observation, is not needed for inference and is therefore dropped from the data set. Furthermore, we rename the expression levels of each protein to make both the subsequent code and its output more readable.

Subsequently, we can perform the two queries using the junction tree algorithm provided by the gRain package.

> library(gRain)
> jtree = compile(as.grain(fitted))

We can introduce the direct perturbation of Erk required by both queries by calling setEvidence as follows. In causal terms, this would be an ideal intervention.

> jlow = setEvidence(jtree, nodes = "Erk", states  = "LOW")

As we can see from the code below, the marginal distribution of Akt changes depending on whether we take the evidence (intervention) into account or not.

> querygrain(jtree, nodes = "Akt")$Akt
Akt
      LOW       AVG      HIGH 
0.6093933 0.3103746 0.0802321
> querygrain(jlow, nodes = "Akt")$Akt
Akt
        LOW         AVG        HIGH 
0.666515564 0.333333333 0.000151103

The slight inhibition of Akt induced by the inhibition of Erk agrees with both the direction of the arc linking the two nodes and the additional experiments performed by Sachs et al.. In causal terms, the fact that changes in Erk affect Akt supports the existence of a causal link from the former to the latter.

As far as PKA is concerned, both the validated network and the additional experimental evidence support the existence of a causal link from PKA to Erk. Therefore, interventions to Erk cannot affect PKA. In a causal setting, interventions on Erk would block all biological influences from other proteins, which amounts to removing all the parents of Erk from the DAG.

> causal.sachs = drop.arc(dag.sachs, "PKA", "Erk")
> causal.sachs = drop.arc(causal.sachs, "Mek", "Erk")
> cfitted = bn.fit(causal.sachs, isachs, method = "bayes")
> cjtree = compile(as.grain(cfitted))
> cjlow = setEvidence(cjtree, nodes = "Erk", states  = "LOW")

After building the junction tree for this new DAG, called causal.sachs, we can perform the same query on both cjtree and cjlow as we did above.

> querygrain(cjtree, nodes = "PKA")$PKA
PKA
     LOW      AVG     HIGH 
0.194100 0.696229 0.109671
> querygrain(cjlow, nodes = "PKA")$PKA
PKA
     LOW      AVG     HIGH 
0.194100 0.696229 0.109671
plot of chunk compare-gRain-queries-by-barplots

Indeed, PKA has exactly the same distribution in both cases. However, knowledge of the expression level of Erk may still alter our expectations on PKA if we treat it as evidence instead of an ideal intervention. In practice this implies the use of the original junction trees jtree and jlow, as opposed to the modified cjtree and cjlow we used for the previous query.

> querygrain(jtree, nodes = "PKA")$PKA
PKA
     LOW      AVG     HIGH 
0.194100 0.696229 0.109671
> querygrain(jlow, nodes = "PKA")$PKA
PKA
     LOW      AVG     HIGH 
0.489725 0.451647 0.058628

All the queries illustrated above can be easily changed to maximum a posteriori queries by finding the largest element (with which.max()) in the distribution of the target node.

> names(which.max(querygrain(jlow, nodes = c("PKA"))$PKA))
[1] "LOW"

Clearly, such a simple approach is possible because of the nature of the evidence and the small number of nodes we are exploring. When many nodes are explored simultaneously, inference on their joint conditional distribution quickly becomes very difficult and computationally expensive. In these high-dimensional settings, algorithms specifically designed for MAP queries and ad-hoc approaches are preferable.

Last updated on Wed Nov 30 12:35:56 2022 with bnlearn 4.9-20221107 and R version 4.2.2 Patched (2022-11-10 r83330).