Plotting arc strengths with Rgraphviz and lattice
Measuring the strength of the relationship represented by each arc in a Bayesian network is a fundamental tool to investigate and validate its structure. Arc strengths are mostly used programmatically for both purposes; nevertheless it is sometimes useful to explore them visually.
We will consider all the different measures of arc strengths implemented in bnlearn:
- p-values from the conditional independence tests that would remove individual arcs present in a network;
> library(bnlearn) > dag = model2network("[A][C][F|C:A][B|A][D|A:C][E|B:F]") > pvalues = arc.strength(dag, data = learning.test, criterion = "x2") > head(pvalues)
from to strength 1 C F 6.564063e-01 2 A F 7.470134e-01 3 A B 0.000000e+00 4 A D 0.000000e+00 5 C D 0.000000e+00 6 B E 6.226919e-289
- differences in network score from removing individual arcs present in a network;
> score.deltas = arc.strength(dag, data = learning.test, criterion = "bic") > head(score.deltas)
from to strength 1 C F 23.47069 2 A F 23.81276 3 A B -1153.87956 4 A D -1938.94993 5 C D -823.76049 6 B E -720.82662
- probabilities of inclusion of all possible arcs, and their directions;
> bagging = boot.strength(learning.test, algorithm = "tabu", R = 200) > head(bagging)
from to strength direction 1 A B 1 0.5 2 A C 0 0.0 3 A D 1 1.0 4 A E 0 0.0 5 A F 0 0.0 6 B A 1 0.5
- probabilities of inclusion of individual arcs present in a network, and their directions.
> bayes.factors = bf.strength(dag, data = learning.test)
> head(bayes.factors)
from to strength direction 1 A B 1.000000e+00 5.000000e-01 2 A C 3.251115e-08 5.000000e-01 3 A D 1.000000e+00 1.000000e+00 4 A E 2.667283e-54 1.000000e+00 5 A F 1.016754e-04 2.065548e-09 6 B A 1.000000e+00 5.000000e-01
As far as plotting is concerned, the bn.strength
objects produced above differ in two key ways:
pvalues
andscore.deltas
contain an entry for each of the arcs indag
, and provide the confidence level in each arc's presence (in thestrength
column);bagging
andbayes.factors
contain an entry for every possible arc (that is, for every pair of nodes), and provide the confidence level in each arc's presence (in thestrength
column) and direction (in thedirection
column).
The strength
values in pvalues
, bagging
and bayes.factors
are
bound in the [0, 1] interval; and the same is true for the direction
values in bagging
and
bayes.factors
. The strength
values in score.deltas
are real numbers.
Plotting the distribution of arc strengths
The bagging
and bayes.factors
encode the complete distribution of the arc strengths, since
they cover all possible arcs. This makes it possible to plot the empirical cumulative distribution function (ECDF) of
the confidence in the arcs' presence (modulo their directions) using the values stored in the strength
column. This is what the default plot()
method does for the bn.strength
objects produced
by boot.strength()
and bf.strength()
.
> plot(bagging)
The plot is defined in [0, 1]2 because both the arc strengths and their cumulative probabilities are
defined in [0, 1]. At the left-hand side, the vertical gap between 0.0
and the first point gives the
proportion of the possible arcs that have strength equal to zero: they are completely unsupported by the data. At the
right hand side, the gap between the horizontal line and 1.0
gives the proportion of arcs that have
strength equal to one, the maximum possible. The vertical dashed line is the default threshold for significance computed
by inclusion.threshold()
(we can choose not to draw it by setting draw.threshold = FALSE
).
Points to the right of the threshold represent arcs that would be included in the Bayesian network produced by
averaged.network()
.
If structure learning is stable, that is, it reliably learns the same set of arcs, then the ECDF looks like the above: arcs cluster around strength zero and around strength one, with a large gap between the two groups. If, on the other hand, structure learning is noisy then the points in the ECDF are more uniformly distributed along the x-axis to represent arcs with varying strength values. Look, for instance, at how the ECDF changes if we reduce the sample size from 5000 to 50.
> weak = boot.strength(learning.test[1:50, ], algorithm = "tabu", R = 200) > plot(weak)
The pvalues
and score.deltas
objects typically contain information on just a few arcs, so
it is not meaningful to plot their ECDF.
Plotting the network structure together with the arc strengths
Another useful type of plot is overlaying arc strengths on a network to get a visual impression of the layout of the
arcs and their presence at the same time. strength.plot()
implements this by increasing each arc's
thickness in proportion to its strength, and by drawing weak arcs with dashed lines. Modes and arcs are laid out using
the Rgraphviz package as in graphviz.plot()
.
When arc strengths are estimated using p-values, by default they are grouped in the following intervals to determine the thickness of the arcs:
> c(0, 0.05 / c(10, 5, 2, 1.5, 1), 1)
[1] 0.00000000 0.00500000 0.01000000 0.02500000 0.03333333 0.05000000 1.00000000
The value of the significance threshold is the 0.05
above by default, but it can be changed using the
threshold
argument of strength.plot()
. Arcs with strengths in the right-most interval are
associated with p-values larger than the threshold and are plotted with dashed lines. Other arcs are drawn with
increasingly thick lines as we move intervals from right to left, that is, the closer to zero the interval the pvalues
fall in is.
Arc strengths estimated using scores are discretized in a similar way by default. Since the scale of the score
differences depends on the specific network and on the data they were learned from, intervals are defined using the
c(0.50, 0.75, 0.90, 0.95, 1)
empirical quantiles of positive score differences. (They correspond to arcs
that are well supported by the data: if we remove one of those arcs, the network score increases.) The significance
threshold is equal to zero by default, but it can be changed with the threshold
argument.
When arc strength is estimated using boot.strength()
or bf.strength()
, it takes values
in [0, 1] again but this time higher values are better. (With p-values, lower values are better.) Hence intervals by
default are defined as:
> c(0, (1 - threshold) / c(10, 5, 2, 1.5, 1), 1)
The resulting plots for pvalues
, score.deltas
, bagging
and
bayes.factors
are below.
> par(mfrow = c(2, 2)) > strength.plot(dag, strength = pvalues, main = "pvalues")
> strength.plot(dag, strength = score.deltas, main = "score.deltas") > strength.plot(dag, strength = bagging, main = "bagging") > strength.plot(dag, strength = bayes.factors, main = "bayes.factors")
The plots produced by strength.plot()
can be customized much in the same way as those produced by
graphviz.plot()
, which we illustrated here. An example:
> strength.plot(dag, strength = bagging, main = "bagging", shape = "rectangle", + highlight = list(nodes = "F", arcs = incident.arcs(dag, "F"), + col = "darkblue", fill = "lightblue"))
Obviously, setting either lwd
or lty
in the highlight
argument will reset the
formatting introduced by strength.plot()
and should be avoided. In addition, we can manually specify both
the threshold
of significance and the cutpoints
used to bin the arc strengths into
intervals.
> strength.plot(dag, strength = pvalues, + threshold = 0.01, cutpoints = c(0, 1e-300, 1e-250, 1e-200, 0.01, 1))
Mon Aug 5 02:51:29 2024
with bnlearn
5.0
and R version 4.4.1 (2024-06-14)
.