Overview of Bayesian

Networks With Examples in R

(Scutari and Denis 2015)
• Please install bnlearn in R  install.packages(“bnlearn”)
• Theory
• Types of Bayesian networks
• Learning Bayesian networks
• Structure learning
• Parameter learning
• Using Bayesian networks
• Queries
• Conditional independence
• Inference based on new evidence
• Hard vs. soft evidence
• Conditional probability vs. most likely outcome (a.k.a maximum a posteriori)
• Exact
• Approximate
• R packages for Bayesian networks
• Case study: protein signaling network
Bayesian networks (BNs)
• Represent a probability distribution as a probabilistic directed acyclic
graph (DAG)
• Graph = nodes and edges (arcs) denote variables and dependencies, respectively
• Directed = arrows represent the directions of relationships between nodes
• Acyclic = if you trace arrows with a pencil, you cannot traverse back to the same node
without picking up your pencil
• Probabilistic = each node has an associated probability that can be influenced by values
other nodes assume based on the structure of the graph
• The node at the tail of a connection is called the parent, and the node
at the head of the connection is called its child
• Ex. A  B: A is the parent, B is its child
Which is a BN?


Taken from

Factorization into local distributions

• Factorization of the joint distribution of all variables (global

distribution) into local distributions encoded by the DAG is:
1) intuitively appealing
2) reduces the variables/computational requirements when
using the BN for inference
3) increases power for parameter learning
• The dimensions of local distributions usually do not scale with the
size of the BN
• Each variable (node) depends only on its parents
• For the BN here:
P(A, S, E, O, R, T) = P(A)P(S)P(E|A:S)P(O|E)P(R|E)P(T|O:R)
• A and S are often referred to as root nodes
Fundamental connections
1) Serial connection, e.g. A  E  O
2) Divergent connection, e.g. O  E  R
3) Convergent connection, e.g. A  E  S (also
referred to as a v-structure)
• The child of a convergent connection is often referred
to as a collider
• Only (immoral) v-structures uniquely define
probabilistic relationships
• Ex.

P(B)P(A|B)P(C|A) = P(A,B)P(C|A) = P(B|A)P(A)P(C|A)
• “d” stands for dependence
• Defines conditional independencies/dependencies
• Determines whether a set of X variables is independent of another
set Y, given a third set Z
• Intuitively important because it reveals how variables are related
• Computationally important because it provides a means for efficient
• Reduces the effective dimension of inference problems
• Formal definition: If A, B, and C are three disjoint subsets of nodes in
a DAG G, then C is said to d-separate A from B if along every path
between a node in A and a node in B there is a node v satisfying one
of the following two conditions:
1) v has converging arcs (i.e. there are two arcs pointing to v from the adjacent
nodes in the path) and neither v nor any of its descendants (i.e. the nodes
that can be reached from v) are in C
2) v is in C and does not have converging arcs
d-separation practice
• In R:
> library(bnlearn)
> dag <-
> dsep(bn = dag, x = "A", y = "O", z = "E")
[1] TRUE
• What this says is that given E, A and O are independent
> dsep(bn = dag, x = "A", y = "S")
[1] TRUE
> dsep(bn = dag, x = "A", y = "S", z = "E")
[1] [FALSE]
• Conditioning on a collider or its descendants (Z) makes the
parent nodes dependent
• Intuitively, if we know E, then certain combinations of A and S are more
likely and hence conditionally dependent
• Note that it is impossible for nodes directly linked by an edge
to be independent conditional on any other node
Equivalent class = CPDAG
• Two DAGs defined over the same set of
variables are equivalent if and only if they:
1) have the same skeleton (i.e. the same
underlying undirected graph)
2) the same v-structures
• Compelled edges: edges whose directions are
oriented in the equivalence class because
assuming the opposite direction would:
1) introduce new v-structures (and thus a different
2) cycles (and thus the resulting graph would no
longer be a DAG)
• Note that DAGs can be probabilistically
equivalent but encode very different causal
Markov blankets The Markov Blanket of Node X9

• Information (evidence) on the values of

parents, children, and nodes sharing a
child for a given node give information on
that node
• Inference is most powerful when considering
all these nodes (due to the use of Bayes’
theorem when querying)
• Markov blanket defines this set of nodes
and effectively d-separates a given node
from the rest of the graph
• Symmetry of Markov blankets
• If node A is in the Markov blanket of node B,
then B is in the Markov blanket of A
Beyond dependencies: causal inference
• While a directed graph seems to suggest • Ex. The presence of a latent variable
causality, in reality additional criteria significantly altered DAG used to represent
must be met the relationships between test scores
• Specially designed perturbation
experiments can be employed to
characterize causal relationships
• Algorithms also exist that attempt to
elucidate causal relationships from
observational data
• Often times, the “high p, small n” nature of
the data result in subsets (equivalent
classes) of possible causal networks
• “If conditional independence judgments
are byproducts of stored causal
relationships, then tapping and
representing those relationships directly
would be a more natural and more
reliable way of expressing what we know
or believe about the world”
Types of BNs
Discrete BNs Conditional probability table of A
young adult old
0.3 0.5 0.2

• All variables contain discrete data

• Ex. Multinomial distribution
• A = age, young, adult, or old
• S = gender, male or female
• E = education, high or uni
• R = residence, small or big
Conditional probability table of E
Conditional probability table of R
Gender = M
R high uni
E young adult old
small 0.25 0.20
high 0.75 0.72 0.88
big 0.75 0.80
univ 0.25 0.28 0.12
Gender = F
E young adult old
high 0.64 0.70 0.90
uni 0.36 0.30 0.10
Gaussian BNs (GBNs)
• Assumptions
• Each node follows a normal distribution
• Root nodes are described by the respective marginal distributions
• The conditioning effect of the parent nodes is given by an additive linear term in the
mean, and does not affect the variance
• In other words, each node has a variance that is specific to that node and does not depend on
the values of the parents
• The local distribution of each node can be equivalently expressed as a Gaussian
linear model which includes an intercept and the node’s parents as explanatory
variables, without any interaction terms
• Based on these assumptions, the joint distribution of all nodes (global
distribution) is multivariate normal
Gaussian BNs (GBNs)
E ~ N (50, 102)

V|G,E ~ N (-10.35 + 0.5G + 0.77E, 52)

W|V ~ N (15 + 0.7V, 52)

Hybrid BNs CL ~ Beta (3,1)

• Contains both discrete and

continuous variables
• One common class of hybrid
BNs is conditional Gaussian G1|PR, CL ~ Pois (CL*g(PR))
• Continuous variables cannot be
parents of discrete variables
• The Gaussian distribution of
continuous variables is
conditional on the configuration
of its discrete parent(s) TR|G1 ~ Ber (logit-1[G1-5/2.5])
• In other words, the variable can
have a unique linear model (i.e.
mean, variance) for each
configuration of its discrete
Comparison of BNs
• Discrete BNs
• Local probability distributions can be plotted using
the function from bnlearn
• The iss argument to include a weighted prior for
parameter learning using the function from the
bnlearn only works with discrete data
• Discretization produces better BNs than misspecified
distributions and coarse approximations of the
conditional probabilities
• GBNs
• Perform better than hybrid BNs when few
observations are available
• Greater accuracy than discretization for continuous
• Computationally more efficient than hybrid BNs
• Hybrid BNs
• Greater flexibility
• No dedicated R package
• No structure learning
Learning BNs
Structure learning
• All structure learning methods boil down to three approaches:
1) Constraint-based
2) Score-based
3) Hybrid-based
1) Constraint-based
• Constraint-based algorithms rely on conditional independence tests
• All modern algorithms first learn Markov blankets
• Simplifies the identification of neighbors and in turn reduces computational complexity
• Symmetry of Markov blankets also leveraged
• Discrete BNs
• Tests are functions of observed frequencies
• GBNs
• Tests are functions of partial correlation coefficients
• For both cases:
• We are checking the independence of two sets of variables given a third set
• Null hypothesis is conditional independence
• Test statistics are utilized
• Functions in bnlearn include gs, iamb, fast.iamb, inter.iamb, mmpc, and
2) Score-based
• Candidate BNs are assigned a goodness-of-fit “network score” that heuristic
algorithms then attempt to maximize
• Due to the difficulty assigning scores, only two options are common:
1) BDe(discrete case)/BGe(continuous case)
2) BIC
• Larger values = better fit
• Classes of heuristic algorithms include greedy search, genetic, and simulated
• Functions in bnlearn include hc and tabu
3) Hybrid-based
• Combine constraint-based and score-based algorithms to offset respective
• Two steps:
1) Restrict
• Constraint-based algorithms are utilized to reduce the set of candidate DAGs
2) Maximize
• Score-based algorithms are utilized to find optimal DAG from the reduced set
• Functions in bnlearn include mmhc and rsmax2 where for rsmax2 you can specify
your own combinations of restrict and maximize algorithms
Parameter learning
• Once the structure of a DAG has been determined, the parameters
can be determined as well
• Two most common approaches are maximum likelihood estimation
and Bayesian estimation (not available for GBNs in bnlearn)
• Parameter estimates are based only on the subset of data spanning
the considered variable and its parents
• The function from bnlearn will automatically determine the
type of data and fit parameters
Notes on learning
• Three learning techniques:
1) unsupervised, i.e. from the data set
2) supervised, i.e. from experts in the field of the phenomenon being studied
3) a combination of both
• The arguments blacklist and whitelist can be specified in structure learning
functions to force the absence and presence of specific edges, respectively
• For GBNs, you can easily replace parameter estimates with your own
regression fit
• Ex. The penalized package in R can be used to perform ridge, lasso, or elastic net
regression for biased coefficient estimates
Using BNs
• Once a BN has been constructed, it can be used
• The term query is derived from computer science terminology and
means to ask questions
• Two main types of queries:
1) conditional independence
• Uses only the DAG structure to explain how variables are associated with one
another, i.e. d-separation
2) inference, a.k.a. probabilistic reasoning or belief updating
• Uses the local distributions
2) Inference
• Investigates the distribution of one or more variables under non-trivial
• Variable(s) being conditioned on are the new evidence
• The probability of the variable(s) of interest are then re-evaluated
• Works in the framework of Bayesian statistics because it focuses on the
computation of posterior probabilities or densities
• Based on the basic principle of modifying the joint distributions of nodes to
incorporate a new piece of information
• Uses the fundamental properties of BNs in that only local distributions are
considered when computing posterior probabilities to reduce dimensionality
• The network structure and distributional assumptions of a BN are treated
as fixed when performing inference
Types of evidence
• Hard evidence
• Instantiation of one or more variables in the network
• Soft evidence
• New distribution for one or more variables in the network, i.e. a new set of
Types of queries
• Conditional probability
• Interested in the marginal posterior probability distribution of variables given
evidence on other variables
• Most likely outcome (a.k.a. maximum a posteriori)
• Interested in finding the configuration of the variables that have the highest
posterior probability (discrete) or maximum posterior density (continuous)
Types of inference
• Exact inference
• Repeated applications of Bayes’ theorem with local computations to obtain
exact probability values
• Feasible only for small or very simple graphs
• Approximate inference
• Monte Carlo simulations are used to sample from the global distribution and
thus estimate probability values
• Several approaches can be used for both random sampling and weighting
• There are functions in bnlearn to generate random observations and
calculate probability distributions given evidence using these
R packages for BNs
R packages
• Two categories:
1) those that implement structure and parameter learning
2) those that focus on parameter learning and inference
• Some packages of note:
• bnlearn (developed by the authors)
• deal
• Can handle conditional Gaussian BNs
• pcalg
• Focuses on causal inference (implements the PC algorithm)
• Other packages include catnet, gRbase, gRain, and rbmn
• Some of these packages augment bnlearn
Case study: protein signaling
• Analysis published in Sachs, K., Perez, O., Pe'er, D., Lauffenburger,
D.A., Nolan, G.P. (2005). Causal Protein-Signaling Networks Derived
from Multiparameter Single-Cell Data. Science, 308(5721):523-529.
• Hypothesis: Machine learning for the automated derivation of a
protein signaling network will elucidate many of the traditionally
reported signaling relationships and predict novel causal pathways
• Methods
• Measure concentrations of pathway molecules in primary immune system
• Perturbation experiments to confirm causality
The data
> sachs <- read.table("", header = TRUE)
> head(sachs)
Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk
1 26.4 13.2 8.82 18.3 58.8 6.61 17 414 17 44.9 40
2 35.9 16.5 12.3 16.8 8.13 18.6 32.5 352 3.37 16.5 61.5
3 59.4 44.1 14.6 10.2 13 14.9 32.5 403 11.4 31.9 19.5
4 73 82.8 23.1 13.5 1.29 5.83 11.8 528 13.7 28.6 23.1
5 33.7 19.8 5.19 9.73 24.8 21.1 46.1 305 4.66 25.7 81.3
6 18.8 3.75 17.6 22.1 10.9 11.9 25.7 610 13.7 49.1 57.8

• Continuous data
Data exploration
• Violations of the assumptions of Densities of Mek, P38,
GBNs PIP2, and PIP3 along with
the normal distribution
• Highly skewed curves
• Concentrations cluster around 0
• Nonlinear correlations
• Difficult for accurate structure learning
• What can we do?
• Data transformations (log)
• Hybrid network: specify an appropriate
conditional distribution for each Concentration of PKA vs.
variable concentration of PKC
along with the fitted
• Requires extensive prior knowledge of the
regression line
signaling pathway
• Discretize
Discretizing the data
• Information-preserving discretization algorithm introduced by
Hartemink (2001)
1) Discretizes each variable into a large number of intervals
• idisc argument = type of intervals
• ibreaks argument = number of intervals
2) Iterates over the variables and collapses, for each of them, the pair of
adjacent intervals that minimize the lost of pairwise mutual information
• Basically does its best to reflect the dependence structure of the original data
> dsachs <- discretize(sachs, method = "hartemink", breaks = 3,
ibreaks = 60, idisc = "quantile")
• breaks = number of desired levels (“low”, “medium”, and “high”
Model averaging
• The quality of the structure learned from the data can be improved by
averaging multiple CPDAGs
• Bootstrap resampling as described in Friedman et al. (1999)
• “Perturb" the data
• Frequencies of edges and directions is their confidence measure
> boot <- boot.strength(dsachs, R = 500, algorithm = "hc",
algorithm.args = list(score = "bde", iss = 10))
• R = number of network structures
from to strength direction
Model averaging results 1
> boot[boot$strength > 0.85 & 24 Plcg PIP3 1 0.519
34 PIP2 PIP3 1 0.508
boot$direction >= 0.5, ]
56 Erk Akt 1 0.559
• strength = frequency of edge 57 Erk PKA 0.984 0.568089
• direction = frequency of edge direction 67 Akt PKA 1 0.566
conditional on the edge’s presence 89 PKC P38 1 0.508
90 PKC Jnk 1 0.509
• Many score-equivalent edges 100 P38 Jnk 0.95 0.505263
• This means the directions are not well Note: your numbers may differ since no seed was set but you
established should still have the same edges passing the threshold
> avg.boot <-,
threshold = 0.85)
The network

> avg.boot
• Network learned from the
discretized, observational data
• Since we are not confident in the
directions of any of the edges, we
remove them by constructing the
> avg.boot <- skeleton(avg.boot)

