Inference in Bayesian Networks

Inference in Bayesian Networks
Arifa Azeez
No:03
Set E of evidence variables that are observed, e.g.,
{JohnCalls,MaryCalls}
Query variable X, e.g., Burglary, for which we
would like to know the posterior probability
distribution P(X|E)
Distribution conditional to
the observations made
Inference In BN
?

T T
P(B|) M J
Inference Patterns
Burglary Earthquake
Alarm
MaryCalls JohnCalls
Diagnostic
Burglary Earthquake
Alarm
MaryCalls JohnCalls
Causal
Burglary Earthquake
Alarm
MaryCalls JohnCalls
Intercausal
Burglary Earthquake
Alarm
MaryCalls JohnCalls
Mixed
(from effects to causes).
Given that JohnCalls, infer that P(BurglarylJohnCalls)
Causal inferences (from causes to
effects).
Given Burglary, P(John Calls I
Burglary)
Intercausal inferences
(between causes of a common
effect).
Mixed inferences (combining
two or more of the above).
Types Of Nodes On A Path
Radio
Battery
SparkPlugs
Starts
Gas
Moves
linear
converging
diverging
Independence Relations In BN
Radio
Battery
SparkPlugs
Starts
Gas
Moves
linear
converging
diverging
Given a set E of evidence nodes, two beliefs
connected by an undirected path are
independent if one of the following three
conditions holds:
1. A node on the path is linear and in E
2. A node on the path is diverging and in E
3. A node on the path is converging and
neither this node, nor any descendant is in E
Radio
Battery
SparkPlugs
Starts
Gas
Moves
linear
converging
diverging
conditions holds:
Radio
Battery
SparkPlugs
Starts
Gas
Moves
linear
converging
diverging
conditions holds:
Radio
Battery
SparkPlugs
Starts
Gas
Moves
linear
converging
diverging
conditions holds:
BN Inference
Simplest Case:
B A
P(B) = P(a)P(B|a) + P(~a)P(B|~a)
=
A
) A | B ( P ) A ( P ) B ( P
B A C
P(C) = ???
Inference Ex. 2
Rain
Sprinkler
Cloudy
WetGrass
=
C , S , R
) c ( P ) c | s ( P ) c | r ( P ) s , r | w ( P ) W ( P

=
S , R C
) c ( P ) c | s ( P ) c | r ( P ) s , r | w ( P
=
S , R
C
) s , r ( f ) s , r | w ( P
) S , R ( f
C
Algorithm is computing not individual
probabilities, but entire tables
Two ideas crucial to avoiding exponential blowup:
because of the structure of the BN, some
subexpression in the joint depend only on a small number
of variable
By computing them once and caching the result, we
can avoid generating them exponentially many times
Variable Elimination Algorithm
Let X1,, Xm be an ordering on the non-query variables

For I = m, , 1
Leave in the summation for Xi only factors mentioning Xi
Multiply the factors, getting a factor that contains a number for
each value of the variables mentioned, including Xi
Sum out Xi, getting a factor f that contains a number for each
value of the variables mentioned, not including Xi
Replace the multiplied factor in the summation
[
j
j j
X X X
)) X ( Parents | X ( P ...
1 m 2
=
x
k x k x
y y x f y y f ) , , , ( ' ) , , (
1 1

[
=
=
m
i
l i k x
i
y y x f y y x f
1
, 1 , 1 , 1 1
) , , ( ) , , , ( '
Complexity of variable elimination
Suppose in one elimination step we compute

This requires
multiplications

For each value for x, y
1
, , y
k
, we do m multiplications

additions

For each value of y
1
, , y
k
, we do |Val(X)| additions

Complexity is exponential in number of variables in the intermediate factor!

[

i
i
Y X m ) Val( ) Val(
[
i
i
Y X ) Val( ) Val(
Understanding Variable Elimination
We want to select good elimination orderings
that reduce complexity

This can be done be examining a graph theoretic
property of the induced graph

This reduces the problem of finding good ordering
to graph-theoretic operation that is well-
understoodunfortunately computing it is NP-
hard!

Approaches to inference
Exact inference
Inference in Simple Chains
Variable elimination
Clustering / join tree algorithms
Approximate inference
Stochastic simulation / sampling methods
Markov chain Monte Carlo methods
Basic task of inference system is to compute
posterior prob distribution for a set of query
variables
Complete set of variables X={X}UEUY
Typical query asks for the probablity of distr of
P(X/e)
P(Burglary/john calls=true, mary
calls=true)=<0.284,0.716>
17
Inference by enumeration
Add all of the terms (atomic event
probabilities) from the full joint distribution
If E are the evidence (observed) variables and
Y are the other (unobserved) variables, then:
P(X|e) = P(X, E) = P(X, E, Y)
Y
Each P(X, E, Y) term can be computed using
the chain rule
Computationally expensive!
18
Example: Enumeration
P(x
i
) =
i
P(x
i
|
i
) P(
i
)
Suppose we want P(D=true), and only the value of E
is given as true
P (d|e) = o
ABC
P(a, b, c, d, e)
= o
ABC
P(a) P(b|a) P(c|a) P(d|b,c) P(e|c)
With simple iteration to compute this expression,
theres going to be a lot of repetition (e.g., P(e|c)
has to be recomputed every time we iterate over
C=true)

a
b c
d e
19
Basically just enumeration, but with caching of
local calculations
Linear for polytrees (singly connected BNs)
Potentially exponential for multiply connected
BNs
Exact inference in Bayesian networks is NP-hard!
Join tree algorithms are an extension of variable
elimination methods that compute posterior
probabilities for all nodes in a BN simultaneously
20
General idea:
Write query in the form

Iteratively
Move all irrelevant terms outside of innermost sum
Perform innermost sum, getting a new term
Insert the new term into the product
[
=
k
x x x i
i i n
pa x P X P
3 2
) | ( ) , ( e
21
Variable elimination: Example
Rain
Sprinkler
Cloudy
WetGrass
=
c , s , r
) c ( P ) c | s ( P ) c | r ( P ) s , r | w ( P ) w ( P

=
s , r c
) c ( P ) c | s ( P ) c | r ( P ) s , r | w ( P
=
s , r
1
) s , r ( f ) s , r | w ( P
) s , r ( f
1
22
Computing factors
R S C P(R|C) P(S|C) P(C) P(R|C) P(S|C) P(C)
T T T
T T F
T F T
T F F
F T T
F T F
F F T
F F F
R S f
1
(R,S) =
c
P(R|S) P(S|C) P(C)
T T
T F
F T
F F
INFERENCE IN MULTIPLY CONNECTED
BELIEF NETWORKS
A multiply connected graph is one in which two nodes are connected by
more than one path.
One way this happens is when there are two or more possible causes for
some variable, and the causes share a common ancestor.
There are three basic classes of algorithms for evaluating multiply
connected networks,each with its own areas of applicability:
0 Clustering methods transform the network into a probabilistically
equivalent (but topologically different) polytree by merging offending
nodes.
0 Conditioning methods do the transformation by instantiating variables
to definite values, and then evaluating a polytree for each possible
instantiation.
0 Stochastic simulation methods use the network to generate a large
number of concrete models of the domain that are consistent with the
network distribution. They give an approximation of the exact evaluation.
One way of evaluating the network is to transform it
into a polytree by combining
the Sprinkler and Rain node into a meganode called
Sprinkler+Rain
The two Boolean nodes are replaced by a meganode
that takes on four possible values: TT, TF,FT, and FF.
The meganode has only one parent, the Boolean
variable Cloudy, so there are two conditioning cases.
Once the network has been converted to a polytree, a
linear-time algorithm can be applied to answer
queries.
Queries on variables that have been clustered can be
answered by averaging over the values of the other
variables in the cluster
Direct Stochastic Simulation
Rain
Sprinkler
Cloudy
WetGrass
1. Repeat N times:
1.1. Guess Cloudy at random
1.2. For each guess of Cloudy, guess
Sprinkler and Rain, then WetGrass

2. Compute the ratio of the # runs where
WetGrass and Cloudy are True
over the # runs where Cloudy is True
P(WetGrass|Cloudy)?
P(WetGrass|Cloudy)
= P(WetGrass . Cloudy) / P(Cloudy)
Approximate inference in Bayesian networks
Instead of enumerating all possibilities,
sample to estimate probabilities.

X
1
X
2
X
3
X
n

...
General question: What is P(X|e)?

Notation convention: upper-case letters refer to random variables;
lower-case letters refer to specific values of those variables
Direct Sampling
Suppose we have no evidence, but we want
to determine P(C,S,R,W) for all C,S,R,W.

Direct sampling:
Sample each variable in topological order,
conditioned on values of parents.

I.e., always sample from P(X
i
| parents(X
i
))

1. Sample from P(Cloudy). Suppose returns true.

2. Sample from P(Sprinkler | Cloudy = true). Suppose returns
false.

3. Sample from P(Rain | Cloudy = true). Suppose returns true.

4. Sample from P(WetGrass | Sprinkler = false, Rain = true).
Suppose returns true.

Here is the sampled event: [true, false, true, true]

Example
Suppose there are N total samples, and let N
S
(x
1
, ...,
x
n
) be the observed frequency of the specific event
x
1
, ..., x
n
.

Suppose N samples, n nodes. Complexity O(Nn).

) ,..., (
) ,..., (
lim
1
1
n
n S
N
x x P
N
x x N
=

) ,..., (
) ,..., (
1
1
n
n S
x x P
N
x x N
~
Markov Chain Monte Carlo Sampling
One of most common methods used in real
applications.

Uses idea of Markov blanket of a variable
X
i
:
parents, children, childrens parents

Illustration of Markov Blanket
X
Recall that: By construction of Bayesian
network, a node is conditionaly
independent of its non-descendants, given
its parents.

Proposition: A node X
i
is conditionally
independent of all other nodes in the
network, given its Markov blanket.

Markov Chain Monte Carlo Sampling
Algorithm
Start with random sample from variables: (x
1
,
..., x
n
). This is the current state of the
algorithm.

Next state: Randomly sample value for one
non-evidence variable X
i
, conditioned on
current values in Markov Blanket of X
i
.

Example
Query: What is P(Rain | Sprinkler = true, WetGrass
= true)?

MCMC:
Random sample, with evidence variables fixed:
[true, true, false, true]

Repeat:
1. Sample Cloudy, given current values of its Markov blanket: Sprinkler =
true, Rain = false. Suppose result is false. New state:
[false, true, false, true]

2. Sample Rain, given current values of its Markov blanket:
Cloudy = false, Sprinkler = true, WetGrass = true. Suppose
result is true. New state: [false, true, true, true].
Each sample contributes to estimate for query
P(Rain | Sprinkler = true, WetGrass = true)

Suppose we perform 100 such samples, 20 with Rain = true and 80 with Rain
= false.

Then answer to the query is
Normalize ((20,80)) = (.20,.80)
Claim: The sampling process settles into a dynamic equilibrium in which the
long-run fraction of time spent in each state is exactly proportional to its
posterior probability, given the evidence.

That is: for all variables X
i
, the probability of the value x
i
of X
i
appearing in a
sample is equal to P(x
i
| e).
.
Likelihood weighting
Idea: Dont generate samples that need to be
rejected in the first place!
Sample only from the unknown variables Z
Weight each sample according to the likelihood
that it would occur, given the evidence E
42
Exercise: Direct sampling
smart study
prepared fair
pass
p(smart)=.8 p(study)=.6
p(fair)=.9
p(prep|) smart smart
study .9 .7
study .5 .1
p(pass|)
smart
smart
prep prep prep prep
fair .9 .7 .7 .2
fair .1 .1 .1 .1
Topological order = ?
Random number generator: .35, .76,
.51, .44, .08, .28, .03, .92, .02, .42
44
Markov chain Monte Carlo algorithm
So called because
Markov chain each instance generated in the sample
is dependent on the previous instance
Monte Carlo statistical sampling method
Perform a random walk through variable
assignment space, collecting statistics as you go
Start with a random instantiation, consistent with
evidence variables
At each step, for some nonevidence variable, randomly
sample its value, consistent with the other current
assignments
Given enough samples, MCMC gives an accurate
estimate of the true distribution of values

Inference in Bayesian Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inference in Bayesian Networks

Uploaded by

Copyright:

Available Formats

Inference in Bayesian Networks

You might also like