You are on page 1of 22

Artificial Intelligence

Bayesian Networks
Stephan Dreiseitl
FH Hagenberg
Software Engineering & Interactive Media

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

1 / 43

Overview
Representation of uncertain knowledge
Constructing Bayesian networks
Using Bayesian networks for inference
Algorithmic aspects of inference

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

2 / 43

A simple Bayesian network example


Rain

Worms

Umbrellas

P(Rain, Worms, Umbrellas) =


P(Worms | Rain)P(Umbrellas | Rain)P(Rain)
With conditional independence, need only right-hand side
to represent joint distribution
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

3 / 43

A simple Bayesian net example (cont.)


Rain

Worms

Umbrellas

Intuitively: graphical representation of influence


Mathematically: graphical representation of conditional
independence assertions
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

4 / 43

A more complicated example


Earthquake

Burglary
P(b)
0.001

Alarm

MaryCalls

B
T
T
F
F

E
T
F
T
F

P(a | B, E)
0.95
0.94
0.24
0.001

JohnCalls
A
T
F

Stephan Dreiseitl (Hagenberg/SE/IM)

P(e)
0.002

P(m | A)
0.7
0.01
Lecture 11: Bayesian Networks

A
T
F

P(j | A)
0.9
0.05

Artificial Intelligence SS2010

5 / 43

Definition of Bayesian networks


A Bayesian network is a directed acyclic graph with
random variables as nodes,
links that specify directly influences relationships,
probability distributions P(Xi | parents(Xi )) for each
node Xi
Graph structure asserts conditional independencies:
P(MaryCalls | JohnCalls, Alarm, Earthquake, Burglary) =
P(MaryCalls | Alarm)

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

6 / 43

Bayesian networks as joint probabilities


P(X1 , . . . , Xn ) =
=

n
Y
i=1
n
Y

P(Xi | X1 , . . . , Xi1 )
P(Xi | parents(Xi ))

i=1

for parents(Xi ) {X1 , . . . , Xi1 }


Burglary example:
P(b, e, a, m, j) = P(b)P(e)P(a | b, e)P(m | a)P(j | a)
= 0.001 0.998 0.94 0.3 0.9
= 0.0002
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

7 / 43

Conditional independencies in networks


Use graphical structure to visualize conditional
dependencies and independencies
Nodes are dependent if there is information flow between
them (along one possible path)
Nodes are independent if information flow is blocked
(along all possible paths)
Distinguish situations with and without evidence
(instantiated variables)

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

8 / 43

Conditional independencies in networks


No evidence: Information flow along a path is blocked iff
on path
there is a head-to-head node (blocker)
No blockers between A and B:
A

Blocker C between A and B:


A
Stephan Dreiseitl (Hagenberg/SE/IM)

C
Lecture 11: Bayesian Networks

B
Artificial Intelligence SS2010

9 / 43

Conditional independencies in networks


Evidence blocks information flow, except at blockers (or
their descendents), where it opens information flow
Information flow between A and B blocked by evidence:
A

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

10 / 43

Conditional independencies in networks


Information flow between A and B unblocked by
evidence:
A

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

11 / 43

Conditional independencies in networks


A node is conditionally independent of its
non-descendents, given its parents

P1

P2

C1

C2

P(X | P1 , P2 , A, B, D) = P(X | P1 , P2 )
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

12 / 43

Cond. independencies in networks (cont.)


A node is conditionally independent of all other nodes in
the network, given its Markov blanket: its parents,
children, and childrens parents
P1

P2

C1

C2

P(X | P1 , P2 , C1 , C2 , A, B, D) = P(X | P1 , P2 , C1 , C2 , A, B)
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

13 / 43

Noisy OR
For Boolean node X with n Boolean parents, conditional
probability table has 2n entries
Noisy OR assumption reduces this number to n: Assume
each parent may be inhibited independently
Flu

Malaria

Cold

Fever

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

14 / 43

Noisy OR (cont.)
Need only specify first three entries of table:
Flu Malaria Cold
T
F
F
F
F
T
T
T

F
T
F
F
T
F
T
T

Stephan Dreiseitl (Hagenberg/SE/IM)

F
F
T
F
T
T
F
T

P(fever)
0.2
0.1
0.6
1.0
0.1 0.6 = 0.06
0.2 0.6 = 0.12
0.2 0.1 = 0.02
0.2 0.1 0.6 = 0.012

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

15 / 43

Building an example network


When I go home at night, I want to know if my family
is home before I try the doors (perhaps the most
convenient door to enter is double locked when nobody
is home). Now, often when my wife leaves the house she
turns on an outdoor light. However, she sometimes turns
on this light if she is expecting a guest. Also, we have a
dog. When nobody is home, the dog is put in the back
yard. The same is true if the dog has bowel trouble.
Finally, if the dog is in the back yard, I will probably hear
her barking, but sometimes I can be confused by other
dogs barking.
F. Jensen, An introduction to Bayesian networks, UCL Press, 1996.
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

16 / 43

Building an example network (cont.)


Relevant entities: Boolean random variables FamilyOut,
LightsOn, HearDogBark
Causal structure: FamilyOut has direct influence on both
LightsOn and HearDogBark, so LightsOn and
HearDogBark are conditionally independent given
FamilyOut
FamilyOut

LightsOn

Stephan Dreiseitl (Hagenberg/SE/IM)

HearDogBark

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

17 / 43

Building an example network (cont.)


Numbers in conditional probability table derived from
previous experience, or subjective belief
P(familyout) = 0.2
P(lightson | familyout) = 0.99
P(lightson | familyout) = 0.1
Run into problem with P(heardogbark | familyout): dog
may be out because of bowel problems, and barking may
be other dogs
Network structure needs to be updated to reflect this
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

18 / 43

Building an example network (cont.)


Introduce mediating variable DogOut to model
uncertainty with bowel problems and hearing other dogs
bark
BowelProblems

FamilyOut

LightsOn

DogOut

HearDogBark

Need: P(DogOut | FamilyOut, BowelProblems)


P(HearDogBark | DogOut)
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

19 / 43

Building an example network (cont.)


Obtain the following additional probability tables:
FamilyOut

LightsOn
F
T
F

P(l | F)
0.99
0.1

Stephan Dreiseitl (Hagenberg/SE/IM)

BowelProblems

P(f)
0.2

P(b)
0.05

DogOut
F
T
T
F
F

B
T
F
T
F

P(d | F, B)
0.99
0.88
0.96
0.2

Lecture 11: Bayesian Networks

HearDogBark
D
T
F

P(h | D)
0.6
0.25

Artificial Intelligence SS2010

20 / 43

Inference in Bayesian networks


Given events (instantiated variables) e and no
information on hidden variables H, calculate distribution
for query variable Q
Algorithmically, calculate P(Q | e) by marginalizing over H
X
P(Q, e, h)
P(Q | e) = P(Q, e) =
h

with h all possible value combinations of H


Distinguish between causal, diagnostic, and intercausal
reasoning
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

21 / 43

Types of inference
Causal reasoning: query variable is downstream of events
P(heardogbark | familyout) = 0.56
Diagnostic reasoning: query variable upstream of events
P(familyout | heardogbark) = 0.296
Explaining away (intercausal reasoning): knowing effect
and possible cause, reduce the probability of other
possible causes
P(familyout | bowelproblems, heardogbark) = 0.203
P(bowelproblems | heardogbark) = 0.078
P(bowelproblems | familyout, heardogbark) = 0.053
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

22 / 43

Algorithmic aspects of inference


Calculating joint distribution computationally expensive
Several alternatives for inference in Bayesian networks:
Exact inference

by enumeration
by variable elimination

Stochastic inference (Monte Carlo methods)


by
by
by
by

sampling from the joint distribution


rejection sampling
likelihood weighting
Markov chain Monte Carlo methods

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

23 / 43

Inference by enumeration
FamilyOut example (d 0 {d, d}, b 0 {b, b})
XX
P(F | l, h) = P(F , l, h) =
P(F , l, h, d 0 , b 0 )
d0

P(f | l, h) =

XX
d0

b0

P(f )P(b 0 )P(l | f )P(d 0 | f , b 0 )P(h | d 0 )

b0

= P(f )P(l | f )

P(h | d )

d0

P(b 0 )P(d 0 | f , b 0 )

b0

= 0.2 0.99 (0.6 0.8857 + 0.25 0.1143)


= 0.111
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

24 / 43

Inference by enumeration (cont.)


Similarily, P(f | l, h) = 0.0267
From P(f | l, h) + P(f | l, h) = 1, normalization yields
P(F | l, h) = (0.111, 0.0267) = (0.806, 0.194)
Burglary example:
P(B | j, m) =

X
X
0
P(B) P(e ) P(a0 | B, e 0 )P(j | a0 )P(m | a0 )
e0

a0

Last two factors P(j | a)P(m | a) do not depend on e 0 ,


but have to be evaluated twice (for e and e)
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

25 / 43

Variable elimination
Eliminate repetitive calculations by summing inside out,
storing intermediate results (cf. dynamic programming)
Burglary example, different query:
P(J | b) =

X
X
X
0
0
0
0
P(m0 |a0 )
P(b) P(e ) P(a |b, e )P(J|a )
e0

a0

|m

{z
=1

Fact: Any variable that is not an ancestor of the query or


evidence variables is irrelevant and can be dropped
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

26 / 43

Sampling from the joint distribution


Straightforward if there is no evidence in the network:
Sample each variable in topological order
For nodes without parents, sample from their
distribution; for nodes with parents, sample from the
conditional distribution
With NS (x1 , . . . , xn ) being the number of times specific
realizations (x1 , . . . , xn ) are generated in N sampling
experiments, obtain
NS (x1 , . . . , xn )
= P(x1 , . . . , xn )
N
N
lim

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

27 / 43

Example: joint distribution sampling


FamilyOut

LightsOn
F
T
F

P(l | F)
0.99
0.1

BowelProblems

P(f)
0.2

P(b)
0.05

DogOut
F
T
T
F
F

B
T
F
T
F

P(d | F, B)
0.99
0.88
0.96
0.2

HearDogBark
D
T
F

P(h | D)
0.6
0.25

What is probability of family at home, dog has no bowel


problems and isnt out, the light is off, and a dogs
barking can be heard?
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

28 / 43

Example: joint distribution sampling


FamilyOut example: Generate 100000 samples from the
network by
first sampling from FamilyOut and BowelProblems
variables,
then sampling from all other variables in turn, given
sampled parent values
Obtain NS (f , b, l, d, h) = 13740
Compare with P(f , b, l, d, h) =
0.8 0.95 0.9 0.8 0.25 = 0.1368
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

29 / 43

Example: joint distribution sampling


Advantage of sampling: easy to generate estimates for
other probabilities

Standard error of estimates drops as 1/ N, for


N = 100000 this is 0.00316

NS (d)/100000 = 0.63393 P(d) = 0.63246
NS (f , h)
= 0.1408
NS (h)


P(f | h) = 0.1416

Last example: Form of rejection sampling


Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

30 / 43

Rejection sampling in Bayesian networks


Method to approximate conditional probabilities P(X | e)
of variables X , given evidence e:
NS (X , e)
P(X | e)
NS (e)
Rejection sampling: Take only those samples that are
consistent with the evidence into account
Problem with rejection sampling: Number of samples
consistent with evidence drops exponentially with
number of evidence variables, therefore unusable for
real-life networks
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

31 / 43

Likelihood weighting
Fix evidence and sample all other variables
This overcomes rejection sampling shortcoming by only
generating samples consistent with the evidence
Problem: Consider situation with P(E = e|X = x) = 0.001
and P(X = x) = 0.9. Then, 90% of samples will have
X = x (and fixed E = e), but this combination is very
unlikely, since P(E = e|X = x) = 0.001
Solution: Weight each sample by product of conditional
probabilities of evidence variables, given its parents
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

32 / 43

Example: Likelihood weighting


FamilyOut example: Calculate P(F | l, d)
Iterate the following:
sample all non-evidence variables, given the evidence
variables, obtaining, e.g., (f , b, h)
calculate weighting factor, e.g.
P(l | f ) P(d | f , b) = 0.1 0.2 = 0.02
Finally, sum and normalize weighting factors for samples
(f , l, d) and (f , l, d)
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

33 / 43

Example: Likelihood weighting (cont).


For N = 100000, obtain
NS (f , l, d) = 20164
NS (f , l, d) = 79836

X
X

w(f ,l,d) = 1907.18


w(f ,l,d) = 17676.4

P(f | l, d) 17676.4/(17676.4 + 1907.18) = 0.90261


Correct: P(f | l, d) = 0.90206
Disadvantage of likelihood weighting: With many
evidence variables, most samples will have very small
weights, and few samples with larger weights dominate
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

34 / 43

Markov chains
A sequence of discrete r.v. X0 , X1 , . . . is called a Markov
chain with state space S iff
P(Xn = xn | X0 = x0 , . . . , Xn1 = xn1 )
= P(Xn = xn | Xn1 = xn1 )
for all x0 , . . . , xn S.
Thus, Xn is conditionally independent of all variables
before it, given Xn1
Specify state transition matrix P with
Pij = P(Xn = xj | Xn1 = xi )
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

35 / 43

Markov chains Monte Carlo methods


Want to obtain samples from given distributions Pd (X )
(hard to sample from with other methods)
Idea: Construct a Markov chain that, for arbitrary initial
state x0 , converges towards a stationary (equilibrium)
distribution Pd (X )
Then, successive realizations xn , xn+1 , . . . are sampled
according to Pd (but are not independent!!)
Often not clear when convergence of chain has taken
place
Therefore, discard initial portion of chain (burn-in phase)
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

36 / 43

Markov chain example


Let S = {1, 2} with state transition matrix P =

1

1
2
3
4

2
1
4

Simulate chain for 1000 steps, show NS (1)/N and


NS (2)/N for N = 1, . . . , 1000 with starting state 1 (left)
and 2 (right)
0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

200

400

600

Stephan Dreiseitl (Hagenberg/SE/IM)

800

1000

Lecture 11: Bayesian Networks

200

400

600

800

1000

Artificial Intelligence SS2010

37 / 43

MCMC for Bayesian networks


Given evidence e and non-evidence variables X , use
Markov chains to sample from the distribution P(X | e)
Obtain sequence of states x0 , x1 , . . . , discard initial
portion
After convergence, samples xk , xk+1 , . . . have desired
distribution P(X | e)
Many variants of Markov chain Monte Carlo algorithms
Consider only Gibbs sampling

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

38 / 43

Gibbs sampling
Fix evidence variables to e, assign arbitrary values to
non-evidence variables X
Recall: Markov blanket of a variable is parents, children,
and childrens parents.
Iterate the following:
pick arbitrary variable Xi from X
sample from P(Xi | MarkovBlanket(Xi ))
new state = old state, with new value of Xi

Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

39 / 43

Gibbs sampling (cont.)


Calculating P(Xi | MarkovBlanket(Xi )):
P(xi | MarkovBlanket(Xi ))
Y
= P(xi | parents(Xi ))
P(yi | parents(Yi ))
Yi children(Xi )

With this, calculate P(xi | MarkovBlanket(Xi )) and


P(xi | MarkovBlanket(Xi )), normalize to obtain
P(Xi | MarkovBlanket(Xi ))
Sample from this for next value of Xi , and thus next
state of Markov chain
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

40 / 43

Bayesian network MCMC example


FamilyOut example: Calculate P(F | l, d)
Start with arbitrary non-evidence settings (f , b, h)
Pick F , sample from P(F | l, d, b), obtain f
Pick B, sample from P(B | f , d), obtain b
Pick H, sample from P(H | d), obtain h
Iterate last three steps 50000 times, keep last 10000
states
Obtain P(f | l, d) 0.9016 (correct 0.90206)
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

41 / 43

Comparison of inference algorithms


Inference by enumeration computationally prohibitive
Variable elimination removes all irrelevant variables
Direct sampling from joint distribution: easy when no
evidence present
Use rejection sampling and likelihood weighting for more
efficient calculations
Markov chain Monte Carlo methods are most efficient for
large networks by calculating new states based on old
states, but lose independence of samples
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

42 / 43

Summary
Bayesian networks are graphical representations of causal
influence among random variables
Network structure graphically specifies conditional
independence assumptions
Need conditional distributions of nodes, given its parents
Use noisy OR to reduce number of parameters in tables
Reasoning types in Bayesian networks: causal, diagnostic,
and explaining away
There are exact and approximate inference algorithms
Stephan Dreiseitl (Hagenberg/SE/IM)

Lecture 11: Bayesian Networks

Artificial Intelligence SS2010

43 / 43

You might also like