Bayesian AI Tutorial Bayesian AI Tutorial

Overview Bayesian AI
1. Introduction to Bayesian AI (20 min) AI99, Sydney 6 December 1999 Ann E. Nicholson and Kevin B. Korb School of Computer Science and Software Engineering Monash University, Clayton, VIC 3168 AUSTRALIA 2. Bayesian networks (50 min) Break (10 min) 3. Applications (50 min) Break (10 min) 4. Learning Bayesian networks (50 min) 5. Current research issues (10 min) 6. Bayesian Net Lab (60 min: Optional) 7. Dinner (Optional)
fannn,korbg@csse.monash.edu.au
Bayesian AI Tutorial
Nicholson & Korb
Nicholson & Korb
Introduction to Bayesian AI
Reasoning under uncertainty Probabilities Alternative formalisms Fuzzy logic MYCINs certainty factors Default Logic Bayesian philosophy Dutch book arguments Bayes Theorem Conditionalization Conrmation theory Bayesian decision theory Towards a Bayesian AI
Reasoning under uncertainty

Uncertainty: The quality or state of being not clearly known. This encompasses most of what we understand about the world and most of what we would like our AI systems to understand. Distinguishes deductive knowledge (e.g., mathematics) from inductive belief (e.g., science). Sources of uncertainty Ignorance (which side of this coin is up?) Physical randomness (which side of this coin will land up?) Vagueness (which tribe am I closest to genetically? Picts? Angles? Saxons? Celts?)
Fuzzy Logic Probabilities

Classic approach to reasoning under uncertainty. (Blaise Pascal and Fermat). Kolmogorovs Axioms: 1. 2. 3. Designed to cope with vagueness: Is Fido a Labrador or a Shepard? Fuzzy set theory:
m(Fido 2 Labrador) = m(Fido 2 Shepard) = 0:5

Extended to fuzzy logic, which takes intermediate truth values: T (Labrador(Fido)) = 0:5. Combination rules:
P (U ) = 1
8X U P (X ) 0 8X; Y U if X \ Y = ; then P (X ^ Y ) = P (X ) + P (Y ) q Y i P (X jY ) = P (X )
P (X ^Y ) P (Y )
T (p ^ q) = min(T (p); T (q)) T (p _ q) = max(T (p); T (q)) T (:p) = 1 T (p)

Not suitable for coping with randomness or ignorance. Obviously not: Uncertainty(inclement weather) = max(Uncertainty(rain),Uncertainty(hail),. . . )
Conditional Probability P (X jY ) = Independence X
Nicholson & Korb
Nicholson & Korb
MYCINs Certainty Factors Default Logic

Uncertainty formalism developed for the early expert system MYCIN (Buchanan and Shortliffe, 1984): Elicit for (h; e): Intended to reect stereotypical reasoning under uncertainty (Reiter 1980). Example:
MB (h; e) 2 0; 1] measure of disbelief: MD(h; e) 2 0; 1]

measure of belief:
Bird(Tweety) : Bird(x) ! Flies(x) Flies(Tweety)

Problems: Best semantics for default rules are probabilistic (Pearl 1988, Korb 1995). Mishandles combinations of low probability events. E.g.,
CF (h; e) = MB (h; e) MD(h; e) 2 1; 1]

Special functions provided for combining evidence. Problems: No semantics ever given for belief/disbelief Heckerman (1986) proved that restrictions required for a probabilistic semantics imply absurd independence assumptions.
ApplyforJob(me) : ApplyforJob(x) ! Reject(x) Reject(me)

I.e., the dole always looks better than applying for a job!
Probability Theory
So, why not use probability theory to represent uncertainty? Thats what it was invented for. . . dealing with physical randomness and degrees of ignorance. Furthermore, if you make bets which violate probability theory, you are subject to Dutch books: A Dutch book is a sequence of fair bets which collectively guarantee a loss. Fair bets are bets based upon the standard odds-probability relation:
A Dutch Book
Payoff table on a bet for h (Odds = p=1 p; S = betting unit) h T F Payoff $(1-p) -$p S S
Given a fair bet, the expected value from such a payoff is always $0. Now, lets violate the probability axioms. Example Say, P (A) =
0:1 (violating A2)
O(h) = 1 P (h()h) P O P (h) = 1 + (h()h) O

Payoff table against A (inverse of: for A), with S = 1:
:A
T F
Payoff $pS = -$0.10 -$(1-p)S = -$1.10

Nicholson & Korb
11
Nicholson & Korb
12
Bayes Theorem; Conditionalization

Due to Reverend Thomas Bayes (1764)
Bayesian Decision Theory

Frank Ramsey (1931) Decision making under uncertainty: what action to take (plan to adopt) when future state of the world is not known. Bayesian answer: Find utility of each possible outcome (action-state pair) and take the action that maximizes expected utility. Example
Action Take umbrella Leave umbrella Expected utilities: E(Take umbrella) = (30)(.4) + (10)(.6) = 18 E(Leave umbrella) = (-50)(.4) + (100)(.6) = 40 Rain (p = .4) 30 -100 Shine (1 - p = .6) 10 50
j P (hje) = P (ePh()eP (h) )

Conditionalization:
P 0 (h) = P (hje)
Or, read Bayes theorem as:
Posterior = Likelihood Prior Prob of evidence

Assumptions: 1. Joint priors over fhi g and e exist. 2. Total evidence: e, and only e, is learned.
Bayesian AI
A Bayesian conception of an AI is: An autonomous agent which Has a utility structure (preferences) Can learn about its world and the relation between its actions and future states (probabilities) Maximizes its expected utility The techniques used in learning about the world are (primarily) statistical. . . Hence Bayesian data mining
Bayesian Networks: Overview

Syntax Semantics Evaluation methods Inuence diagrams (Decision Networks) Dynamic Bayesian Networks
Nicholson & Korb
15
Nicholson & Korb
16
Bayesian Networks
Data Structure which represents the dependence between variables; Gives concise specication of the joint probability distribution. A Bayesian Network is a graph in which the following holds: 1. A set of random variables makes up the nodes in the network. 2. A set of directed links or arrows connects pairs of nodes. 3. Each node has a conditional probability table that quanties the effects the parents have on the node. 4. Directed, acyclic graph (DAG), i.e. no directed cycles.
Example: Earthquake (Pearl,R&N)

You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor earthquakes. Two neighbours (John, Mary) promise to call you at work when they hear the alarm. John always calls when hears alarm, but confuses alarm with phone ringing (and calls then also) Mary likes loud music and sometimes misses alarm! Given evidence about who has and hasnt called, estimate the probability of a burglary.
Earthquake Example: Network Structure
Assumptions: John and Mary dont perceive burglary directly; they do not feel minor earthquakes. Note: no info about loud music or telephone ringing and confusing John. Summarised in uncertainty in links from Alarm to JohnCalls and MaryCalls. Once specied topology, need to specify conditional probability table (CPT) for each node. Each row contains the cond prob of each node value for a conditioning case. Each row must sum to 1. A table for a Boolean var with n Boolean parents contain 2n+1 probs. A node with no parents has one row (the prior probabilities)
Burglary P(B) 0.01 Alarm
Earthquake
P(E) 0.02 B E P(A|B,E) 0.95 0.94 0.29 0.001
JohnCalls A T F P(J|A) 0.90 0.05
MaryCalls A T F
T T T F F T F F
P(M|A) 0.70 0.01
Nicholson & Korb
19
Nicholson & Korb
20
Representing the joint probability distribution Semantics of Bayesian Networks

A (more compact) representation of the joint probability distribution. helpful in understanding how to construct network Encoding a collection of conditional independence statements. helpful in understanding how to design inference procedures
P (X1 = x1 ; X2 = x2 ; :::; Xn = xn ) = P (x1 ; x2 ; :::; xn)

= P (x1 ) P (x2 jx1 )::: P (xn jx1 ^ :::xn 1 ) = =
i P (xi jx1 ^ :::xi 1 ) i P (xi j (Xi ))
Example:
P (J ^ M ^ A ^ :B ^ :E )
= P (J jA)P (M jA)P (Aj:B ^ :E )P (:B )P (:E ) = 0:9 0:7 0:001 0:999 0:998 = 0:0067:
Network Construction
1. Choose the set of relevant variables Xi that describe the domain. 2. Choose an ordering for the variables. 3. While there are variables left: (a) Pick a variable Xi and add a node to the network for it. (b) Set (Xi ) to some minimal set of nodes already in the net such that the conditional independence property is satised. (c)
Compactness and Node Ordering

Compactness of BN is an example of a locally structured (or sparse) system. The correct order to add nodes is to add the root causes rst, then the variable they inuence, so on until leaves reached. Examples of wrong ordering (which still represent same joint distribution): 1. MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.
MaryCalls JohnCalls
P (XijXi 1; :::; X1) = P (Xij (Xi)) Dene the CPT for Xi .
Alarm
Burglary
Earthquake
Nicholson & Korb
23
Nicholson & Korb
24
Compactness and Node Ordering (cont.)

2. MaryCalls, JohnCalls, Earthquake, Burglary, Alarm.
MaryCalls JohnCalls
Conditional Independence: Causal Chains
Causal chains give rise to conditional independence:
Earthquake
P (C jA ^ B ) = P (C jB )
Alarm
Burglary
Example More probabilities than the full joint! See below for why. C = Jills u A = Jacks u B = severe cough
Common Causes
Common Effects
Causal causes (or ancestors) also give rise to conditional independence:
Common effects (or their descendants) give rise to conditional dependence:

A C
B B
P (AjC ^ B ) 6= P (A)P (C )
Example A = u B = severe cough C = tuberculosis Given a severe cough, u explains away tuberculosis.
P (C jA ^ B ) = P (C jB )
Example A = Jacks u B = Joes u C = Jills u
Nicholson & Korb
27
Nicholson & Korb
28
D-separation
Graph-theoretic criterion of conditional independence. We can determine whether a set of nodes X is independent of another set Y, given a set of evidence nodes E, i.e., X q Y jE . Earthquake example
Burglary Earthquake
Causal Ordering
Why does variable order affect network density? Because Using the causal order allows direct representation of conditional independencies Violating causal order requires new arcs to re-establish conditional independencies
Alarm
JohnCalls
MaryCalls
Causal Ordering (contd)
Networks
Basic task for any probabilistic inference system: Compute the posterior probability distribution for a set of query variables, given values for some evidence variables. Also called Belief Updating. Types of Inference:
Flu
TB
Cough
Flu and TB are marginally independent. Given the ordering: Cough, Flu, TB:
Q
Cough
Flu
TB
E (Explaining Away) Intercausal
Marginal independence of Flu and TB must be TB re-established by adding Flu ! TB or Flu
E Diagnostic
Q Causal
E Mixed
Nicholson & Korb
31
Nicholson & Korb
32
Kinds of Inference
Diagnostic inferences: from effect to causes. P(Burglary|JohnCalls) Causal Inferences: from causes to effects. P(JohnCalls|Burglary) P(MaryCalls|Burglary) Intercausal Inferences: between causes of a common effect. P(Burglary|Alarm) P(Burglary|Alarm Exact inference Trees and polytrees: message-passing algorithm Multiply-connected networks: Clustering Approximate Inference
Inference Algorithms: Overview
Earthquake)
Large, complex networks: Stochastic Simulation Other approximation methods In the general case, both sorts of inference are computationally complex (NP-hard).
Mixed Inference: combining two or more of above.
^ :EarthQuake) P(Burglary|JohnCalls ^ :EarthQuake)

P(Alarm|JohnCalls
P(B) 0.01
Burglary
Earthquake
P(E) 0.02 B E P(A) 0.95 0.94 0.29 0.001
connected networks
Networks where two nodes are connected by more than one path Two or more possible causes which share a common ancestor One variable can inuence another through more than one causal mechanism Example: Cancer network
Metastatic Cancer
PhoneRings P(Ph) 0.05 JohnCalls P A P(J) 0.95 0.5 0.90 0.01
Alarm
MaryCalls A T F P(M) 0.70 0.01
T T T F F T F F
T T T F F T F F
() = (.001,.999) () = (1,1) bel(B) = (.001, .999)

B
() = (.002,.998) () = (1,1) bel(E) = (.002, .998) A (E)

E
A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches
A (B)
bel(Ph) = (.05, .95) (Ph) = (.05,.95) (Ph) = (1,1)

Ph
A (B) J (Ph) J (A)

A
A (E) M(A)
J (Ph)
M(A) J (A)
(J) = (1,1)
(M) = (1,0)
Message passing doesnt work - evidence gets counted twice

Nicholson & Korb
35
Nicholson & Korb
36
Clustering methods Clustering methods (cont.)

Transform network into a probabilistically equivalent polytree by merging (clustering) offending nodes Cancer example: new node Z combining B and C Jensen Join-tree (Jensen, 1996) version the current most efcient algorithm in this class (e.g. used in Hugin, Netica). Network evaluation done in two stages
Compile into join-tree May be slow May require too much memory if original network is highly connected
Z=B,C E D
Do belief updating in join-tree (usually fast) Caveat: clustered nodes have increased complexity; updates may be computationally complex
P (z ja) = P (b; cja) = P (bja)P (cja) P (ejz ) = P (ejb; c) = P (ejc) P (djz ) = P (djb; c)
Approximate inference with stochastic simulation

Use the network to generate a large number of cases that are consistent with the network distribution. Evaluation may not converge to exact values (in reasonable time). Usually converges to close to exact solution quickly if the evidence is not too unlikely. Performs better when evidence is nearer to root nodes, however in real domains, evidence tends to be near leaves (Nicholson&Jitnah, 1998)
Making Decisions
Bayesian networks can be extended to support decision making. Preferences between different outcomes of various plans. Utility theory Decision theory = Utility theory + Probability theory.
Nicholson & Korb
39
Nicholson & Korb
40
Type of Nodes Decision Networks

A Decision network represents information about the agents current state its possible actions the state that will result from the agents action the utility of that state Also called, Inuence Diagrams (Howard&Matheson, 1981). Chance nodes: (ovals) represent random variables (same as Bayesian networks). Has an associated CPT. Parents can be decision nodes and other chance nodes. Decision nodes: (rectangles) represent points where the decision maker has a choice of actions. Utility nodes: (diamonds) represent the agents utility function (also called value nodes in the literature). Parents are variables describing the outcome state that directly affect utility. Has an associated table representing multi-attribute utility function.
Example: Umbrella
Weather
Evaluating Decision Networks: Algorithm

1. Set the evidence variables for the current state. 2. For each possible value of the decision node (a) Set the decision node to that value. (b) Calculate the posterior probabilities for the parent nodes of the utility node (as for BNs). (c) Calculate the resulting (expected) utility for the action. 3. Return the action with the highest expected utility. Simple for single decision, less so when executing several actions in sequence (i.e. a plan).
Forecast
Take Umbrella
P (Weather = Rainj) = 0:3 P (Forecast = RainyjWeather = Rain) = 0:60 P (Forecast = CloudyjWeather = Rain) = 0:25 P (Forecast = SunnyjWeather = Rain) = 0:15 P (Forecast = RainyjWeather = NoRain) = 0:1 P (Forecast = CloudyjWeather = NoRain) = 0:2 P (Forecast = SunnyjWeather = NoRain) = 0:7 U (NoRain; TakeUmbrella) = 20 U (NoRain; LeaveAtHome) = 100 U (Rain; TakeIt) = 70 U (Rain; LeaveAtHome) = 0
Nicholson & Korb
43
Nicholson & Korb
44
Dynamic Belief Networks
State evolution model
Dynamic Decision Network

State t State t+1 State t+2 Obs t Obs t+1
Sensor model
State t-2
State t-1
Obs t-2
Obs t-1
Obs t+2
Similarly, Decision Networks can be extended to include temporal aspects. Sequence of decisions taken = Plan.
The values of state variables at time t depend only on the values at t 1. Can calculate distributions for St+1 and further: probabilistic projection. Can be done using standard BN updating algorithms This type of DBN gets very large, very quickly. Usually only keep two time slices of the network.
Dt
Dt+1
Dt+1
Dt+1
State t
State t+1
State t+2
State t+3
Ut+3 Obs t Obs t+1 Obs t+2 Obs t+3
Uses of Bayesian Networks

1. Calculating the belief in query variables given values for evidence variables (above). 2. Predicting values in dependent variables given values for independent variables. 3. Decision making based on probabilities in the network and on the agents utilities (Inuence Diagrams [Howard and Matheson 1981]). 4. Deciding which additional evidence should be observed in order to gain useful information. 5. Sensitivity analysis to test impact of changes in probabilities or utilities on decisions.
Bayes rule allows unknown probabilities to be computed from known ones. Conditional independence (due to causal relationships) allows efcient updating Bayesian networks are a natural way to represent conditional independence info. links between nodes: qualitative aspects; conditional probability tables: quantitative aspects. Inference means computer the probability distribution for a set of query variables, given a set of evidence variables. Inference in Bayesian networks is very exible: can enter evidence about any node and update beliefs in any other nodes. The speed of inference in practice depends on the structure of the network: how many
Nicholson & Korb
47
Nicholson & Korb
48
Applications: Overview
(Simple) Example Networks loops; numbers of parents; location of evidence and query nodes. Bayesian networks can be extended with decision nodes and utility nodes to support decision making: Decision Networks or Inuence Diagrams. Bayesian and Decision networks can be extended to allow explicit reasoning about changes over time. Applications Medical Decision Making: Survey of applications Planning and Plan Recognition Natural Language Generation (NAG) Bayesian poker Deployed Bayesian Networks (See Handout for details) BN Software Web Resources
Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possibly associated with a brain tumor. (Example from (Pearl, 1988).)
Metastatic Cancer A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches
Example: Asia
A patient presents to a doctor with shortness of breath. The doctor considers that possibles causes are tuberculosis, lung cancer and bronchitis. Other additional information that is relevant is whether the patient has recently visited Asia (where tuberculosis is more prevalent), whether or not the patient is a smoker (which increases the chances of cancer and bronchitis). A positive xray would indicate either TB or lung cancer. (Example from (Lauritzen, 1988).)
visit to Asia smoking
P (a) = 0:2 P (bja) = 0:80 P (cja) = 0:20 P (djb; c) = 0:80 P (djb; :c) = 0:80 P (ejc) = 0:80
P (bj:a) = 0:20 P (cj:a) = 0:05 P (dj:b; c) = 0:80 P (dj:b; :c) = 0:05 P (ej:c) = 0:60
tuberculosis either tub or lung cancer
lung cancer
bronchitis
positive X-ray
dyspnoea
Nicholson & Korb
51
Nicholson & Korb
52
Example: A Lecturers Life

Dr. Ann Nicholson spends 60% of her work time in her ofce. The rest of her work time is spent elsewhere. When Ann is in her ofce, half the time her light is off (when she is trying to hide from students and get some real work done). When she is not in her ofce, she leaves her light on only 5% of the time. 80% of the time she is in her ofce, Ann is logged onto the computer. Because she sometimes logs onto the computer from home, 10% of the time she is not in her ofce, she is still logged onto the computer. Suppose a student checks Dr. Nicholsons login status and sees that she is logged on. What effect does this have on the students belief that Dr. Nicholsons light is on? (Example from (Nicholson, 1999))
Probabilistic reasoning in medicine

See handout from (Dean et al., 1993). Simplest tree-structured network for diagnostic reasoning
H = disease hypothesis; F = ndings (symptoms, test results)
in-office
lights-on
logged-on
Multiply-connected network (QMR structure) B = background information (e.g. age, sex of patient)
Medical Applications
Pathnder case study: see handout using material from (Russell&Norvig, 1995, pp.457-458). QMR (Quick Medical Reference): 600 diseases, 4,000 ndings, 40,000 arcs. (Dean&Wellman, 1991) MUNIN (Andreassen et al., 1989): neuromuscular disorders, about 1000 nodes; exact computation < 5 seconds. Glucose prediction and insulin dose adjustment (DBN application) (Andreassen et al., 1991). CPSC project (Pradham et al., 1994) 448 nodes, 906 links, 8254 conditional probability values LW algorithm - answers in 35 mins (1994)
Bayesian AI Tutorial Bayesian AI Tutorial
Nicholson & Korb
55
Nicholson & Korb
56
Application of LW to medical diagnosis (Shwe&Cooper, 1990). Forecasting sleep apnea (Dagum et al., 1993). ALARM (Beinlich et al., 1989): 37 nodes, 42 arcs. (See Netica examples.)
MinVolSet (3)
.976
Robot Navigation and Tracking

Example of a Dynamic Decision Network Dean&Wellman, 1991.
Ventmach (4)
1.158
Disconnect (2)
.617
PulmEmbolus(2)
.369 .288 .428
Intubation (3)
.141 .140
VentTube (4) 1.146 KinkedTube(4)

1.180 .227 .098
PAP (3)
Shunt (2)
.067 .100
Press (4)
1.201
VentLung (4)
1.189
FiO2 (2)
.411 .213
VentAlv (4)
.805 .743
MinVol (4)
.891 .362
PVSat (3) InsuffAnesth (2)

.092
ArtCO2 (3)
.054 .066
Anaphylaxis (2) ExpCO2 (4) SaO2 (3)

.246 .239
TPR (3)
Catechol (2) LVFailure(2)

.547 .137 .360 .479
.470
Hypovolemia (2)
.538
ErrCauter (2)
.324
.888
HR (3)
ErrLowOutput(2)
.344
History (2)
.724
StrokeVolume (3)
.746
LVEDVolume(3)
.874
HRSat (3)
.888 .948 .324 HREKG (3)
HRBP (3)
.251
CO (3)
.199
CVP (3)
PCPW (3)
.485
BP (3)
Plan Recognition Applications

Keyhole plan recognition in an Adventure game (Albrecht et al., 1998).
A 0 A 1 A 2 A 3 A 0 A 1 A 2 A 3
Trafc Monitoring: BATmobile

(Forbes et al., 1995) Example of a DBN
L 0
L 1
L 2
L 3
L 0
L 1
L 2
L 3
(a) mainModel
A 0 A 1 A 2 A 3
(b) indepModel
Q Q
L 0
L 1
L 2
L 3
(c) actionModel
(d) locationModel
Trafc plan recognition (Pynadeth&Wellman, 1995).
Nicholson & Korb
59
Nicholson & Korb
60
Natural Language Generation

NAG (McConachy et al., 1999) A Nice Argument Generator uses two Bayesian networks to generate and assess natural language arguments: Normative Model: Represents our best understanding of the domain; proper (constrained) Bayesian updating, given premises. User Model: Represents our best understanding of the human; Bayesian updating modied to reect human biases (e.g., overcondence; Korb, McConachy, Zukerman, 1997). BNs are embedded in a semantic hierarchy
1
Higher level Ec concepts like %% EE cc motivation or ability %% EE ccc % Lower level cc EE concepts like %% Grade Point Average cc EE cc %% + Semantic @ EE H cc Network @ %% @ cc 2nd layer EE BB HHHH % E A Q @% EA H B cc EEQQQBB %@@ EA %- @ Semantic cc E Network EE Q R @ E c 1st layer %% c EE CC % HH HH EE CC %% H Bayesian EE C %% Network EE %% % 6 E %
Proposition, e.g., [publications authored by
person X cited >5 times]
supports attentional modeling constrained updating

Bayesian Poker
(Korb et al., 1999) Poker is ideal for testing automated reasoning under uncertainty Physical randomisation Incomplete hand information Incomplete opponent info (strategies, blufng, etc) Bayesian networks are a good representation for complex game playing. Our Bayesian Poker Player (BPP) plays 5-Card stud poker at the level of a good amateur human player. To play: telnet indy13.cs.monash.edu.au login: poker password: maverick
Bayesian Poker BN
Bayesian network provides an estimate of winning at any point in the hand. Betting curves based on pot-odds used to determine action (bet/call, pass or fold).
BPP Win
OPP Final
BPP Final
M C|F OPP Current
C|F
BPP Current
M A|C
U|C
OPP Action
OPP Upcards
Nicholson & Korb
63
Nicholson & Korb
64
Bayesian Poker BN (cont.)

Hand Types
Bayesian Poker BN (cont.)

Different networks (matrices) for each round. OPP Current, BPP Current: (partial) hand types with cards dealt so far. OPP Final, BPP Final: hand types after all 5 cards dealt. Observation nodes: OPP Upcards: All opponents cards except rst are visible to BPP. OPP Action: BPP knows opponents action.
Initial 9 hand types too coarse. We use a ner granularity for most common hands (busted and a pair): low, medium, Q-high, K-high, A-high results in 17 hand-types Conditional Probability Matrices
MAjC : probability of opponents action given

current hand type learned from observed showdown data.
MU jC and MC jF
poker hands.
estimated by dealing out 107
Belief Updating: Since network is a polytree, simple fast propagation updating algorithm used.
Deployed BNs Extensions

BPP outperforms automated opponents, is fairly even with ave amateur humans, and loses to experienced humans. Learning the OPP Action CPTs does not (yet) appear to improve performance. BN Improvements Rene action nodes Further renement of hand types Improve network structure Adding blufng to the opponent model Improved learning of opponent model More complex poker: multi-opponent games, table stake games. DBN model to represent changes over time
From Web Site database: See handout for details. TRACS: Predicting reliability of military vehicles. Andes: intelligent tutoring system for physics. Distributed Virtual Agents advising online users on web sites. Information extraction from natural language text DXPLAIN: decision support for medical diagnosis. Illiad: teaching tool for medical students. Microsoft Health Produce: nd by symptom feature.
Nicholson & Korb
67
Nicholson & Korb
68
Weapons scheduling. Monitoring power generation. Processor fault diagnosis. Knowledge Industries applications: (a) in medicine, sleep disorders, pathology, trauma care, hand and wrist evaluations, dermatology, and home-based health evaluations (b) in capital equipment, locomotives, gas-turbine engines for aircraft and land-based power production, the space shuttle, and ofce equipment. Software debuggin. Vista: decision support system used at NASA Mission Control Center. MS: (a) Answer Wizard (Ofce 95), Information retrieval; (b) Print Troubleshooter; (c) Aladdin, troubleshooting customer support.
BN Software: Issues
Functionality Especially application vs API Price Many free for demo versions or educational use Commercial licence costs. Availability (platforms) Quality GUI Documentation and Help Leading edge Robustness software company
BN Software
Analytica: www.lumina.com Hugin: www.hugin.com
Web Resources
Bayesian Belief Network site (Russell Greiner):
www.cs.ualberta.ca/ greiner/bn.html
Netica: www.norsys.com Bayesian Network Repository (Nir Friedman) Above 3 available during tutorial lab session.
www-
JavaBayes: http://www.cs.cmu.edu/ javabayes/Home/ Many other packages (see next slide)
nt.cs.berkeley.edu/home/nir/public html/Repository/index.htm
Summary of BN software and links to software sites (Kevin Murphy)

HTTP.CS.Berkeley.EDU/ murphyk/Bayes/bnsoft.html
Nicholson & Korb
71
Nicholson & Korb
72
Learning Bayesian Networks

Linear and Discrete Models Learning Network Parameters Linear Coefcients
Applications: Summary
Various BN structures are available to compactly and accurately represent certain types of domain features. Bayesian networks have been used for a wide range of AI applications. Robust and easy to use Bayesian network software is now readily available.
Learning Probability Tables Learning Causal Structure Conditional Independence Learning Statistical Equivalence TETRAD II Bayesian Learning of Bayesian Networks Cooper & Herskovits: K2 Learning Variable Order Statistical Equivalence Learners Full Causal Learners Minimum Encoding Methods Lam & Bacchuss MDL learner MML metrics MML search algorithms MML Sampling Empirical Results
Linear and Discrete Models

Linear Models: Used in biology & social sciences since Sewall Wright (1921) Linear models represent causal relationships as sets of linear functions of independent variables.
X1 X2
Learning Linear Parameters

Maximum likelihood methods have been available since Wrights path model analysis (1921). Equivalent methods: Simon-Blalock method (Simon, 1954; Blalock, 1964) Ordinary least squares multiple regression (OLS)
X3
Equivalently (assuming linear parameters):
X3 = a13 X1 + a23 X2 +
Discrete models: Bayesian nets replace vectors of linear coefcients with CPTs.
Nicholson & Korb
75
Nicholson & Korb
76
Learning Conditional Probability Tables

Spiegelhalter & Lauritzen (1990): assume parameter independence each CPT cell i = a parameter in a Dirichlet distribution
for K parents
D 1; : : : ; i; : : : ; K ]
i = K=1 k k
Dual log-linear and full CPT models (Neil, Wallace, Korb 1999).
prob of outcome i is
observing outcome i update D to
D 1; : : : ; i + 1; : : : ; K ]
Others are looking at learning without parameter independence. E.g., Decision trees to learn structure within CPTs (Boutillier et al. 1996).
This is the real problem; parameterizing models is essentially numerical computing. There are two basic methods: Learning from conditional independencies (CI learning) Learning using a scoring metric (Metric learning)
Statistical Equivalence
Verma and Pearls rules identify the set of causal models which are statistically equivalent Two causal models H1 and H2 are statistically equivalent iff they contain the same variables and joint samples over them provide no statistical grounds for preferring one over the other. Examples All fully connected models are equivalent. A !B !C and A A !B !D B C. B !D C.
CI learning (Verma and Pearl, 1991) Suppose you have an Oracle who can answer yes or no to any question of the type:
X q Y jS?
Then you can learn the correct causal model, up to statistical equivalence.
C and A
Nicholson & Korb
79
Nicholson & Korb
80
Statistical Equivalence
Chickering (1995): Any two causal models over the same variables which have the same skeleton (undirected arcs) and the same directed v-structures are statistically equivalent. If H1 and H2 are statistically equivalent, then they have the same maximum likelihoods relative to any joint samples:
TETRAD II
Spirtes, Glymour and Scheines (1993) Replace the Oracle with statistical tests: for linear models a signicance test on partial correlation
X q Y jS i
XY S = 0
max P (ejH1 ; 1) = max P (ejH2 ; 2)

where
for discrete models a 2 test on the difference between CPT counts expected with independence (Ei ) and observed (Oi )
is a parameterization of Hi
X q Y jS i
Oi i Oi ln Ei
Herskovits TETRAD II
Asymptotically nds causal structure to within the statistical equivalence class of the true model. Requires larger sample sizes than MML (Dai, Korb, Wallace & Wu, 1997): Statistical tests are not robust given weak causal interactions and/or small samples. Cheap, and easy to use. Cooper & Herskovits (1991, 1992) Compute P (hi je) by brute force, under the assumptions: 1. All variables are discrete. 2. Samples are i.i.d. 3. No missing values. 4. All values of child variables are uniformly distributed. 5. Priors over hypotheses are uniform. With these assumptions, Cooper & Herskovits reduce the computation of PCH (h; e) to a polynomial time counting problem.
Nicholson & Korb
83
Nicholson & Korb
84
Learning Variable Order Cooper & Herskovits

But the hypothesis space is exponential; they go for dramatic simplication: 6. Assume we know the temporal ordering of the variables. Now for any pair of variables: either they are connected by an arc or they are not. Further, cycles are impossible. New hypothesis space has size only 2(n (still exponential).
2
Reliance upon a given variable order is a major drawback to K2 And many other algorithms (Buntine 1991, Bouckert 1994, Suzuki 1996, Madigan & Raftery 1994) Whats wrong with that? We want autonomous AI (data mining). If experts can order the variables they can likely supply models. Determining variable ordering is half the problem. If we know A comes before B , the only remaining issue is whether there is a link between the two. The number of orderings consistent with dags is apparently exponential (Brightwell & Winkler 1990). So iterating over all possible orderings will not scale up.
n)=2
Algorithm K2 does a greedy search through this reduced space.
Statistical Equivalence Learners

Heckerman & Geiger (1995) advocate learning only up to statistical equivalence classes (a la TETRAD II). Since observational data cannot distinguish btw equivalent models, theres no point trying to go futher.
Statistical Equivalence Learners

Wallace & Korb (1999): This is not right! These are causal models; they are distinguishable on experimental data. Failure to collect some data is no reason to change prior probabilities. E.g., If your thermometer topped out at 35 , you wouldnt treat 35 and 34 as equally likely. Not all equivalence classes are created equal: f A B !C, A !B !C, A B Cg f A !B Cg Within classes some dags should have greater priors than others. . . E.g., LightsOn !InOfce !LoggedOn v. LightsOn InOfce !LoggedOn
)Madigan, Andersson, Perlman & Volinsky (1996) follow this advice, use uniform prior over equivalence classes. )Geiger and Heckerman (1994) dene Bayesian metrics for linear and discrete equivalence classes of models (BGe and BDe)
Nicholson & Korb
87
Nicholson & Korb
88
Full Causal Learners

So. . . a full causal learner is an algorithm that: 1. Learns causal connectedness. 2. Learns v-structures. Hence, learns equivalence classes. 3. Learns full variable order. Hence, learns full causal structure (order + connectedness). TETRAD II: 1, 2. Madigan et al.: 1, 2. Cooper & Herskovits K2: 1. Lam and Bacchus MDL: 1, 2 (partial), 3 (partial). Wallace, Neil, Korb MML: 1, 2, 3.
MDL
Minimum Description Length (MDL) inference Invented by Rissanen (1978) based upon Minimum Message Length (MML) invented by Wallace (Wallace and Boulton, 1968). Plays trade-off btw model simplicity model t to the data by minimizing the length of a joint description of model and data given the model.
MDL encoding of causal models: Network:
Lam & Bacchus

j 2 (i) sj ]
Search algorithm: Initial constraints taken from domain expert: partial variable order, direct connections Greedy search: every possible arc addition is tested, best MDL measure used to add one (Note: no arcs are deleted) Local arcs checked for improved MDL via arc reversal Iterate until MDL fails to improve
ki log(n) for specifying ki parents for ith

node
n i=1 ki log(n) + d(si ki
1)
d(si 1) j=1 sj for specifying the CPT: d is the xed bit-length per probability si is the number of states for node i N N M (Xi; (i)) is mutual information btw Xi
and its parent set
Data given network:
n i=1 M (Xi ; (i))
n i=1 H (Xi )
H (Xi ) is entropy of variable Xi
)Results similar to K2, but without full variable

ordering
(NB: This code is not efcient. E.g., treats every node as equally likely to be a parent; assumes knowledge of all ki .)
Nicholson & Korb
91
Nicholson & Korb
92
MML
Minimum Message Length (Wallace & Boulton 1968) uses Shannons measure of information:
MML Metric for Linear Models

Network:
I (m) = log P (m)

Applied in reverse, we can compute P (h; e) from I (h; e). Given an efcient joint encoding method for the hypothesis & evidence space (i.e., satisfying Shannons law), MML: Searches fhi g for that hypothesis h that minimizes I (h) + I (ejh). Equivalent to that h that maximizes P (h)P (ejh) i.e., P (hje). The other signicant difference from MDL: MML takes parameter estimation seriously.
log n! + n(n2 1) log E

log n! for variable order

n(n
1)
log E restore efciency by subtracting

cost of selecting a linear extension
for connectivity
Parameters given dag h:
Xj
f log p( j jh) F ( j)
where j are the parameters for Xj and F ( j ) is the Fisher information. f ( j jh) is assumed to be N (0; j ). (Cf. with MDLs xed length for parms)
MML Metric for discrete models MML Metric for Linear Models
Sample for Xj given h and We can use PCH (hi ; e) (from Cooper & Herskovits) to dene an MML metric for discrete models. Difference between MML and Bayesian metrics:
2
jk
j:
log P (ejh; j ) =
K k=1
p1 e 2 j
2 2 j
where K is the number of sample values and jk is the difference between the observed value of Xj and its linear prediction.
MML partitions the parameter space and selects optimal parameters. Equivalent to a penalty of 1 log 6e per parameter 2 (Wallace & Freeman 1987); hence:
I (e; hi) = j2sj log 6e log PCH (hi ; e)

Applied in MML Sampling algorithm.
(1)
Nicholson & Korb
95
Nicholson & Korb
96
MML search algorithms

MML metrics need to be combined with search. This has been done three ways: 1. Wallace, Korb, Dai (1996): greedy search (linear). Brute force computation of linear extensions (small models only). 2. Neil and Korb (1999): genetic algorithms (linear). Asymptotic estimator of linear extensions GA chromosomes = causal models Genetic operators manipulate them Selection pressure is based on MML 3. Wallace and Korb (1999): MML sampling (linear, discrete). Stochastic sampling through space of totally ordered causal models No counting of linear extensions required
MML Sampling
Search space of totally ordered models (TOMs). Sampled via a Metropolis algorithm (Metropolis et al. 1953). From current model M , nd the next model M 0 by: Randomly select a variable; attempt to swap order with its predecessor. Or, randomly select a pair; attempt to add/delete an arc. Attempts succeed whenever P (M 0 )=P (M ) > U (per MML metric), where U is uniformly random from 0 : 1].
Empirical Results MML Sampling

Metropolis: this procedure samples TOMs with a frequency proportional to their posterior probability. To nd posterior of dag h: keep count of visits to all TOMs consistent with h Estimated by counting visits to all TOMs with identical max likelihoods to h Output: Probabilities of Top dags Top statistical equivalence classes Top MML equivalence classes A weakness in this area and AI generally. Paper publications based upon very small models, loose comparisons. ALARM net often used everything sets it to within 1 or 2 arcs. Neil and Korb (1999) compared MML and BGe (Heckerman & Geigers Bayesian metric over equivalence classes), using identical GA search over linear models: On KL distance and topological distance from the true model, MML and BGe performed nearly the same. On test prediction accuracy on strict effect nodes (those with no children), MML clearly outperformed BGe.
Nicholson & Korb
99
Nicholson & Korb
100
Current Research Issues

size and complexity difculties with elicitation combinations of discrete and continuous (i.e. mixing node types) Learning issues Missing data Latent variables Experimental data Learning CPT structure Multi-structure models continuous & discrete CPTs w/ & w/o parm independence inappropriate problems (deterministic systems, legal rules)
(Other) Limitations
Systems. Wiley. C HAPTERS 1, 2 HISTORY.
AND
COVER SOME OF THE RELEVANT
Introduction to Bayesian AI
T. Bayes (1764) An Essay Towards Solving a Problem in the Doctrine of Chances. Phil Trans of the Royal Soc of London. Reprinted in Biometrika, 45 (1958), 296-315. B. Buchanan and E. Shortliffe (eds.) (1984) Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley. B. de Finetti (1964) Foresight: Its Logical Laws, Its Subjective Sources, in Kyburg and Smokler (eds.) Studies in Subjective Probability. NY: Wiley. D. Heckerman (1986) Probabilistic Interpretations for MYCINs Certainty Factors, in L.N. Kanal and J.F. Lemmer (eds.) Uncertainty in Articial Intelligence. North-Holland. C. Howson and P. Urbach (1993) Scientic Reasoning: The Bayesian Approach. Open Court. A MODERN REVIEW OF BAYESIAN THEORY. K.B. Korb (1995) Inductive learning and defeasible inference, Jrn for Experimental and Theoretical AI, 7, 291-324.
J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann.
F. P. Ramsey (1931) Truth and Probability in The Foundations of Mathematics and Other Essays. NY: Humanities Press. T HE ORIGIN OF MODERN BAYESIANISM . I NCLUDES LOTTERY- BASED ELICITATION AND D UTCH - BOOK ARGUMENTS FOR THE USE OF PROBABILITIES. R. Reiter (1980) A logic for default reasoning, Articial Intelligence, 13, 81-132. J. von Neumann and O.Morgenstern (1947) Theory of Games and Economic Behavior, 2nd ed. Princeton Univ. S TANDARD REFERENCE ON ELICITING UTILITIES VIA LOTTERIES.
Bayesian Networks
E. Charniak (1991) Bayesian Networks Without Tears, Articial Intelligence Magazine, pp. 50-63, Vol 12. A N ELEMENTARY INTRODUCTION.
Nicholson & Korb
103
Nicholson & Korb
104
D. DAmbrosio (1999) Inference in Bayesian Networks. Articial Intelligence Magazine, Vol 20, No. 2. P. Haddaway (1999) An Overview of Some Recent Developments in Bayesian Problem-Solving Techniques. Articial Intelligence Magazine, Vol 20, No. 2. Howard & Matheson (1981) Inuence Diagrams, Principles and Applications of Decision Analysis. F. V. Jensen (1996) An Introduction to Bayesian Networks, Springer. R. Neapolitan (1990) Probabilistic Reasoning in Expert Systems. Wiley. S IMILAR COVERAGE TO THAT OF P EARL ; MORE
EMPHASIS ON PRACTICAL ALGORITHMS FOR NETWORK UPDATING.
Applications
D.W. Albrecht, I. Zukerman and Nicholson, A.E. (1998) Bayesian Models for Keyhole Plan Recognition in an Adventure Game. User Modeling and User-Adapted Interaction, 8(1-2), 5-47, Kluwer Academic Publishers. S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U. Kjrulff, M. Woldbye, A.R. Srensen, A. Rosenfalck and F. Jensen (1989) MUNIN An Expert EMG Assistant, Computer-Aided Electromyography and Expert Systems, Chapter 21, J.E. Desmedt (Ed.), Elsevier. S.A. Andreassen, J.J Benn, R. Hovorks, K.G. Olesen and R.E. Carson (1991) A Probabilistic Approach to Glucose Prediction and Insulin Dose Adjustment: Description of Metabolic Model and Pilot Evaluation Study. I. Beinlich, H. Suermondt, R. Chavez and G. Cooper (1992) The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks, Proc. of the 2nd European Conf. on Articial Intelligence in medicine, pp. 689-693. T.L Dean and M.P. Wellman (1991) Planning and control, Morgan Kaufman. T.L. Dean, J. Allen and J. Aloimonos (1994) Articial Intelligence: Theory and Practice, Benjamin/Cummings.
J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann. T HIS IS THE CLASSIC TEXT INTRODUCING BAYESIAN NETWORKS TO THE AI COMMUNITY. Poole, D., Mackworth, A., and Goebel, R. (1998) Computational Intelligence: a logical approach. Oxford University Press. Russell & Norvig (1995) Articial Intelligence: A Modern Approach, Prentice Hall.
Network Models for Forecasting, Proceedings of the 8th Conference on Uncertainty in Articial Intelligence, pp. 41-48. J. Forbes, T. Huang, K. Kanazawa and S. Russell (1995) The BATmobile: Towards a Bayesian Automated Taxi, Proceedings of the 14th Int. Joint Conf. on Articial Intelligence (IJCAI95), pp. 1878-1885. S.L Lauritzen and D.J. Spiegelhalter (1988) Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems, Journal of the Royal Statistical Society, 50(2), pp. 157-224. McConachy et al (1999) A.E. Nicholson (1999) CSE2309/3309 Articial Intelligence, Monash University, Lecture Notes, http://www.csse.monash.edu.au/nnn/2-3309.html. a M. Pradham, G. Provan, B. Middleton and M. Henrion (1994) Knowledge engineering for large belief networks, Proceedings of the 10th Conference on Uncertainty in Articial Intelligence. D. Pynadeth and M. P. Wellman (1995) Accounting for Context in Plan Recogniition, with Application to Trafc Monitoring, Proceedings of the 11th Conference on Uncertainty in Articial Intelligence, pp.472-481.
Likelihood-Weighting Simulation on a Large, Multiply Connected Belief Network, Proceedings of the Sixth Workshop on Uncertainty in Articial Intelligence, pp. 498-508, 1990. L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P. Aleman, B.G. Tall (1999) How to Elicit Many Probabilities, Laskey & Prade (eds) UAI99, 647-654. Zukerman, I., McConachy, R., Korb, K. and Pickett, D. (1999) Exploratory Interaction with a Bayesian Argumentation System, in IJCAI-99 Proceedings the Sixteenth International Joint Conference on Articial Intelligence, pp. 1294-1299, Stockholm, Sweden, Morgan Kaufmann.
Learning Bayesian Networks

H. Blalock (1964) Causal Inference in Nonexperimental Research. University of North Carolina. R. Bouckeart (1994) Probabilistic network construction using the minimum description length principle. Technical Report RUU-CS-94-27, Dept of Computer Science, Utrecht University. C. Boutillier, N. Friedman, M. Goldszmidt, D. Koller (1996) Context-specic independence in Bayesian networks, in Horvitz & Jensen (eds.) UAI 1996, 115-123.
Nicholson & Korb
107
Nicholson & Korb
108
G. Brightwell and P. Winkler (1990) Counting linear extensions is #P-complete. Technical Report DIMACS 90-49, Dept of Computer Science, Rutgers Univ. W. Buntine (1991) Theory renement on Bayesian networks, in DAmbrosio, Smets and Bonissone (eds.) UAI 1991, 52-69. W. Buntine (1996) A Guide to the Literature on Learning Probabilistic Networks from Data, IEEE Transactions on Knowledge and Data Engineering,8, 195-210. D.M. Chickering (1995) A Tranformational Characterization of Equivalent Bayesian Network Structures, in P. Besnard and S. Hanks (eds.) Proceedings of the Eleventh Conference on Uncertainty in Articial Intelligence (pp. 87-98). San Francisco: Morgan Kaufmann. STATISTICAL EQUIVALENCE . G.F. Cooper and E. Herskovits (1991) A Bayesian Method for Constructing Bayesian Belief Networks from Databases, in DAmbrosio, Smets and Bonissone (eds.) UAI 1991, 86-94. G.F. Cooper and E. Herskovits (1992) A Bayesian Method for the Induction of Probabilistic Networks from Data, Machine Learning, 9, 309-347. A N EARLY BAYESIAN CAUSAL DISCOVERY METHOD.
H. Dai, K.B. Korb, C.S. Wallace and X. Wu (1997) A study of casual discovery with weak links and small samples. Proceedings of the Fifteenth International Joint Conference on Articial Intelligence (IJCAI), pp. 1304-1309. Morgan Kaufmann. N. Friedman (1997) The Bayesian Structural EM Algorithm, in D. Geiger and P.P. Shenoy (eds.) Proceedings of the Thirteenth Conference on Uncertainty in Articial Intelligence (pp. 129-138). San Francisco: Morgan Kaufmann. Geiger and Heckerman (1994) Learning Gaussian networks, in Lopes de Mantras and Poole (eds.) UAI 1994, 235-243. D. Heckerman and D. Geiger (1995) Learning Bayesian networks: A unication for discrete and Gaussian domains, in Besnard and Hankds (eds.) UAI 1995, 274-284. D. Heckerman, D. Geiger, and D.M. Chickering (1995) Learning Bayesian Networks: The Combination of Knowledge and Statistical Data, Machine Learning, 20, 197-243. BAYESIAN LEARNING OF STATISTICAL EQUIVALENCE CLASSES. K. Korb (1999) Probabilistic Causal Structure in H. Sankey (ed.) Causation and Laws of Nature:
Science 14. Kluwer Academic. I NTRODUCTION TO THE RELEVANT PHILOSOPHY OF CAUSATION FOR LEARNING BAYESIAN NETWORKS. P. Krause (1998) Learning Probabilistic Networks.
http : ==www:auai:org=bayes USKrause:ps:gz

BASIC
INTRODUCTION TO
Methodologies for Knowledge Discovery and Data Mining: Third Pacic-Asia Conference (pp. 432-437). Springer Verlag. G ENETIC ALGORITHMS FOR CAUSAL DISCOVERY; STRUCTURE PRIORS. J.R. Neil, C.S. Wallace and K.B. Korb (1999) Learning Bayesian networks with restricted causal interactions, in Laskey and Prade (eds.) UAI 99, 486-493. J. Rissanen (1978) Modeling by shortest data description, Automatica, 14, 465-471. H. Simon (1954) Spurious Correlation: A Causal Interpretation, Jrn Amer Stat Assoc, 49, 467-479. D. Spiegelhalter & S. Lauritzen (1990) Sequential Updating of Conditional Probabilities on Directed Graphical Structures, Networks, 20, 579-605. P. Spirtes, C. Glymour and R. Scheines (1990) Causality from Probability, in J.E. Tiles, G.T. McKee and G.C. Dean Evolving Knowledge in Natural Science and Articial Intelligence. London: Pitman. A N
ELEMENTARY INTRODUCTION TO STRUCTURE LEARNING VIA CONDITIONAL INDEPENDENCE .
BN S,
PARAMETERIZATION
AND LEARNING CAUSAL STRUCTURE .
W. Lam and F. Bacchus (1993) Learning Bayesian belief networks: An approach based on the MDL principle, Jrn Comp Intelligence, 10, 269-293. D. Madigan, S.A. Andersson, M.D. Perlman & C.T. Volinsky (1996) Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs, Comm in Statistics: Theory and Methods, 25, 2493-2519. D. Madigan and A. E. Raftery (1994) Model selection and accounting for model uncertainty in graphical modesl using Occams window, Jrn AMer Stat Assoc, 89, 1535-1546. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller (1953) Equations of state calculations by fast computing machines, Jrn Chemical Physics, 21, 1087-1091. J.R. Neil and K.B. Korb (1999) The Evolution of Causal Models: A Comparison of Bayesian Metrics and
P. Spirtes, C. Glymour and R. Scheines (1993) Causation, Prediction and Search: Lecture Notes in Statistics 81.
Nicholson & Korb
111
Nicholson & Korb
112
Springer Verlag. A THOROUGH PRESENTATION

STRUCTURE .
OF THE ORTHODOX
STATISTICAL APPROACH TO LEARNING CAUSAL
J. Suzuki (1996) Learning Bayesian Belief Networks Based on the Minimum Description Length Principle, in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 462-470). San Francisco: Morgan Kaufmann. T.S. Verma and J. Pearl (1991) Equivalence and Synthesis of Causal Models, in P. Bonissone, M. Henrion, L. Kanal and J.F. Lemmer (eds) Uncertainty in Articial Intelligence 6 (pp. 255-268). Elsevier. T HE GRAPHICAL CRITERION FOR STATISTICAL EQUIVALENCE . C.S. Wallace and D. Boulton (1968) An information measure for classication, Computer Jrn, 11, 185-194. C.S. Wallace and P.R. Freeman (1987) Estimation and inference by compact coding, Jrn Royal Stat Soc (Series B), 49, 240-252. C. S. Wallace and K. B. Korb (1999) Learning Linear Causal Models by MML Sampling, in A. Gammerman (ed.) Causal Models and Intelligent Data Management. Springer Verlag. S AMPLING APPROACH TO LEARNING CAUSAL MODELS ; DISCUSSION OF STRUCTURE PRIORS.
C. S. Wallace, K. B. Korb, and H. Dai (1996) Causal Discovery via MML, in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 516-524). San Francisco: Morgan Kaufmann. I NTRODUCES AN MML METRIC FOR CAUSAL MODELS. S. Wright (1921) Correlation and Causation, Jrn Agricultural Research, 20, 557-585. S. Wright (1934) The Method of Path Coefcients, Annals of Mathematical Statistics, 5, 161-215.
Current Research
Bayesian Network URLs

Bayesian AI Tutorial Bayesian AI Tutorial

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian AI Tutorial Bayesian AI Tutorial

Uploaded by

Copyright:

Available Formats

Overview Bayesian AI

Nicholson & Korb

Nicholson & Korb

Reasoning under uncertainty

Fuzzy Logic Probabilities

m(Fido 2 Labrador) = m(Fido 2 Shepard) = 0:5

T (p ^ q) = min(T (p); T (q)) T (p _ q) = max(T (p); T (q)) T (:p) = 1 T (p)

Conditional Probability P (X jY ) = Independence X

Nicholson & Korb

Nicholson & Korb

MYCINs Certainty Factors Default Logic

MB (h; e) 2 0; 1] measure of disbelief: MD(h; e) 2 0; 1]

Bird(Tweety) : Bird(x) ! Flies(x) Flies(Tweety)

CF (h; e) = MB (h; e) MD(h; e) 2 1; 1]

ApplyforJob(me) : ApplyforJob(x) ! Reject(x) Reject(me)

0:1 (violating A2)

O(h) = 1 P (h()h) P O P (h) = 1 + (h()h) O

Payoff table against A (inverse of: for A), with S = 1:

Payoff $pS = -$0.10 -$(1-p)S = -$1.10

Nicholson & Korb

Nicholson & Korb

Bayes Theorem; Conditionalization

Bayesian Decision Theory

j P (hje) = P (ePh()eP (h) )

Or, read Bayes theorem as:

Posterior = Likelihood Prior Prob of evidence

Bayesian Networks: Overview

Nicholson & Korb

Nicholson & Korb

Example: Earthquake (Pearl,R&N)

Earthquake Example: Network Structure

Burglary P(B) 0.01 Alarm

P(E) 0.02 B E P(A|B,E) 0.95 0.94 0.29 0.001

JohnCalls A T F P(J|A) 0.90 0.05

P(M|A) 0.70 0.01

Nicholson & Korb

Nicholson & Korb

Representing the joint probability distribution Semantics of Bayesian Networks

P (X1 = x1 ; X2 = x2 ; :::; Xn = xn ) = P (x1 ; x2 ; :::; xn)

Compactness and Node Ordering

P (XijXi 1; :::; X1) = P (Xij (Xi)) Dene the CPT for Xi .

Nicholson & Korb

Nicholson & Korb

Compactness and Node Ordering (cont.)

Conditional Independence: Causal Chains

Causal chains give rise to conditional independence:

Causal causes (or ancestors) also give rise to conditional independence:

Common effects (or their descendants) give rise to conditional dependence:

Nicholson & Korb

Nicholson & Korb

Causal Ordering (contd)

E (Explaining Away) Intercausal

Marginal independence of Flu and TB must be TB re-established by adding Flu ! TB or Flu

Nicholson & Korb

Nicholson & Korb

Inference Algorithms: Overview

Mixed Inference: combining two or more of above.

^ :EarthQuake) P(Burglary|JohnCalls ^ :EarthQuake)

P(E) 0.02 B E P(A) 0.95 0.94 0.29 0.001

PhoneRings P(Ph) 0.05 JohnCalls P A P(J) 0.95 0.5 0.90 0.01

MaryCalls A T F P(M) 0.70 0.01

() = (.001,.999) () = (1,1) bel(B) = (.001, .999)