You are on page 1of 28

Overview Bayesian AI

1. Introduction to Bayesian AI (20 min) AI99, Sydney 6 December 1999 Ann E. Nicholson and Kevin B. Korb School of Computer Science and Software Engineering Monash University, Clayton, VIC 3168 AUSTRALIA 2. Bayesian networks (50 min) Break (10 min) 3. Applications (50 min) Break (10 min) 4. Learning Bayesian networks (50 min) 5. Current research issues (10 min) 6. Bayesian Net Lab (60 min: Optional) 7. Dinner (Optional)

fannn,korbg@csse.monash.edu.au

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

Nicholson & Korb

Introduction to Bayesian AI
Reasoning under uncertainty Probabilities Alternative formalisms Fuzzy logic MYCINs certainty factors Default Logic Bayesian philosophy Dutch book arguments Bayes Theorem Conditionalization Conrmation theory Bayesian decision theory Towards a Bayesian AI

Reasoning under uncertainty


Uncertainty: The quality or state of being not clearly known. This encompasses most of what we understand about the world and most of what we would like our AI systems to understand. Distinguishes deductive knowledge (e.g., mathematics) from inductive belief (e.g., science). Sources of uncertainty Ignorance (which side of this coin is up?) Physical randomness (which side of this coin will land up?) Vagueness (which tribe am I closest to genetically? Picts? Angles? Saxons? Celts?)
Bayesian AI Tutorial

Bayesian AI Tutorial

Fuzzy Logic Probabilities


Classic approach to reasoning under uncertainty. (Blaise Pascal and Fermat). Kolmogorovs Axioms: 1. 2. 3. Designed to cope with vagueness: Is Fido a Labrador or a Shepard? Fuzzy set theory:

m(Fido 2 Labrador) = m(Fido 2 Shepard) = 0:5


Extended to fuzzy logic, which takes intermediate truth values: T (Labrador(Fido)) = 0:5. Combination rules:

P (U ) = 1

8X U P (X ) 0 8X; Y U if X \ Y = ; then P (X ^ Y ) = P (X ) + P (Y ) q Y i P (X jY ) = P (X )
P (X ^Y ) P (Y )

T (p ^ q) = min(T (p); T (q)) T (p _ q) = max(T (p); T (q)) T (:p) = 1 T (p)


Not suitable for coping with randomness or ignorance. Obviously not: Uncertainty(inclement weather) = max(Uncertainty(rain),Uncertainty(hail),. . . )

Conditional Probability P (X jY ) = Independence X

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

Nicholson & Korb

MYCINs Certainty Factors Default Logic


Uncertainty formalism developed for the early expert system MYCIN (Buchanan and Shortliffe, 1984): Elicit for (h; e): Intended to reect stereotypical reasoning under uncertainty (Reiter 1980). Example:

MB (h; e) 2 0; 1] measure of disbelief: MD(h; e) 2 0; 1]


measure of belief:

Bird(Tweety) : Bird(x) ! Flies(x) Flies(Tweety)


Problems: Best semantics for default rules are probabilistic (Pearl 1988, Korb 1995). Mishandles combinations of low probability events. E.g.,

CF (h; e) = MB (h; e) MD(h; e) 2 1; 1]


Special functions provided for combining evidence. Problems: No semantics ever given for belief/disbelief Heckerman (1986) proved that restrictions required for a probabilistic semantics imply absurd independence assumptions.
Bayesian AI Tutorial

ApplyforJob(me) : ApplyforJob(x) ! Reject(x) Reject(me)


I.e., the dole always looks better than applying for a job!

Bayesian AI Tutorial

Probability Theory
So, why not use probability theory to represent uncertainty? Thats what it was invented for. . . dealing with physical randomness and degrees of ignorance. Furthermore, if you make bets which violate probability theory, you are subject to Dutch books: A Dutch book is a sequence of fair bets which collectively guarantee a loss. Fair bets are bets based upon the standard odds-probability relation:

A Dutch Book
Payoff table on a bet for h (Odds = p=1 p; S = betting unit) h T F Payoff $(1-p) -$p S S

Given a fair bet, the expected value from such a payoff is always $0. Now, lets violate the probability axioms. Example Say, P (A) =

0:1 (violating A2)

O(h) = 1 P (h()h) P O P (h) = 1 + (h()h) O


Bayesian AI Tutorial

Payoff table against A (inverse of: for A), with S = 1:

:A
T F

Payoff $pS = -$0.10 -$(1-p)S = -$1.10


Bayesian AI Tutorial

Nicholson & Korb

11

Nicholson & Korb

12

Bayes Theorem; Conditionalization


Due to Reverend Thomas Bayes (1764)

Bayesian Decision Theory


Frank Ramsey (1931) Decision making under uncertainty: what action to take (plan to adopt) when future state of the world is not known. Bayesian answer: Find utility of each possible outcome (action-state pair) and take the action that maximizes expected utility. Example
Action Take umbrella Leave umbrella Expected utilities: E(Take umbrella) = (30)(.4) + (10)(.6) = 18 E(Leave umbrella) = (-50)(.4) + (100)(.6) = 40 Rain (p = .4) 30 -100 Shine (1 - p = .6) 10 50

j P (hje) = P (ePh()eP (h) )


Conditionalization:

P 0 (h) = P (hje)

Or, read Bayes theorem as:

Posterior = Likelihood Prior Prob of evidence


Assumptions: 1. Joint priors over fhi g and e exist. 2. Total evidence: e, and only e, is learned.

Bayesian AI Tutorial

Bayesian AI Tutorial

Bayesian AI
A Bayesian conception of an AI is: An autonomous agent which Has a utility structure (preferences) Can learn about its world and the relation between its actions and future states (probabilities) Maximizes its expected utility The techniques used in learning about the world are (primarily) statistical. . . Hence Bayesian data mining

Bayesian Networks: Overview


Syntax Semantics Evaluation methods Inuence diagrams (Decision Networks) Dynamic Bayesian Networks

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

15

Nicholson & Korb

16

Bayesian Networks
Data Structure which represents the dependence between variables; Gives concise specication of the joint probability distribution. A Bayesian Network is a graph in which the following holds: 1. A set of random variables makes up the nodes in the network. 2. A set of directed links or arrows connects pairs of nodes. 3. Each node has a conditional probability table that quanties the effects the parents have on the node. 4. Directed, acyclic graph (DAG), i.e. no directed cycles.

Example: Earthquake (Pearl,R&N)


You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor earthquakes. Two neighbours (John, Mary) promise to call you at work when they hear the alarm. John always calls when hears alarm, but confuses alarm with phone ringing (and calls then also) Mary likes loud music and sometimes misses alarm! Given evidence about who has and hasnt called, estimate the probability of a burglary.

Bayesian AI Tutorial

Bayesian AI Tutorial

Earthquake Example: Network Structure

Assumptions: John and Mary dont perceive burglary directly; they do not feel minor earthquakes. Note: no info about loud music or telephone ringing and confusing John. Summarised in uncertainty in links from Alarm to JohnCalls and MaryCalls. Once specied topology, need to specify conditional probability table (CPT) for each node. Each row contains the cond prob of each node value for a conditioning case. Each row must sum to 1. A table for a Boolean var with n Boolean parents contain 2n+1 probs. A node with no parents has one row (the prior probabilities)

Burglary P(B) 0.01 Alarm

Earthquake

P(E) 0.02 B E P(A|B,E) 0.95 0.94 0.29 0.001

JohnCalls A T F P(J|A) 0.90 0.05

MaryCalls A T F

T T T F F T F F

P(M|A) 0.70 0.01

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

19

Nicholson & Korb

20

Representing the joint probability distribution Semantics of Bayesian Networks


A (more compact) representation of the joint probability distribution. helpful in understanding how to construct network Encoding a collection of conditional independence statements. helpful in understanding how to design inference procedures

P (X1 = x1 ; X2 = x2 ; :::; Xn = xn ) = P (x1 ; x2 ; :::; xn)


= P (x1 ) P (x2 jx1 )::: P (xn jx1 ^ :::xn 1 ) = =
i P (xi jx1 ^ :::xi 1 ) i P (xi j (Xi ))

Example:

P (J ^ M ^ A ^ :B ^ :E )

= P (J jA)P (M jA)P (Aj:B ^ :E )P (:B )P (:E ) = 0:9 0:7 0:001 0:999 0:998 = 0:0067:

Bayesian AI Tutorial

Bayesian AI Tutorial

Network Construction
1. Choose the set of relevant variables Xi that describe the domain. 2. Choose an ordering for the variables. 3. While there are variables left: (a) Pick a variable Xi and add a node to the network for it. (b) Set (Xi ) to some minimal set of nodes already in the net such that the conditional independence property is satised. (c)

Compactness and Node Ordering


Compactness of BN is an example of a locally structured (or sparse) system. The correct order to add nodes is to add the root causes rst, then the variable they inuence, so on until leaves reached. Examples of wrong ordering (which still represent same joint distribution): 1. MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.
MaryCalls JohnCalls

P (XijXi 1; :::; X1) = P (Xij (Xi)) Dene the CPT for Xi .

Alarm

Burglary

Earthquake

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

23

Nicholson & Korb

24

Compactness and Node Ordering (cont.)


2. MaryCalls, JohnCalls, Earthquake, Burglary, Alarm.
MaryCalls JohnCalls

Conditional Independence: Causal Chains

Causal chains give rise to conditional independence:

Earthquake

P (C jA ^ B ) = P (C jB )
Alarm

Burglary

Example More probabilities than the full joint! See below for why. C = Jills u A = Jacks u B = severe cough

Bayesian AI Tutorial

Bayesian AI Tutorial

Common Causes

Common Effects

Causal causes (or ancestors) also give rise to conditional independence:

Common effects (or their descendants) give rise to conditional dependence:


A C

B B

P (AjC ^ B ) 6= P (A)P (C )
Example A = u B = severe cough C = tuberculosis Given a severe cough, u explains away tuberculosis.

P (C jA ^ B ) = P (C jB )
Example A = Jacks u B = Joes u C = Jills u
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

27

Nicholson & Korb

28

D-separation
Graph-theoretic criterion of conditional independence. We can determine whether a set of nodes X is independent of another set Y, given a set of evidence nodes E, i.e., X q Y jE . Earthquake example
Burglary Earthquake

Causal Ordering

Why does variable order affect network density? Because Using the causal order allows direct representation of conditional independencies Violating causal order requires new arcs to re-establish conditional independencies

Alarm

JohnCalls

MaryCalls

Bayesian AI Tutorial

Bayesian AI Tutorial

Causal Ordering (contd)

Networks
Basic task for any probabilistic inference system: Compute the posterior probability distribution for a set of query variables, given values for some evidence variables. Also called Belief Updating. Types of Inference:

Flu

TB

Cough

Flu and TB are marginally independent. Given the ordering: Cough, Flu, TB:

Q
Cough

Flu

TB

E (Explaining Away) Intercausal

Marginal independence of Flu and TB must be TB re-established by adding Flu ! TB or Flu

E Diagnostic

Q Causal

E Mixed

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

31

Nicholson & Korb

32

Kinds of Inference
Diagnostic inferences: from effect to causes. P(Burglary|JohnCalls) Causal Inferences: from causes to effects. P(JohnCalls|Burglary) P(MaryCalls|Burglary) Intercausal Inferences: between causes of a common effect. P(Burglary|Alarm) P(Burglary|Alarm Exact inference Trees and polytrees: message-passing algorithm Multiply-connected networks: Clustering Approximate Inference

Inference Algorithms: Overview

Earthquake)

Large, complex networks: Stochastic Simulation Other approximation methods In the general case, both sorts of inference are computationally complex (NP-hard).

Mixed Inference: combining two or more of above.

^ :EarthQuake) P(Burglary|JohnCalls ^ :EarthQuake)


P(Alarm|JohnCalls
Bayesian AI Tutorial

Bayesian AI Tutorial

P(B) 0.01

Burglary

Earthquake

P(E) 0.02 B E P(A) 0.95 0.94 0.29 0.001

connected networks
Networks where two nodes are connected by more than one path Two or more possible causes which share a common ancestor One variable can inuence another through more than one causal mechanism Example: Cancer network
Metastatic Cancer

PhoneRings P(Ph) 0.05 JohnCalls P A P(J) 0.95 0.5 0.90 0.01

Alarm

MaryCalls A T F P(M) 0.70 0.01

T T T F F T F F

T T T F F T F F

() = (.001,.999) () = (1,1) bel(B) = (.001, .999)


B

() = (.002,.998) () = (1,1) bel(E) = (.002, .998) A (E)


E

A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches

A (B)

bel(Ph) = (.05, .95) (Ph) = (.05,.95) (Ph) = (1,1)


Ph

A (B) J (Ph) J (A)


A

A (E) M(A)

J (Ph)

M(A) J (A)

(J) = (1,1)

(M) = (1,0)

Message passing doesnt work - evidence gets counted twice


Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

35

Nicholson & Korb

36

Clustering methods Clustering methods (cont.)


Transform network into a probabilistically equivalent polytree by merging (clustering) offending nodes Cancer example: new node Z combining B and C Jensen Join-tree (Jensen, 1996) version the current most efcient algorithm in this class (e.g. used in Hugin, Netica). Network evaluation done in two stages

Compile into join-tree May be slow May require too much memory if original network is highly connected

Z=B,C E D

Do belief updating in join-tree (usually fast) Caveat: clustered nodes have increased complexity; updates may be computationally complex

P (z ja) = P (b; cja) = P (bja)P (cja) P (ejz ) = P (ejb; c) = P (ejc) P (djz ) = P (djb; c)
Bayesian AI Tutorial

Bayesian AI Tutorial

Approximate inference with stochastic simulation


Use the network to generate a large number of cases that are consistent with the network distribution. Evaluation may not converge to exact values (in reasonable time). Usually converges to close to exact solution quickly if the evidence is not too unlikely. Performs better when evidence is nearer to root nodes, however in real domains, evidence tends to be near leaves (Nicholson&Jitnah, 1998)

Making Decisions
Bayesian networks can be extended to support decision making. Preferences between different outcomes of various plans. Utility theory Decision theory = Utility theory + Probability theory.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

39

Nicholson & Korb

40

Type of Nodes Decision Networks


A Decision network represents information about the agents current state its possible actions the state that will result from the agents action the utility of that state Also called, Inuence Diagrams (Howard&Matheson, 1981). Chance nodes: (ovals) represent random variables (same as Bayesian networks). Has an associated CPT. Parents can be decision nodes and other chance nodes. Decision nodes: (rectangles) represent points where the decision maker has a choice of actions. Utility nodes: (diamonds) represent the agents utility function (also called value nodes in the literature). Parents are variables describing the outcome state that directly affect utility. Has an associated table representing multi-attribute utility function.

Bayesian AI Tutorial

Bayesian AI Tutorial

Example: Umbrella
Weather

Evaluating Decision Networks: Algorithm


1. Set the evidence variables for the current state. 2. For each possible value of the decision node (a) Set the decision node to that value. (b) Calculate the posterior probabilities for the parent nodes of the utility node (as for BNs). (c) Calculate the resulting (expected) utility for the action. 3. Return the action with the highest expected utility. Simple for single decision, less so when executing several actions in sequence (i.e. a plan).

Forecast

Take Umbrella

P (Weather = Rainj) = 0:3 P (Forecast = RainyjWeather = Rain) = 0:60 P (Forecast = CloudyjWeather = Rain) = 0:25 P (Forecast = SunnyjWeather = Rain) = 0:15 P (Forecast = RainyjWeather = NoRain) = 0:1 P (Forecast = CloudyjWeather = NoRain) = 0:2 P (Forecast = SunnyjWeather = NoRain) = 0:7 U (NoRain; TakeUmbrella) = 20 U (NoRain; LeaveAtHome) = 100 U (Rain; TakeIt) = 70 U (Rain; LeaveAtHome) = 0
Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

43

Nicholson & Korb

44

Dynamic Belief Networks

State evolution model

Dynamic Decision Network


State t State t+1 State t+2 Obs t Obs t+1
Sensor model

State t-2

State t-1

Obs t-2

Obs t-1

Obs t+2

Similarly, Decision Networks can be extended to include temporal aspects. Sequence of decisions taken = Plan.

The values of state variables at time t depend only on the values at t 1. Can calculate distributions for St+1 and further: probabilistic projection. Can be done using standard BN updating algorithms This type of DBN gets very large, very quickly. Usually only keep two time slices of the network.

Dt

Dt+1

Dt+1

Dt+1

State t

State t+1

State t+2

State t+3

Ut+3 Obs t Obs t+1 Obs t+2 Obs t+3

Bayesian AI Tutorial

Bayesian AI Tutorial

Uses of Bayesian Networks


1. Calculating the belief in query variables given values for evidence variables (above). 2. Predicting values in dependent variables given values for independent variables. 3. Decision making based on probabilities in the network and on the agents utilities (Inuence Diagrams [Howard and Matheson 1981]). 4. Deciding which additional evidence should be observed in order to gain useful information. 5. Sensitivity analysis to test impact of changes in probabilities or utilities on decisions.

Bayes rule allows unknown probabilities to be computed from known ones. Conditional independence (due to causal relationships) allows efcient updating Bayesian networks are a natural way to represent conditional independence info. links between nodes: qualitative aspects; conditional probability tables: quantitative aspects. Inference means computer the probability distribution for a set of query variables, given a set of evidence variables. Inference in Bayesian networks is very exible: can enter evidence about any node and update beliefs in any other nodes. The speed of inference in practice depends on the structure of the network: how many

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

47

Nicholson & Korb

48

Applications: Overview
(Simple) Example Networks loops; numbers of parents; location of evidence and query nodes. Bayesian networks can be extended with decision nodes and utility nodes to support decision making: Decision Networks or Inuence Diagrams. Bayesian and Decision networks can be extended to allow explicit reasoning about changes over time. Applications Medical Decision Making: Survey of applications Planning and Plan Recognition Natural Language Generation (NAG) Bayesian poker Deployed Bayesian Networks (See Handout for details) BN Software Web Resources

Bayesian AI Tutorial

Bayesian AI Tutorial

Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possibly associated with a brain tumor. (Example from (Pearl, 1988).)
Metastatic Cancer A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches

Example: Asia
A patient presents to a doctor with shortness of breath. The doctor considers that possibles causes are tuberculosis, lung cancer and bronchitis. Other additional information that is relevant is whether the patient has recently visited Asia (where tuberculosis is more prevalent), whether or not the patient is a smoker (which increases the chances of cancer and bronchitis). A positive xray would indicate either TB or lung cancer. (Example from (Lauritzen, 1988).)
visit to Asia smoking

P (a) = 0:2 P (bja) = 0:80 P (cja) = 0:20 P (djb; c) = 0:80 P (djb; :c) = 0:80 P (ejc) = 0:80

P (bj:a) = 0:20 P (cj:a) = 0:05 P (dj:b; c) = 0:80 P (dj:b; :c) = 0:05 P (ej:c) = 0:60

tuberculosis either tub or lung cancer

lung cancer

bronchitis

positive X-ray

dyspnoea

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

51

Nicholson & Korb

52

Example: A Lecturers Life


Dr. Ann Nicholson spends 60% of her work time in her ofce. The rest of her work time is spent elsewhere. When Ann is in her ofce, half the time her light is off (when she is trying to hide from students and get some real work done). When she is not in her ofce, she leaves her light on only 5% of the time. 80% of the time she is in her ofce, Ann is logged onto the computer. Because she sometimes logs onto the computer from home, 10% of the time she is not in her ofce, she is still logged onto the computer. Suppose a student checks Dr. Nicholsons login status and sees that she is logged on. What effect does this have on the students belief that Dr. Nicholsons light is on? (Example from (Nicholson, 1999))

Probabilistic reasoning in medicine


See handout from (Dean et al., 1993). Simplest tree-structured network for diagnostic reasoning
H = disease hypothesis; F = ndings (symptoms, test results)

in-office

lights-on

logged-on

Multiply-connected network (QMR structure) B = background information (e.g. age, sex of patient)

Bayesian AI Tutorial

Bayesian AI Tutorial

Medical Applications
Pathnder case study: see handout using material from (Russell&Norvig, 1995, pp.457-458). QMR (Quick Medical Reference): 600 diseases, 4,000 ndings, 40,000 arcs. (Dean&Wellman, 1991) MUNIN (Andreassen et al., 1989): neuromuscular disorders, about 1000 nodes; exact computation < 5 seconds. Glucose prediction and insulin dose adjustment (DBN application) (Andreassen et al., 1991). CPSC project (Pradham et al., 1994) 448 nodes, 906 links, 8254 conditional probability values LW algorithm - answers in 35 mins (1994)
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

55

Nicholson & Korb

56

Application of LW to medical diagnosis (Shwe&Cooper, 1990). Forecasting sleep apnea (Dagum et al., 1993). ALARM (Beinlich et al., 1989): 37 nodes, 42 arcs. (See Netica examples.)
MinVolSet (3)
.976

Robot Navigation and Tracking


Example of a Dynamic Decision Network Dean&Wellman, 1991.

Ventmach (4)
1.158

Disconnect (2)
.617

PulmEmbolus(2)
.369 .288 .428

Intubation (3)

.141 .140

VentTube (4) 1.146 KinkedTube(4)


1.180 .227 .098

PAP (3)

Shunt (2)
.067 .100

Press (4)
1.201

VentLung (4)
1.189

FiO2 (2)
.411 .213

VentAlv (4)
.805 .743

MinVol (4)
.891 .362

PVSat (3) InsuffAnesth (2)


.092

ArtCO2 (3)
.054 .066

Anaphylaxis (2) ExpCO2 (4) SaO2 (3)


.246 .239

TPR (3)

Catechol (2) LVFailure(2)


.547 .137 .360 .479

.470

Hypovolemia (2)
.538

ErrCauter (2)
.324

.888

HR (3)

ErrLowOutput(2)
.344

History (2)
.724

StrokeVolume (3)
.746

LVEDVolume(3)
.874

HRSat (3)

.888 .948 .324 HREKG (3)

HRBP (3)

.251

CO (3)
.199

CVP (3)

PCPW (3)
.485

BP (3)

Bayesian AI Tutorial

Bayesian AI Tutorial

Plan Recognition Applications


Keyhole plan recognition in an Adventure game (Albrecht et al., 1998).
A 0 A 1 A 2 A 3 A 0 A 1 A 2 A 3

Trafc Monitoring: BATmobile


(Forbes et al., 1995) Example of a DBN

L 0

L 1

L 2

L 3

L 0

L 1

L 2

L 3

(a) mainModel
A 0 A 1 A 2 A 3

(b) indepModel
Q Q

L 0

L 1

L 2

L 3

(c) actionModel

(d) locationModel

Trafc plan recognition (Pynadeth&Wellman, 1995).

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

59

Nicholson & Korb

60

Natural Language Generation


NAG (McConachy et al., 1999) A Nice Argument Generator uses two Bayesian networks to generate and assess natural language arguments: Normative Model: Represents our best understanding of the domain; proper (constrained) Bayesian updating, given premises. User Model: Represents our best understanding of the human; Bayesian updating modied to reect human biases (e.g., overcondence; Korb, McConachy, Zukerman, 1997). BNs are embedded in a semantic hierarchy
1

Higher level Ec concepts like %% EE cc motivation or ability %% EE ccc % Lower level cc EE concepts like %% Grade Point Average cc EE cc %% + Semantic @ EE H cc Network @ %% @ cc 2nd layer EE BB HHHH % E A Q @% EA H B cc EEQQQBB %@@ EA %- @ Semantic cc E Network EE Q R @ E c 1st layer %% c EE CC % HH HH EE CC %% H Bayesian EE C %% Network EE %% % 6 E %

Proposition, e.g., [publications authored by

person X cited >5 times]

supports attentional modeling constrained updating


Bayesian AI Tutorial Bayesian AI Tutorial

Bayesian Poker
(Korb et al., 1999) Poker is ideal for testing automated reasoning under uncertainty Physical randomisation Incomplete hand information Incomplete opponent info (strategies, blufng, etc) Bayesian networks are a good representation for complex game playing. Our Bayesian Poker Player (BPP) plays 5-Card stud poker at the level of a good amateur human player. To play: telnet indy13.cs.monash.edu.au login: poker password: maverick

Bayesian Poker BN
Bayesian network provides an estimate of winning at any point in the hand. Betting curves based on pot-odds used to determine action (bet/call, pass or fold).
BPP Win

OPP Final

BPP Final

M C|F OPP Current

C|F

BPP Current

M A|C

U|C

OPP Action

OPP Upcards

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

63

Nicholson & Korb

64

Bayesian Poker BN (cont.)


Hand Types

Bayesian Poker BN (cont.)


Different networks (matrices) for each round. OPP Current, BPP Current: (partial) hand types with cards dealt so far. OPP Final, BPP Final: hand types after all 5 cards dealt. Observation nodes: OPP Upcards: All opponents cards except rst are visible to BPP. OPP Action: BPP knows opponents action.

Initial 9 hand types too coarse. We use a ner granularity for most common hands (busted and a pair): low, medium, Q-high, K-high, A-high results in 17 hand-types Conditional Probability Matrices

MAjC : probability of opponents action given


current hand type learned from observed showdown data.

MU jC and MC jF
poker hands.

estimated by dealing out 107

Belief Updating: Since network is a polytree, simple fast propagation updating algorithm used.
Bayesian AI Tutorial Bayesian AI Tutorial

Deployed BNs Extensions


BPP outperforms automated opponents, is fairly even with ave amateur humans, and loses to experienced humans. Learning the OPP Action CPTs does not (yet) appear to improve performance. BN Improvements Rene action nodes Further renement of hand types Improve network structure Adding blufng to the opponent model Improved learning of opponent model More complex poker: multi-opponent games, table stake games. DBN model to represent changes over time
Bayesian AI Tutorial

From Web Site database: See handout for details. TRACS: Predicting reliability of military vehicles. Andes: intelligent tutoring system for physics. Distributed Virtual Agents advising online users on web sites. Information extraction from natural language text DXPLAIN: decision support for medical diagnosis. Illiad: teaching tool for medical students. Microsoft Health Produce: nd by symptom feature.
Bayesian AI Tutorial

Nicholson & Korb

67

Nicholson & Korb

68

Weapons scheduling. Monitoring power generation. Processor fault diagnosis. Knowledge Industries applications: (a) in medicine, sleep disorders, pathology, trauma care, hand and wrist evaluations, dermatology, and home-based health evaluations (b) in capital equipment, locomotives, gas-turbine engines for aircraft and land-based power production, the space shuttle, and ofce equipment. Software debuggin. Vista: decision support system used at NASA Mission Control Center. MS: (a) Answer Wizard (Ofce 95), Information retrieval; (b) Print Troubleshooter; (c) Aladdin, troubleshooting customer support.

BN Software: Issues
Functionality Especially application vs API Price Many free for demo versions or educational use Commercial licence costs. Availability (platforms) Quality GUI Documentation and Help Leading edge Robustness software company
Bayesian AI Tutorial

Bayesian AI Tutorial

BN Software
Analytica: www.lumina.com Hugin: www.hugin.com

Web Resources
Bayesian Belief Network site (Russell Greiner):
www.cs.ualberta.ca/ greiner/bn.html

Netica: www.norsys.com Bayesian Network Repository (Nir Friedman) Above 3 available during tutorial lab session.
www-

JavaBayes: http://www.cs.cmu.edu/ javabayes/Home/ Many other packages (see next slide)

nt.cs.berkeley.edu/home/nir/public html/Repository/index.htm

Summary of BN software and links to software sites (Kevin Murphy)


HTTP.CS.Berkeley.EDU/ murphyk/Bayes/bnsoft.html

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

71

Nicholson & Korb

72

Learning Bayesian Networks


Linear and Discrete Models Learning Network Parameters Linear Coefcients

Applications: Summary
Various BN structures are available to compactly and accurately represent certain types of domain features. Bayesian networks have been used for a wide range of AI applications. Robust and easy to use Bayesian network software is now readily available.

Learning Probability Tables Learning Causal Structure Conditional Independence Learning Statistical Equivalence TETRAD II Bayesian Learning of Bayesian Networks Cooper & Herskovits: K2 Learning Variable Order Statistical Equivalence Learners Full Causal Learners Minimum Encoding Methods Lam & Bacchuss MDL learner MML metrics MML search algorithms MML Sampling Empirical Results

Bayesian AI Tutorial

Bayesian AI Tutorial

Linear and Discrete Models


Linear Models: Used in biology & social sciences since Sewall Wright (1921) Linear models represent causal relationships as sets of linear functions of independent variables.
X1 X2

Learning Linear Parameters


Maximum likelihood methods have been available since Wrights path model analysis (1921). Equivalent methods: Simon-Blalock method (Simon, 1954; Blalock, 1964) Ordinary least squares multiple regression (OLS)

X3

Equivalently (assuming linear parameters):

X3 = a13 X1 + a23 X2 +

Discrete models: Bayesian nets replace vectors of linear coefcients with CPTs.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

75

Nicholson & Korb

76

Learning Conditional Probability Tables


Spiegelhalter & Lauritzen (1990): assume parameter independence each CPT cell i = a parameter in a Dirichlet distribution

for K parents

D 1; : : : ; i; : : : ; K ]
i = K=1 k k

Dual log-linear and full CPT models (Neil, Wallace, Korb 1999).

prob of outcome i is

observing outcome i update D to

D 1; : : : ; i + 1; : : : ; K ]
Others are looking at learning without parameter independence. E.g., Decision trees to learn structure within CPTs (Boutillier et al. 1996).
Bayesian AI Tutorial Bayesian AI Tutorial

This is the real problem; parameterizing models is essentially numerical computing. There are two basic methods: Learning from conditional independencies (CI learning) Learning using a scoring metric (Metric learning)

Statistical Equivalence
Verma and Pearls rules identify the set of causal models which are statistically equivalent Two causal models H1 and H2 are statistically equivalent iff they contain the same variables and joint samples over them provide no statistical grounds for preferring one over the other. Examples All fully connected models are equivalent. A !B !C and A A !B !D B C. B !D C.

CI learning (Verma and Pearl, 1991) Suppose you have an Oracle who can answer yes or no to any question of the type:

X q Y jS?
Then you can learn the correct causal model, up to statistical equivalence.
Bayesian AI Tutorial

C and A

Bayesian AI Tutorial

Nicholson & Korb

79

Nicholson & Korb

80

Statistical Equivalence
Chickering (1995): Any two causal models over the same variables which have the same skeleton (undirected arcs) and the same directed v-structures are statistically equivalent. If H1 and H2 are statistically equivalent, then they have the same maximum likelihoods relative to any joint samples:

TETRAD II
Spirtes, Glymour and Scheines (1993) Replace the Oracle with statistical tests: for linear models a signicance test on partial correlation

X q Y jS i

XY S = 0

max P (ejH1 ; 1) = max P (ejH2 ; 2)


where

for discrete models a 2 test on the difference between CPT counts expected with independence (Ei ) and observed (Oi )

is a parameterization of Hi

X q Y jS i

Oi i Oi ln Ei

Bayesian AI Tutorial

Bayesian AI Tutorial

Herskovits TETRAD II
Asymptotically nds causal structure to within the statistical equivalence class of the true model. Requires larger sample sizes than MML (Dai, Korb, Wallace & Wu, 1997): Statistical tests are not robust given weak causal interactions and/or small samples. Cheap, and easy to use. Cooper & Herskovits (1991, 1992) Compute P (hi je) by brute force, under the assumptions: 1. All variables are discrete. 2. Samples are i.i.d. 3. No missing values. 4. All values of child variables are uniformly distributed. 5. Priors over hypotheses are uniform. With these assumptions, Cooper & Herskovits reduce the computation of PCH (h; e) to a polynomial time counting problem.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

83

Nicholson & Korb

84

Learning Variable Order Cooper & Herskovits


But the hypothesis space is exponential; they go for dramatic simplication: 6. Assume we know the temporal ordering of the variables. Now for any pair of variables: either they are connected by an arc or they are not. Further, cycles are impossible. New hypothesis space has size only 2(n (still exponential).
2

Reliance upon a given variable order is a major drawback to K2 And many other algorithms (Buntine 1991, Bouckert 1994, Suzuki 1996, Madigan & Raftery 1994) Whats wrong with that? We want autonomous AI (data mining). If experts can order the variables they can likely supply models. Determining variable ordering is half the problem. If we know A comes before B , the only remaining issue is whether there is a link between the two. The number of orderings consistent with dags is apparently exponential (Brightwell & Winkler 1990). So iterating over all possible orderings will not scale up.
Bayesian AI Tutorial

n)=2

Algorithm K2 does a greedy search through this reduced space.

Bayesian AI Tutorial

Statistical Equivalence Learners


Heckerman & Geiger (1995) advocate learning only up to statistical equivalence classes (a la TETRAD II). Since observational data cannot distinguish btw equivalent models, theres no point trying to go futher.

Statistical Equivalence Learners


Wallace & Korb (1999): This is not right! These are causal models; they are distinguishable on experimental data. Failure to collect some data is no reason to change prior probabilities. E.g., If your thermometer topped out at 35 , you wouldnt treat 35 and 34 as equally likely. Not all equivalence classes are created equal: f A B !C, A !B !C, A B Cg f A !B Cg Within classes some dags should have greater priors than others. . . E.g., LightsOn !InOfce !LoggedOn v. LightsOn InOfce !LoggedOn

)Madigan, Andersson, Perlman & Volinsky (1996) follow this advice, use uniform prior over equivalence classes. )Geiger and Heckerman (1994) dene Bayesian metrics for linear and discrete equivalence classes of models (BGe and BDe)

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

87

Nicholson & Korb

88

Full Causal Learners


So. . . a full causal learner is an algorithm that: 1. Learns causal connectedness. 2. Learns v-structures. Hence, learns equivalence classes. 3. Learns full variable order. Hence, learns full causal structure (order + connectedness). TETRAD II: 1, 2. Madigan et al.: 1, 2. Cooper & Herskovits K2: 1. Lam and Bacchus MDL: 1, 2 (partial), 3 (partial). Wallace, Neil, Korb MML: 1, 2, 3.

MDL
Minimum Description Length (MDL) inference Invented by Rissanen (1978) based upon Minimum Message Length (MML) invented by Wallace (Wallace and Boulton, 1968). Plays trade-off btw model simplicity model t to the data by minimizing the length of a joint description of model and data given the model.

Bayesian AI Tutorial

Bayesian AI Tutorial

MDL encoding of causal models: Network:

Lam & Bacchus


j 2 (i) sj ]
Search algorithm: Initial constraints taken from domain expert: partial variable order, direct connections Greedy search: every possible arc addition is tested, best MDL measure used to add one (Note: no arcs are deleted) Local arcs checked for improved MDL via arc reversal Iterate until MDL fails to improve

ki log(n) for specifying ki parents for ith


node

n i=1 ki log(n) + d(si ki

1)

d(si 1) j=1 sj for specifying the CPT: d is the xed bit-length per probability si is the number of states for node i N N M (Xi; (i)) is mutual information btw Xi
and its parent set

Data given network:

n i=1 M (Xi ; (i))

n i=1 H (Xi )

H (Xi ) is entropy of variable Xi

)Results similar to K2, but without full variable


ordering

(NB: This code is not efcient. E.g., treats every node as equally likely to be a parent; assumes knowledge of all ki .)
Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

91

Nicholson & Korb

92

MML
Minimum Message Length (Wallace & Boulton 1968) uses Shannons measure of information:

MML Metric for Linear Models


Network:

I (m) = log P (m)


Applied in reverse, we can compute P (h; e) from I (h; e). Given an efcient joint encoding method for the hypothesis & evidence space (i.e., satisfying Shannons law), MML: Searches fhi g for that hypothesis h that minimizes I (h) + I (ejh). Equivalent to that h that maximizes P (h)P (ejh) i.e., P (hje). The other signicant difference from MDL: MML takes parameter estimation seriously.
Bayesian AI Tutorial

log n! + n(n2 1) log E


log n! for variable order


n(n
1)

log E restore efciency by subtracting


cost of selecting a linear extension

for connectivity

Parameters given dag h:

Xj

f log p( j jh) F ( j)

where j are the parameters for Xj and F ( j ) is the Fisher information. f ( j jh) is assumed to be N (0; j ). (Cf. with MDLs xed length for parms)

Bayesian AI Tutorial

MML Metric for discrete models MML Metric for Linear Models
Sample for Xj given h and We can use PCH (hi ; e) (from Cooper & Herskovits) to dene an MML metric for discrete models. Difference between MML and Bayesian metrics:
2
jk

j:

log P (ejh; j ) =

K k=1

p1 e 2 j

2 2 j

where K is the number of sample values and jk is the difference between the observed value of Xj and its linear prediction.

MML partitions the parameter space and selects optimal parameters. Equivalent to a penalty of 1 log 6e per parameter 2 (Wallace & Freeman 1987); hence:

I (e; hi) = j2sj log 6e log PCH (hi ; e)


Applied in MML Sampling algorithm.

(1)

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

95

Nicholson & Korb

96

MML search algorithms


MML metrics need to be combined with search. This has been done three ways: 1. Wallace, Korb, Dai (1996): greedy search (linear). Brute force computation of linear extensions (small models only). 2. Neil and Korb (1999): genetic algorithms (linear). Asymptotic estimator of linear extensions GA chromosomes = causal models Genetic operators manipulate them Selection pressure is based on MML 3. Wallace and Korb (1999): MML sampling (linear, discrete). Stochastic sampling through space of totally ordered causal models No counting of linear extensions required
Bayesian AI Tutorial Bayesian AI Tutorial

MML Sampling
Search space of totally ordered models (TOMs). Sampled via a Metropolis algorithm (Metropolis et al. 1953). From current model M , nd the next model M 0 by: Randomly select a variable; attempt to swap order with its predecessor. Or, randomly select a pair; attempt to add/delete an arc. Attempts succeed whenever P (M 0 )=P (M ) > U (per MML metric), where U is uniformly random from 0 : 1].

Empirical Results MML Sampling


Metropolis: this procedure samples TOMs with a frequency proportional to their posterior probability. To nd posterior of dag h: keep count of visits to all TOMs consistent with h Estimated by counting visits to all TOMs with identical max likelihoods to h Output: Probabilities of Top dags Top statistical equivalence classes Top MML equivalence classes A weakness in this area and AI generally. Paper publications based upon very small models, loose comparisons. ALARM net often used everything sets it to within 1 or 2 arcs. Neil and Korb (1999) compared MML and BGe (Heckerman & Geigers Bayesian metric over equivalence classes), using identical GA search over linear models: On KL distance and topological distance from the true model, MML and BGe performed nearly the same. On test prediction accuracy on strict effect nodes (those with no children), MML clearly outperformed BGe.

Bayesian AI Tutorial

Bayesian AI Tutorial

Nicholson & Korb

99

Nicholson & Korb

100

Current Research Issues


size and complexity difculties with elicitation combinations of discrete and continuous (i.e. mixing node types) Learning issues Missing data Latent variables Experimental data Learning CPT structure Multi-structure models continuous & discrete CPTs w/ & w/o parm independence inappropriate problems (deterministic systems, legal rules)

(Other) Limitations

Bayesian AI Tutorial

Bayesian AI Tutorial

Systems. Wiley. C HAPTERS 1, 2 HISTORY.

AND

COVER SOME OF THE RELEVANT

Introduction to Bayesian AI
T. Bayes (1764) An Essay Towards Solving a Problem in the Doctrine of Chances. Phil Trans of the Royal Soc of London. Reprinted in Biometrika, 45 (1958), 296-315. B. Buchanan and E. Shortliffe (eds.) (1984) Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley. B. de Finetti (1964) Foresight: Its Logical Laws, Its Subjective Sources, in Kyburg and Smokler (eds.) Studies in Subjective Probability. NY: Wiley. D. Heckerman (1986) Probabilistic Interpretations for MYCINs Certainty Factors, in L.N. Kanal and J.F. Lemmer (eds.) Uncertainty in Articial Intelligence. North-Holland. C. Howson and P. Urbach (1993) Scientic Reasoning: The Bayesian Approach. Open Court. A MODERN REVIEW OF BAYESIAN THEORY. K.B. Korb (1995) Inductive learning and defeasible inference, Jrn for Experimental and Theoretical AI, 7, 291-324.
Bayesian AI Tutorial

J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann.

F. P. Ramsey (1931) Truth and Probability in The Foundations of Mathematics and Other Essays. NY: Humanities Press. T HE ORIGIN OF MODERN BAYESIANISM . I NCLUDES LOTTERY- BASED ELICITATION AND D UTCH - BOOK ARGUMENTS FOR THE USE OF PROBABILITIES. R. Reiter (1980) A logic for default reasoning, Articial Intelligence, 13, 81-132. J. von Neumann and O.Morgenstern (1947) Theory of Games and Economic Behavior, 2nd ed. Princeton Univ. S TANDARD REFERENCE ON ELICITING UTILITIES VIA LOTTERIES.

Bayesian Networks
E. Charniak (1991) Bayesian Networks Without Tears, Articial Intelligence Magazine, pp. 50-63, Vol 12. A N ELEMENTARY INTRODUCTION.
Bayesian AI Tutorial

Nicholson & Korb

103

Nicholson & Korb

104

D. DAmbrosio (1999) Inference in Bayesian Networks. Articial Intelligence Magazine, Vol 20, No. 2. P. Haddaway (1999) An Overview of Some Recent Developments in Bayesian Problem-Solving Techniques. Articial Intelligence Magazine, Vol 20, No. 2. Howard & Matheson (1981) Inuence Diagrams, Principles and Applications of Decision Analysis. F. V. Jensen (1996) An Introduction to Bayesian Networks, Springer. R. Neapolitan (1990) Probabilistic Reasoning in Expert Systems. Wiley. S IMILAR COVERAGE TO THAT OF P EARL ; MORE
EMPHASIS ON PRACTICAL ALGORITHMS FOR NETWORK UPDATING.

Applications
D.W. Albrecht, I. Zukerman and Nicholson, A.E. (1998) Bayesian Models for Keyhole Plan Recognition in an Adventure Game. User Modeling and User-Adapted Interaction, 8(1-2), 5-47, Kluwer Academic Publishers. S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U. Kjrulff, M. Woldbye, A.R. Srensen, A. Rosenfalck and F. Jensen (1989) MUNIN An Expert EMG Assistant, Computer-Aided Electromyography and Expert Systems, Chapter 21, J.E. Desmedt (Ed.), Elsevier. S.A. Andreassen, J.J Benn, R. Hovorks, K.G. Olesen and R.E. Carson (1991) A Probabilistic Approach to Glucose Prediction and Insulin Dose Adjustment: Description of Metabolic Model and Pilot Evaluation Study. I. Beinlich, H. Suermondt, R. Chavez and G. Cooper (1992) The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks, Proc. of the 2nd European Conf. on Articial Intelligence in medicine, pp. 689-693. T.L Dean and M.P. Wellman (1991) Planning and control, Morgan Kaufman. T.L. Dean, J. Allen and J. Aloimonos (1994) Articial Intelligence: Theory and Practice, Benjamin/Cummings.
Bayesian AI Tutorial

J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann. T HIS IS THE CLASSIC TEXT INTRODUCING BAYESIAN NETWORKS TO THE AI COMMUNITY. Poole, D., Mackworth, A., and Goebel, R. (1998) Computational Intelligence: a logical approach. Oxford University Press. Russell & Norvig (1995) Articial Intelligence: A Modern Approach, Prentice Hall.
Bayesian AI Tutorial

Network Models for Forecasting, Proceedings of the 8th Conference on Uncertainty in Articial Intelligence, pp. 41-48. J. Forbes, T. Huang, K. Kanazawa and S. Russell (1995) The BATmobile: Towards a Bayesian Automated Taxi, Proceedings of the 14th Int. Joint Conf. on Articial Intelligence (IJCAI95), pp. 1878-1885. S.L Lauritzen and D.J. Spiegelhalter (1988) Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems, Journal of the Royal Statistical Society, 50(2), pp. 157-224. McConachy et al (1999) A.E. Nicholson (1999) CSE2309/3309 Articial Intelligence, Monash University, Lecture Notes, http://www.csse.monash.edu.au/nnn/2-3309.html. a M. Pradham, G. Provan, B. Middleton and M. Henrion (1994) Knowledge engineering for large belief networks, Proceedings of the 10th Conference on Uncertainty in Articial Intelligence. D. Pynadeth and M. P. Wellman (1995) Accounting for Context in Plan Recogniition, with Application to Trafc Monitoring, Proceedings of the 11th Conference on Uncertainty in Articial Intelligence, pp.472-481.
Bayesian AI Tutorial

Likelihood-Weighting Simulation on a Large, Multiply Connected Belief Network, Proceedings of the Sixth Workshop on Uncertainty in Articial Intelligence, pp. 498-508, 1990. L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P. Aleman, B.G. Tall (1999) How to Elicit Many Probabilities, Laskey & Prade (eds) UAI99, 647-654. Zukerman, I., McConachy, R., Korb, K. and Pickett, D. (1999) Exploratory Interaction with a Bayesian Argumentation System, in IJCAI-99 Proceedings the Sixteenth International Joint Conference on Articial Intelligence, pp. 1294-1299, Stockholm, Sweden, Morgan Kaufmann.

Learning Bayesian Networks


H. Blalock (1964) Causal Inference in Nonexperimental Research. University of North Carolina. R. Bouckeart (1994) Probabilistic network construction using the minimum description length principle. Technical Report RUU-CS-94-27, Dept of Computer Science, Utrecht University. C. Boutillier, N. Friedman, M. Goldszmidt, D. Koller (1996) Context-specic independence in Bayesian networks, in Horvitz & Jensen (eds.) UAI 1996, 115-123.
Bayesian AI Tutorial

Nicholson & Korb

107

Nicholson & Korb

108

G. Brightwell and P. Winkler (1990) Counting linear extensions is #P-complete. Technical Report DIMACS 90-49, Dept of Computer Science, Rutgers Univ. W. Buntine (1991) Theory renement on Bayesian networks, in DAmbrosio, Smets and Bonissone (eds.) UAI 1991, 52-69. W. Buntine (1996) A Guide to the Literature on Learning Probabilistic Networks from Data, IEEE Transactions on Knowledge and Data Engineering,8, 195-210. D.M. Chickering (1995) A Tranformational Characterization of Equivalent Bayesian Network Structures, in P. Besnard and S. Hanks (eds.) Proceedings of the Eleventh Conference on Uncertainty in Articial Intelligence (pp. 87-98). San Francisco: Morgan Kaufmann. STATISTICAL EQUIVALENCE . G.F. Cooper and E. Herskovits (1991) A Bayesian Method for Constructing Bayesian Belief Networks from Databases, in DAmbrosio, Smets and Bonissone (eds.) UAI 1991, 86-94. G.F. Cooper and E. Herskovits (1992) A Bayesian Method for the Induction of Probabilistic Networks from Data, Machine Learning, 9, 309-347. A N EARLY BAYESIAN CAUSAL DISCOVERY METHOD.
Bayesian AI Tutorial

H. Dai, K.B. Korb, C.S. Wallace and X. Wu (1997) A study of casual discovery with weak links and small samples. Proceedings of the Fifteenth International Joint Conference on Articial Intelligence (IJCAI), pp. 1304-1309. Morgan Kaufmann. N. Friedman (1997) The Bayesian Structural EM Algorithm, in D. Geiger and P.P. Shenoy (eds.) Proceedings of the Thirteenth Conference on Uncertainty in Articial Intelligence (pp. 129-138). San Francisco: Morgan Kaufmann. Geiger and Heckerman (1994) Learning Gaussian networks, in Lopes de Mantras and Poole (eds.) UAI 1994, 235-243. D. Heckerman and D. Geiger (1995) Learning Bayesian networks: A unication for discrete and Gaussian domains, in Besnard and Hankds (eds.) UAI 1995, 274-284. D. Heckerman, D. Geiger, and D.M. Chickering (1995) Learning Bayesian Networks: The Combination of Knowledge and Statistical Data, Machine Learning, 20, 197-243. BAYESIAN LEARNING OF STATISTICAL EQUIVALENCE CLASSES. K. Korb (1999) Probabilistic Causal Structure in H. Sankey (ed.) Causation and Laws of Nature:
Bayesian AI Tutorial

Science 14. Kluwer Academic. I NTRODUCTION TO THE RELEVANT PHILOSOPHY OF CAUSATION FOR LEARNING BAYESIAN NETWORKS. P. Krause (1998) Learning Probabilistic Networks.

http : ==www:auai:org=bayes USKrause:ps:gz


BASIC
INTRODUCTION TO

Methodologies for Knowledge Discovery and Data Mining: Third Pacic-Asia Conference (pp. 432-437). Springer Verlag. G ENETIC ALGORITHMS FOR CAUSAL DISCOVERY; STRUCTURE PRIORS. J.R. Neil, C.S. Wallace and K.B. Korb (1999) Learning Bayesian networks with restricted causal interactions, in Laskey and Prade (eds.) UAI 99, 486-493. J. Rissanen (1978) Modeling by shortest data description, Automatica, 14, 465-471. H. Simon (1954) Spurious Correlation: A Causal Interpretation, Jrn Amer Stat Assoc, 49, 467-479. D. Spiegelhalter & S. Lauritzen (1990) Sequential Updating of Conditional Probabilities on Directed Graphical Structures, Networks, 20, 579-605. P. Spirtes, C. Glymour and R. Scheines (1990) Causality from Probability, in J.E. Tiles, G.T. McKee and G.C. Dean Evolving Knowledge in Natural Science and Articial Intelligence. London: Pitman. A N
ELEMENTARY INTRODUCTION TO STRUCTURE LEARNING VIA CONDITIONAL INDEPENDENCE .

BN S,

PARAMETERIZATION

AND LEARNING CAUSAL STRUCTURE .

W. Lam and F. Bacchus (1993) Learning Bayesian belief networks: An approach based on the MDL principle, Jrn Comp Intelligence, 10, 269-293. D. Madigan, S.A. Andersson, M.D. Perlman & C.T. Volinsky (1996) Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs, Comm in Statistics: Theory and Methods, 25, 2493-2519. D. Madigan and A. E. Raftery (1994) Model selection and accounting for model uncertainty in graphical modesl using Occams window, Jrn AMer Stat Assoc, 89, 1535-1546. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller (1953) Equations of state calculations by fast computing machines, Jrn Chemical Physics, 21, 1087-1091. J.R. Neil and K.B. Korb (1999) The Evolution of Causal Models: A Comparison of Bayesian Metrics and
Bayesian AI Tutorial

P. Spirtes, C. Glymour and R. Scheines (1993) Causation, Prediction and Search: Lecture Notes in Statistics 81.
Bayesian AI Tutorial

Nicholson & Korb

111

Nicholson & Korb

112

Springer Verlag. A THOROUGH PRESENTATION


STRUCTURE .

OF THE ORTHODOX

STATISTICAL APPROACH TO LEARNING CAUSAL

J. Suzuki (1996) Learning Bayesian Belief Networks Based on the Minimum Description Length Principle, in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 462-470). San Francisco: Morgan Kaufmann. T.S. Verma and J. Pearl (1991) Equivalence and Synthesis of Causal Models, in P. Bonissone, M. Henrion, L. Kanal and J.F. Lemmer (eds) Uncertainty in Articial Intelligence 6 (pp. 255-268). Elsevier. T HE GRAPHICAL CRITERION FOR STATISTICAL EQUIVALENCE . C.S. Wallace and D. Boulton (1968) An information measure for classication, Computer Jrn, 11, 185-194. C.S. Wallace and P.R. Freeman (1987) Estimation and inference by compact coding, Jrn Royal Stat Soc (Series B), 49, 240-252. C. S. Wallace and K. B. Korb (1999) Learning Linear Causal Models by MML Sampling, in A. Gammerman (ed.) Causal Models and Intelligent Data Management. Springer Verlag. S AMPLING APPROACH TO LEARNING CAUSAL MODELS ; DISCUSSION OF STRUCTURE PRIORS.
Bayesian AI Tutorial

C. S. Wallace, K. B. Korb, and H. Dai (1996) Causal Discovery via MML, in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 516-524). San Francisco: Morgan Kaufmann. I NTRODUCES AN MML METRIC FOR CAUSAL MODELS. S. Wright (1921) Correlation and Causation, Jrn Agricultural Research, 20, 557-585. S. Wright (1934) The Method of Path Coefcients, Annals of Mathematical Statistics, 5, 161-215.

Current Research

Bayesian Network URLs

Bayesian AI Tutorial

You might also like