You are on page 1of 27

CS109/Stat121/AC209/E-109

Data Science
Bayesian Methods
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu / blitzstein@stat.harvard.edu

Freq Bayes

FB
This Week
HW3 due next Thursday (Oct 17) at 11:59 pm
start now!

Friday lab 10-11:30 am in MD G115


Nate Silver is Hiring:
The Rise of Data Journalism?
http://www.fivethirtyeight.com/2013/09/seeking-
lead-writers-in-sports-politics.html

FiveThirtyEight is conducting a search for lead writers in


three of our most important content verticals: sports, politics
and economics. ...
These are high-profile, full-time positions for people with an
outstanding combination of writing and statistical skills. ...
We are intrigued by candidates who can combine traditional
reporting with critical, empirical analysis...
Programming skills, database skills, and familiarity with
statistical software packages are clear positives....
So is the demonstrated ability to produce high-quality charts,
graphics, and interactive features.
Bayesian Data Analysis
Probabilistic Programming and Bayesian Methods for Hackers

http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-
for-Hackers/master/Prologue/Prologue.ipynb
Full Probability Modeling
The process of Bayesian data analysis can be idealized by
dividing it into the following three steps:
1.Setting up a full probability model a joint probability
distribution for all observable and unobservable quantities
in a problem...
2.Conditioning on observed data: calculating and
interpreting the appropriate posterior distribution the
conditional probability distribution of the unobserved
quantities of ultimate interest, given the observed data.
3.Evaluating the fit of the model and the implications of the
resulting posterior distribution...
-- Gelman et al, Bayesian Data Analysis
Conjugate Priors

http://www.johndcook.com/conjugate_prior_diagram.html
Ranking Reddit Comments:
Example from Probabilistic Programming and Bayesian Methods for Hackers

http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-
Methods-for-Hackers/master/Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb
Ranking Reddit Comments: A Simple Model

number of upvotes Bin(n, p)


a 1 b 1
conjugate prior: p Beta(a, b), pdf / p (1 p)

posterior: p|data Beta(a + #upvotes, b + #downvotes)


Ranking Reddit Comments

Why not just add pseudocounts and then use


proportion? Why bother with Bayes?

For example, the Agresti-Coull method adds 2


successes and 2 failures.
Posterior Distributions for Reddit Comments

http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/
Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb
Ranking Reddit Comments by Posterior Quantiles

http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/
Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb
Bayesian Bandits
Example from Probabilistic Programming and Bayesian Methods for Hackers

http://research.microsoft.com/en-us/projects/bandits/

N slot machines, each with its own unknown probability of


giving a prize. Exploration-exploitation tradeoff.
Bayesian Bandits
Example from Probabilistic Programming and Bayesian Methods for Hackers

A fast, simple Bayesian algorithm:

1. sample from the prior of each bandit


2. select the bandit with the largest sampled value
3. update the prior for that bandit (the posterior
becomes the new prior)
4. repeat.
Bayesian Bandits

http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/
Chapter6_Priorities/Priors.ipynb
to set to zero is the Bayesian approach described abov
ethod is the lasso, LASSOintroduced by Tibshirani (1996) a
and Sparsity
by many others. The lasso estimate is the value lasso
In a linear regression model, in place of minimizing the sum
SSR( : ),ofasquaredmodified version of the sum
residuals, LASSO says to minimize
of squared re
n
X p
X
SSR( : ) = (yi T
xi ) +
2
| j| .
i=1 j=1
Bayesian interpretation: posterior mode, with independent
Laplace priors on the parameters.
1.0
0.8
0.6
f(x)

0.4
0.2

-3 -2 -1 0 1 2 3
Kidney Cancer Example from Bayesian Data Analysis (Gelman et al)

U.S. counties with the highest 10% of


Figure 2.7 The counties of the United States with the
kidney cancer death rates (age-adjusted) highest 10% age-
standardized death rates for cancer of kidney/ureter for U.S. white
Kidney Cancer Example from Bayesian Data Analysis (Gelman et al)

58 Bayesian Data Analysis

The issue is sample size. Consider a county of population 1000. Kidney cancer is a rare
disease, and, in any ten-year period, a county of 1000 will probably have zero kidney
cancer deaths, so that it will be tied for the lowest rate in the country and will be shaded
in Figure 2.8. However, there is a chance the county will have one kidney cancer

Figure 2.7 The counties of the United States with the highest 10% age-
standardized death rates for cancer of kidney/ureter for U.S. white
males, 19801989. Why are most of the shaded counties in the middle
of the country? See Section 2.8 for discussion.

each based on different data but with a common prior distribution. In addition to
illustrating the role of the prior distribution, this example introduces hierarchical
modeling, to which we return in Chapter 5.

A puzzling pattern in a map


Figure 2.7 shows the counties in the United States with the highest kidney cancer death
Figure 2. A cursory glance at the distribution of the U.S. counties with the lowest rates of kidney cancer (teal) might lead one to conclude that
Wainer, The Most Dangerous Equation (American Scientist, 2007)
something about the rural lifestyle reduces the risk of that cancer. After all, the counties with the lowest 10 percent of risk are mainly Midwest-
ern, Southern and Westem counties. When one examines the distribution of counties with the highest rates of kidney cancer (red), however,
it becomes clear that some other factor is at play Knowledge of de Moivre's equation leads to the conclusion that what the counties with the
teal: lowest 10%
lowest and highest kidney-cancer rates have in common is low populationand therefore high variation in kidney-cancer rates.

overweight coins, melt them down andorange: highest


recast of the 10% access to good medi-
rural lifestyleno
Kidney Cancer Example from Bayesian Data Analysis (Gelman et al)

Figure 2.7 The counties of the United States with the highest 10% age-
simple model: y Pois(10n )
standardized death rates for cancer of kidney/ureter for U.S. white
j j j
males, 19801989. Why are most of the shaded counties in the middle
of the country? See Section 2.8 for discussion.
j Gamma(, )
each based on different data but with a common prior distribution. In addition to
yj this example introduces hierarchical
illustrating the role of the prior distribution,
E( |yj )in Chapter
modeling, to which we jreturn
= w 5. + (1 w)E(j )
10nj
A puzzling pattern in a map
weighted combination of the data and the prior mean
Figure 2.7 shows the counties in the United States with the highest kidney cancer death
ay, 319 were 0s, 141 were 1s, 33 were 2s, and 5
Kidney Cancer Example from Bayesian Data Analysis (Gelman et al)

death ratesraw
yj/(10 nj)small
data: vs. population size nj.for
counties account (b)almost
all of the high
le of log10 population andthe
to see lowdata
deathmore
rates
come from the discreteness of the data (nj=0, 1,
clearly. The patterns come from the discreteness o
Kidney Cancer Example from
2, ). Bayesian Data Analysis (Gelman et al)

BayesFigure 2.10 (a)


estimates: Bayes-estimated
automatically posterior
accounts for mean kidney
regression toward the mean
vs. logarithm of popula
counties in the U.S. (b) Posterior medians and
for a sample of 100 counties j. The scales on th
Kidney Cancer Example from Bayesian Data Analysis (Gelman et al)

Bayesian posterior medians and 50% probability


ed posterior mean kidney cancer death rates, intervals

vs. logarithm of population size nj, the 3071


b) Posterior medians and 50% intervals for
drawn from course work of Stanford students Marc Coram and Phil Beineke.
Markov Chain
Example 1 (Cryptography). Monte
Stanfords Carlo
Statistics (MCMC):
Department has a drop-in con-
sulting service. One day,Diaconis-Coram Example
a psychologist from the state prison system showed up
with a collection of coded messages. Figure 1 shows part of a typical example.

Figure 1:
The problem was to decode these messages. Marc guessed that the code was a
simple substitution cipher, each symbol standing for a letter, number, punctuation
mark or space. Thus, there is an unknown function f
f : {code space} {usual alphabet}.
PERSI DIACONIS
MCMCryptography
statistics, Marc matrix
Get a transition downloaded a standard
M(x,y) for English (thetext (e.g., Wa
probability
the first-orderof transitions: the xproportion
going from letter to letter y) of consecutive
This gives a matrix M (x, y) of transitions. One may th
f via Define plausibility
!
Pl(f ) = M (f (si ), f (si+1 )) ,
i

over consecutive symbols in the coded message.


Try to swap two random letters in the decoding, basedFunc
ues of Pl(f ) areongood candidates for decryption.
the ratio of plausibilities. Maxim
y running the following Markov chain Monte Carlo alg
with a preliminary guess, say f .
ute Pl(f ).
ge to f by making a random transposition of the value
Figure 2:

The text was scrambled at random and the Monte Carlo algorithm was run.
Figure 3 shows sample output.
Figure 2:

The text was scrambled at random and the Monte Carlo algorithm was ru
igure 3 shows sample output.

Figure 3:
Figure 4:

I like this example because a) it is real, b) there is no question the algorithm f


e correct answer, and c) the procedure works despite the implausible under
umptions. In fact, the message is in a mix of English, Spanish and prison jar
e plausibility measure is based on first-order transitions only. A prelimi

You might also like