You are on page 1of 29

Conjugate Bayesian analysis of the Gaussian distribution

Kevin P. Murphy

murphyk@cs.ubc.ca
Last updated October 3, 2007
1 Introduction
The Gaussian or normal distribution is one of the most widely used in statistics. Estimating its parameters using
Bayesian inference and conjugate priors is also widely used. The use of conjugate priors allows all the results to be
derived in closed form. Unfortunately, different books use different conventions on how to parameterize the various
distributions (e.g., put the prior on the precision or the variance, use an inverse gamma or inverse chi-squared, etc),
which can be very confusing for the student. In this report, we summarize all of the most commonly used forms. We
provide detailed derivations for some of these results; the rest can be obtained by simple reparameterization. See the
appendix for the denition the distributions that are used.
2 Normal prior
Let us consider Bayesian estimation of the mean of a univariate Gaussian, whose variance is assumed to be known.
(We discuss the unknown variance case later.)
2.1 Likelihood
Let D = (x
1
, . . . , x
n
) be the data. The likelihood is
p(D|,
2
) =
n

i=1
p(x
i
|,
2
) = (2
2
)
n/2
exp
_

1
2
2
n

i=1
(x
i
)
2
_
(1)
Let us dene the empirical mean and variance
x =
1
n
n

i=1
x
i
(2)
s
2
=
1
n
n

i=1
(x
i
x)
2
(3)
(Note that other authors (e.g., [GCSR04]) dene s
2
=
1
n1

n
i=1
(x
i
x)
2
.) We can rewrite the term in the exponent
as follows

i
(x
i
)
2
=

i
[(x
i
x) ( x)]
2
(4)
=

i
(x
i
x)
2
+

i
(x )
2
2

i
(x
i
x)( x) (5)
= ns
2
+n(x )
2
(6)
since

i
(x
i
x)( x) = ( x)
_
(

i
x
i
) nx
_
= ( x)(nx nx) = 0 (7)

Thanks to Hoyt Koepke for proof reading.


1
Hence
p(D|,
2
) =
1
(2)
n/2
1

n
exp
_

1
2
2
_
ns
2
+n(x )
2

_
(8)

_
1

2
_
n/2
exp
_

n
2
2
(x )
2
_
exp
_

ns
2
2
2
_
(9)
If
2
is a constant, we can write this as
p(D|) exp
_

n
2
2
(x )
2
_
N(x|,

2
n
) (10)
since we are free to drop constant factors in the denition of the likelihood. Thus n observations with variance
2
and
mean x is equivalent to 1 observation x
1
= x with variance
2
/n.
2.2 Prior
Since the likelihood has the form
p(D|) exp
_

n
2
2
(x )
2
_
N(x|,

2
n
) (11)
the natural conjugate prior has the form
p() exp
_

1
2
2
0
(
0
)
2
_
N(|
0
,
2
0
) (12)
(Do not confuse
2
0
, which is the variance of the prior, with
2
, which is the variance of the observation noise.) (A
natural conjugate prior is one that has the same form as the likelihood.)
2.3 Posterior
Hence the posterior is given by
p(|D) p(D|, )p(|
0
,
2
0
) (13)
exp
_

1
2
2

i
(x
i
)
2
_
exp
_

1
2
2
0
(
0
)
2
_
(14)
= exp
_
1
2
2

i
(x
2
i
+
2
2x
i
) +
1
2
2
0
(
2
+
2
0
2
0
)
_
(15)
Since the product of two Gaussians is a Gaussian, we will rewrite this in the form
p(|D) exp
_

2
2
_
1

2
0
+
n

2
_
+
_

2
0
+

i
x
i

2
_

_

2
0
2
2
0
+

i
x
2
i
2
2
__
(16)
def
= exp
_

1
2
2
n
(
2
2
n
+
2
n
)
_
= exp
_

1
2
2
n
(
n
)
2
_
(17)
Matching coefcients of
2
, we nd
2
n
is given by

2
2
2
n
=

2
2
_
1

2
0
+
n

2
_
(18)
1

2
n
=
1

2
0
+
n

2
(19)

2
n
=

2

2
0
n
2
0
+
2
=
1
n

2
+
1

2
0
(20)
2
N = 0
N = 1
N = 2
N = 10
1 0 1
0
5
Figure 1: Sequentially updating a Gaussian mean starting with a prior centered on 0 = 0. The true parameters are

= 0.8
(unknown), (
2
)

= 0.1 (known). Notice how the data quickly overwhelms the prior, and how the posterior becomes narrower.
Source: Figure 2.12 [Bis06].
Matching coefcients of we get
2
n
2
2
n
=
_
n
i=1
x
i

2
+

0

2
0
_
(21)

2
n
=

n
i=1
x
i

2
+

0

2
0
(22)
=

2
0
nx +
2

2
0
(23)
Hence

n
=

2
n
2
0
+
2

0
+
n
2
0
n
2
0
+
2
x =
2
n
_

2
0
+
nx

2
_
(24)
This operation of matching rst and second powers of is called completing the square.
Another way to understand these results is if we work with the precision of a Gaussian, which is 1/variance (high
precision means low variance, low precision means high variance). Let
= 1/
2
(25)

0
= 1/
2
0
(26)

n
= 1/
2
n
(27)
Then we can rewrite the posterior as
p(|D, ) = N(|
n
,
n
) (28)

n
=
0
+n (29)

n
=
xn +
0

n
= w
ML
+ (1 w)
0
(30)
3
5 0 5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
prior sigma 10.000
prior
lik
post
5 0 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
prior sigma 1.000
prior
lik
post
(a) (b)
Figure 2: Bayesian estimation of the mean of a Gaussian from one sample. (a) Weak prior N(0, 10). (b) Strong prior N(0, 1). In
the latter case, we see the posterior mean is shrunk towards the prior mean, which is 0. Figure produced by gaussBayesDemo.
where nx =

n
i=1
x
i
and w =
n
n
. The precision of the posterior
n
is the precision of the prior
0
plus one
contribution of data precision for each observed data point. Also, we see the mean of the posterior is a convex
combination of the prior and the MLE, with weights proportional to the relative precisions.
To gain further insight into these equations, consider the effect of sequentially updating our estimate of (see
Figure 1). After observing one data point x (so n = 1), we have the following posterior mean

1
=

2

2
+
2
0

0
+

2
0

2
+
2
0
x (31)
=
0
+ (x
0
)

2
0

2
+
2
0
(32)
= x (x
0
)

2

2
+
2
0
(33)
The rst equation is a convex combination of the prior and MLE. The second equation is the prior mean ajusted
towards the data x. The third equation is the data x adjusted towads the prior mean; this is called shrinkage. These
are all equivalent ways of expressing the tradeoff between likelihood and prior. See Figure 2 for an example.
2.4 Posterior predictive
The posterior predictive is given by
p(x|D) =
_
p(x|)p(|D)d (34)
=
_
N(x|,
2
)N(|
n
,
2
n
)d (35)
= N(x|
n
,
2
n
+
2
) (36)
This follows fromgeneral properties of the Gaussian distribution (see Equation 2.115 of [Bis06]). An alternative proof
is to note that
x = (x ) + (37)
x N(0,
2
) (38)
N(
n
,
2
n
) (39)
Since E[X
1
+X
2
] = E[X
1
] +E[X
2
] and Var [X
1
+X
2
] = Var [X
1
] +Var [X
2
] if X
1
, X
2
are independent, we have
X N(
n
,
2
n
+
2
) (40)
4
since we assume that the residual error is conditionally independent of the parameter. Thus the predictive variance is
the uncertainty due to the observation noise
2
plus the uncertainty due to the parameters,
2
n
.
2.5 Marginal likelihood
Writing m =
0
and
2
=
2
0
for the hyper-parameters, we can derive the marginal likelihood as follows:
= p(D|m,
2
,
2
) =
_
[
n

i=1
N(x
i
|,
2
)]N(|m,
2
)d (41)
=

(

2)
n

n
2
+
2
exp
_

i
x
2
i
2
2

m
2
2
2
_
exp
_

2
n
2
x
2

2
+

2
m
2

2
+ 2nxm
2(n
2
+
2
)
_
(42)
The proof is below, based on the on the appendix of [DMP
+
06].
We have
= p(D|m,
2
,
2
) =
_
[
n

i=1
N(x
i
|,
2
)]N(|m,
2
)d (43)
=
1
(

2)
n
(

2)
_
exp
_

1
2
2

i
(x
i
)
2

1
2
2
( m)
2
_
d (44)
Let us dene S
2
= 1/
2
and T
2
= 1/
2
. Then
=
1
(

2/S)
n
(

2/T)
_
exp
_

S
2
2
(

i
x
2
i
+n
2
2

i
x
i
)
T
2
2
(
2
+m
2
2m)
_
d (45)
= c
_
exp
_

1
2
(S
2
n
2
2S
2

i
x
i
+T
2

2
2T
2
m)
_
d (46)
where
c =
exp
_

1
2
(S
2

i
x
2
i
+T
2
m
2
)
_
(

2/S)
n
(

2/T)
(47)
So
= c
_
exp
_

1
2
(S
2
n +T
2
)
_

2
2
S
2

i
x
i
+T
2
m
S
2
n +T
2
__
d (48)
= c exp
_
(S
2
nx +T
2
m)
2
2(S
2
n +T
2
)
__
exp
_

1
2
(S
2
n +T
2
)
_

S
2
nx +T
2
m
S
2
n +T
2
_
2
_
d (49)
= c exp
_
(S
2
nx +T
2
m)
2
2(S
2
n +T
2
)
_
2

S
2
n +T
2
(50)
=
exp
_

1
2
(S
2

i
x
2
i
+T
2
m
2
)
_
(

2/S)
n
(

2/T)
exp
_
(S
2
nx +T
2
m)
2
2(S
2
n +T
2
)
_
2

S
2
n +T
2
(51)
Now
1
_
(2)/T

S
2
n +T
2
=

N
2
+
2
(52)
and
(
nx

2
+
m

2
)
2
2(
n

2
+
1

2
)
=
(nx
2
+m
2
)
2
2
2

2
(n
2
+
2
)
(53)
=
n
2
x
2

2
/
2
+
2
m
2
/
2
+ 2nxm
2(n
2
+
2
)
(54)
5
So
p(D) =

(

2)
n

n
2
+
2
exp
_

i
x
2
i
2
2

m
2
2
2
_
exp
_

2
n
2
x
2

2
+

2
m
2

2
+ 2nxm
2(n
2
+
2
)
_
(55)
To check this, we should ensure that we get
p(x|D) =
p(x, D)
p(D)
= N(x|
n
,
2
n
+
2
) (56)
(To be completed)
2.6 Conditional prior p(|
2
)
Note that the previous prior is not, strictly speaking, conjugate, since it has the form p() whereas the posterior has
the form p(|D, ), i.e., occurs in the posterior but not the prior. We can rewrite the prior in conditional form as
follows
p(|) = N(|
0
,
2
/
0
) (57)
This means that if
2
is large, the variance on the prior of is also large. This is reasonable since
2
denes the
measurement scale of x, so the prior belief about is equivalent to
0
observations of
0
on this scale. (Hence a
noninformative prior is
0
= 0.) Then the posterior is
p(|D) = N(|
n
,
2
/
n
) (58)
where
n
=
0
+ n. In this form, it is clear that
0
plays a role analogous to n. Hence
0
is the equivalent sample
size of the prior.
2.7 Reference analysis
To get an uninformative prior, we just set the prior variance to innity to simulate a uniform prior on .
p() 1 = N(|, ) (59)
p(|D) = N(|x,
2
/n) (60)
3 Normal-Gamma prior
We will now suppose that both the mean m and the precision =
2
are unknown. We will mostly follow the
notation in [DeG70, p169].
3.1 Likelihood
The likelihood can be written in this form
p(D|, ) =
1
(2)
n/2

n/2
exp
_

2
n

i=1
(x
i
)
2
_
(61)
=
1
(2)
n/2

n/2
exp
_

2
_
n( x)
2
+
n

i=1
(x
i
x)
2
__
(62)
3.2 Prior
The conjugate prior is the normal-Gamma:
NG(, |
0
,
0
,
0
,
0
)
def
= N(|
0
, (
0
)
1
)Ga(|
0
, rate =
0
) (63)
=
1
Z
NG
(
0
,
0
,
0
,
0
)

1
2
exp(

2
(
0
)
2
)
01
e
0
(64)
=
1
Z
NG

0
1
2
exp
_

2
_

0
(
0
)
2
+ 2
0

_
(65)
Z
NG
(
0
,
0
,
0
,
0
) =
(
0
)

0
0
_
2

0
_
1
2
(66)
6
2
0
2
0
2
4
0
0.2
0.4

NG(=2.0, a=1.0, b=1.0)

2
0
2
0
2
4
0
0.2
0.4

NG(=2.0, a=3.0, b=1.0)

2
0
2
0
2
4
0
0.2
0.4

NG(=2.0, a=5.0, b=1.0)

2
0
2
0
2
4
0
0.2
0.4

NG(=2.0, a=5.0, b=3.0)

Figure 3: Some Normal-Gamma distributions. Produced by NGplot2.


See Figure 3 for some plots.
We can compute the prior marginal on as follows:
p()
_

0
p(, )d (67)
=
_

0

0+
1
2
1
exp
_
(
0
+

0
(
0
)
2
2
)
_
d (68)
We recognize this as an unnormalized Ga(a =
0
+
1
2
, b =
0
+
0(0)
2
2
) distribution, so we can just write down
p()
(a)
b
a
(69)
b
a
(70)
= (
0
+

0
2
(
0
)
2
)
0
1
2
(71)
= (1 +
1
2
0

0
(
0
)
2

0
)
(20+1)/2
(72)
which we recognize as as a T
20
(|
0
,
0
/(
0

0
)) distribution.
7
3.3 Posterior
The posterior can be derived as follows.
p(, |D) NG(, |
0
,
0
,
0
,
0
)p(D|, ) (73)

1
2
e
(0(0)
2
)/2

01
e
0

n/2
e

2
P
n
i=1
(xi)
2
(74)

1
2

0+n/21
e
0
e
(/2)[0(0)
2
+
P
i
(xi)
2
]
(75)
From Equation 6 we have
n

i=1
(x
i
)
2
= n( x)
2
+
n

i=1
(x
i
x)
2
(76)
Also, it can be shown that

0
(
0
)
2
+n( x)
2
= (
0
+n)(
n
)
2
+

0
n(x
0
)
2

0
+n
(77)
where

n
=

0

0
+nx

0
+n
(78)
Hence

0
(
0
)
2
+

i
(x
i
)
2
=
0
(
0
)
2
+n( x)
2
+

i
(x
i
x)
2
(79)
= (
0
+n)(
n
)
2
+

0
n(x
0
)
2

0
+n
+

i
(x
i
x)
2
(80)
So
p(, |D)
1
2
e
(/2)(0+n)(n)
2
(81)

0+n/21
e
0
e
(/2)
P
i
(xix)
2
e
(/2)

0
n(x
0
)
2

0
+n
(82)
N(|
n
, (( +n))
1
) Ga(|
0
+n/2,
n
) (83)
where

n
=
0
+
1
2
n

i=1
(x
i
x)
2
+

0
n(x
0
)
2
2(
0
+n)
(84)
In summary,
p(, |D) = NG(, |
n
,
n
,
n
,
n
) (85)

n
=

0

0
+nx

0
+n
(86)

n
=
0
+n (87)

n
=
0
+n/2 (88)

n
=
0
+
1
2
n

i=1
(x
i
x)
2
+

0
n(x
0
)
2
2(
0
+n)
(89)
We see that the posterior sum of squares,
n
, combines the prior sum of squares,
0
, the sample sum of squares,

i
(x
i
x)
2
, and a term due to the discrepancy between the prior mean and sample mean. As can be seen from
Figure 3, the range of probable values for and
2
can be quite large even after for moderate n. Keep this picture in
mind whenever someones claims to have t a Gaussian to their data.
8
3.3.1 Posterior marginals
The posterior marginals are (using Equation 72)
p(|D) = Ga(|
n
,
n
) (90)
p(|D) = T
2n
(|
n
,
n
/(
n

n
)) (91)
3.4 Marginal likelihood
To derive the marginal likelihood, we just dererive the posterior, but this time we keep track of all the constant factors.
Let NG

(, |
0
,
0
,
0
,
0
) denote an unnormalized Normal-Gamma distribution, and let Z
0
= Z
NG
(
0
,
0
,
0
,
0
)
be the normalization constant of the prior; similarly let Z
n
be the normalization constant of the posterior. Let
N

(x
i
|, ) denote an unnormalized Gaussian with normalization constant 1/

2. Then
p(, |D) =
1
p(D)
1
Z
0
NG

(, |
0
,
0
,
0
,
0
)
_
1
2
_
n/2

i
N

(x
i
|, ) (92)
The NG

and N

terms combine to make the posterior NG

:
p(, |D) =
1
Z
n
NG

(, |
n
,
n
,
n
,
n
) (93)
Hence
p(D) =
Z
n
Z
0
(2)
n/2
(94)
=
(
n
)
(
0
)

0
0

n
n
(

n
)
1
2
(2)
n/2
(95)
3.5 Posterior predictive
The posterior predictive for m new observations is given by
p(D
new
|D) =
p(D
new
, D)
p(D)
(96)
=
Z
n+m
Z
0
(2)
(n+m)/2
Z
0
Z
n
(2)
n/2
(97)
=
Z
n+m
Z
n
(2)
m/2
(98)
=
(
n+m
)
(
n
)

n
n

n+m
n+m
_

n

n+m
_
1
2
(2)
m/2
(99)
In the special case that m = 1, it can be shown (see below) that this is a T-distribution
p(x|D) = t
2n
(x|
n
,

n
(
n
+ 1)

n
) (100)
To derive the m = 1 result, we proceed as follows. (This proof is by Xiang Xuan, and is based on [GH94, p10].)
When m = 1, the posterior parameters are

n+1
=
n
+ 1/2 (101)

n+1
=
n
+ 1 (102)

n+1
=
n
+
1
2
1

i=1
(x
i
x)
2
+

n
( x
n
)
2
2(
n
+ 1)
(103)
9
Use the fact that when m = 1, we have x
1
= x (since there is only one observation), hence we have
1
2

1
i=1
(x
i
x)
2
=
0. Lets use x denote D
new
, then
n+1
is

n+1
=
n
+

n
(x
n
)
2
2(
n
+ 1)
(104)
Substituting, we have the following,
p(D
new
|D) =
(
n+1
)
(
n
)

n
n

n+1
n+1
_

n

n+1
_1
2
(2)
1/2
(105)
=
(
n
+ 1/2)
(
n
)

n
n
(
n
+
n(xn)
2
2(n+1)
)
n+1/2
_

n

n
+ 1
_1
2
(2)
1/2
(106)
=
((2
n
+ 1)/2)
((2
n
)/2)
_
_

n
+
n(xn)
2
2(n+1)
_
_
n+1/2
1

1
2
n
_

n
2(
n
+ 1)
_1
2
)
1/2
(107)
=
((2
n
+ 1)/2)
((2
n
)/2)
_
_
1
1 +
n(xn)
2
2n(n+1)
_
_
n+1/2
_

n
2
n
(
n
+ 1)
_1
2
()
1/2
(108)
= ()
1/2
((2
n
+ 1)/2)
((2
n
)/2)
_

n

n
2
n

n
(
n
+ 1)
_1
2
_
1 +

n

n
(x
n
)
2
2
n

n
(
n
+ 1)
_
(2n+1)/2
(109)
Let =
nn
n(n+1)
, then we have,
p(D
new
|D) = ()
1/2
((2
n
+ 1)/2)
((2
n
)/2)
_

2
n
_1
2
_
1 +
(x
n
)
2
2
n
_
(2n+1)/2
(110)
We can see this is a T-distribution with center at
n
, precision =
nn
n(n+1)
, and degree of freedom 2
n
.
3.6 Reference analysis
The reference prior for NG is
p(m, )
1
= NG(m, | = , = 0, =
1
2
, = 0) (111)
So the posterior is
p(m, |D) = NG(
n
= x,
n
= n,
n
= (n 1)/2,
n
=
1
2
n

i=1
(x
i
x)
2
) (112)
So the posterior marginal of the mean is
p(m|D) = t
n1
(m|x,

i
(x
i
x)
2
n(n 1)
) (113)
which corresponds to the frequentist sampling distribution of the MLE . Thus in this case, the condence interval
and credible interval coincide.
4 Gamma prior
If is known, and only is unknown (e.g., when implementing Gibbs sampling), we can use the following results,
which can be derived by simplifying the results for the Normal-NG model.
10
4.1 Likelihood
p(D|)
n/2
exp
_

2
n

i=1
(x
i
)
2
_
(114)
4.2 Prior
p() = Ga(|, )
1
e

(115)
4.3 Posterior
p(|D) = Ga(|
n
,
n
) (116)

n
= +n/2 (117)

n
= +
1
2
n

i=1
(x
i
)
2
(118)
4.4 Marginal likelihood
To be completed.
4.5 Posterior predictive
p(x|D) = t
2n
(x|,
2
=
n
/
n
) (119)
4.6 Reference analysis
p()
1
= Ga(|0, 0) (120)
p(|D) = Ga(|n/2,
1
2
m

i=1
(x
i
)
2
) (121)
5 Normal-inverse-chi-squared (NIX) prior
We will see that the natural conjugate prior for
2
is the inverse-chi-squared distribution.
5.1 Likelihood
The likelihood can be written in this form
p(D|,
2
) =
1
(2)
n/2
(
2
)
n/2
exp
_

1
2
2
_
n
n

i=1
(x
i
x)
2
+ n(x )
2
__
(122)
5.2 Prior
The normal-inverse-chi-squared prior is
p(,
2
) = NI
2
(
0
,
0
,
0
,
2
0
) (123)
= N(|
0
,
2
/
0
)
2
(
2
|
0
,
2
0
) (124)
=
1
Z
p
(
0
,
0
,
0
,
2
0
)

1
(
2
)
(0/2+1)
exp
_

1
2
2
[
0

2
0
+
0
(
0
)
2
]
_
(125)
Z
p
(
0
,
0
,
0
,
2
0
) =
_
(2)

0
(
0
/2)
_
2

2
0
_
0/2
(126)
11
1
0.5
0
0.5
1
0
0.5
1
1.5
2
0
0.1
0.2
0.3
0.4

NIX(
0
=0.0,
0
=1.0,
0
=1.0,
2
0
=1.0)
sigma
2 1
0.5
0
0.5
1
0
0.5
1
1.5
2
0
0.2
0.4
0.6
0.8

NIX(
0
=0.0,
0
=5.0,
0
=1.0,
2
0
=1.0)
sigma
2
(a) (b)
1
0.5
0
0.5
1
0
0.5
1
1.5
2
0
0.1
0.2
0.3
0.4

NIX(
0
=0.0,
0
=1.0,
0
=5.0,
2
0
=1.0)
sigma
2 1
0.5
0
0.5
1
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5

NIX(
0
=0.5,
0
=5.0,
0
=5.0,
2
0
=0.5)
sigma
2
(c) (d)
Figure 4: The NI
2
(0, 0, 0,
2
0
) distribution. 0 is the prior mean and 0 is how strongly we believe this;
2
0
is the prior
variance and 0 is how strongly we believe this. (a) 0 = 0, 0 = 1, 0 = 1,
2
0
= 1. Notice that the contour plot (underneath the
surface) is shaped like a squashed egg. (b) We increase the strenght of our belief in the mean, so it gets narrower: 0 = 0, 0 =
5, 0 = 1,
2
0
= 1. (c) We increase the strenght of our belief in the variance, so it gets narrower: 0 = 0, 0 = 1, 0 = 5,
2
0
= 1.
(d) We strongly believe the mean and variance are 0.5: 0 = 0.5, 0 = 5, 0 = 5,
2
0
= 0.5. These plots were produced with
NIXdemo2.
See Figure 4 for some plots. The hyperparameters
0
and
2
/
0
can be interpreted as the location and scale of , and
the hyperparameters u
0
and
2
0
as the degrees of freedom and scale of
2
.
For future reference, it is useful to note that the quadratic term in the prior can be written as
Q
0
() = S
0
+
0
(
0
)
2
(127)
=
0

2
2(
0

0
) + (
0

2
0
+S
0
) (128)
where S
0
=
0

2
0
is the prior sum of squares.
12
5.3 Posterior
(The following derivation is based on [Lee04, p67].) The posterior is
p(,
2
|D) N(|
0
,
2
/
0
)
2
(
2
|
0
,
2
0
)p(D|,
2
) (129)

1
(
2
)
(0/2+1)
exp
_

1
2
2
[
0

2
0
+
0
(
0
)
2
]
__
(130)

_
(
2
)
n/2
exp
_

1
2
2
_
ns
2
+n(x )
2

__
(131)

3
(
2
)
(n/2)
exp
_

1
2
2
[
n

2
n
+
n
(
n
)
2
]
_
= NI
2
(
n
,
n
,
n
,
2
n
) (132)
Matching powers of
2
, we nd

n
=
0
+n (133)
To derive the other terms, we will complete the square. Let S
0
=
0

2
0
and S
n
=
n

2
n
for brevity. Grouping the
terms inside the exponential, we have
S
0
+
0
(
0
)
2
+ns
2
+n(x )
2
= (S
0
+
0

2
0
+ns
2
+nx
2
) +
2
(
0
+n) 2(
0

0
+nx)(134)
Comparing to Equation 128, we have

n
=
0
+n (135)

n
=
0

0
+nx (136)
S
n
+
n

2
n
= (S
0
+
0

2
0
+ns
2
+nx
2
) (137)
S
n
= S
0
+ns
2
+
0

2
0
+nx
2

2
n
(138)
One can rearrange this to get
S
n
= S
0
+ns
2
+ (
1
0
+n
1
)
1
(
0
x)
2
(139)
= S
0
+ns
2
+
n
0

0
+n
(
0
x)
2
(140)
We see that the posterior sum of squares, S
n
=
n

2
n
, combines the prior sum of squares, S
0
=
0

2
0
, the sample sum
of squares, ns
2
, and a term due to the uncertainty in the mean.
In summary,

n
=

0

0
+nx

n
(141)

n
=
0
+n (142)

n
=
0
+n (143)

2
n
=
1

n
(
0

2
0
+

i
(x
i
x)
2
+
n
0

0
+n
(
0
x)
2
) (144)
The posterior mean is given by
E[|D] =
n
(145)
E[
2
|D] =

n

n
2

2
n
(146)
The posterior mode is given by (Equation 14 of [BL01]):
mode[|D] =
n
(147)
mode[
2
|D] =

n

2
n

n
1
(148)
13
The modes of the marginal posterior are
mode[|D] =
n
(149)
mode[
2
|D] =

n

2
n

n
+ 2
(150)
5.3.1 Marginal posterior of
2
First we integrate out , which is just a Gaussian integral.
p(
2
|D) =
_
p(
2
, |D)d (151)

1
(
2
)
(n/2+1)
exp
_

1
2
2
[
n

2
n
]
__
exp
_


n
2
2
(
n
)
2
]
_
d (152)

1
(
2
)
(n/2+1)
exp
_

1
2
2
[
n

2
n
]
_

_
(2)

n
(153)
(
2
)
(n/2+1)
exp
_

1
2
2
[
n

2
n
]
_
(154)
=
2
(
2
|
n
,
2
n
) (155)
5.3.2 Marginal posterior of
Let us rewrite the posterior as
p(,
2
|D) = C

1
exp
_

1
2
[
n

2
n
+
n
(
n
)
2
]
_
(156)
where =
2
and = (
n
+ 1)/2. This follows since

1
(
2
)
(n/2+1)
=
1

2
=

n+1
2

1
=
1
(157)
Now make the substitutions
A =
n

2
n
+
n
(
n
)
2
(158)
x =
A
2
(159)
d
dx
=
A
2
x
2
(160)
so
p(|D) =
_
C
(+1)
e
A/2
d (161)
= (A/2)
_
C(
A
2x
)
(+1)
e
x
x
2
dx (162)
A

_
x
1
e
x
dx (163)
A

(164)
= (
n

2
n
+
n
(
n
)
2
)
(n+1)/2
(165)

_
1 +

n

2
n
(
n
)
2
_
(n+1)/2
(166)
t
n
(|
n
,
2
n
/
n
) (167)
14
5.4 Marginal likelihood
Repeating the derivation of the posterior, but keeping track of the normalization constants, gives the following.
p(D) =
_ _
P(D|,
2
)P(,
2
)dd
2
(168)
=
Z
p
(
n
,
n
,
n
,
2
n
)
Z
p
(
0
,
0
,
0
,
2
0
)
1
Z
N
l
(169)
=

n
(
n
/2)
(
0
/2)
_

2
0
2
_
0/2
_
2

2
n
_
n/2
1
(2)
(n/2)
(170)
=
(
n
/2)
(
0
/2)
_

n
(
0

2
0
)
0/2
(
n

2
n
)
n/2
1

n/2
(171)
5.5 Posterior predictive
p(x|D) =
_ _
p(x|,
2
)p(,
2
|D)dd
2
(172)
=
p(x, D)
p(D)
(173)
=
((
n
+ 1)/2)
(
n
/2)
_

n

n
+ 1
(
n

2
n
)
n/2
(
n

2
n
+
n
n+1
(x
n
)
2
))
(n+1)/2
1

1/2
(174)
=
((
n
+ 1)/2)
(
n
/2)
_

n
(
n
+ 1)
n

2
n
_1
2
_
1 +

n
(x
n
)
2
(
n
+ 1)
n

2
n
_
(n+1)/2
(175)
= t
n
(
n
,
(1 +
n
)
2
n

n
) (176)
5.6 Reference analysis
The reference prior is p(,
2
) (
2
)
1
which can be modeled by
0
= 0,
0
= 1,
0
= 0, since then we get
p(,
2
)
1
(
2
)
(
1
2
+1)
e
0
=
1
(
2
)
1/2
=
2
(177)
(See also [DeG70, p197] and [GCSR04, p88].)
With the reference prior, the posterior is

n
= x (178)

n
= n 1 (179)

n
= n (180)

2
n
=

i
(x
i
x)
2
n 1
(181)
p(,
2
|D)
n2
exp
_

1
2
2
[

i
(x
i
x)
2
+n(x )
2
]
_
(182)
The posterior marginals are
p(
2
|D) =
2
(
2
|n 1,

i
(x
i
x)
2
n 1
) (183)
p(|D) = t
n1
(|x,

i
(x
i
x)
2
n(n 1)
) (184)
15
which are very closely related to the sampling distribution of the MLE. The posterior predictive is
p(x|D) = t
n1
_
x,
(1 +n)

i
(x
i
x)
2
n(n 1)
_
(185)
Note that [Min00] argues that Jeffreys principle says the uninformative prior should be of the form
lim
k0
N(|
0
,
2
/k)
2
(
2
|k,
2
0
) (2
2
)

1
2
(
2
)
1

3
(186)
This can be achieved by setting
0
= 0 instead of
0
= 1.
6 Normal-inverse-Gamma (NIG) prior
Another popular parameterization is the following:
p(,
2
) = NIG(m, V, a, b) (187)
= N(|m,
2
V )IG(
2
|a, b) (188)
6.1 Likelihood
The likelihood can be written in this form
p(D|,
2
) =
1
(2)
n/2
(
2
)
n/2
exp
_

1
2
2
_
ns
2
+n(x )
2

_
(189)
6.2 Prior
p(,
2
) = NIG(m
0
, V
0
, a
0
, b
0
) (190)
= N(|m
0
,
2
V
0
)IG(
2
|a
0
, b
0
) (191)
This is equivalent to the NI
2
prior, where we make the following substitutions.
m
0
=
0
(192)
V
0
=
1

0
(193)
a
0
=

0
2
(194)
b
0
=

0

2
0
2
(195)
6.3 Posterior
We can show that the posterior is also NIG:
p(,
2
|D) = NIG(m
n
, V
n
, a
n
, b
n
) (196)
V
1
n
= V
1
0
+n (197)
m
n
V
n
= V
1
0
m
0
+nx (198)
a
n
= a
0
+n/2 (199)
b
n
= b
0
+
1
2
[m
2
0
V
1
0
+

i
x
2
i
m
2
n
V
1
n
] (200)
The NIG posterior follows directly from the NI
2
results using the specied substitutions. (The b
n
term requires
some tedious algebra...)
16
6.3.1 Posterior marginals
To be derived.
6.4 Marginal likelihood
For the marginal likelihood, substituting into Equation 171 we have
p(D) =
(a
n
)
(a
0
)
_
V
n
V
0
(2b
0
)
a0
(2b
n
)
an
1

n/2
(201)
=
|V
n
|
1
2
|V
0
|
1
2
b
a0
0
b
an
n
(a
n
)
(a
0
)
1

n/2
2
a0an
(202)
=
|V
n
|
1
2
|V
0
|
1
2
b
a0
0
b
an
n
(a
n
)
(a
0
)
1

n/2
2
n
(203)
6.5 Posterior predictive
For the predictive density, substituting into Equation 176 we have

n
(1 +
n
)
2
n
=
1
(
1
n
+ 1)
2
n
(204)
=
2a
n
2b
n
(1 +V
n
)
(205)
So
p(y|D) = t
2an
(m
n
,
b
n
(1 +V
n
)
a
n
) (206)
These results follow from [DHMS02, p240] by setting x = 1, = , B
T
B = n, B
T
X = nx, X
T
X =

i
x
2
i
.
Note that we use a difference parameterization of the student-t. Also, our equations for p(D) differ by a 2
n
term.
7 Multivariate Normal prior
If we assume is known, then a conjugate analysis of the mean is very simple, since the conjugate prior for the mean
is Gaussian, the likelihood is Gaussian, and hence the posterior is Gaussian. The results are analogous to the scalar
case. In particular, we use the general result from [Bis06, p92] with the following substitutions:
x = , y = x,
1
=
0
, A = I, b = 0, L
1
= /N (207)
7.1 Prior
p() = N(|
0
,
0
) (208)
7.2 Likelihood
p(D|, ) N(x|,
1
N
) (209)
7.3 Posterior
p(|D, ) = N(|
N
,
N
) (210)

N
=
_

1
0
+N
1
_
1
(211)

N
=
N
_
N
1
x +
1
0

0
_
(212)
17
7.4 Posterior predictive
p(x|D) = N(x|
N
, +
N
) (213)
7.5 Reference analysis
p() 1 = N(|, I) (214)
p(|D) = N(x, /n) (215)
8 Normal-Wishart prior
The multivariate analog of the normal-gamma prior is the normal-Wishart prior. Here we just state the results without
proof; see [DeG70, p178] for details. We assume X is a d-dimensional.
8.1 Likelihood
p(D|, ) = (2)
nd/2
||
n/2
exp
_

1
2
n

i=1
(x
i
)
T
(x
i
)
_
(216)
8.2 Prior
p(, ) = NWi(, |
0
, , , T) = N(|
0
, ()
1
)Wi

(|T) (217)
=
1
Z
||
1
2
exp
_

2
(
0
)
T
(
0
)
_
||
(d1)/2
exp
_

1
2
tr(T
1
)
_
(218)
Z =
_

2
_
d/2
|T|
/2
2
d/2

d
(/2) (219)
Here T is the prior covariance. To see the connection to the scalar case, make the substitutions

0
=

0
2
,
0
=
T
0
2
(220)
8.3 Posterior
p(, |D) = N(|
n
, (
n
)
1
)Wi
n
(|T
n
) (221)

n
=

0
+nx
+n
(222)
T
n
= T +S +
n
+n
(
0
x)(
0
x)
T
(223)
S =
n

i=1
(x
i
x)(x
i
x)
T
(224)

n
= +n (225)

n
= +n (226)
Posterior marginals
p(|D) = Wi
n
(T
n
) (227)
p(|D) = t
nd+1
(|
n
,
T
n

n
(
n
d + 1)
) (228)
18
The MAP estimates are given by
( ,

) = arg max
,
p(D|, )NWi(, ) (229)
=
n

i=1
x
i
+
0

0
N +
0
(230)

n
i=1
(x
i
)(x
i
)
T
+
0
(
0
)(
0
)
T
+T
1
0
N +
0
d
(231)
This reduces to the MLE if
0
= 0,
0
= d and |T
0
| = 0.
8.4 Posterior predictive
p(x|D) = t
nd+1
(
n
,
T
n
(
n
+ 1)

n
(
n
d + 1)
) (232)
If d = 1, this reduces to Equation 100.
8.5 Marginal likelihood
This can be computed as a ratio of normalization constants.
p(D) =
Z
n
Z
0
1
(2)
nd/2
(233)
=
1

nd/2

d
(
n
/2)

d
(
0
/2)
|T
0
|
0/2
|T
n
|
n/2
_

n
_
d/2
(234)
This reduces to Equation 95 if d = 1.
8.6 Reference analysis
We set

0
= 0,
0
= 0,
0
= 1, |T
0
| = 0 (235)
to give
p(, ) ||
(d+1)/2
(236)
Then the posterior parameters become

n
= x, T
n
= S,
n
= n,
n
= n 1 (237)
the posterior marginals become
p(|D) = t
nd
(|x,
S
n(n d)
) (238)
p(|D) = Wi
nd
(|S) (239)
and the posterior predictive becomes
p(x|D) = t
nd
(x,
S(n + 1)
n(n d)
(240)
9 Normal-Inverse-Wishart prior
The multivariate analog of the normal inverse chi-squared (NIX) distribution is the normal inverse Wishart (NIW) (see
also [GCSR04, p85]).
19
9.1 Likelihood
The likelihood is
p(D|, ) ||

n
2
exp
_

1
2
n

i=1
(x
i
)
T

1
(x
i
)
_
(241)
= ||

n
2
exp
_

1
2
tr(
1
S)
_
(242)
(243)
where S is the matrix of sum of squares (scatter matrix)
S =
N

i=1
(x
i
x)(x
i
x)
T
(244)
9.2 Prior
The natural conjugate prior is normal-inverse-wishart
IW
0
(
1
0
) (245)
| N(
0
, /
0
) (246)
p(, )
def
= NIW(
0
,
0
,
0
,
0
) (247)
=
1
Z
||
((0+d)/2+1)
exp
_

1
2
tr(
0

1
)

0
2
(
0
)
T

1
(
0
)
_
(248)
Z =
2
v0d/2

d
(
0
/2)(2/
0
)
d/2
|
0
|
0/2
(249)
9.3 Posterior
The posterior is
p(, |D,
0
,
0
,
0
,
0
) = NIW(, |
n
,
n
,
n
,
n
) (250)

n
=

0
+ 0 +ny

n
=

0

0
+n

0
+
n

0
+n
y (251)

n
=
0
+n (252)

n
=
0
+n (253)

n
=
0
+S +

0
n

0
+n
(x
0
)(x
0
)
T
(254)
The marginals are
|D IW(
1
n
,
n
) (255)
|D = t
nd+1
(
n
,

n

n
(
n
d + 1)
) (256)
To see the connection with the scalar case, note that
n
plays the role of
n

2
n
(posterior sum of squares), so

n
(
n
d + 1)
=

n

n
=

2

n
(257)
20
9.4 Posterior predictive
p(x|D) = t
nd+1
(
n
,

n
(
n
+ 1)

n
(
n
d + 1)
) (258)
To see the connection with the scalar case, note that

n
(
n
+ 1)

n
(
n
d + 1)
=

n
(
n
+ 1)

n
=

2
(
n
+ 1)

n
(259)
9.5 Marginal likelihood
The posterior is given by
p(, |D) =
1
p(D)
1
Z
0
NIW

(, |
0
)
1
(2)
nd/2
N

(D|, ) (260)
=
1
Z
n
NIW

(, |
n
) (261)
where
NIW

(, |
0
) = ||
((0+d)/2+1)
exp
_

1
2
tr(
0

1
)

0
2
(
0
)
T

1
(
0
)
_
(262)
N

(D|, ) = ||

n
2
exp
_

1
2
tr(
1
S)
_
(263)
is the unnormalized prior and likelihood. Hence
p(D) =
Z
n
Z
0
1
(2)
nd/2
=
2
nd/2

d
(
n
/2)(2/
n
)
d/2
|
n
|
n/2
|
0
|
0/2
2
0d/2

d
(
0
/2)(2/
0
)
d/2
1
(2)
nd/2
(264)
=
1
(2)
nd/2
2
nd/2
2
0d/2
(2/
n
)
d/2
(2/
0
)
d/2

d
(
n
/2)

d
(
0
/2)
(265)
=
1

nd/2

d
(
n
/2)

d
(
0
/2)
|
0
|
0/2
|
n
|
n/2
_

n
_
d/2
(266)
This reduces to Equation 171 if d = 1.
9.6 Reference analysis
A noninformative (Jeffreys) prior is p(, ) ||
(d+1)/2
which is the limit of
0
0,
0
1, |
0
|0 [GCSR04,
p88]. Then the posterior becomes

n
= x (267)

n
= n (268)

n
= n 1 (269)

n
= S =

i
(x
i
x)(x
i
x)
T
(270)
p(|D) = IW
n1
(|S) (271)
p(|D) = t
nd
(|x,
S
n(n d)
) (272)
p(x|D) = t
nd
(x|x,
S(n + 1)
n(n d)
) (273)
Note that [Min00] argues that Jeffreys principle says the uninformative prior should be of the form
lim
k0
N(|
0
, /k)IW
k
(|k) |2|

1
2
||
(d+1)/2
||
(
d
2
+1)
(274)
This can be achieved by setting
0
= 0 instead of
0
= 1.
21
0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Gamma(shape=a,rate=b)


a=0.5, b=1.0
a=1.0, b=1.0
a=1.5, b=1.0
a=2.0, b=1.0
a=5.0, b=1.0
0 1 2 3 4 5
0
0.5
1
1.5
2
2.5
3
Gamma(shape=a,rate=b)


a=0.5, b=3.0
a=1.0, b=3.0
a=1.5, b=3.0
a=2.0, b=3.0
a=5.0, b=3.0
Figure 5: Some Ga(a, b) distributions. If a < 1, the peak is at 0. As we increase b, we squeeze everything leftwards and upwards.
Figures generated by gammaDistPlot2.
10 Appendix: some standard distributions
10.1 Gamma distribution
The gamma distribution is a exible distribution for positive real valued rvs, x > 0. It is dened in terms of two
parameters. There are two common parameterizations. This is the one used by Bishop [Bis06] (and many other
authors):
Ga(x|shape = a, rate = b) =
b
a
(a)
x
a1
e
xb
, x, a, b > 0 (275)
The second parameterization (and the one used by Matlabs gampdf) is
Ga(x|shape = , scale = ) =
1

()
x
1
e
x/
= Ga
rate
(x|, 1/) (276)
Note that the shape parameter controls the shape; the scale parameter merely denes the measurement scale (the
horizontal axis). The rate parameter is just the inverse of the scale. See Figure 5 for some examples. This distribution
has the following properties (using the rate parameterization):
mean =
a
b
(277)
mode =
a 1
b
for a 1 (278)
var =
a
b
2
(279)
10.2 Inverse Gamma distribution
Let X Ga(shape = a, rate = b) and Y = 1/X. Then it is easy to show that Y IG(shape = a, scale = b), where
the inverse Gamma distribution is given by
IG(x|shape = a, scale = b) =
b
a
(a)
x
(a+1)
e
b/x
, x, a, b > 0 (280)
22
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
IG(a,b)
a=0.10, b=1.00
a=1.00, b=1.00
a=2.00, b=1.00
a=0.10, b=2.00
a=1.00, b=2.00
a=2.00, b=2.00
Figure 6: Some inverse gamma distributions (a=shape, b=rate). These plots were produced by invchi2plot.
The distribution has these properties
mean =
b
a 1
, a > 1 (281)
mode =
b
a + 1
(282)
var =
b
2
(a 1)
2
(a 2)
, a > 2 (283)
See Figure 6 for some plots. We see that increasing b just stretches the horizontal axis, but increasing a moves the
peak up and closer to the left.
There is also another parameterization, using the rate (inverse scale):
IG(x|shape = , rate = ) =
1

a
(a)x
(+1)
e
1/(x)
, x, , > 0 (284)
10.3 Scaled Inverse-Chi-squared
The scaled inverse-chi-squared distribution is a reparameterization of the inverse Gamma [GCSR04, p575].

2
(x|,
2
) =
1
(/2)
_

2
2
_
/2
x

2
1
exp[

2
2x
], x > 0 (285)
= IG(x|shape=

2
, scale=

2
2
) (286)
where the parameter > 0 is called the degrees of freedom, and
2
> 0 is the scale. See Figure 7 for some plots. We
see that increasing lifts the curve up and moves it slightly to the right. Later, when we consider Bayesian parameter
estimation, we will use this distribution as a conjugate prior for a scale parameter (such as the variance of a Gaussian);
increasing corresponds to increasing the effective strength of the prior.
23
0 0.5 1 1.5 2
0
0.5
1
1.5

2
(,
2
)
=1.00,
2
=0.50
=1.00,
2
=1.00
=1.00,
2
=2.00
=5.00,
2
=0.50
=5.00,
2
=1.00
=5.00,
2
=2.00
Figure 7: Some inverse scaled
2
distributions. These plots were produced by invchi2plot.
The distribution has these properties
mean =

2
2
for > 2 (287)
mode =

2
+ 2
(288)
var =
2
2

4
( 2)
2
( 4)
for > 4 (289)
The inverse chi-squared distribution, written
2

(x), is the special case where


2
= 1 (i.e.,
2
= 1/). This
corresponds to IG(a = /2, b = scale = 1/2).
10.4 Wishart distribution
Let X be a p dimensional symmetric positive denite matrix. The Wishart is the multidimensional generalization of
the Gamma. Since it is a distribution over matrices, it is hard to plot as a density function. However, we can easily
sample from it, and then use the eigenvectors of the resulting matrix to dene an ellipse. See Figure 8.
There are several possible parameterizations. Some authors (e.g., [Bis06, p693], [DeG70, p.57],[GCSR04, p574],
wikipedia) as well as WinBUGS and Matlab (wishrnd), dene the Wishart in terms of degrees of freedom p
and the scale matrix S as follows:
Wi

(X|S) =
1
Z
|X|
(p1)/2
exp[
1
2
tr(S
1
X)] (290)
Z = 2
p/2

p
(/2)|S|
/2
(291)
where
p
(a) is the generalized gamma function

p
() =
p(p1)/4
p

i=1

_
2 + 1 i
2
_
(292)
(So
1
() = ().) The mean and mode are given by (see also [Pre05])
mean = S (293)
mode = ( p 1)S, > p + 1 (294)
24
5 0 5
5
0
5
5 0 5
5
0
5
4 2 0 2 4
2
0
2
10 0 10
5
0
5
5 0 5
4
2
0
2
4
5 0 5
5
0
5
5 0 5
4
2
0
2
4
10 0 10
5
0
5
Wishart(dof=2.0,S=[4 3; 3 4])
10 0 10
5
0
5
10 0 10
10
5
0
5
10
20 0 20
10
0
10
20 0 20
10
0
10
10 0 10
10
5
0
5
10
10 0 10
5
0
5
20 0 20
10
0
10
10 0 10
10
5
0
5
10
10 0 10
10
5
0
5
10
Wishart(dof=10.0,S=[4 3; 3 4])
10 0 10
10
5
0
5
10
Figure 8: Some samples from the Wishart distribution. Left: = 2, right: = 10. We see that if if = 2 (the smallest valid
value in 2 dimensions), we often sample nearly singular matrices. As increases, we put more mass on the S matrix. If S = I2,
the samples would look (on average) like circles. Generated by wishplot.
In 1D, this becomes Ga(shape = /2, rate = S/2).
Note that if X Wi
u
(S), and Y = X
1
, then Y IW

(S
1
) and E[Y ] =
S
d1
.
In [BS94, p.138], and the wishpdf in Tom Minkas lightspeed toolbox, they use the following parameterization
Wi(X|a, B) =
|B|
a

p
(a)
|X|
a(p+1)/2
exp[tr(BX)] (295)
We require that Bis a pp symmetric positive denite matrix, and 2a > p1. If p = 1, so Bis a scalar, this reduces
to the Ga(shape = a, rate= b) density.
To get some intuition for this distribution, recall that tr(AB) is a vector which contains the inner product of the
rows of A and the columns of B. In Matlab notation we have
trace(A B) = [a(1,:)
*
b(:,1), ..., a(n,:)
*
b(:,n)]
If X Wi

(S), then we are performing a kind of template matching between the columns of X and S
1
(recall that
both X and S are symmetric). This is a natural way to dene the distance between two matrices.
10.5 Inverse Wishart
This is the multidimensional generalization of the inverse Gamma. Consider a dd positive denite (covariance) ma-
trix Xand a dof parameter > d1 and psd matrix S. Some authors (eg [GCSR04, p574]) use this parameterization:
IW

(X|S
1
) =
1
Z
|X|
(+d+1)/2
exp
_

1
2
Tr(SX
1
)
_
(296)
Z =
|S|
/2
2
d/2

d
(/2)
(297)
where

d
(/2) =
d(d1)/4
d

i=1
(
+ 1 i
2
) (298)
25
The distribution has mean
E X =
S
d 1
(299)
In Matlab, use iwishrnd. In the 1d case, we have

2
(|
0
,
2
0
) = IW
0
(|(
0

2
0
)
1
) (300)
Other authors (e.g., [Pre05, p117]) use a slightly different formulation (with 2d < )
IW

(X|Q) =
_
_
2
(d1)d/2

d(d1)/4
d

j=1
(( d j)/2)
_
_
1
(301)
|Q|
(d1)/2
|X|
/2
exp
_

1
2
Tr(X
1
Q)
_
(302)
which has mean
E X =
Q
2d 2
(303)
10.6 Student t distribution
The generalized t-distribution is given as
t

(x|,
2
) = c
_
1 +
1

(
x

)
2
_
(
+1
2
)
(304)
c =
(/2 + 1/2)
(/2)
1

(305)
where c is the normalization consant. is the mean, > 0 is the degrees of freedom, and
2
> 0 is the scale. (Note
that the parameter is often written as a subscript.) In Matlab, use tpdf.
The distribution has the following properties:
mean = , > 1 (306)
mode = (307)
var =

2
( 2)
, > 2 (308)
Note: if x t

(,
2
), then
x

(309)
which corresponds to a standard t-distribution with = 0,
2
= 1:
t

(x) =
(( + 1)/2)

(/2)
(1 +x
2
/)
(+1)/2
(310)
In Figure 9, we plot the density for different parameter values. As , the T approaches a Gaussian. T-
distributions are like Gaussian distributions with heavy tails. Hence they are more robust to outliers (see Figure 10).
If = 1, this is called a Cauchy distribution. This is an interesting distribution since if X Cauchy, then E[X]
does not exist, since the corresponding integral diverges. Essentially this is because the tails are so heavy that samples
from the distribution can get very far from the center .
It can be shown that the t-distribution is like an innite sum of Gaussians, where each Gaussian has a different
precision:
p(x|, a, b) =
_
N(x|,
1
)Ga(|a, rate = b)d (311)
= t
2a
(x|, b/a) (312)
(See exercise 2.46 of [Bis06].)
26
6 4 2 0 2 4 6
0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Student T distributions
t(=0.1)
t(=1.0)
t(=5.0)
N(0,1)
Figure 9: Student t-distributions T(,
2
, ) for = 0. The effect of is just to scale the horizontal axis. As , the
distribution approaches a Gaussian. See studentTplot.
(a)
5 0 5 10
0
0.1
0.2
0.3
0.4
0.5
(b)
5 0 5 10
0
0.1
0.2
0.3
0.4
0.5
Figure 10: Fitting a Gaussian and a Student distribution to some data (left) and to some data with outliers (right). The Student
distribution (red) is much less affected by outliers than the Gaussian (green). Source: [Bis06] Figure 2.16.
27
2
1
0
1
2
2
1
0
1
2
0
0.05
0.1
0.15
0.2
T distribution, dof 2.0
2
1
0
1
2
2
1
0
1
2
0
0.5
1
1.5
2
Gaussian
Figure 11: Left: T distribution in 2d with dof=2 and = 0.1I2. Right: Gaussian density with = 0.1I2 and = (0, 0); we see
it goes to zero faster. Produced by multivarTplot.
10.7 Multivariate t distributions
The multivariate T distribution in d dimensions is given by
t

(x|, ) =
(/2 +d/2)
(/2)
||
1/2
v
d/2

d/2

_
1 +
1

(x )
T

1
(x )
_
(
+d
2
)
(313)
(314)
where is called the scale matrix (since it is not exactly the covariance matrix). This has fatter tails than a Gaussian:
see Figure 11. In Matlab, use mvtpdf.
The distribution has the following properties
E x = if > 1 (315)
mode x = (316)
Cov x =

2
for > 2 (317)
(The following results are from [Koo03, p328].) Suppose Y T(, , ) and we partition the variables into 2
blocks. Then the marginals are
Y
i
T(
i
,
ii
, ) (318)
and the conditionals are
Y
1
|y
2
T(
1|2
,
1|2
, +d
1
) (319)

1|2
=
1
+
12

1
22
(y
2

2
) (320)

1|2
= h
1|2
(
11

12

1
22

T
12
) (321)
h
1|2
=
1
+d
2
_
+ (y
2

2
)
T

1
22
(y
2

2
)

(322)
We can also show linear combinations of Ts are Ts:
Y T(, , ) AY T(A, AA

, ) (323)
We can sample from a y T(, , ) by sampling x T(0, 1, ) and then transforming y = + R
T
x, where
R = chol(), so R
T
R = .
28
References
[Bis06] C. Bishop. Pattern recognition and machine learning. Springer, 2006.
[BL01] P. Baldi and A. Long. A Bayesian framework for the analysis of microarray expression data: regularized
t-test and statistical inferences of gene changes. Bioinformatics, 17(6):509519, 2001.
[BS94] J. Bernardo and A. Smith. Bayesian Theory. John Wiley, 1994.
[DeG70] M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.
[DHMS02] D. Denison, C. Holmes, B. Mallick, and A. Smith. Bayesian methods for nonlinear classication and
regression. Wiley, 2002.
[DMP
+
06] F. Demichelis, P. Magni, P. Piergiorgi, M. Rubin, and R. Bellazzi. A hierarchical Naive Bayes model
for handling sample heterogeneity in classication problems: an application to tissue microarrays. BMC
Bioinformatics, 7:514, 2006.
[GCSR04] A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman and Hall, 2004. 2nd
edition.
[GH94] D. Geiger and D. Heckerman. Learning Gaussian networks. Technical Report MSR-TR-94-10, Microsoft
Research, 1994.
[Koo03] Gary Koop. Bayesian econometrics. Wiley, 2003.
[Lee04] Peter Lee. Bayesian statistics: an introduction. Arnold Publishing, 2004. Third edition.
[Min00] T. Minka. Inferring a Gaussian distribution. Technical report, MIT, 2000.
[Pre05] S. J. Press. Applied multivariate analysis, using Bayesian and frequentist methods of inference. Dover,
2005. Second edition.
29