Professional Documents
Culture Documents
Multivariate Models 97
The primary purpose of this chapter is to introduce some basic ideas from multivari-
ate statistical analysis. Quite often, experiments produce data where measurements
were obtained on more than one variable hence the name: multivariate. In the Swiss
head dimension example (Flury, 1997), in order to determine well-fitting masks, sev-
eral different head-dimension measurements were obtained on the soldiers. In the
next chapter on regression analysis, we will examine models that are defined in terms
of several parameters. In order to properly understand the estimation of these model
parameters, a foundation in multivariate statistics is needed. In particular, we need
to understand concepts such as covariances and correlations between variables and
estimators. An advantage of the multivariate approach is to allow for designs of ex-
periments where the resulting parameter estimators will be uncorrelated, thus making
it easier to interpret results.
1. f (y1 , y2 ) 0.
2. The total volume under the pdf must be 1:
Z Z
f (y1 , y2 )dy1 dy2 = 1
Chapter 4. Multivariate Models 98
Definition. The marginal pdf of Y1 , denoted f1 (y1 ), is just the pdf of the random
variable Y1 considered alone. To determine the marginal pdf, we integrate out y2 in
the joint pdf: Z
f1 (y1 ) = f (y1 , y2 )dy2 .
2 Covariance
Let Y1 and Y2 be two jointly distributed random variables with means 1 and 2
respectively and variances 12 and 22 . A common measure of association between Y1
and Y2 is the covariance, denoted 12 :
A positive covariance indicates that if Y1 is above its average (Y1 1 > 0), then
Y2 tends to be above its average (Y2 2 > 0), so that (Y1 1 )(Y2 2 ) tends to
be positive; also if Y1 is below average, then Y2 tends to be below average as well
whereby (Y1 1 )(Y2 2 ) is a negative times a negative resulting in a positive value.
Conversely, a negative covariance indicates that if Y1 tends to be small, then Y2 tends
to be large, and vice-versa.
To illustrate, if Y1 is a measure of a persons height and Y2 is a measure of their weight,
then these two variables tend to be associated. In particular, the covariance between
them is usually positive since taller people tend to weigh more and shorter people
tend to weigh less. On the other hand, if Y1 is the hours of training a technician
receives for learning to operate a new machine and Y2 represents the number of
errors the technician makes using the machine, then we would expect to see fewer
errors corresponding with more training and hence Y1 and Y2 would have a negative
covariance.
An important area where the covariance is important is when considering differences
of jointly distributed random variables Y1 Y2 . For instance, we will discuss later
experiments looking at paired differences in situations where we may want to compare
Chapter 4. Multivariate Models 99
two different experimental conditions. The statistical analysis requires that we know
the variance of the difference: var(Y1 Y2 ). There are two extreme cases:
Y1 = Y2 : var(Y1 Y2 ) = var(0) = 0
12 = 0 : var(Y1 Y2 ) = var(Y1 ) + var(Y2 )
These two extremes are special cases the following formula which holds in all cases:
3 Correlation
We can transform the covariance to obtain a well-known measure of association known
as the correlation, which is denoted by the Greek letter (rho).
12
Correlation: = 1 2 , where 1 and 2 are the standard deviations of Y1 and Y2
respectively.
1. 1 1.
Property (1) highlights the fact that the correlation is a unitless quantity. Property
(2) highlights the fact that the correlation is a measure of the strength of the linear
relation between Y1 and Y2 . A perfect linear relation produces a correlation of 1
or 1. A correlation of zero indicates no linear relation between the two random
variables. Figure 1 shows scatterplots of data obtained from bivariate distributions
with different correlations. The distribution for the top-left panel had a correlation
of = 0.95. The plot shows a strong positive relation between Y1 and Y2 with the
points tightly clustered together in a linear pattern. The correlation for the top-right
panel is also positive with = 0.50 and again we see a positive relation between the
two variables, but not as strong as in the top-right panel. The bottom-left panel
corresponds to a correlation of = 0 and consequently, we see no relationship evident
between Y1 and Y2 in this plot. Finally, the bottom-right panel shows a negative
linear relation with a correlation of = 0.50.
A note of caution is in order: two variables Y1 and Y2 can be strongly related, but
the relation may be nonlinear in which case the correlation may not be a reasonable
Chapter 4. Multivariate Models 100
plot(mfb, bam, *)
title(Swiss Head Data)
xlabel(Forehead Width)
ylabel(Chin Width)
Note that to access a particular variable (i.e. a column of the data set) call swiss, we
write swiss(:,1) for column 1, and so on.
When we have more than two variables, we can compute covariances between each
pair of variables. These covariances are collected together in a p p matrix called
the covariance matrix. The diagonal elements of a covariance matrix correspond to
the variances of the random variables. The i-jth element of the covariance matrix
is the covariance between Yi and Yj . The covariance matrix is a symmetric matrix
because the covariance between Yi and Yj is the same as the covariance between Yj
and Yi . To illustrate, suppose we have a tri-variate distribution for Y1 , Y2 and Y3 .
Let 12 = cov(Y1 , Y2 ), 13 = cov(Y1 , Y3 ), and 23 = cov(Y2 , Y3 ), . Then the covariance
matrix, denoted by is
12 12 13
Covariance Matrix: = 12 22 23 .
13 23 32
E[(Y )(Y )0 ].
When we take the expected value of a random vector or a random matrix, we compute
the expected value of each term individually. For example,
E[Y1 ]
E[Y2 ]
E[Y ] =
.. .
.
E[Yp ]
Therefore,
E[(Y1 1 )2 ] E[(Y1 1 )(Y2 2 )]
E[(Y )(Y )0 ] =
E(Y1 1 )(Y2 2 )] E[(Y2 2 )2 ]
Of course, the population covariances (e.g. 12 ) and the population correlations are
typically unknown population parameters which must be estimated from the data.
Generally, multivariate data sets are organized so that each row corresponds to a new
p-dimensional observation and each column corresponds to the measurement on one
of the p variables. In other words, the data usually comes in the form of n rows for the
sample size and p columns for the p measured variables. For a p dimensional data set,
let yi1 equal the ith observation on the first variable, and yi2 equal the ith observation
on the second variable and so on for i = 1, 2, . . . , n. The sample covariance between
variables 1 and 2, denoted s12 is
n
X
s12 = (yi1 y1 )(yi2 y2 )/(n 1) (2)
i=1
where y1 and y2 are the sample means of the first and second variables respectively.
We can estimate the covariance matrix by replacing the population variances and
covariances by their respective estimators this will be called the sample covariance
matrix and is generally denoted by S.
Example. Consider once again the Swiss head dimension data consisting of p = 6
head measurements. Denote these measurements by
To give an indication of what the data looks like, below is a list of the fist 20 obser-
vations:
Chapter 4. Multivariate Models 104
To get a better feel for the data, Figure 3 shows scatterplots of each pair of variables.
The sample mean vector for the entire data set is given by
114.7245
y1
115.9140
y2 123.0550
y = .. =
. 57.9885
y6 122.2340
138.8335
Matlab can compute these statistics easily the cov command to get the sample co-
variance matrix. Note that the covariance between the six head measurements are all
positive. It is quite common to see all positive covariances on data of this sort. For
example, if people with larger than average forehead widths will tend to also have
larger than average chin widths and so on.
Chapter 4. Multivariate Models 105
130
115
MFB
100
115
BAM
100
TFH 140
125
110
70
LGAN
60
50
135
125
LTN
115
140
LTG
125
Figure 3: Scatterplot matrix of each pair of variables in the Swiss head data. Note
that most pairs of variables are positively correlated.
Chapter 4. Multivariate Models 106
The sample correlations, typically denoted by r, are the sample counterpart to the
population correlation. For instance,
s12
r12 = . (3)
s1 s2
We can collect the sample correlations together into a correlation matrix, denoted
by R where the i-jth element of the matrix is rij , the sample correlation between
the ith and the jth variables. Note that the correlation between a random variable
with itself is always 1 (the same goes for sample correlations). Therefore, correlation
matrices always have ones down the diagonal. For the Swiss head dimension data,
the sample correlation matrix is
1.0000 0.4662 0.1749 0.1338 0.4021 0.4137
0.4662 1.0000 0.0930 0.0933 0.3482 0.3884
0.1749 0.0930 1.0000 0.4135 0.2590 0.2381
R=
0.1338
.
0.0933 0.4135 1.0000 0.1763 0.2095
0.4021 0.3482 0.2590 0.1763 1.0000 0.6564
0.4137 0.3884 0.2381 0.2095 0.6564 1.0000
Note that the highest correlation r56 = 0.6564 is between LTN and LTG, the distances
from the top of the nose to the ear and the distance from the bottom of the chin to the
ear. The correlation between chin width (BAM) and facial height (TFH) is relatively
quite small (r23 = 0.0930). Also, the correlation between chin width and the distance
from the top the nose to the ear is also relatively quite small (r24 = 0.0933). Looking
at Figure 3, one can see a weak association between BAM and TFH and a strong
association between LTN and LTG.
Recall that a normal random variable Y with mean and variance 2 has a probability
density function (pdf) of
1 1
f (y) = exp{ 2 (y )2 },
2 2 2
for < y < . It is easy to generalize the this univariate pdf to a multivariate nor-
mal pdf. Let Y = (Y1 , Y2 , . . . , Yp )0 denote a multivariate normal random vector with
mean vector = (1 , 2 , . . . , p ) and covariance matrix . To obtain the multivariate
density function, we replace (y )2 / 2 in the exponential exponent by
(y )0 1 (y ),
and we replace the 1/ scaler by the determinant of raised to the 1/2 power:
||1/2 . The p-dimensional normal pdf can be written
1 p/2 1/2 1
f (y1 , y2 , . . . , yp ) = ( ) || exp{ (y )0 1 (y )}, (4)
2 2
for y <p .
It is informative to note that if we set the expression (y )0 1 (y ) in the
exponent of the multivariate normal density equal to a constant, the resulting equation
Chapter 4. Multivariate Models 107
Figure 4: The bivariate normal pdf (4) for the Swiss head dimension variables LTN
and LTG.
for < y1 < , and < y2 < . This looks quite complicated. The expression
for p = 3 or more dimensions becomes even more of a mess to write out without
matrix notation but (4) stays the same regardless of the dimension.
To get an idea of what a multivariate normal pdf looks like, Figure 4 shows a bivariate
normal pdf for the LTN and LTG variables from the Swiss head dimension data. The
bivariate normal pdf looks like a mountain centered over the mean of the distribution.
In order to compute probabilities using the pdf, one needs to compute the volume
under the pdf surface corresponding to the region of interest.
Chapter 4. Multivariate Models 108
5 Confidence Regions
Confidence intervals were introduced for estimating a single parameter such as the
mean of a distribution. In the multivariate setting, we can similarly define confidence
regions for vectors of parameters, such as the mean vector . To illustrate matters, we
shall consider two of the Swiss head dimension variables: LTN and LTG. Since they
correspond to the 5th and 6th variables, the mean vector of interest is = (5 , 6 )0 .
There are two approaches. One method is to simply compute two univariate confi-
dence intervals separately for 1 and 2 and form the Cartesian product of the two
intervals to obtain a confidence rectangle. However, if we compute say 95% confidence
intervals for 1 and 2 , then the joint confidence region (the rectangle) has a lower
confidence level. To understand why, consider an analogy: if there is a 5% chance Ill
get a speeding ticket on a given day when I drive to work. Then the probability I
get at least one ticket during the year is certainly higher than 5%. Similarly, if there
is a 5% probability that each random interval does not contain its respective mean,
then the probability that at least one of the intervals does not contain their respective
mean is higher than 5%. A simple (but not always efficient) fix to this problem is to
use what is known as the Bonferroni adjustment. If you form p confidence intervals
for p parameters using a confidence level 1 for the family of parameters, then one
can compute a confidence interval for each parameter separately using a confidence
level of 1/p to guarantee that the confidence level is at least 1 for all p intervals
considered jointly.
A more efficient approach for estimating a mean vector is to incorporate the correla-
tions between the estimate parameters. For multivariate normal data, the resulting
confidence regions have ellipsoidal shapes. Instead of determining a random interval
that covers a mean with high probability, we want to determine a region that covers
the vector with high probability.
The solution to this problem requires introducing another probability distribution
known as the F -distribution which results when we look at statistics formed by ratios
of variance estimates. The F -distribution is used extensively in analysis of variance
(ANOVA) applications where we want to compare several means. Because an F
random variable is defined in terms of a ratio of variance estimators, and variance
estimators depend on a degrees of freedom, the F -distribution is specified by a nu-
merator and a denominator degrees of freedom. The F -distribution is skewed to the
right and takes only positive values. Critical values for the F -distribution can be
found beginning on page 202 in the Appendix. Let Fp,np () denote the critical
value of an F -distribution on p numerator degrees of freedom and n p denominator
degrees of freedom.
Returning to the confidence region problem, one can show (e.g., Johnson and Wichern,
1998, page 179) that for a sample of size n from a p-dimensional normal distribution
(n 1)p
P (n(Y )0 S 1 (Y ) Fp,np ()) = 1 .
(n p)
This statement shows that a (1 )100% confidence region for the mean of a p-
Chapter 4. Multivariate Models 109
Figure 5: A 95% confidence ellipse for the Swiss head dimension data using only the
variables LTN and LTG.
dimensional normal distribution is given by the set of <p that satisfy the in-
equality:
(n 1)p
n(y )0 S 1 (y ) Fp,np ().
(n p)
The inequality defines a p-dimensional ellipsoid centered at y. To determine if a
hypothesized value of lies in this region, simply plug it into the expression and see
if the inequality is satisfied or not. Figure 5 shows a 95% confidence ellipse for the
Swiss head data for variables Y5 = LTN and Y6 = LTG.
Multivariate statistics is a broad field of statistics and we have only introduced some
of the most basic ideas. Additional topics in multivariate analysis (such as principal
component analysis, discriminant analysis, cluster analysis, cannonical correlations,
MANOVA) take the correlations between variables into consideration to solve various
problems.
Problems
1. Data on felled black cherry trees was collected (Ryan et al., 1976). The measured
variables were the diameter (in inches measured from 4.5 feet above the ground),
the height (measured in feet) and the volume (measured in cubic feet). The full
data set appear in the following table:
Chapter 4. Multivariate Models 110
x1 x2 x3
Diameter Height Volume (xi1 x1 ) (xi2 x2 ) (xi1 x1 )(xi2 x2 )
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77.0
a) Before analyzing the data, do you expect the correlations between these
three variables to be negative, positive or zero? A scatterplot matrix of
the data is plotted in Figure 6
x1 x2 x3
Diameter Height Volume (xi1 x1 ) (xi2 x2 ) (xi1 x1 )(xi2 x2 )
11.0 75 18.2 +
The sample covariance is basically the average of the product (xi1 x1 )(xi2
Chapter 4. Multivariate Models 111
x2 ). From the list of + and s, does it appear the covariance will be pos-
itive or negative?
c) The sample covariance matrix for the entire data set is given by
9.85 10.38 49.89
S = 10.38 40.60 62.66 .
49.89 62.66 270.20
Compute the sample correlation matrix (using (3)) from the covariance
matrix.
d) The purpose of this study was to predict the volume of wood of the tree
using the diameter and/or height. If you had to choose one of the variables
(height or diameter) for predicting the volume of the tree, which would you
choose from a purely statistical point of view (note that for trees that have
not been cut, it would be much more difficult to measure the height than
the diameter). What was the basis for your choice?
e) If we convert the diameter measurements from units of inches to feet,
then we would need to divide each diameter measurement by 12. Let
xi1 = yi1 /12 denote the diameter measurements in units of feet. Compute
the sample variance of the xi1 measurements. Also, compute the sample
correlation between the diameter (in feet) and the height of the cherry
trees.
In this appendix we give a brief review of some of the basics of matrix algebra. A
matrix is simply an array of numbers. Let n denote the number of rows and p denote
the number of columns in an array. Matrices are denoted by boldface letters. For
example, let A denote a matrix with n = 3 rows and p = 2 columns. Then we say
that A is a n p matrix, which in this case, A is a 3 2 matrix. A special case of
a matrix is a vector which is simple a matrix with a single column (a column vector)
or a single row (a row vector). By convention, whenever we denote a vector, we shall
assume it is a column vector. One can regard an n 1 column vector as a point in
n-dimensional Euclidean space
To illustrate matters, let x denote a 3 1 column vector and A denote a 3 2 matrix
defined as follows:
1 2 3
x = 2, A = 4 5.
3 1 6
We can perform operations on vectors and matrices such as summation, subtraction,
multiplication.
The transpose of a matrix means to simply change the columns to vectors and is
denoted by a prime: A0 is the transpose of A. Thus,
x0 = ( 1, 2, 3 )
Chapter 4. Multivariate Models 112
20
18
16
Girth
14
12
10
8
85
80
Height
75
70
65
70
60
50
Volume
40
30
20
10
8 10 12 14 16 18 20 10 20 30 40 50 60 70
Figure 6: Scatterplots of the black cherry tree data. Here, Girth = diameter
and
2 4 1
A0 = .
3 5 6
To multiply a matrix by a number (i.e. a scalar), one just multiplies each element of
the matrix by the scalar. For instance, if c = 2 then
2 3 4 6
cA = 2 4 5 = 8 10 .
1 6 2 12
In order to add two matrices together, they must both be of the same dimensions in
which case you just add the corresponding components together (or subtract if you
are subtracting matrices). We cannot add the vector x to the matrix A because they
are not of the same dimension. However, if
4
y = 5
6
then
1 4 5
x + y = 2 + 5 = 7.
3 6 9
Chapter 4. Multivariate Models 113
Let
a11 a12 b11 b12
A = a21 a22 and B = b21 b22 ,
a31 a32 b31 b32
then
a11 a12 b11 b12 a11 + b11 a12 + b12
A + B = a21 a22 + b21 b22 = a21 + b21 a22 + b22 .
a31 a32 b31 b32 a31 + b31 a32 + b32
Note that the ijth entry in the matrix A for the ith row and the jth column is denoted
aij . Thus, the first index specifies the row number and the second index specifies the
column number.
We saw how to compute the product ab. Note that we can also form the product
ba since b is a 3 1 column vector and a is a 1 3 row vector, i.e. the number of
columns of b matches the number of rows of a and the product will be a matrix of
dimension 3 3:
b11 b11 a11 b11 a12 b11 a13
ba = b21 ( a11 , a12 , a13 ) b21 a11 b21 a12 b21 a13
.
b31 b31 a11 b31 a12 b31 a13
Inverses. For a scalar such as 5, its inverse is simply 1/5 = 51 and 5(1/5) = 1. All
real numbers have an inverse except zero. Let A denote a square p p matrix. The
inverse of A, if it exists, is denoted A1 and is the p p matrix such that
AA1 = A1 A = I, (the identity matrix).
Chapter 4. Multivariate Models 115
In order for a matrix A to have an inverse, its columns must be linearly independent
which means that no column of A can be expressed as a linear combination of the
other columns of A. Such matrices are call nonsingular. Thus a singular matrix does
not have an inverse. Finding the inverse of a matrix is somewhat tedious for higher
dimensional matrices. However, for a 2 2 matrix, there is a simple formula. If
a11 a12
A= ,
a21 a22
then
1 a22 a12
A1 = .
a11 a22 a12 a21 a21 a11
Inverse matrices are needed in order to define the multivariate normal density function
and understanding multivariate distance. The inverse of diagonal matrices are easy
to compute:
1 .
0 0 0
a11 0 0 0 1 a11
0 1
0
a22 0 0 a22
0 0
.. .. .. .. =
... .. .. ..
. . . . . . .
0 0 0 app 0 0 0 1
app
In order for the inverse of a diagonal matrix to exist, all the diagonal elements must
be nonzero. For higher dimensional non-diagonal matrices, computer software such
as matlab can be used to compute inverses of matrices.
Another matrix operation needed for the multivariate normal density is the deter-
minant of a matrix denoted by |A| (also denoted by det(A)). The computation of
the determinant is rather tedious for square matrices of dimension higher than 3 3
and again, software such as matlab can be used to compute determinants. In the case
Chapter 4. Multivariate Models 116
Figure 8: The determinant is the area of the parallelogram formed from the column
vectors that make up the matrix.
of a 2 2 matrix A where
a11 a12
A= ,
a21 a22
the formula is quite simple:
Thus, if
3 4
A= ,
2 5
then |A| = 3 5 4 2 = 15 8 = 7. One way to think of the determinant of a
matrix is to look at the two column vectors of A. If we plot these two vectors, they
form two edges of a parallelogram as seen in Figure 8. The determinant of A is the
(signed) area of the parallelogram. For higher dimensional matrices, the columns of
the matrix are the vertices of a parallelepiped and the determinant is equal to the
(signed) volume of the parallelepiped.
The determinant of a singular matrix is zero. For instance, suppose
1 4
A= .
2 8
Then, the second column of A is just 2 times the first column of A and therefore A
is singular. The determinant of A is |A| = 8 8 = 0. Since the second column of
A is 2 times the first column of A, the vertices of the parallelogram formed by these
two columns coincide and hence the area of the resulting parallelogram is zero.
Note that in the formula for the multivariate normal distribution, we divide by the
determinant of the covariance matrix. If the determinant is zero, then the distribution
Chapter 4. Multivariate Models 117
does not have a density. To understand what this means, consider a bivariate normal
random vector (Y1 , Y2 )0 . If the covariance matrix has determinant zero, then that
means that Y2 is a linear function of Y1 and the two random variables are perfectly
correlated (i.e. correlation equal to 1). In a scatterplot such as Figure 5, if Y1
and Y2 were perfectly correlated, then the points would lie exactly in a line and the
confidence ellipse would shrink to a line. The bivariate density assigns probability by
computing the volume under the density surface (as shown in Figure 4). However, if
the entire distribution is concentrated on a line in the plane, then the volume under
the density, if it existed, would be zero. In other words, the distribution is degenerate.
Problems
1. Let
3 4 2 1
A= and B = .
4 7 1 2
Find the following:
a) A + B.
b) AB.
c) BA. (Is AB = BA?)
d) A1 . Verify that your answer is correct by confirming that AA1 = I.
e) |A|
2. Let
1 x1
1 x2
X=
1 x3
.
1 x4
1 x5
Find the following:
a) X 0 X.
b) (X 0 X)1 .
References
Flury, B. (1997). A First Course in Multivariate Statistics. Springer, New York.
Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis.
Prentice Hall, New Jersey.
Ryan, T. A., Joiner, B. L., and Ryan, B. F. (1976). The Minitab Student Handbook.
Duxbury Press, California.