You are on page 1of 35

Cluster Analysis

Statistics 407, ISU

Denition
The aim of cluster analysis is to ________ cases (objects) according to their _________ on the variables. It is also often called ______________ classication, meaning that classication is the ultimate goal, but the classes (groups) are not known ahead of time. Hence the rst task in cluster analysis is to construct the class information. To determine closeness we start with measuring the _________ distances.

Distance Measures
Distance Measures Distance Measures Let X = (X1 X2 . . . Xp) and Y = (Y1 Y2 . . . Yp be two points ) Let X = (X1 X2 . . . Xp) and Y = (Y1 Y2 . . . Yp) be two points in p-space (two rows of a data matrix). in p-space (two rows of a data matrix). Euclidean Distance: Euclidean Distance: (X Y) = (X Y )2 + . . . + (X Y )2 d(X, Y) = (X Y) p d(X, Y) = (X Y) (X Y) = (X11 Y11 2 + . . . + (Xp p Yp)2 ) Statistical Distance: Statistical Distance:
1 d(X, Y) = (X Y)S 1(X Y) d(X, Y) = (X Y) S (X Y)

Both of these can benet from using standardized variables inBoth of these can benet from using standardized variables instead of the raw variables. These are the most commonly used stead of the raw variables. These are the most commonly used measures of distance. measures of distance.
22

Both of these distance measures benet from ___________ the variables rst.

Distance Metrics
Kendall tau distance: _____ each variable. For all ______ of elements of the two points, count 1 for each pair which the ranks are in the same relationship (___, ___; ___, ____) and 0 otherwise. Measures the _______ between two points, eg height value is often as highly ranked as weight value says that the two variables are positively correlated. Not as affected by outliers as raw data values.

Distance Metrics
Pearson correlation: d=____ - d=_ when r=1 Pearson square correlation: d=_____ - d=_ when r=0, d=_ when r=1 or -1 Measures the similarity in _____, rather than global closeness D C A B Also can be considered to be ______ distance
5

Hierarchical Clustering
Hierarchical algorithms sequentially ____ (or ____) cases to make clusters. Process can be viewed using a __________. The vertical heights of the dendrogram are used to decide __________________.

Linkage
When a cluster is formed, containing two or more cases, there are now multiple ways to dene the _______ from the _______ to other _______ or cases. For example, we could dene the distance from one cluster to another as the minimum interpoint distance, or the maximum interpoint distance or the average interpoint distance. These are called ________ __________. Each method changes the results of the cluster analysis.
7

Common linkage methods


The intercluster distance is described by: Single: the distance between the two ______ points. Complete: the distance between the two ______ points. Average: the _______ of all the interpoint distance. Centroid: the distance between the two ______. Wards: the smallest increase in the ____________ ________ after fusing two clusters, like ANOVA.
8

Single Linkage

Cluster 1

Cluster 2

Closest points dene the intercluster distance

Complete Linkage

Cluster 1

Cluster 2

Farthest points dene the intercluster distance

10

Average Linkage

Cluster 1

Cluster 2

Average of all of the distances denes the intercluster distance


11

Centroid Linkage

Cluster 1

Cluster 2

Distance between the cluster means denes the intercluster distance


12

Ward Linkage

One Cluster

Cluster 1

Cluster 2

Ratio of sum of squared distance from means, between one cluster, and the two clusters denes the intercluster distance
13

i
A B C D E
5

X1
1 1 0 2 3

X2
1 0 2 4 5
D

Example
Euclidean distances
A
E

B
1 0

C
1.4 2.2 0

D
3.2 4.1

E
4.5 5.4 4.2

A B C D

0 1

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

3.2 4.5

4.1

2.8 4.2

0 1.4

1.4 0

x2

x1

14

Step 1.1
Join the two closest points into a cluster.
A A B C D E
0 1 1.4 3.2 4.5
5

E D

B
1 0 2.2 4.1 5.4

C
1.4 2.2 0 2.8 4.2

D
3.2 4.1 2.8 0 1.4

E
4.5
x2

5.4 4.2 1.4 0

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1

15

Step 1.2
Reduce the distance matrix, using the linkage methods. Draw the dendrogram.
AB AB C D E
0 1.8 3.6 4.2

1 0 A B C D E

C
1.8 0

D
2.8 0 1.4

E
4.2 1.4 0

Average linkage used.

16

Step 2.1
Join the two closest points into a cluster.
AB AB C D E
0 1.8 3.6 4.9
5

E D

C
1.8 0 2.8 4.2

D
3.6 2.8 0 1.4

E
x2

4.9 4.2 1.4 0

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1

17

Step 2.2
Reduce the distance matrix, using the linkage methods. Draw the dendrogram.
AB AB C DE
4.3 0 4 3 2 1 0 A AB B C D E

C
1.8 0 3.5

DE

Average linkage used.

18

Step 3.1
Join the two closest points into a cluster.
x2
5

E D

AB AB C DE
0 1.8 4.3

C
1.8 0 3.5

DE
4.3 3.5 0

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1

19

Step 2.2
Reduce the distance matrix, using the linkage methods. Draw the dendrogram.
4 3 2 1 0 A AB B C D DE E

ABC ABC DE
0 4.0

DE

Average linkage used.


0

20

Step 3
Join last two clusters
5

4 3 2 1 0 A AB B C D ABC DE E

E D

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

Use ___________ to decide on the number of clusters.


x1

x2
0

21

Examples
6 4 2

Simulated data with ____ clusters.


Cluster Dendrogram

V2

factor(cl) 1 2

-2

V2
V1

-1

Height

41

44

11 14

-2 -1 0 1 2 3 4 5 6

43 45 36 37 46 50 47 33 49 38 29 48 32 30 31 28 40 26 27 39 35 34 42

7 15 21 2 3 12 19 17 1 9 13 22 23 10 8 25 18 20 4 6 16 5 24

V1
dist(x) hclust (*, "average")

22

Examples
Cluster Dendrogram
3.0

Cluster Dendrogram
10 12

Cluster Dendrogram

2.0

Height

Height

Height

1.0

11 14

41

14 11

4 0 2
11 10 8 25 7 15 21 2 3 12 19 17 1 9 13 22 23 20 4 6 18 16 5 24 36 37 38 29 48 35 34 42 28 40 26 27 32 30 31 39 44 46 50 47 33 49 41 43 45 14

43 45 44 47 39 49 33 46 50 28

44

36 37

43 45 36 37 46 50 47 33 49 38 29 48 32 30 31 28 40 26 27 39 35 34 42

7 15 21 2 3 12 19 17 1 9 13 22 23 10 8 25 18 20 4 6 16 5 24

40 26 27 35 34 42 32 30 31 38 29 48

dist(x) hclust (*, "average")

dist(x) hclust (*, "single")

12 19

15 21 2 3

17 1 9 13 22 23 18 10 16 20 4 6 5 24 8 25

41

0.0

dist(x) hclust (*, "complete")

Cluster Dendrogram
150 6

Cluster Dendrogram

Each of the dendrograms suggests ___ clusters, but there are a lot of differences. Several suggest some points are _______: 11, 14, 41.
23

100

Height

Height

50

11 14

36 37

41

7 15 21 2 3 12 19 17 1 9 13 22 23 10 8 25 18 20 4 6 16 5 24

20 4 6 18 16 5 24 11 10 8 25 14 2 3 12 19 17 1 9 7 15 21 13 22 23 46 50 47 33 49 41 43 45 39 44 35 34 42 32 30 31 36 37 38 29 48 28 40 26 27

dist(x) hclust (*, "ward")

dist(x) hclust (*, "centroid")

Examples
Average
6

Single
6 6

30 31 49 38 29 48 35 34 42 28 40 26 27

39 44 43 45 47 33 46 50

32

Complete

factor(cl)

factor(cl)

factor(cl) 1 2

V2

V2

V2

1 2

-2 -1 0 1 2 3 4 5 6

-2 -1 0 1 2 3 4 5 6

-2 -1 0 1 2 3 4 5 6

Ward
6

V1

Centroid
6

V1

V1

factor(cl)

factor(cl) 1 2

V2

V2

1 2

And the two cluster solution is the ____ for all methods.

-2 -1 0 1 2 3 4 5 6

-2 -1 0 1 2 3 4 5 6

V1

V1

24

Height
120 130 140 150 110 120 130 140 240 220 200 180 160 140 10 12 14 16 58 56 54 52 50 48 46 44 8 120 110 100 90 80 70 60

Height 2 48
tars1

y
3 4

10

20

30

40

50

60

6 10

tars2

140 60 80 00 20 40 110120130140 1 1 2 2 2

Average

Ward
head

41

45 49 64 74 54 52 55 62 69 47 51 56 71 50 59 63 66 53 58 60 65 57 70 46 61 72 44 73 48 67 68 18 7 1 14 3 11 19 21 6 10 5 20 2 13 9 8 15 4 12 16 17 25 27 31 39 41 22 36 42 30 34 35 23 26 40 28 29 24 32 38 37 33 43
aede1 4446485052545658 120130140150

62 69 47 6751 68 61 65 57 70 72 44 73 45 49 64 74 54 52 55 63 66 56 46 71 50 59 53 58 60 22 36 18 5 20 16 17 8 9 15 2 13 4 12 19 3 11 21 1 7 14 25 3127 39 35 30 34 42 23 26 40 28 29 37 33 43 24 32 38

Height 1.0 4148 1.5 2.0

Height

0.0

0.5

aede2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

47

48

aede3

Single
aede2 aede3

Centroid
22 36

62 69 51 61 44 65 73 72 46 53 54 70 57 74 58 60 56 71 50 59 55 45 52 49 64 63 66 67 68 6 10 37 35 29 42 30 34 23 26 40 33 2428 43 32 38 27 25 31 39 5 11 1817 20 4 3 16 1 7 14 12 2 13 8 9 15 19 21 41 22 36
aede1

27 25 31 39 37 29 28 42 24 33 43 32 38 35 26 40 3023 34 67 68 6 10 18 17 21 16 19 5 3 20 11 4 7 1 14 213 9 8 12 15 47 62 6961 53 46 63 44 66 73 72 51 49 45 64 74 54 52 5565 57 58 70 60 71 5056 59

Flea beetles
head
8 10 12 14 16 60708090 11020 100 1

Flea beetles
tars1 tars2

Height 4 6 8

We expect clustering to produce ______ clusters. Data is ___________ before calculating distances.

Complete

__ ________ between methods. Which would you use?

48 67 68 54 52 55 56 71 50 5945 49 64 72 6374 66 44 73 46 61 65 57 70 58 53 60 25 27 31 39 24 32 38 37 33 43 23 28 29 26 35 40 42 30 34 5 20 18 29 13 8 15 6 10 4 12 16 17 22 41 62 36 69 47 351 11 7 1 14 19 21

26

25

Flea beetles
Average Single Complete
factor(cl) factor(cl) factor(cl)

pp2

pp2

2 3

2 3

pp2

1 2 3

pp1

pp1

pp1

Ward

Centroid

factor(cl)

factor(cl)

pp2

2 3

pp2

1 2 3

Which would you use? How can there be so much difference?


27

pp1

pp1

Nuisance variables
2 1

-1

-2

Variables that ______ _________ to the clustering but are included in the ________ calculations.
-1.0 -0.5 0.0 0.5 1.0

V2

V1

28

V2 Height 0 74 10 20 30 40 50 60 0.0 1.0 2.0 3.0 Height


1 2

V2

-2

-1

-2

-1

-1.0

-1.0

-0.5

-0.5

Ward V1

Average

Average

V1

0.0

Ward

0.0

0.5

0.5

85

1.0

1.0

factor(cl)

factor(cl)

V2 Height
1 2

88 97 60 80 98 52 89 54 99 71 86 74 67 57 72 92 85 94 64 83 76 91 78 66 77 59 84 93 82 61 95 69 100 51 90 96 87 53 62 75 58 68 63 79 65 73 56 70 55 81 42 7 49 9 11 22 32 25 27 38 1 23 41 48 4 13 43 5 18 2 34 19 31 35 12 33 10 21 8 17 29 44 15 16 30 40 14 37 3 39 20 45 46 36 50 28 47 24 6 26 33 10 12 21 19 31 2 35 34 8 17 29 15 164 4 28 47 30 40 3714 3 39 24 6 26 20 45 46 36 2 50 3 25 8 27 4 41 1 23 5 18 38 4 13 42 43 7 49 119 22 94 64 83 76 91 78 66 7759 84 93 65 73 56 70 55 81 51 90 96 69 82 61 95 87 53 100 62 67 75 57 72 92 63 98 79 58 688 8 97 604 80 5 89 52 99 71 86 Height 0.0 0.2 0.4 0.6 0.8 1.0 0.0 85 74 0.5 1.0 1.5 2.0 74

V2

-2

-1

-2

-1

-1.0

-1.0

-0.5

-0.5

Single

Single

Centroid V1

Centroid

Nuisance variables

Nuisance variables

V1
1 2

0.0

0.0

85

0.5

0.5

1.0

1.0

7 2 4942 119 34 22 12 33 10 21 19 3130 3540 3 25 2 27 37 14 3 24 39 29 44 28 47 6 26 6 1 15 8 17 20 45 36 46 50 4 48 13 38 4318 5 41 1 23 98 88 97 6054 80 89 52 99 71 59 86 94 64 84 8393 63 79 58 68 56 70 55 81 6587 73 76 91 51 90 96 53 62 75 100 69 82 61 78 95 66 77 57 67 72 92

factor(cl)

factor(cl)

V2

76 91 78 6694 77 64 83 59 84 93 55 81 87 56 70 65 73 51 90 96 69 82 613 95 5 100 62 75 67 72 57 98 92 63 79 58 68 54 89 52 99 71 86 88 97 60 42 80 7 49 119 29 22 48 41 38 23 41 13 435 182 3 25 27 6 26 24 20 456 46 38 50 17 154 16 4 28 47 30 40 33 10 34 21 19 31 35 14 37 3 39 12 2

Height 0 1 2 3 4 5

-2

-1

-1.0

-0.5

Complete

Complete

V1

0.0

0.5

1.0

1 23 38 4 13 2 43 3 5 18 25 27 41 48 49 11 229 7 42 88 97 60 80 54 99 71 86 52 98 89 58 68 63 7967 74 57 72 2 92 3334 10 12 21 19 31 35 29 8 17 44 15 16 20 45 46 6 26 36 4 50 2 28 47 30 404 1 37 3 39 85 94 64 83 76 91 78 66 77 59 84 93 65 73 56 70 55 81 8 100 627 75 69 82 61 95 53 51 90 96

factor(cl)

Complete and ward linkage see ___ clusters

Is this what you expected of complete and ward?


29

30

Nuisance variables
Average
2 2

Single
2 1 factor(cl) factor(cl) 2 -1 3 1

Complete

V2

V2

2 -1 3

V2

factor(cl) 1 2 3

-1

-2

-2

-2

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

Ward V1
2 2 1 factor(cl) 2 -1 3 1

Centroid V1

V1

V2

V2

factor(cl) 1 2 3

-1

Which method is best, now?

-2

-2

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

31
V1 V1

Nuisance points
1

V2

-1

-2 -1 0 1 2

V1

Points that are _______ major clusters of data. This affects some linkage methods, eg single, which will tend to ______ through the data grouping everything together.

32

V2
1

V2
0 20 40 60 80 0.0 0.5 1.0 1.5 2.0 2.5

Height

Height

-2

-1

-2

-1

-1

-1

Ward

Average

Average

V1

V1

Ward

factor(cl)

factor(cl)

V2
1

V2
0.0 0.5 1.0 1.5

131 61 130 127 128 71 129 120 121 122 123 60 56 124 125 126 65 68 80 92 81 63 74 86 88 82 72 89 79 85 78 73 90 97 57 96 98 51 76 55 91 53 58 70 69 87 84 99 54 64 62 66 67 77 94 52 95 75 100 83 59 93 104 105 106 107 34 102 103 26 101 35 50 20 42 115 116 117 118 119 110 108 109 31 113 114 111 112 45 18 25 29 49 5 15 9 27 8 46 23 40 14 22 1 17 13 28 38 3 30 10 7 19 16 32 24 36 47 33 2 4 21 43 12 37 48 39 41 11 6 44 Height 0.0 Height 0.2 0.4 0.6

115 116 117 1181 119 3 113 114 111 112 110 108 109 104 105 106 107 20 42 35 50 34 102 103 11 26 1016 44 48 39 41 33 2 4 24 36 47 21 43 10 12 37 30 7 193 16 32 5 8 46 23 40 15 18 45 25 29 49 9 27 1 28 3 38 1 17 67 14 77 22 54 64 62 66 131 61 130 127 128 71 129 120 121 122 123 60 56 124 125 126 94 52 95 65 68 81 63 74 6 80 92 8 88 82 72 8978 79 85 73 90 97 55 91 53 70 69 87 84 99 98 51 76 58 57 96 75 100 83 59 93

-2

-1

-2

-1

-1

-1

0 1

Single

Single

Centroid

Centroid

Nuisance points

Nuisance points

V1

V1

factor(cl)

factor(cl)

V2

5 115 116 117 118 119 110 1081 109 3 113 114 111 112 45 104 105 106 107 20 42 35 50 34 26 101 102 103 10 14 22 9 27 15 49 29 18 25 2813 381 17 8 46 23 30 740 19 3 41 16 32 33 2 4 24 36 47 21 43 39 126 37 44 67 11 94 48 52 77 69 87 8495 99 4 5 64 6 62 66 8 88 78 7353 97 55 91 63 7481 80 92 82 79 72 89 85 90 75 1003 8 5970 93 98 51 768 565 57 96 68 129 131 61 130 71 127 128 60 56 124 125 126 120 121 122 123

94 52 1148 5 10 95 39 44 41 86 4623 45 40 60 20 31 42 34 26 56 71 123 122 120 121 129 131 61 130 128 127 124 125 126 118 119 117 115 116 101 113 114 111 112 107 110 108 109 102 103 106 104 105 3 35 50 7 19 30 16 13 32 2137 33 4312 224 4 36 47 9 27 1 2817 38 14 22 15 1849 2529 66 62 69 87 54 84 6478 99 5553 91 75 70 100 58 98 51 76 57 96 59 83 82 93 97 8573 79 7290 81 89 630 74 65 8 92 6886 88 67 77

Height 0 1 2 3 4 5

-2

-1

-1

Complete

Complete

Only single linkage does ____ see two clusters

V1

All but single linkage _____ the nuisance points.


33

67 77 86 88 78 81 633 74 5 55 91 80 92 65 68 75 100 83 59 93 82 72 89 79 85 73 90 97 58 70 57 96 98 514 76 9 52 95 69 87 84 99 54 64 62 66 106 107 110 108 109 31 111 112 113 114 115 116 120 121 117 118 119 131 61 130 127 128 71 129 56 125 126 60 124 122 123 33 41 39 48 11 6 44 1 17 14 22 2 4 24 36 47 21 43 12 37 3 30 10 7 19 16 32 13 28 38 8 46 23 40 4 104 5 105 34 102 103 20 42 35 50 26 101 5 15 9 27 18 25 29 49

factor(cl)

34

Height 8000 12000


pp2

Height

4000

pp2

50000

150000

pp1

pp1

Average

Average

Ward
3 2 1 factor(cl)
factor(cl) 3 2 1

Ward

68 44 63 67 73 66 72 53 54 48 59 56 62 49 71 47 51 50 55 45 58 64 65 70 57 46 52 61 69 60 74 21 9 5 7 8 11 20 6 10 16 17 18 12 14 19 3 13 15 1 2 4 23 43 36 28 38 24 32 41 33 37 26 35 34 22 30 25 31 29 27 40 39 42

66 72 44 68 63 67 73 47 69 51 45 58 74 64 60 65 70 57 46 52 61 56 62 49 71 50 55 53 54 48 59 12 18 14 8 11 20 6 1016 17 21 59 7 19 3 13 15 1 2 4 24 32 23 41 36 28 43 3834 22 30 33 37 26 25 3531 29 27 40 39 42

Height

pp2

pp2

Height 3000

1000

2000

4000

6000

8000

42 24 39

pp1

Single

Centroid

pp1

Single
factor(cl) 3 2 1

factor(cl)

Centroid

24 32 41 2836 23 38 25 43 31 29 27 40 37 2633 35 34 22 30 39 42 16 5921 7 8 17 1120 6 10 19 3 13 15 1 2 4 12 47 18 14 51 66 4472 68 63 67 73 62 49 5657 71 46 70 52 61 58 60 65 5045 5564 53 54 48 59 69 74

32 41 28 36 23 38 43 22 34 30 3733 26 31 35 25 29 27 40 12 14 17 16 18 86 111020 5 921 37 1319 15 1 2 4 47 69 51 74 45 50 64 55 58 56 49 62 71 48 60 59 57 53 54 65 46 70 52 61 68 44 66 63 72 67 73

Flea beetles

pp2

Height 15000

5000

Flea beetles (PP dim)

Complete

pp1

Complete

factor(cl)

When the nuisance variables are _______ all linkage methods see the clusters.

All methods see ___ clusters. ________ still a little confused between 3 or 4.
35

21 9 5 7 20 6 10 16 17 15 1 2 4 8 11 19 3 13 18 12 1443 36 28 24 38 32 39 41 33 37 42 26 35 34 22 23 30 25 29 31 27 40 4468 63 56 62 49 71 67 73 66 69 72 60 74 47 51 45 58 64 6557 70 46 52 61 50 55 53 54 48 59

36

k-Means Clustering
This is an iterative procedure. To use it the ______ ____________, k, must be decided rst. The stages of the iteration are: 1. Initialize by either (a) partitioning the data into k groups, and compute the k group means or (b) an initial set of k points as the rst estimate of the cluster means (seed points). 2. Loop over all observations ___________ them to the group with the closest mean. 3. Recompute group _________. Iterate steps 2 and 3 until _____________.
37

Step 0
i
A B C D E

X1
1 1 0 2 3

X2
1 0 2 4 5
x2

E D

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1

Use k=2. Suppose A and C are randomly selected as the initial means.
38

Step 1.1
_0 X1 _0 X2
i
A B C D E

X1
1 1 0 2 3

X2
1 0 2 4 5

i
A B C D E

2
1.4 2.2 0 2.8 4.2

Compute distances between each of the cluster means and all other points.
39

Step 1.1
i
A B C D E

1
0 1 1.4 3.2 4.5

2
1.4 2.2 0 2.8 4.2

Cluster

i
A B C D E

X1
1 1 0 2 3

X2
1 0 2 4 5

_1 X1 _1 X2

Assign each case to the cluster having the closest mean. Recalculate the cluster means.
40

Step 1.1 - Plots


5

E D

_1 X1= (1,0.5) _1 X 2= (1.7,3.7)

x2

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1

Assign each case to the cluster having the closest mean. Recalculate the cluster means.
41

Step 2.1
i
A B C D E

X1
1 1 0 2 3

X2
1 0 2 4 5

_1 X1= (1,0.5) _1 X 2= (1.7,3.7)

i
A B C D E

2
2.7 3.7 2.4 0.5 1.9

Compute distances between each of the cluster means and all other points.
42

Step 2.1
i
A B C D E

1
0.5 0.5 1.8 3.6 4.9

2
2.7 3.7 2.4 0.5 1.9

Cluster

i
A B C D E

X1
1 1 0 2 3

X2
1 0 2 4 5

_2 X1 _2 X2

Assign each case to the cluster having the closest mean. Recalculate the cluster means.
43

Step 2.1 - Plots


5

E D

_2 X1= (0.7,1) _2 = X 2 (2.5,4.5)

x2

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1

Assign each case to the cluster having the closest mean. Recalculate the cluster means.
44

Step 3
5

E D

_2 X1= (0.7,1) _2 = X 2 (2.5,4.5)

x2

C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1

Algorithm has _________ - re-calculating distances, reassigning cases to clusters results in no change. This is the ______________.
45

k-Means - Initialization
The algorithm needs to be ___________ by
choosing k initial means. Approaches: 1.___________ choose k points from the data set to act as the initial means. 2.First do ________________, decide on k, and use the _____ of these clusters as the initial kmeans. Initialization can _______ the nal result. If k is not known, re-run for several ________ k.
46

Examples
5000 factor(x.km$cluster)

pp2

1 0 2 3 -5000

Flea beetles Several cases are confused. Why would kmeans have trouble with this data?

-5000

5000

pp1

47

Example
2 1 factor(x.km$cluster)

V2

V2

1 -1 2

factor(x.km$cluster) 1 2

-1
-2

-2
-1.0 -0.5 0.0 0.5 1.0

-1

V1

V1

k-means does not handle nuisance variables well, but surprisingly does well with these data sets.
48

Example - partitioning
3 2 1 0 -1 -2

Many clustering tasks involve _________ data into chunks. There may not be natural clusters.
-2 -1 0 1 2

V2

V1
3 2 1

k-means
factor(x.km$cluster) 1 2 3 4

Ward
3 2 1 factor(cl) 1 2 3 4

V2

0 -1 -2

V2

0 -1 -2

-2

-1

V1

-2

-1

V1

49

Example - partitioning
2 1

V2

___________ matters in the way the data gets partitioned.

-1

-2

-2

-1

V1

k-means
factor(x.km$cluster) 1 2 3 4

Ward
2 1 factor(cl) 1 2 3 -1 4

V2

-1

V2

-2

-2

-2

-1

-2

-1

V1

V1

50

Summarizing results
Need to show how the clusters ______ from each other: Tabulate the ________________ for each cluster. Make separate plots for each cluster, using same scale Plot the __________ on one plot

51

BRIEF ARTICLE
THE AUTHOR

Example
Cluster 1 has ____ values on all variables. Cluster 2 has ___ values for tars1 and aede2, ___ values of head and aede3. Cluster 3 has ____ values of tars 1 and aede2, but ___ values of all other variables.

cluster tars1 tars2 head aede1 aede2 aede3 mean 1 183.10 129.62 51.24 146.19 14.10 104.86 sd 1 12.14 7.16 2.23 5.63 0.89 6.18 mean 2 138.23 125.09 51.59 138.27 10.09 106.59 sd 2 9.34 8.55 2.84 4.14 0.97 5.85 mean 3 201.00 119.32 48.87 124.65 14.29 81.00 sd 3 14.90 6.65 2.35 4.62 1.10 8.93

1.0 0.8 0.6

1 2 3
tars1 tars2 head aede1 aede2 aede3

value

0.4 0.2 0.0

variable

52

Example
1 1.0 0.8 0.6 0.4 0.2 0.0 3 1.0 0.8 0.6 0.4 0.2 0.0 tars1 tars2 head aede1 aede2 aede3 tars1 tars2 head aede1 aede2 aede3 2

Plotting all of the data shows the __________ in each cluster.

value

variable

53

Example
2 1

factor(d.flea$cl)

PC2

1 2 3

-1

Plotting the clusters in a _____________ _____________ like the rst two principal components can also help evaluate the clusters.

-2

-3 -3 -2 -1 0 1 2 3

PC1

54

Height 1.0

0.4

0.6

0.8

Cluster Dendrogram

newpbi.dist hclust (*, "ward")

Mama C-E - Marsh Judson Briggs Woods Reichelt Unit Moeckly Sheeder Colo Bogs C-Turtlehead Fen Stargrass C-Sandhill W C-Puccoon McFarland C-Sandhill E Grimes Farm C-Airport Prairie Flower Neal 3 Big Creek Meetz Marietta Prairie Morris Grant Ridge Neal 1 Liska-Stanek Doolittle Neal 5 Neal 2 Richards Marsh Neal 4

dA,B

Example

Example

Iowa prairies Which sites are BRIEF similar? ARTICLE THE AUTHOR Use Canberra distance:

Dendrogram suggests ___ clusters?

Aj Bj 1 = #non-zero entries Aj + Bj

j=1

where

0 =0 0

56

55

Example
Diptera Hemiptera_.Cicadellidae Hymenoptera Araneae Orthoptera_.Gryllidae Coleoptera_.Chrysomelidae Thysanoptera Hymenoptera_.Formicidae Hemiptera_.Miridae Coleoptera_.Curculionidae Orthoptera_.Tettigonidae Orthoptera_.Acrididae Hemiptera_.Psyllidae Acari Hemiptera_.Cercopidae Hemiptera_.Delphacidae Hemiptera_.Aphididae Hemiptera_.Lygaeidae Psocoptera Hemiptera_.Dictyopharidae Hemiptera_.Membracidae Coleoptera_.Mordellidae Coleoptera_.Cantharidae Hemiptera_.Cixiidae Hemiptera_.Anthocoridae Hemiptera_.Issidae Hemiptera_.Pentatomidae Hemiptera_.Reduviidae Neuroptera_.Chrysopidae Hemiptera_.Thyreocoridae Coleoptera_.Coccinellidae Collembola Coleoptera_.Lampyridae Hemiptera_.Alydidae Coleoptera_.Melyridae Hemiptera_.Tingidae Coleoptera_.Meloidae Phasmatodea_.Heteronemiidae Hemiptera_.Coreidae Coleoptera_.Carabidae Hemiptera_.Nabidae Coleoptera_.Elateridae Coleoptera_.Staphylinidae Coleoptera_.Cleridae 0

1 2 3 4 5

Differences between clusters on:


________________ (2,3 low; 4,5 medium; 1 high) ________________ (5 low; 2 lowish; 1,4 med) ....... Cluster 5 has _________________, absent elsewhere. Cluster 2 has _________________, absent elsewhere.
57

Species
variable

Average
1 2 3

value

Example
List clusters
1
C-Turtlehead Fen, Moeckly, Reichelt Unit, Sheeder, Colo Bogs, Grimes Farm, McFarland Stargrass, C-Airport, C-Puccoon, C-Sandhill E, C-Sandhill W

2
Doolittle, Liska-Stanek, Richards Marsh, Neal 2, Neal 4, Neal 5

3
Judson, Mama, Briggs Woods, C-E - Marsh

4
Marietta Prairie, Morris, Grant Ridge, Neal 1

5
Big Creek, Meetz, Prairie Flower, Neal 3

58

Example
Really need to take out the ________ variables to digest the differences. Try to write down a few species that have differences between clusters.

59

Example
Hemiptera_.Coreidae Coleoptera_.Staphylinidae

Coleoptera_.Meloidae Hemiptera_.Lygaeidae Hemiptera_.Anthocoridae Coleoptera_.Melyridae Hemiptera_.Membracidae Hemiptera_.Thyreocoridae Coleoptera_.Coccinellidae Hemiptera_.Aphididae Coleoptera_.Cleridae Hemiptera_.Pentatomidae Hemiptera_.Tingidae Hemiptera_.Miridae Hemiptera_.Cercopidae Neuroptera_.Chrysopidae Hemiptera_.Delphacidae Coleoptera_.Chrysomelidae Coleoptera_.Curculionidae Hemiptera_.Dictyopharidae Hemiptera_.Psyllidae Hemiptera_.Nabidae Orthoptera_.Acrididae Coleoptera_.Cantharidae Hymenoptera Diptera Hymenoptera_.Formicidae Araneae Hemiptera_.Cixiidae Hemiptera_.Cicadellidae Hemiptera_.Alydidae Coleoptera_.Elateridae Thysanoptera Phasmatodea_.Heteronemiidae Coleoptera_.Mordellidae Collembola Acari Orthoptera_.Gryllidae Orthoptera_.Tettigonidae Hemiptera_.Reduviidae

NMDS2

MDS Circle the clusters. Match the species to the clusters.


1 2 3 4

0.0

0.5

1.0

Coleoptera_.Lampyridae

-0.5

Coleoptera_.Carabidae

Hemiptera_.Issidae Psocoptera

-1.0

-1.0

-0.5

0.0 NMDS1

0.5

1.0

5
60

Self-organizing maps
A self-organizing map is a ______________ _________ algorithm. A 1D or 2D net is stretched through the data. The knots in the net form the cluster means, and points closest to the knot are considered to belong to that cluster. The map provides a lowdimensional view of the clusters, alternate to _____ or _____.

61

The net ts neatly into the data. Some of the extreme points are far from the model, which is not obvious from the map view.
62

Model-based clustering
Model-based clustering (Fraley and Raftery, 2002) ts a ___________ __________ _____ to the data. For example, if it is believed that clusters in the data are approximately elliptical in shape, then a multivariate normal distribution might be used. The _____ of the clusters is dened by the _________________ matrix for each group.
Model-based clustering
5 Model-based clustering

63

Model-based clustering (Fraley and Raftery, 2002) ts a multivariate mixture model to the data. For example, Model-based clustering (Fraley and Raftery, 2002) ts a multivariate mixture model to the data. For example, f it is believed that clusters clusters indata are approximately elliptical in shape, then a multivariate normal in the the data are approximately elliptical in shape, then a multivariate normal if it is believed that distribution might be might be used.shape of the clusters is dened by the variance-covariance matrix for each distribution used. The The shape of the clusters is dened by the variance-covariance matrix for each group which group be parametrized as follows: can which can be parametrized as follows:

Variance covariance is parametrized as:


k = k Dk Ak Dk ,
k = k Dk Ak Dk ,

This leads to several choices from simple to complex models: EII I Spherical k Distribution VII k I Spherical EEI DD Diagonal I Spherical VEI k DD Diagonal k I Spherical VVI k Dk Dk Diagonal DD DAD Diagonal EEE Ellipsoidal EEV Ellipsoidal k DD DAk D Diagonal VEV Dk ADk Ellipsoidal k Dk D D A D Ellipsoidal k k k Diagonal VVV k k

k = 1, . . . , g (number of clusters)
Orientation equal equal NA Volume Shape Orientation variable equal NA equal equal NA equal equal NA variable equal NA variable equal NA variable variable NA equal equal equalequal NA equal equal variableequal equalvariable NA variable equal variable variable variable NA variable variable variable

k = 1, . . . , g (number of clusters)

This leads to several choices from simple Distribution Volume Shape to complex models: Name k

number of mixture components, is used to assess the best model. The higher the BIC value the better the model. Model-based clustering uses the EM algorithm to t the parameters for the mean, variance-covariance of Figure 12 illustrates model-based clustering of the simulated data. Models 4 (EEV), 5 (VVV), 3 (EEE) each population and the are almostproportion. This is appropriate when the observations are that with 2 clusters mixing equally good according to the BIC value. The top choice is model 4 (EEV), independent and is, equal volume but dierent orientation elliptical clusters. dentically distributed multivariate normal observations. Almost as good is model 3 (EEE), two equal Typicallyelliptical clusters. This is the real model underlying this data. The best 3 loglikelihood, number of variables and the Bayes Information Criterion (BIC), based on the cluster solution is model 3, three equal elliptical clusters. Model 5 (VVV) the fully unconstrained parametrization has a lot of parameters number of mixture components, is used toare only two clusters, model. The higher data to estimate thethe better the to estimate, which is possible if there assess the best but there is not enough the BIC value parameters for more than two clusters. model.

Name EII VII EEI VEI VVI EEE DAD Ellipsoidal equal equal equal Model-based clustering uses the EM algorithm to t the parameters for the mean, variance-covariance of EEV DAk D Ellipsoidal equal equal variable each population and the mixing proportion. This is appropriate when the observations are independent and VEV Dk ADk Ellipsoidal variable equal variable identically distributed multivariate normal observations. Typically Bayes Information Criterion VVV the Dk Ak Dk Ellipsoidal (BIC), based on the loglikelihood, number of variables and variable variable variable k

EII

VII

EEE

EEV

VVV

Figure 12library(mclust) illustrates model-based clustering of the simulated data. Models 4 (EEV), 5 (VVV), 3 (EEE) x.mc<-EMclust(x,2:4,c("EII","VII","EEE","EEV","VVV")) value. The top choice is model 4 (EEV), that with 2 clusters are almost equally good according to the BIC 64 plot(x.mc) s, equal volume but dierent orientation elliptical clusters. Almost as good is model 3 (EEE), two equal legend(2,-105,col=c(1:5),lty=c(1:5), elliptical clusters. This is EEI","2 VII","3 EEE","4 EEV","5 VVV")) The best 3 cluster solution is model 3, three legend=c("1 the real model underlying this data.

Model tting
The cluster model is t by estimating mean, variance-covariance of each population and the mixing proportion, and optimizing these for the sample. The t is evaluated by examining the sample variation in relation to the parameter estimates, using Bayes Information Criterion (BIC). The higher the BIC value the better the model.
65

Model EEV,2
4 5 3 3 4 1 EEI 2 VII 3 EEE 4 EEV 5 VVV 1 2 1 2.0 2.5 3.0 3.5 1 2 4.0
2 2

BIC

110

4
0

120

130

1.5

1.0

0.5

0.0

0.5

1.0

number of clusters

Model EEE,2
2 2

Model EEE,3

1.5

1.0

0.5

0.0

0.5

1.0

Real model that generated the data is EEE-2. Best t as measured by BIC is _______. How many parameters need to be estimated?

100

1.5

1.0

0.5

0.0

0.5

1.0

66

Example: Music
5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3

11000

2 1 0

1 EII 2 VII 3 EEE 4 EEV 5 VVV 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 2 2 2 2 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 5 10 15 20 25 30 35

______ models are much better than spherical models.

BIC

9000

7000

number of clusters

1 2 1 1 2 1 2 2 1 EEE 2 EEV 2 4 2 2 6 8 number of clusters 10 12 14 2 2 2 2 2 1 1 1 2 1 1 1 1 1 1

5600

5500

best model: ________ but other models worth exploring too.


67

BIC

Cluster 11 12 10 13 14 1 2 4 3 5 6 9 7 8

5400

LVar 1.3 108 8.3 107 4.2 107 5.9 107 2.4 107 1.6 107 1.4 107 7.2 106 6.5 106 4.7 106 3.1 106 2.4 106 5.3 105 5.8 105

Cluster Means LAve LMax LFEner 50.2 3.3 104 114 -2.8 3.1 104 112 -4.3 3.2 104 108 -3.5 -20.1 -24.2 216.2 -83.0 17.5 14.2 1.6 -21.3 7.7 -9.2 3.1 104 2.9 104 2.7 104 3.0 104 2.7 104 2.3 104 1.8 104 2.1 104 1.3 104 8.1 103 5.7 103 111 107 106 105 102 102 86 98 103 99 105

LFreq 41 246 108 160 232 169 198 92 274 209 552 233 566 222

Names of tracks Saturday Morning Girl, Cant Buy Me Love All in a Days Work, Love of the Loveless, Wrong About Bobby, Yellow Submarine Rock Hard Times, Lone Wolf, I Want to Hold Your Hand, I Feel Fine, Ticket to Ride, Help, Penny Lane The Good Old Days, Love Me Do, Yesterday, B4, Waterloo Dancing Queen, Agony, Eleanor Rigby, B8, Anywhere Is V6 Knowing Me, Take a Chance, Mamma Mia, Lay All You, Super Trouper, Money V1, V3, V9, M3, M6, Restraining, B1, B3, B7, V13 V5, HeyJude V7, M4, B5 I Have A Dream, SOS, M1, M2, M5, The Memory of Trees, Pax Deorum, V11 V2, V10, B6 The Winner, V4, V8, B2, V12

The common variance-covariance matrix is EEE model 10 1.6 10 7.2 10 8.5 14 clusters
12 7 8

The unusual tracks discovered earlier are in singleton clusters, V6, Saturday Morning. Hey Jude the

1.6 107 S = 7.2 108 1.2 106 5.1 107

5.9 102 2.3 104 2.8 100 2.2 102

2.3 104 5.0 106 2.0 103 8.4 104

1.2 106 2.8 100 2.0 103 3.4 100 2.4 101

5.1 107 2.2 102 8.4 104 2.4 101 1.6 104


68

Comparing results
One approach is to use a confusion table. For Comparing example, to compareResults the results for hierarchical One approach is to uselinkage and k-means clustering on using average a confusion table. For example, to compare the results for hierarchical using average linkage and the ea beatles ea beatles could summarize the k-means clustering on the data we data we could summarize the clusters like: clusters like:
hc(ave) 1 2 3 4 k-means 1 2 3 0 0 19 24 0 0 0 30 0 0 1 0 Mapping is: 1 3 2 1 3 2 Rearrange confusion table accordingly: hc(ave) 2 3 1 4 k-means 1 2 3 24 0 0 0 30 0 0 0 19 0 1 0

The two methods agree almost completely.


21
69

This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/ licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

70

You might also like