Professional Documents
Culture Documents
Denition
The aim of cluster analysis is to ________ cases (objects) according to their _________ on the variables. It is also often called ______________ classication, meaning that classication is the ultimate goal, but the classes (groups) are not known ahead of time. Hence the rst task in cluster analysis is to construct the class information. To determine closeness we start with measuring the _________ distances.
Distance Measures
Distance Measures Distance Measures Let X = (X1 X2 . . . Xp) and Y = (Y1 Y2 . . . Yp be two points ) Let X = (X1 X2 . . . Xp) and Y = (Y1 Y2 . . . Yp) be two points in p-space (two rows of a data matrix). in p-space (two rows of a data matrix). Euclidean Distance: Euclidean Distance: (X Y) = (X Y )2 + . . . + (X Y )2 d(X, Y) = (X Y) p d(X, Y) = (X Y) (X Y) = (X11 Y11 2 + . . . + (Xp p Yp)2 ) Statistical Distance: Statistical Distance:
1 d(X, Y) = (X Y)S 1(X Y) d(X, Y) = (X Y) S (X Y)
Both of these can benet from using standardized variables inBoth of these can benet from using standardized variables instead of the raw variables. These are the most commonly used stead of the raw variables. These are the most commonly used measures of distance. measures of distance.
22
Both of these distance measures benet from ___________ the variables rst.
Distance Metrics
Kendall tau distance: _____ each variable. For all ______ of elements of the two points, count 1 for each pair which the ranks are in the same relationship (___, ___; ___, ____) and 0 otherwise. Measures the _______ between two points, eg height value is often as highly ranked as weight value says that the two variables are positively correlated. Not as affected by outliers as raw data values.
Distance Metrics
Pearson correlation: d=____ - d=_ when r=1 Pearson square correlation: d=_____ - d=_ when r=0, d=_ when r=1 or -1 Measures the similarity in _____, rather than global closeness D C A B Also can be considered to be ______ distance
5
Hierarchical Clustering
Hierarchical algorithms sequentially ____ (or ____) cases to make clusters. Process can be viewed using a __________. The vertical heights of the dendrogram are used to decide __________________.
Linkage
When a cluster is formed, containing two or more cases, there are now multiple ways to dene the _______ from the _______ to other _______ or cases. For example, we could dene the distance from one cluster to another as the minimum interpoint distance, or the maximum interpoint distance or the average interpoint distance. These are called ________ __________. Each method changes the results of the cluster analysis.
7
Single Linkage
Cluster 1
Cluster 2
Complete Linkage
Cluster 1
Cluster 2
10
Average Linkage
Cluster 1
Cluster 2
Centroid Linkage
Cluster 1
Cluster 2
Ward Linkage
One Cluster
Cluster 1
Cluster 2
Ratio of sum of squared distance from means, between one cluster, and the two clusters denes the intercluster distance
13
i
A B C D E
5
X1
1 1 0 2 3
X2
1 0 2 4 5
D
Example
Euclidean distances
A
E
B
1 0
C
1.4 2.2 0
D
3.2 4.1
E
4.5 5.4 4.2
A B C D
0 1
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
3.2 4.5
4.1
2.8 4.2
0 1.4
1.4 0
x2
x1
14
Step 1.1
Join the two closest points into a cluster.
A A B C D E
0 1 1.4 3.2 4.5
5
E D
B
1 0 2.2 4.1 5.4
C
1.4 2.2 0 2.8 4.2
D
3.2 4.1 2.8 0 1.4
E
4.5
x2
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
15
Step 1.2
Reduce the distance matrix, using the linkage methods. Draw the dendrogram.
AB AB C D E
0 1.8 3.6 4.2
1 0 A B C D E
C
1.8 0
D
2.8 0 1.4
E
4.2 1.4 0
16
Step 2.1
Join the two closest points into a cluster.
AB AB C D E
0 1.8 3.6 4.9
5
E D
C
1.8 0 2.8 4.2
D
3.6 2.8 0 1.4
E
x2
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
17
Step 2.2
Reduce the distance matrix, using the linkage methods. Draw the dendrogram.
AB AB C DE
4.3 0 4 3 2 1 0 A AB B C D E
C
1.8 0 3.5
DE
18
Step 3.1
Join the two closest points into a cluster.
x2
5
E D
AB AB C DE
0 1.8 4.3
C
1.8 0 3.5
DE
4.3 3.5 0
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
19
Step 2.2
Reduce the distance matrix, using the linkage methods. Draw the dendrogram.
4 3 2 1 0 A AB B C D DE E
ABC ABC DE
0 4.0
DE
20
Step 3
Join last two clusters
5
4 3 2 1 0 A AB B C D ABC DE E
E D
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x2
0
21
Examples
6 4 2
V2
factor(cl) 1 2
-2
V2
V1
-1
Height
41
44
11 14
-2 -1 0 1 2 3 4 5 6
43 45 36 37 46 50 47 33 49 38 29 48 32 30 31 28 40 26 27 39 35 34 42
7 15 21 2 3 12 19 17 1 9 13 22 23 10 8 25 18 20 4 6 16 5 24
V1
dist(x) hclust (*, "average")
22
Examples
Cluster Dendrogram
3.0
Cluster Dendrogram
10 12
Cluster Dendrogram
2.0
Height
Height
Height
1.0
11 14
41
14 11
4 0 2
11 10 8 25 7 15 21 2 3 12 19 17 1 9 13 22 23 20 4 6 18 16 5 24 36 37 38 29 48 35 34 42 28 40 26 27 32 30 31 39 44 46 50 47 33 49 41 43 45 14
43 45 44 47 39 49 33 46 50 28
44
36 37
43 45 36 37 46 50 47 33 49 38 29 48 32 30 31 28 40 26 27 39 35 34 42
7 15 21 2 3 12 19 17 1 9 13 22 23 10 8 25 18 20 4 6 16 5 24
40 26 27 35 34 42 32 30 31 38 29 48
12 19
15 21 2 3
17 1 9 13 22 23 18 10 16 20 4 6 5 24 8 25
41
0.0
Cluster Dendrogram
150 6
Cluster Dendrogram
Each of the dendrograms suggests ___ clusters, but there are a lot of differences. Several suggest some points are _______: 11, 14, 41.
23
100
Height
Height
50
11 14
36 37
41
7 15 21 2 3 12 19 17 1 9 13 22 23 10 8 25 18 20 4 6 16 5 24
20 4 6 18 16 5 24 11 10 8 25 14 2 3 12 19 17 1 9 7 15 21 13 22 23 46 50 47 33 49 41 43 45 39 44 35 34 42 32 30 31 36 37 38 29 48 28 40 26 27
Examples
Average
6
Single
6 6
30 31 49 38 29 48 35 34 42 28 40 26 27
39 44 43 45 47 33 46 50
32
Complete
factor(cl)
factor(cl)
factor(cl) 1 2
V2
V2
V2
1 2
-2 -1 0 1 2 3 4 5 6
-2 -1 0 1 2 3 4 5 6
-2 -1 0 1 2 3 4 5 6
Ward
6
V1
Centroid
6
V1
V1
factor(cl)
factor(cl) 1 2
V2
V2
1 2
And the two cluster solution is the ____ for all methods.
-2 -1 0 1 2 3 4 5 6
-2 -1 0 1 2 3 4 5 6
V1
V1
24
Height
120 130 140 150 110 120 130 140 240 220 200 180 160 140 10 12 14 16 58 56 54 52 50 48 46 44 8 120 110 100 90 80 70 60
Height 2 48
tars1
y
3 4
10
20
30
40
50
60
6 10
tars2
140 60 80 00 20 40 110120130140 1 1 2 2 2
Average
Ward
head
41
45 49 64 74 54 52 55 62 69 47 51 56 71 50 59 63 66 53 58 60 65 57 70 46 61 72 44 73 48 67 68 18 7 1 14 3 11 19 21 6 10 5 20 2 13 9 8 15 4 12 16 17 25 27 31 39 41 22 36 42 30 34 35 23 26 40 28 29 24 32 38 37 33 43
aede1 4446485052545658 120130140150
62 69 47 6751 68 61 65 57 70 72 44 73 45 49 64 74 54 52 55 63 66 56 46 71 50 59 53 58 60 22 36 18 5 20 16 17 8 9 15 2 13 4 12 19 3 11 21 1 7 14 25 3127 39 35 30 34 42 23 26 40 28 29 37 33 43 24 32 38
Height
0.0
0.5
aede2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
47
48
aede3
Single
aede2 aede3
Centroid
22 36
62 69 51 61 44 65 73 72 46 53 54 70 57 74 58 60 56 71 50 59 55 45 52 49 64 63 66 67 68 6 10 37 35 29 42 30 34 23 26 40 33 2428 43 32 38 27 25 31 39 5 11 1817 20 4 3 16 1 7 14 12 2 13 8 9 15 19 21 41 22 36
aede1
Flea beetles
head
8 10 12 14 16 60708090 11020 100 1
Flea beetles
tars1 tars2
Height 4 6 8
We expect clustering to produce ______ clusters. Data is ___________ before calculating distances.
Complete
26
25
Flea beetles
Average Single Complete
factor(cl) factor(cl) factor(cl)
pp2
pp2
2 3
2 3
pp2
1 2 3
pp1
pp1
pp1
Ward
Centroid
factor(cl)
factor(cl)
pp2
2 3
pp2
1 2 3
pp1
pp1
Nuisance variables
2 1
-1
-2
Variables that ______ _________ to the clustering but are included in the ________ calculations.
-1.0 -0.5 0.0 0.5 1.0
V2
V1
28
V2
-2
-1
-2
-1
-1.0
-1.0
-0.5
-0.5
Ward V1
Average
Average
V1
0.0
Ward
0.0
0.5
0.5
85
1.0
1.0
factor(cl)
factor(cl)
V2 Height
1 2
88 97 60 80 98 52 89 54 99 71 86 74 67 57 72 92 85 94 64 83 76 91 78 66 77 59 84 93 82 61 95 69 100 51 90 96 87 53 62 75 58 68 63 79 65 73 56 70 55 81 42 7 49 9 11 22 32 25 27 38 1 23 41 48 4 13 43 5 18 2 34 19 31 35 12 33 10 21 8 17 29 44 15 16 30 40 14 37 3 39 20 45 46 36 50 28 47 24 6 26 33 10 12 21 19 31 2 35 34 8 17 29 15 164 4 28 47 30 40 3714 3 39 24 6 26 20 45 46 36 2 50 3 25 8 27 4 41 1 23 5 18 38 4 13 42 43 7 49 119 22 94 64 83 76 91 78 66 7759 84 93 65 73 56 70 55 81 51 90 96 69 82 61 95 87 53 100 62 67 75 57 72 92 63 98 79 58 688 8 97 604 80 5 89 52 99 71 86 Height 0.0 0.2 0.4 0.6 0.8 1.0 0.0 85 74 0.5 1.0 1.5 2.0 74
V2
-2
-1
-2
-1
-1.0
-1.0
-0.5
-0.5
Single
Single
Centroid V1
Centroid
Nuisance variables
Nuisance variables
V1
1 2
0.0
0.0
85
0.5
0.5
1.0
1.0
factor(cl)
factor(cl)
V2
Height 0 1 2 3 4 5
-2
-1
-1.0
-0.5
Complete
Complete
V1
0.0
0.5
1.0
factor(cl)
30
Nuisance variables
Average
2 2
Single
2 1 factor(cl) factor(cl) 2 -1 3 1
Complete
V2
V2
2 -1 3
V2
factor(cl) 1 2 3
-1
-2
-2
-2
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
Ward V1
2 2 1 factor(cl) 2 -1 3 1
Centroid V1
V1
V2
V2
factor(cl) 1 2 3
-1
-2
-2
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
31
V1 V1
Nuisance points
1
V2
-1
-2 -1 0 1 2
V1
Points that are _______ major clusters of data. This affects some linkage methods, eg single, which will tend to ______ through the data grouping everything together.
32
V2
1
V2
0 20 40 60 80 0.0 0.5 1.0 1.5 2.0 2.5
Height
Height
-2
-1
-2
-1
-1
-1
Ward
Average
Average
V1
V1
Ward
factor(cl)
factor(cl)
V2
1
V2
0.0 0.5 1.0 1.5
131 61 130 127 128 71 129 120 121 122 123 60 56 124 125 126 65 68 80 92 81 63 74 86 88 82 72 89 79 85 78 73 90 97 57 96 98 51 76 55 91 53 58 70 69 87 84 99 54 64 62 66 67 77 94 52 95 75 100 83 59 93 104 105 106 107 34 102 103 26 101 35 50 20 42 115 116 117 118 119 110 108 109 31 113 114 111 112 45 18 25 29 49 5 15 9 27 8 46 23 40 14 22 1 17 13 28 38 3 30 10 7 19 16 32 24 36 47 33 2 4 21 43 12 37 48 39 41 11 6 44 Height 0.0 Height 0.2 0.4 0.6
115 116 117 1181 119 3 113 114 111 112 110 108 109 104 105 106 107 20 42 35 50 34 102 103 11 26 1016 44 48 39 41 33 2 4 24 36 47 21 43 10 12 37 30 7 193 16 32 5 8 46 23 40 15 18 45 25 29 49 9 27 1 28 3 38 1 17 67 14 77 22 54 64 62 66 131 61 130 127 128 71 129 120 121 122 123 60 56 124 125 126 94 52 95 65 68 81 63 74 6 80 92 8 88 82 72 8978 79 85 73 90 97 55 91 53 70 69 87 84 99 98 51 76 58 57 96 75 100 83 59 93
-2
-1
-2
-1
-1
-1
0 1
Single
Single
Centroid
Centroid
Nuisance points
Nuisance points
V1
V1
factor(cl)
factor(cl)
V2
5 115 116 117 118 119 110 1081 109 3 113 114 111 112 45 104 105 106 107 20 42 35 50 34 26 101 102 103 10 14 22 9 27 15 49 29 18 25 2813 381 17 8 46 23 30 740 19 3 41 16 32 33 2 4 24 36 47 21 43 39 126 37 44 67 11 94 48 52 77 69 87 8495 99 4 5 64 6 62 66 8 88 78 7353 97 55 91 63 7481 80 92 82 79 72 89 85 90 75 1003 8 5970 93 98 51 768 565 57 96 68 129 131 61 130 71 127 128 60 56 124 125 126 120 121 122 123
94 52 1148 5 10 95 39 44 41 86 4623 45 40 60 20 31 42 34 26 56 71 123 122 120 121 129 131 61 130 128 127 124 125 126 118 119 117 115 116 101 113 114 111 112 107 110 108 109 102 103 106 104 105 3 35 50 7 19 30 16 13 32 2137 33 4312 224 4 36 47 9 27 1 2817 38 14 22 15 1849 2529 66 62 69 87 54 84 6478 99 5553 91 75 70 100 58 98 51 76 57 96 59 83 82 93 97 8573 79 7290 81 89 630 74 65 8 92 6886 88 67 77
Height 0 1 2 3 4 5
-2
-1
-1
Complete
Complete
V1
67 77 86 88 78 81 633 74 5 55 91 80 92 65 68 75 100 83 59 93 82 72 89 79 85 73 90 97 58 70 57 96 98 514 76 9 52 95 69 87 84 99 54 64 62 66 106 107 110 108 109 31 111 112 113 114 115 116 120 121 117 118 119 131 61 130 127 128 71 129 56 125 126 60 124 122 123 33 41 39 48 11 6 44 1 17 14 22 2 4 24 36 47 21 43 12 37 3 30 10 7 19 16 32 13 28 38 8 46 23 40 4 104 5 105 34 102 103 20 42 35 50 26 101 5 15 9 27 18 25 29 49
factor(cl)
34
Height
4000
pp2
50000
150000
pp1
pp1
Average
Average
Ward
3 2 1 factor(cl)
factor(cl) 3 2 1
Ward
68 44 63 67 73 66 72 53 54 48 59 56 62 49 71 47 51 50 55 45 58 64 65 70 57 46 52 61 69 60 74 21 9 5 7 8 11 20 6 10 16 17 18 12 14 19 3 13 15 1 2 4 23 43 36 28 38 24 32 41 33 37 26 35 34 22 30 25 31 29 27 40 39 42
Height
pp2
pp2
Height 3000
1000
2000
4000
6000
8000
42 24 39
pp1
Single
Centroid
pp1
Single
factor(cl) 3 2 1
factor(cl)
Centroid
Flea beetles
pp2
Height 15000
5000
Complete
pp1
Complete
factor(cl)
When the nuisance variables are _______ all linkage methods see the clusters.
All methods see ___ clusters. ________ still a little confused between 3 or 4.
35
36
k-Means Clustering
This is an iterative procedure. To use it the ______ ____________, k, must be decided rst. The stages of the iteration are: 1. Initialize by either (a) partitioning the data into k groups, and compute the k group means or (b) an initial set of k points as the rst estimate of the cluster means (seed points). 2. Loop over all observations ___________ them to the group with the closest mean. 3. Recompute group _________. Iterate steps 2 and 3 until _____________.
37
Step 0
i
A B C D E
X1
1 1 0 2 3
X2
1 0 2 4 5
x2
E D
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
Use k=2. Suppose A and C are randomly selected as the initial means.
38
Step 1.1
_0 X1 _0 X2
i
A B C D E
X1
1 1 0 2 3
X2
1 0 2 4 5
i
A B C D E
2
1.4 2.2 0 2.8 4.2
Compute distances between each of the cluster means and all other points.
39
Step 1.1
i
A B C D E
1
0 1 1.4 3.2 4.5
2
1.4 2.2 0 2.8 4.2
Cluster
i
A B C D E
X1
1 1 0 2 3
X2
1 0 2 4 5
_1 X1 _1 X2
Assign each case to the cluster having the closest mean. Recalculate the cluster means.
40
E D
x2
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
Assign each case to the cluster having the closest mean. Recalculate the cluster means.
41
Step 2.1
i
A B C D E
X1
1 1 0 2 3
X2
1 0 2 4 5
i
A B C D E
2
2.7 3.7 2.4 0.5 1.9
Compute distances between each of the cluster means and all other points.
42
Step 2.1
i
A B C D E
1
0.5 0.5 1.8 3.6 4.9
2
2.7 3.7 2.4 0.5 1.9
Cluster
i
A B C D E
X1
1 1 0 2 3
X2
1 0 2 4 5
_2 X1 _2 X2
Assign each case to the cluster having the closest mean. Recalculate the cluster means.
43
E D
x2
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
Assign each case to the cluster having the closest mean. Recalculate the cluster means.
44
Step 3
5
E D
x2
C A B
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
Algorithm has _________ - re-calculating distances, reassigning cases to clusters results in no change. This is the ______________.
45
k-Means - Initialization
The algorithm needs to be ___________ by
choosing k initial means. Approaches: 1.___________ choose k points from the data set to act as the initial means. 2.First do ________________, decide on k, and use the _____ of these clusters as the initial kmeans. Initialization can _______ the nal result. If k is not known, re-run for several ________ k.
46
Examples
5000 factor(x.km$cluster)
pp2
1 0 2 3 -5000
Flea beetles Several cases are confused. Why would kmeans have trouble with this data?
-5000
5000
pp1
47
Example
2 1 factor(x.km$cluster)
V2
V2
1 -1 2
factor(x.km$cluster) 1 2
-1
-2
-2
-1.0 -0.5 0.0 0.5 1.0
-1
V1
V1
k-means does not handle nuisance variables well, but surprisingly does well with these data sets.
48
Example - partitioning
3 2 1 0 -1 -2
Many clustering tasks involve _________ data into chunks. There may not be natural clusters.
-2 -1 0 1 2
V2
V1
3 2 1
k-means
factor(x.km$cluster) 1 2 3 4
Ward
3 2 1 factor(cl) 1 2 3 4
V2
0 -1 -2
V2
0 -1 -2
-2
-1
V1
-2
-1
V1
49
Example - partitioning
2 1
V2
-1
-2
-2
-1
V1
k-means
factor(x.km$cluster) 1 2 3 4
Ward
2 1 factor(cl) 1 2 3 -1 4
V2
-1
V2
-2
-2
-2
-1
-2
-1
V1
V1
50
Summarizing results
Need to show how the clusters ______ from each other: Tabulate the ________________ for each cluster. Make separate plots for each cluster, using same scale Plot the __________ on one plot
51
BRIEF ARTICLE
THE AUTHOR
Example
Cluster 1 has ____ values on all variables. Cluster 2 has ___ values for tars1 and aede2, ___ values of head and aede3. Cluster 3 has ____ values of tars 1 and aede2, but ___ values of all other variables.
cluster tars1 tars2 head aede1 aede2 aede3 mean 1 183.10 129.62 51.24 146.19 14.10 104.86 sd 1 12.14 7.16 2.23 5.63 0.89 6.18 mean 2 138.23 125.09 51.59 138.27 10.09 106.59 sd 2 9.34 8.55 2.84 4.14 0.97 5.85 mean 3 201.00 119.32 48.87 124.65 14.29 81.00 sd 3 14.90 6.65 2.35 4.62 1.10 8.93
1 2 3
tars1 tars2 head aede1 aede2 aede3
value
variable
52
Example
1 1.0 0.8 0.6 0.4 0.2 0.0 3 1.0 0.8 0.6 0.4 0.2 0.0 tars1 tars2 head aede1 aede2 aede3 tars1 tars2 head aede1 aede2 aede3 2
value
variable
53
Example
2 1
factor(d.flea$cl)
PC2
1 2 3
-1
Plotting the clusters in a _____________ _____________ like the rst two principal components can also help evaluate the clusters.
-2
-3 -3 -2 -1 0 1 2 3
PC1
54
Height 1.0
0.4
0.6
0.8
Cluster Dendrogram
Mama C-E - Marsh Judson Briggs Woods Reichelt Unit Moeckly Sheeder Colo Bogs C-Turtlehead Fen Stargrass C-Sandhill W C-Puccoon McFarland C-Sandhill E Grimes Farm C-Airport Prairie Flower Neal 3 Big Creek Meetz Marietta Prairie Morris Grant Ridge Neal 1 Liska-Stanek Doolittle Neal 5 Neal 2 Richards Marsh Neal 4
dA,B
Example
Example
Iowa prairies Which sites are BRIEF similar? ARTICLE THE AUTHOR Use Canberra distance:
Aj Bj 1 = #non-zero entries Aj + Bj
j=1
where
0 =0 0
56
55
Example
Diptera Hemiptera_.Cicadellidae Hymenoptera Araneae Orthoptera_.Gryllidae Coleoptera_.Chrysomelidae Thysanoptera Hymenoptera_.Formicidae Hemiptera_.Miridae Coleoptera_.Curculionidae Orthoptera_.Tettigonidae Orthoptera_.Acrididae Hemiptera_.Psyllidae Acari Hemiptera_.Cercopidae Hemiptera_.Delphacidae Hemiptera_.Aphididae Hemiptera_.Lygaeidae Psocoptera Hemiptera_.Dictyopharidae Hemiptera_.Membracidae Coleoptera_.Mordellidae Coleoptera_.Cantharidae Hemiptera_.Cixiidae Hemiptera_.Anthocoridae Hemiptera_.Issidae Hemiptera_.Pentatomidae Hemiptera_.Reduviidae Neuroptera_.Chrysopidae Hemiptera_.Thyreocoridae Coleoptera_.Coccinellidae Collembola Coleoptera_.Lampyridae Hemiptera_.Alydidae Coleoptera_.Melyridae Hemiptera_.Tingidae Coleoptera_.Meloidae Phasmatodea_.Heteronemiidae Hemiptera_.Coreidae Coleoptera_.Carabidae Hemiptera_.Nabidae Coleoptera_.Elateridae Coleoptera_.Staphylinidae Coleoptera_.Cleridae 0
1 2 3 4 5
Species
variable
Average
1 2 3
value
Example
List clusters
1
C-Turtlehead Fen, Moeckly, Reichelt Unit, Sheeder, Colo Bogs, Grimes Farm, McFarland Stargrass, C-Airport, C-Puccoon, C-Sandhill E, C-Sandhill W
2
Doolittle, Liska-Stanek, Richards Marsh, Neal 2, Neal 4, Neal 5
3
Judson, Mama, Briggs Woods, C-E - Marsh
4
Marietta Prairie, Morris, Grant Ridge, Neal 1
5
Big Creek, Meetz, Prairie Flower, Neal 3
58
Example
Really need to take out the ________ variables to digest the differences. Try to write down a few species that have differences between clusters.
59
Example
Hemiptera_.Coreidae Coleoptera_.Staphylinidae
Coleoptera_.Meloidae Hemiptera_.Lygaeidae Hemiptera_.Anthocoridae Coleoptera_.Melyridae Hemiptera_.Membracidae Hemiptera_.Thyreocoridae Coleoptera_.Coccinellidae Hemiptera_.Aphididae Coleoptera_.Cleridae Hemiptera_.Pentatomidae Hemiptera_.Tingidae Hemiptera_.Miridae Hemiptera_.Cercopidae Neuroptera_.Chrysopidae Hemiptera_.Delphacidae Coleoptera_.Chrysomelidae Coleoptera_.Curculionidae Hemiptera_.Dictyopharidae Hemiptera_.Psyllidae Hemiptera_.Nabidae Orthoptera_.Acrididae Coleoptera_.Cantharidae Hymenoptera Diptera Hymenoptera_.Formicidae Araneae Hemiptera_.Cixiidae Hemiptera_.Cicadellidae Hemiptera_.Alydidae Coleoptera_.Elateridae Thysanoptera Phasmatodea_.Heteronemiidae Coleoptera_.Mordellidae Collembola Acari Orthoptera_.Gryllidae Orthoptera_.Tettigonidae Hemiptera_.Reduviidae
NMDS2
0.0
0.5
1.0
Coleoptera_.Lampyridae
-0.5
Coleoptera_.Carabidae
Hemiptera_.Issidae Psocoptera
-1.0
-1.0
-0.5
0.0 NMDS1
0.5
1.0
5
60
Self-organizing maps
A self-organizing map is a ______________ _________ algorithm. A 1D or 2D net is stretched through the data. The knots in the net form the cluster means, and points closest to the knot are considered to belong to that cluster. The map provides a lowdimensional view of the clusters, alternate to _____ or _____.
61
The net ts neatly into the data. Some of the extreme points are far from the model, which is not obvious from the map view.
62
Model-based clustering
Model-based clustering (Fraley and Raftery, 2002) ts a ___________ __________ _____ to the data. For example, if it is believed that clusters in the data are approximately elliptical in shape, then a multivariate normal distribution might be used. The _____ of the clusters is dened by the _________________ matrix for each group.
Model-based clustering
5 Model-based clustering
63
Model-based clustering (Fraley and Raftery, 2002) ts a multivariate mixture model to the data. For example, Model-based clustering (Fraley and Raftery, 2002) ts a multivariate mixture model to the data. For example, f it is believed that clusters clusters indata are approximately elliptical in shape, then a multivariate normal in the the data are approximately elliptical in shape, then a multivariate normal if it is believed that distribution might be might be used.shape of the clusters is dened by the variance-covariance matrix for each distribution used. The The shape of the clusters is dened by the variance-covariance matrix for each group which group be parametrized as follows: can which can be parametrized as follows:
This leads to several choices from simple to complex models: EII I Spherical k Distribution VII k I Spherical EEI DD Diagonal I Spherical VEI k DD Diagonal k I Spherical VVI k Dk Dk Diagonal DD DAD Diagonal EEE Ellipsoidal EEV Ellipsoidal k DD DAk D Diagonal VEV Dk ADk Ellipsoidal k Dk D D A D Ellipsoidal k k k Diagonal VVV k k
k = 1, . . . , g (number of clusters)
Orientation equal equal NA Volume Shape Orientation variable equal NA equal equal NA equal equal NA variable equal NA variable equal NA variable variable NA equal equal equalequal NA equal equal variableequal equalvariable NA variable equal variable variable variable NA variable variable variable
k = 1, . . . , g (number of clusters)
This leads to several choices from simple Distribution Volume Shape to complex models: Name k
number of mixture components, is used to assess the best model. The higher the BIC value the better the model. Model-based clustering uses the EM algorithm to t the parameters for the mean, variance-covariance of Figure 12 illustrates model-based clustering of the simulated data. Models 4 (EEV), 5 (VVV), 3 (EEE) each population and the are almostproportion. This is appropriate when the observations are that with 2 clusters mixing equally good according to the BIC value. The top choice is model 4 (EEV), independent and is, equal volume but dierent orientation elliptical clusters. dentically distributed multivariate normal observations. Almost as good is model 3 (EEE), two equal Typicallyelliptical clusters. This is the real model underlying this data. The best 3 loglikelihood, number of variables and the Bayes Information Criterion (BIC), based on the cluster solution is model 3, three equal elliptical clusters. Model 5 (VVV) the fully unconstrained parametrization has a lot of parameters number of mixture components, is used toare only two clusters, model. The higher data to estimate thethe better the to estimate, which is possible if there assess the best but there is not enough the BIC value parameters for more than two clusters. model.
Name EII VII EEI VEI VVI EEE DAD Ellipsoidal equal equal equal Model-based clustering uses the EM algorithm to t the parameters for the mean, variance-covariance of EEV DAk D Ellipsoidal equal equal variable each population and the mixing proportion. This is appropriate when the observations are independent and VEV Dk ADk Ellipsoidal variable equal variable identically distributed multivariate normal observations. Typically Bayes Information Criterion VVV the Dk Ak Dk Ellipsoidal (BIC), based on the loglikelihood, number of variables and variable variable variable k
EII
VII
EEE
EEV
VVV
Figure 12library(mclust) illustrates model-based clustering of the simulated data. Models 4 (EEV), 5 (VVV), 3 (EEE) x.mc<-EMclust(x,2:4,c("EII","VII","EEE","EEV","VVV")) value. The top choice is model 4 (EEV), that with 2 clusters are almost equally good according to the BIC 64 plot(x.mc) s, equal volume but dierent orientation elliptical clusters. Almost as good is model 3 (EEE), two equal legend(2,-105,col=c(1:5),lty=c(1:5), elliptical clusters. This is EEI","2 VII","3 EEE","4 EEV","5 VVV")) The best 3 cluster solution is model 3, three legend=c("1 the real model underlying this data.
Model tting
The cluster model is t by estimating mean, variance-covariance of each population and the mixing proportion, and optimizing these for the sample. The t is evaluated by examining the sample variation in relation to the parameter estimates, using Bayes Information Criterion (BIC). The higher the BIC value the better the model.
65
Model EEV,2
4 5 3 3 4 1 EEI 2 VII 3 EEE 4 EEV 5 VVV 1 2 1 2.0 2.5 3.0 3.5 1 2 4.0
2 2
BIC
110
4
0
120
130
1.5
1.0
0.5
0.0
0.5
1.0
number of clusters
Model EEE,2
2 2
Model EEE,3
1.5
1.0
0.5
0.0
0.5
1.0
Real model that generated the data is EEE-2. Best t as measured by BIC is _______. How many parameters need to be estimated?
100
1.5
1.0
0.5
0.0
0.5
1.0
66
Example: Music
5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3
11000
2 1 0
BIC
9000
7000
number of clusters
5600
5500
BIC
Cluster 11 12 10 13 14 1 2 4 3 5 6 9 7 8
5400
LVar 1.3 108 8.3 107 4.2 107 5.9 107 2.4 107 1.6 107 1.4 107 7.2 106 6.5 106 4.7 106 3.1 106 2.4 106 5.3 105 5.8 105
Cluster Means LAve LMax LFEner 50.2 3.3 104 114 -2.8 3.1 104 112 -4.3 3.2 104 108 -3.5 -20.1 -24.2 216.2 -83.0 17.5 14.2 1.6 -21.3 7.7 -9.2 3.1 104 2.9 104 2.7 104 3.0 104 2.7 104 2.3 104 1.8 104 2.1 104 1.3 104 8.1 103 5.7 103 111 107 106 105 102 102 86 98 103 99 105
LFreq 41 246 108 160 232 169 198 92 274 209 552 233 566 222
Names of tracks Saturday Morning Girl, Cant Buy Me Love All in a Days Work, Love of the Loveless, Wrong About Bobby, Yellow Submarine Rock Hard Times, Lone Wolf, I Want to Hold Your Hand, I Feel Fine, Ticket to Ride, Help, Penny Lane The Good Old Days, Love Me Do, Yesterday, B4, Waterloo Dancing Queen, Agony, Eleanor Rigby, B8, Anywhere Is V6 Knowing Me, Take a Chance, Mamma Mia, Lay All You, Super Trouper, Money V1, V3, V9, M3, M6, Restraining, B1, B3, B7, V13 V5, HeyJude V7, M4, B5 I Have A Dream, SOS, M1, M2, M5, The Memory of Trees, Pax Deorum, V11 V2, V10, B6 The Winner, V4, V8, B2, V12
The common variance-covariance matrix is EEE model 10 1.6 10 7.2 10 8.5 14 clusters
12 7 8
The unusual tracks discovered earlier are in singleton clusters, V6, Saturday Morning. Hey Jude the
1.2 106 2.8 100 2.0 103 3.4 100 2.4 101
5.1 107 2.2 102 8.4 104 2.4 101 1.6 104
68
Comparing results
One approach is to use a confusion table. For Comparing example, to compareResults the results for hierarchical One approach is to uselinkage and k-means clustering on using average a confusion table. For example, to compare the results for hierarchical using average linkage and the ea beatles ea beatles could summarize the k-means clustering on the data we data we could summarize the clusters like: clusters like:
hc(ave) 1 2 3 4 k-means 1 2 3 0 0 19 24 0 0 0 30 0 0 1 0 Mapping is: 1 3 2 1 3 2 Rearrange confusion table accordingly: hc(ave) 2 3 1 4 k-means 1 2 3 24 0 0 0 30 0 0 0 19 0 1 0
This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/ licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
70