Professional Documents
Culture Documents
Clustering
Supervised learning
label
label1
model/
label3 predictor
label4
label5
3
Unsupervised learning
4
Unsupervised learning:Clustering
8
K-Means clustering
An iterative clustering algorithm
Initialize: Pick K random points as cluster
centers
Repeat:
1. Assign data points to closest cluster center
2. Change the cluster center to the average of its
assigned points
Stop when no points assignments change
9
10
11
K-means: an example
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
No changes: Done
K-means
Iterate:
Assign/cluster each example to closest center
Recalculate centers as the mean of the points in a cluster
How do we do this?
K-means
Iterate:
Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center (hard cluster)
Recalculate centers as the mean of the points in a cluster
K-means
Iterate:
Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center
Recalculate centers as the mean of the points in a cluster
Iterate:
Assign/cluster each example to closest center
Recalculate centers as the mean of the points in
a cluster
Iterate:
Assign/cluster each example to closest center
Recalculate centers as the mean of the points in
a cluster
A 1 1 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
B 2 1
C 4 3
D 5 4
26
Iteration-1
Initial
value of centroids: Suppose we use medicine A
and medicine B as the first centroids
,
4.5
Medicine X Y Dist- Dist- Cluster 4
3.5
3
2.5
A 1 1 0 1 C-1 2
1.5
B 2 1 1 0 C-2 1
0.5
C 4 3 3.61 2.83 C-2 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
D 5 4 5 4.24 C-2
Recompute centroids:
27
Iteration-2
Recompute centroids:
28
K-means variations/parameters
Initial (seed) cluster centers
Convergence
A fixed number of iterations
Partitions unchanged
Cluster centers dont change
K-means: Initialize centers randomly
Common choices
Random point in feature space
Random point from dataset
Points least similar to any existing center (furthest centers
heuristic)
Try out multiple starting points
Furthest centers heuristic
for i = 2 to K:
i = point that is furthest from any previous centers
K-means: Initialize furthest from centers
for k = 2 to K:
for i = 1 to N:
si = min d(xi, 1k-1) // smallest distance to any center
for k = 2 to K:
for i = 1 to N:
si = min d(xi, 1k-1) // smallest distance to any center
Cons
Need to know K
Problems when clusters are of different
size, densities
Cant handle Outliers well
43
44
Summary
Definition of clustering
Difference between supervised and unsupervised learning.
Finding labels for each datum.
Clustering algorithms
K-means
Always K clusters exist.
Find new mean value.
Find new clusters.
Stop when nothing changes in clusters (or changes are less than
very small value).
54