You are on page 1of 10

K-Means Clustering Example

K-Means Clustering Example

We recall from the previous lecture, that clustering allows for unsupervised learning.
That is, the machine / software will learn on its own, using the data (learning set), and
will classify the objects into a particular class for example, if our class (decision)
attribute is tumor Type and its values are: malignant (danger), benign (mild), etc. - these
will be the classes. They will be represented by cluster1, cluster2, etc. However, the class
information is never provided to the algorithm. The class information can be used later
on, to evaluate how accurately the algorithm classified the objects.

Curvature Texture Blood Tumor Curvature Texture Blood Tumor


Cons Type Consump Type
ump x1 0.8 1.2 A Benign
x1 0.8 1.2 A Benign x2 0.75 1.4 B Benign
x2 0.75 1.4 B Benign x3 0.23 0.4 D Malignant
x3 0.23 0.4 D Malignant x4 0.23 0.5 D Malignant
x4 0.23 0.5 D Malignant .
. .
.
(Learning set)

.
The way we do that, is by plotting the Texture
objects from the database into space.
1.2
Each attribute is one dimension:
x1
Blood
D Consump
0.4
B
A
0.23
0.8
Curvature

After all the objects are plotted, we will


calculate the distance between them, and
the ones that are close to each other we
will group them together, i.e. place them
in the same cluster.
Texture
Cluster 1
1.2 benign .
.. . .
Cluster 2
malignant

Blood
D Consump
0.4
B
A
0.23
0.8
Curvature
1
K-Means Clustering Example

With the K-Means algorithm, we recall it works as follows:

2
K-Means Clustering Example

Example

Problem: Cluster the following eight points (with (x, y) representing locations) into three
clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9).
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function
between two points a=(x1, y1) and b=(x2, y2) is defined as: (a, b) = |x2 x1| + |y2
y1| .
Use k-means algorithm to find the three cluster centers after the second iteration.

Solution:

Iteration 1
(2, 10) (5, 8) (1, 2)
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)

First we list all points in the first column of the table above. The initial cluster centers
means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the
distance from the first point (2, 10) to each of the three means, by using the distance
function:

point mean1
x1, y1 x2, y2
(2, 10) (2, 10)

(a, b) = |x2 x1| + |y2 y1|

(point, mean1) = |x2 x1| + |y2 y1|


= |2 2| + |10 10|
=0+0
=0

3
K-Means Clustering Example

point mean2
x1, y1 x2, y2
(2, 10) (5, 8)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|


= |5 2| + |8 10|
=3+2
=5

point mean3
x1, y1 x2, y2
(2, 10) (1, 2)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|


= |1 2| + |2 10|
=1+8
=9

So, we fill in these values in the table:

(2, 10) (5, 8) (1, 2)


Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)

So, which cluster should the point (2, 10) be placed in? The one, where the point has the
shortest distance to the mean that is mean 1 (cluster 1), since the distance is 0.

4
K-Means Clustering Example

Cluster 1 Cluster 2 Cluster 3


(2, 10)

So, we go to the second point (2, 5) and we will calculate the distance to each of the
three means, by using the distance function:

point mean1
x1, y1 x2, y2
(2, 5) (2, 10)

(a, b) = |x2 x1| + |y2 y1|

(point, mean1) = |x2 x1| + |y2 y1|


= |2 2| + |10 5|
=0+5
=5

point mean2
x1, y1 x2, y2
(2, 5) (5, 8)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|


= |5 2| + |8 5|
=3+3
=6

point mean3
x1, y1 x2, y2
(2, 5) (1, 2)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|


= |1 2| + |2 5|
=1+3
=4

5
K-Means Clustering Example

So, we fill in these values in the table:

Iteration 1
(2, 10) (5, 8) (1, 2)
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)

So, which cluster should the point (2, 5) be placed in? The one, where the point has the
shortest distance to the mean that is mean 3 (cluster 3), since the distance is 0.

Cluster 1 Cluster 2 Cluster 3


(2, 10) (2, 5)

Analogically, we fill in the rest of the table, and place each point in one of the clusters:

Iteration 1
(2, 10) (5, 8) (1, 2)
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2

Cluster 1 Cluster 2 Cluster 3


(2, 10) (8, 4) (2, 5)
(5, 8) (1, 2)
(7, 5)
(6, 4)
(4, 9)

6
K-Means Clustering Example

Next, we need to re-compute the new cluster centers (means). We do so, by taking the
mean of all points in each cluster.

For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster
center remains the same.

For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)

For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)

The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.

7
K-Means Clustering Example

That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so on


until the means do not change anymore.
In Iteration2, we basically repeat the process from Iteration1 this time using the new
means we computed.

K-Means Clustering Algorithm

Definition:

This nonheirarchial method initially takes the number of components of


the population equal to the final required number of clusters. In this step
itself the final required number of clusters is chosen such that the points
are mutually farthest apart. Next, it examines each component in the
population and assigns it to one of the clusters depending on the
minimum distance. The centroid's position is recalculated everytime a
component is added to the cluster and this continues until all the
components are grouped into the final required number of clusters.

8
K-Means Clustering Example

Example:

Let us apply the k-Means clustering algorithm to the same example as in the
previous page and obtain four clusters :
Food item # Protein content, P Fat content, F
Food item #1 1.1 60
Food item #2 8.2 20
Food item #3 4.2 35
Food item #4 1.5 21
Food item #5 7.6 15
Food item #6 2.0 55
Food item #7 3.9 39

Let us plot these points so that we can have better understanding of the
problem. Also, we can select the three points which are farthest apart.

We see from the graph that the distance between the points 1 and 2, 1
and 3, 1 and 4, 1 and 5, 2 and 3, 2 and 4, 3 and 4 is maximum.

Thus, the four clusters chosen are :

9
K-Means Clustering Example

Cluster number Protein content, P Fat content, F


C1 1.1 60
C2 8.2 20
C3 4.2 35
C4 1.5 21

Also, we observe that point 1 is close to point 6. So, both can be taken as
one cluster. The resulting cluster is called C16 cluster. The value of P for
C16 centroid is (1.1 + 2.0)/2 = 1.55 and F for C16 centroid is (60 + 55)/2
= 57.50.

Upon closer observation, the point 2 can be merged with the C5 cluster.
The resulting cluster is called C25 cluster. The values of P for C25
centroid is (8.2 + 7.6)/2 = 7.9 and F for C25 centroid is (20 + 15)/2 =
17.50

The point 3 is close to point 7. They can be merged into C37 cluster. The
values of P for C37 centroid is (4.2 + 3.9)/2 = 4.05 and F for C37
centroid is (35 + 39)/2 = 37.

The point 4 is not close to any point. So, it is assigned to cluster number
4 i.e., C4 with the value of P for C4 centroid as 1.5 and F for C4
centroid is 21.

Finally, four clusters with three centroids have been obtained.


Cluster number Protein content, P Fat content, F
C16 1.55 57.50
C25 7.9 17.5
C37 4.05 37
C4 1.5 21

In the above example it was quite easy to estimate the distance between the
points. In cases in which it is more difficult to estimate the distance, one has to
use euclidean metric to measure the distance between two points to assign a
point to a cluster.

10

You might also like