Clustering

Data Mining and Analytics:
Clustering
Anirban Mondal
anirban.mondal@snu.edu.in
What we are going to cover

What is clustering and why do we need to do
clustering?
Important Clustering algorithms
And finally, you will be given different
application scenarios
And you will need to figure out which clustering
algorithm is good for a given application scenario
Note: This requires you to understand not only the
steps in the algorithm, but also the limitations,
assumptions, applicability etc
What is clustering and why do we

need to do clustering?
Clustering
Clustering is about grouping similar objects/items
together
Within a cluster, items are more similar to each other than
with items outside the cluster
What does similar mean?

Objects are similar in some way i.e., based on some
property
If you change the property, the resulting clusters may
change too
There is a notion of similarity distance
Two or more objects belong to the same cluster if they are
close according to a given distance
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function
Clustering
Clustering is one of the most important
unsupervised learning processes
Clustering finds structures in a collection of
unlabeled data
A separate quality function measures how
good the clustering is
Clustering
Input: a collection of n objects each
represented by a vector
Objective: to divide these n objects into k
clusters so that similar objects are grouped
together
In real-world settings, k is usually unknown
Example
Assume three people John, Jack and Alice

City of residence: John (NY), Jack (LA), Alice (NY)
Age: John (52), Jack (25), Alice (35)
Weight: John (70 kg), Jack (73 kg), Alice (50 kg)
Salary: John ($150 K), Jack ($40 K), Alice ($90 K)
Interests: John (music), Jack (karate), Alice (music)
Based on city of residence, what should be the clusters?

{John, Alice} and {Jack}
Example

Based on age, what should be the clusters?

{John}, {Alice}, {Jack}
Example

Based on weight, what should be the clusters?

{John, Jack} and {Alice}
Example

Based on salary, what should be the clusters?

{John}, {Jack}, {Alice}
Example

Based on interests, what should be the clusters?

{John, Alice} and {Jack}
Learning point: The result clusters

depend on what property you do
the clustering on
The property or way in which you cluster on is called
the dimension
In the example, dimensions were interests, salary, age,
city of residence etc
More learning points

The results of clustering are generally
application-dependent, and depend on how you
intend to use these results
Example: Refer once again to the example
Here, all three people could be in different market
segments
OR {Jack, Alice} could form a cluster
It all depends on the purpose of the clustering (in
this case, could be targeted marketing)
Think scalability
In this example, you could do the clustering
manually because the dataset was very small
What if you had to cluster 1 million people or
even 10000 people based on any one
dimension such as age range, interests etc?
Clustering algorithms are needed to achieve this
The K-Means Clustering algorithm
The K-means clustering algorithm

Inputs: The number K of clusters and the dataset to
cluster
Step 1: Randomly select k points as the initial cluster
centers/means
Step 2: Assign each point in the dataset to the closest
cluster (assign to only one cluster)
based upon the Euclidean distance between each point and
each cluster center (minimum distance)
Step 3: Recompute each cluster center as the average

of the points in that cluster
Repeat Steps 2 and 3 until the clusters
stabilize/converge.
Stabilize usually means that when steps 2 and 3 are
repeated, no changes occur in the clustering results
Example with K = 2
ID
X and Y are the two

attributes on which you
want to do the
clustering
Example with K = 2
ID
4
3
K=2, The points in red AND blue are randomly selected as your
two initial clusters
Example
ID
4
3
Compute Euclidean distance of point 3 from both points 1 and 2
Example
ID
4
3
d(p1,p3) = SQRT ((4-1)^2 + (3-1)^2) = SQRT(13)

d(p2, p3) = SQRT ((4-2)^2 + (3-1)^2) = SQRT(8)
Example
ID
Point 3 falls into the

blue cluster because
its distance to p2 is
less than its distance to
p1
4
3
d(p1,p3) = SQRT ((4-1)^2 + (3-1)^2) = SQRT(13)

d(p2, p3) = SQRT ((4-2)^2 + (3-1)^2) = SQRT(8)
Example
ID
Similarly, Point 4 falls

into the blue cluster
because its distance to
p2 is less than its
distance to p1
4
3
Example
ID
Now recompute each

clusters centre
4
3
Example
ID
The red clusters

centre is same as p1
because the red cluster
has only 1 point. In this
diagram, centres are
indicated by the cross
autoshape
4
3
Example
ID
The blue clusters

centre is the average
of the 3 points in blue.
In this diagram, cluster
centres are indicated
by the cross autoshape
4
3
Centre = (2+4+5)/3, (1+3+4)/3 i.e.,

(11/3, 8/3)
Example
ID
Now repeat step2 and

step 3:
Step 2: Assign each
point in the dataset to
the closest cluster
4
3
Example
ID
Step 3: Recompute
each cluster center as
the average of the
points in that cluster. In
this diagram, each
cluster centre is
indicated by an X
4
3
Example
ID
Now repeat steps 2

and 3 again
What do you see?
4
3
Example
ID
Now repeat steps 2

and 3 again
4
3
What do you see?

The clustering does not
change, hence the
algorithm terminates
(stabilizes)
Final result: {p1, p2} AND {p3, p4}
Pros and Cons of K-means

clustering algorithm
Pros
Simple
Converges to local optimum
Cons
The number K of clusters needs to be provided as an input, hence K
needs to be decided in advance
When dataset is relatively small, the initial clustering assignment has
significant influence on the final clustering results
The same dataset can produce different clusters, depending upon the
order of input
Each attribute is provided the same weightage, hence we cannot figure
out which attribute contributes how much to the clustering process
Observe that the algorithm essentially uses average (arithmetic mean)
Arithmetic mean does not work well with outliers. (Can use median if
outliers issue is significant)
Manhattan distance as similarity measure

In the example for the k-means algorithm, we
used the Euclidean distance as the similarity
measure
However, other distance measures such as the
Manhattan distance measure can also be used
The formula for Manhattan distance is as
follows:
Geometric explanation of Manhattan distance
Distance travelled from

one point to another if you
follow a grid-like path i.e.,
only along axis-aligned
directions
Contrast this with Euclidean distance, which measures the
ordinary distance between two points as in using a ruler
How good is the clustering?

One possible measure of how well the cluster
centres represent the members of their
clusters is the residual sum of squares
(RSS)
RSS = the squared distance of each vector from
its cluster centre summed over all vectors
Observe that the overall aim of the k-means

algorithm is to minimize the RSS
Termination conditions for the K-means algorithm

Stop after a given number of iterations
Pros: Shorter execution time
Cons: Low quality of the clustering if the number of
iterations is inadequate
Stop when no changes in clustering between

iterations
Pros: Good quality of clustering (except when initial
clusters are very bad)
Cons: Runtimes may be way too long!
Stop when the RSS falls below some threshold

To avoid unreasonably long runtimes, you may want
to put a cap on the number of iterations
How to do determine better initial clusters?

As we have seen, the k-means algorithm is very
sensitive to the initial clustering assignment
Spread the k initial cluster centres as far away as
you can
NOTE: In practice, the algorithm is usually very

fast, hence it is usual to execute it multiple
times with different initial clustering assignments
and also with different values of K to see which
gives the most desirable clustering result

Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

Data Mining and Analytics:

What we are going to cover

What is clustering and why do we

What does similar mean?

Assume three people John, Jack and Alice

Based on city of residence, what should be the clusters?

Assume three people John, Jack and Alice

Based on age, what should be the clusters?

Assume three people John, Jack and Alice

Based on weight, what should be the clusters?

Assume three people John, Jack and Alice

Based on salary, what should be the clusters?

Assume three people John, Jack and Alice

Based on interests, what should be the clusters?

Learning point: The result clusters

More learning points

The K-Means Clustering algorithm

The K-means clustering algorithm

Step 3: Recompute each cluster center as the average

X and Y are the two

Compute Euclidean distance of point 3 from both points 1 and 2

d(p1,p3) = SQRT ((4-1)^2 + (3-1)^2) = SQRT(13)

Point 3 falls into the

d(p1,p3) = SQRT ((4-1)^2 + (3-1)^2) = SQRT(13)

Similarly, Point 4 falls

Now recompute each

The red clusters

The blue clusters

Centre = (2+4+5)/3, (1+3+4)/3 i.e.,

Now repeat step2 and

Now repeat steps 2

Now repeat steps 2

What do you see?

Final result: {p1, p2} AND {p3, p4}

Pros and Cons of K-means

Manhattan distance as similarity measure

Geometric explanation of Manhattan distance

Distance travelled from

How good is the clustering?

Observe that the overall aim of the k-means

Termination conditions for the K-means algorithm

Stop when no changes in clustering between

Stop when the RSS falls below some threshold

How to do determine better initial clusters?

NOTE: In practice, the algorithm is usually very

You might also like