You are on page 1of 56

Cluster Analysis

SUSHIL KULKARNI
Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary SUSHIL KULKARNI
What is Cluster ?
 Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that
 –Data points in one cluster are more similar to
one another.
 –Data points in separate clusters are less
similar to one another.

 Similarity Measures:
 –Euclidean Distance if attributes are
continuous.
 –Other Problem-specific Measures.

SUSHIL KULKARNI
Outliers

 Outliers are objects that do not belong to any


cluster or form clusters of very small cardinality

cluster

outliers
 In some applications we are interested in
discovering outliers, not clusters (outlier
analysis)
SUSHIL KULKARNI
Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering
Methods
 Partitioning Methods
 Hierarchical Methods

SUSHIL KULKARNI
Data Structures
 data matrix attributes/dimensions
 (two modes)  x11 ... x1f ... x1p 
 

tuples/objects
 ... ... ... ... ... 
x ... xif ... xip 
the “classic” data input  i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
objects
 dissimilarity or distance
 0 
matrix  d(2,1) 0 
 
objects
 (one mode)  d(3,1) d ( 3,2) 0 
the desired data input to some  
clustering algorithms  : : : 
d ( n,1) d ( n,2) ... ... 0
Measuring Similarity in Clustering

 Dissimilarity/Similarity metric:

 The dissimilarity d(i, j) between two objects


i and j is expressed in terms of a distance
function, which is typically a metric:
metric

 d(i, j) ≥ 0 (non-negativity)
 d(i, i) =0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d (i, j) ≤ d(i, h)+d(h, j) (triangular
inequality)

SUSHIL KULKARNI
Type of data in cluster analysis

 Interval-scaled variables
 e.g., salary, height
 Binary variables
 e.g., gender (M/F), has_cancer(T/F)
 Nominal (categorical) variables
 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

SUSHIL KULKARNI
Similarity and Dissimilarity
Between Objects

 Distance metrics are normally used to


measure the similarity or dissimilarity
between two data objects

SUSHIL KULKARNI
Similarity and Dissimilarity Between
Objects (Cont.)
 Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 i n jn
 Properties

 d(i,j) ≥0
 d(i,i) =0
 d(i,j) = d(j,i)
 d(i,j) ≤ d(i,k) + d(k,j)
 Also one can use weighted distance:
d (i, j) = (w | x − x |2 +w | x − x |2 +...+ wn | x − x |2 )
1 i1 j1 2 i2 j 2 in jn
SUSHIL KULKARNI
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data
object j
1 0 sum
1 a b a +b
object i 0 c d c+d
sum a + c b + d p

 Jaccard coefficient distance (noninvariant if the binary


variable is asymmetric):
d (i, j ) = b +c
a +b +c
SUSHIL KULKARNI
Binary Variables

 1-1 match is stronger indicator of similarity


then 0-0 indicator of similarity
 It may be possible that 0-0 indicator can be
ignored.
 In matching coefficient equal weight for 1-1
matches and 0-0 matches are taken
 In Jacords coefficient 0-0 indicator match is
ignored
SUSHIL KULKARNI
Dissimilarity between Binary
Variables
 Example (Jaccard coefficient)

Name Fever Cough Test-1 Test-2 Test-3 Test-4


Tina 1 0 1 0 0 0
Dina 1 0 1 0 1 0
Meena 1 1 0 0 0 0

SUSHIL KULKARNI
Dissimilarity between Binary
Variables
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Tina 1 0 1 0 0 0
Dina 1 0 1 0 1 0

Consider name as Tina , Dina


Number of attributes where name_1 is 1 and name_2 is 1=2 =a
(Fever, Test-1)
Number of attributes where name_ 1 is 1 and name_ 2 as 0 = b =0
Number of attributes where name_ 1 is 0 and name_ 2 as 1 = c =1
(Test-3)
Number of attributes where both names = 0 = d = 3
(Cough, Test-2, Test-4)
SUSHIL KULKARNI
Dissimilarity between Binary
Variables
Using Jaccard coefficient distance we get :
a =2
d (i, j ) = b +c
b=0
a+ b+ c
c=1
0 +1 d=3
d(Tina , Dina ) = = 0.33
2 + 0 +1
1+1
d(Tina , Meena ) = = 0.67
1+1+1
1+ 2
d(Meena , Dina ) = = 0.75
1+1+ 2
Based on the magnitude of the similarity coefficient, we
conclude that Meena and Dina are most similar and Tina and
Dina are least similar Other pair lies between these extremes

SUSHIL KULKARNI
Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering
Methods
 Partitioning Methods
 Hierarchical Methods

SUSHIL KULKARNI
Major Clustering Approaches

 Partitioning algorithms: Construct


random partitions and then iteratively
refine them by some criterion
 Hierarchical algorithms: Create a
hierarchical decomposition of the set of
data (or objects) using some criterion

SUSHIL KULKARNI
Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods

SUSHIL KULKARNI
Partitioning Algorithms: Basic
Concepts

 Partitioning method: Construct a partition


of a database D of n objects into a set of k
clusters
 Methods: k-means and k-medoids
algorithms
 k-means (MacQueen’67): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster

SUSHIL KULKARNI
Centroid or Medoid

Centroid Medoid

SUSHIL KULKARNI
The k-means Clustering Method
 Given k, the k-means algorithm is
implemented in 4 steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of
the clusters of the current partition. The
centroid is the center (mean point) of the
cluster.
 Assign each object to the cluster with the
nearest seed point.
 Go back to Step 2, stop when no more
new assignment.
SUSHIL KULKARNI
K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}
 Assume that we want two clusters.
 Write the elements in increasing order
{2,3,4,10,11,12,20,25,30}
 Randomly assign means: m1=3,m2=4
 K1={2,3}, K2={4,10,11,12,20,25,30}
 Means are m1=2.5,m2=16
 K1={2,3,4},K2={10,11,12,20,25,30}
 Means are m1=3,m2=18
 K1={2,3,4,10},K2={11,12,20,25,30}
 Means are m1=4.75,m2=19.6
 K1={2,3,4,10,11,12},K2={20,25,30}
 Means are m1=7,m2=25
 Stop as the clusters with these means as no more
jumps from K2 to K 1 is possible. SUSHIL KULKARNI
The k-means Clustering Method

 Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

SUSHIL KULKARNI
Comments on the k-means Method

 Applicable only when mean is defined,


then what about categorical data?
 Need to specify k, the number of
clusters, in advance
 Unable to handle noisy data and outliers

SUSHIL KULKARNI
The K-Medoids Clustering Method

 A medoid can be defined as that object of a


cluster, whose average dissimilarity to all the
objects in the cluster is minimal i.e. it is a most
centrally located point in the given data set.

SUSHIL KULKARNI
The K-Medoids Clustering Algorithm

A medoid clustering algorithm is as follows:


 The algorithm begins with arbitrary selection of the k
objects as medoid points out of n data points (n>k)
 After selection of the k medoid points, associate each
data object in the given data set to most similar medoid.
The similarity here is defined using distance measure
that can be Euclidean distance
 Randomly select non medoid object O′
 compute total cost S of swapping initial medoid object to
O′
 If S<0, then swap initial medoid with the new one (if
S<0 then there will be new set of medoids)
 repeat steps 2 to 5 until there is no change in the
medoid

SUSHIL KULKARNI
The K-Medoids Clustering Method
 Cluster the following data set of ten objects into
two clusters i.e k = 2.
 Consider a data set of ten objects as follows
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6 SUSHIL KULKARNI
The K-Medoids Clustering Method

X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6 SUSHIL KULKARNI
The K-Medoids Clustering Method

Step 1

 Initialise k centre
 Let us assume c1 = (3,4) and c2 = (7,4)
 So here c1 and c2 are selected as medoid.
 Calculating distance so as to associate each data
object to its nearest medoid. Cost is calculated using
Minkowski distance metric.

SUSHIL KULKARNI
Similarity and Dissimilarity
Between Objects

 The most popular conform to Minkowski


distance:
 p p p 1/ p
L p (i, j) = | x − x | + | x − x | + ...+ | x − x |

 

 i1 j1 i2 j 2 in jn 

where i = (xi1, xi2, …, xin) and


j = (xj1, xj2, …, xjn) are two n-dimensional
data objects, and p is a positive integer

SUSHIL KULKARNI
Similarity and Dissimilarity
Between Objects

 If p = 1, then L1 is :

L (i, j) =| x − x | + | x − x | + ...+ | x − x |
1 i1 j1 i2 j 2 i n jn

SUSHIL KULKARNI
The K-Medoids Clustering Method
c1 Data Cost c2 Data Cost
objects (distance) objects (distance)
(Xi) (Xi)

(3,4) (2,6) 3 (7,4) (2,6) 7


(3,4) (3,8) 4 (7,4) (3,8) 8
(3,4) (4,7) 4 (7,4) (4,7) 6
(3,4) (6,2) 5 (7,4) (6,2) 3
(3,4) (6,4) 3 (7,4) (6,4) 1
(3,4) (7,3) 5 (7,4) (7,3) 1
(3,4) (8,5) 6 (7,4) (8,5) 2
(3,4) (7,6) 6 (7,4) (7,6) 2

SUSHIL KULKARNI
The K-Medoids Clustering Method
 So the clusters become:

Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
 Since the points (2,6) (3,8) and (4,7) are close to c1 hence
they form one cluster while remaining points form another
cluster.
 Cost between any two points is found using formula
d
cos t ( x , c) = ∑ | x − c |
i =1
where x is any data object, c is the medoid, and d is the

dimension of the object which in this case is 2. Cost


between any two points is found using formula

Total cost is the summation of the cost of data object


from its medoid in its cluster so here: SUSHIL KULKARNI
The K-Medoids Clustering Method

Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 ={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}

 Total Cost = {cost( (3,4),(2,6))+cost( (3,4), (3,8))+


cost( (3,4), (4,7))}
+
{cost( (7,4),(6,2))+ cost( (7,4),(7,3))+
cost((7,4),(8,5))+ cost( (7,4),(7,6))}
=3+4+4+3+1+1+2+2
=20

SUSHIL KULKARNI
The K-Medoids Clustering Method

clusters after step 1


SUSHIL KULKARNI
The K-Medoids Clustering Method
Step 2

 Selection of nonmedoid O′ randomly


 Let us assume O′ = (7,3)
 So now the medoids are c1(3,4) and O′(7,3)
 If c1 and O′ are new medoids, calculate the
total cost involved by using the formula in
the step 1

SUSHIL KULKARNI
The K-Medoids Clustering Method
c1 Data Cost O’ Data Cost
objects (distance) objects (distance)
(Xi) (Xi)

(3,4) (2,6) 3 (7,3) (2,6) 8


(3,4) (3,8) 4 (7,3) (3,8) 9
(3,4) (4,7) 4 (7,3) (4,7) 7
(3,4) (6,2) 5 (7,3) (6,2) 2
(3,4) (6,4) 3 (7,3) (6,4) 2
(3,4) (7,4) 4 (7,3) (7,4) 1
(3,4) (8,5) 6 (7,3) (8,5) 3
(3,4) (7,6) 6 (7,3) (7,6) 3

SUSHIL KULKARNI
The K-Medoids Clustering Method

SUSHIL KULKARNI
The K-Medoids Clustering Method

 Total Cost = 3+4+4+2+2+1+3+3=22


 So cost of swapping medoid from c2 to
O′ is
 S = current total cost – Past total cost

= 22-20
=2 > 0
 So moving to O′ would be bad idea, so the
previous choice was good and algorithm
terminates here (i.e there is no change in the
medoids).
SUSHIL KULKARNI
Cluster Analysis

 What is a Cluster ?
 Types of Data in Cluster
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods

SUSHIL KULKARNI
Hierarchical Clustering
 Use distance matrix as clustering criteria.
This method does not require the number of
clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting)
 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge objects that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all objects belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

 Single-Link: each time merge the clusters (C1,C2) which are


connected by the shortest single link of objects, i.e., minp∈C1,q∈
C2dist(p,q)
SUSHIL KULKARNI
AGNES (Agglomerative Nesting)

• Uses dissimilarity/distance matrix as input.


• Start with each individual item in its own cluster
Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
• The output is a dendogram which is represented as

a set of ordered triples (d, k , K) where d is the


threshold distance, k is the number of clusters and
K is the set of clusters

SUSHIL KULKARNI
AGNES (Agglomerative Nesting):
Minimum Distance Method

A B C D E
Single link dendrogram
A B A B
1
1
2 3 2
3 2
4 C C
E E
5
1 1
3 D D

Graph with all distances Graph with threshhold dmin=1


AGNES (Agglomerative Nesting):
Minimum Distance Method
A B A B
1 1
2 2
2 3 2
2 3 2
E C C
E

1 1
D 3
D
Graph with threshhold dmin=2 Graph with threshhold dmin=3
A B
1
2 3 2
3 2
E 4 C

1
3 D A B C D E
Graph with threshhold dmin=4 Single link dendrogram
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into d
a several levels of nested e
partitioning (tree of clusters), b
a
called a dendrogram. c

A clustering of the data level 4


objects is obtained by cutting
the dendrogram at the level 3
desired level, then each
connected component forms
a cluster. level 2

E.g., level 1 gives 4 clusters: level 1


{a,b},{c},{d},{e},
level 2 gives 3 clusters:
{a,b},{c},{d,e}
level 3 gives 2 clusters: a b c d e
{a,b},{c,d,e}, etc.
Agglomerative Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D

Threshold of

1 2 34 5

A B C D E
SUSHIL KULKARNI
MST Example

A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D

SUSHIL KULKARNI
DIANA (Divisive Analysis)

 Implemented in statistical analysis packages,


e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its
own
10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)

o Inverse order of AGNES

o Initially all items are placed in one cluster

oThe clusters are split when some elements are not


efficiently close to other elements
o Eventually each node forms a cluster on its own
o One simple example of a divisive algorithm is based

on the MST version of the single link algorithm


o Edges are cut out from the minimum spanning tree
from largest to the smallest
SUSHIL KULKARNI
DIANA (Divisive Analysis)

A B
1
2 3 2
3 2
E 4 C
5 A B
1
1
3
2
D
E C

1
3
D
DIANA (Divisive
A
1
Analysis)
C
2

E
B
1
3 D
A B
1
Cut the largest edge ED
2 The cluster {A,B,C,D,E} is split
into two clusters
E
{E} {A,B,C,D}
1 C
D

SUSHIL KULKARNI
DIANA (Divisive
A Analysis)
1
C
2
Two clusters are
E {E} {A,B,C,D}
B Next cut the edge between B
1
and C
D
A B The cluster {A,B,C,D} is split into
1 {A,B} and {C,D}

1 C
D

SUSHIL KULKARNI
DIANA (Divisive
A Analysis)
1
C

In the next step these will be


E split finally giving clusters as
B {E}, {A}, {B}, {C}, {D},
1
D
A B

C
D

SUSHIL KULKARNI
More on Hierarchical Clustering Methods

 Integration
of hierarchical with
distance-based clustering
 BIRCH (1996): uses CF-tree and
incrementally adjusts the quality of sub-
clusters
 CURE (1998): selects well-scattered points
from the cluster and then shrinks them
towards the center of the cluster by a
specified fraction
 CHAMELEON (1999): hierarchical clustering
using dynamic modeling

SUSHIL KULKARNI

You might also like