Cluster Analysis in Data Minining

Cluster Analysis
SUSHIL KULKARNI
Cluster Analysis
 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary SUSHIL KULKARNI
What is Cluster ?
 Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that
 –Data points in one cluster are more similar to
one another.
 –Data points in separate clusters are less
similar to one another.
 Similarity Measures:
 –Euclidean Distance if attributes are
continuous.
 –Other Problem-specific Measures.
SUSHIL KULKARNI
Outliers
 Outliers are objects that do not belong to any

cluster or form clusters of very small cardinality
cluster
outliers
 In some applications we are interested in
discovering outliers, not clusters (outlier
analysis)
SUSHIL KULKARNI
Cluster Analysis

 A Categorization of Major Clustering
Methods
SUSHIL KULKARNI
Data Structures
 data matrix attributes/dimensions
 (two modes)  x11 ... x1f ... x1p 
 
tuples/objects
 ... ... ... ... ... 
x ... xif ... xip 
the “classic” data input  i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
objects
 dissimilarity or distance
 0 
matrix  d(2,1) 0 
 
objects
 (one mode)  d(3,1) d ( 3,2) 0 
the desired data input to some  
clustering algorithms  : : : 
d ( n,1) d ( n,2) ... ... 0
Measuring Similarity in Clustering
 Dissimilarity/Similarity metric:
 The dissimilarity d(i, j) between two objects

i and j is expressed in terms of a distance
function, which is typically a metric:
metric
 d(i, j) ≥ 0 (non-negativity)
 d(i, i) =0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d (i, j) ≤ d(i, h)+d(h, j) (triangular
inequality)
SUSHIL KULKARNI
Type of data in cluster analysis
 Interval-scaled variables
 e.g., salary, height
 Binary variables
 e.g., gender (M/F), has_cancer(T/F)
 Nominal (categorical) variables
 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
SUSHIL KULKARNI
Similarity and Dissimilarity
Between Objects
 Distance metrics are normally used to

measure the similarity or dissimilarity
between two data objects
SUSHIL KULKARNI
Similarity and Dissimilarity Between
Objects (Cont.)
 Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 i n jn
 Properties
 d(i,j) ≥0
 d(i,i) =0
 d(i,j) = d(j,i)
 d(i,j) ≤ d(i,k) + d(k,j)
 Also one can use weighted distance:
d (i, j) = (w | x − x |2 +w | x − x |2 +...+ wn | x − x |2 )
1 i1 j1 2 i2 j 2 in jn
SUSHIL KULKARNI
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data
object j
1 0 sum
1 a b a +b
object i 0 c d c+d
sum a + c b + d p
 Jaccard coefficient distance (noninvariant if the binary

variable is asymmetric):
d (i, j ) = b +c
a +b +c
SUSHIL KULKARNI
Binary Variables
 1-1 match is stronger indicator of similarity

then 0-0 indicator of similarity
 It may be possible that 0-0 indicator can be
ignored.
 In matching coefficient equal weight for 1-1
matches and 0-0 matches are taken
 In Jacords coefficient 0-0 indicator match is
ignored
SUSHIL KULKARNI
Dissimilarity between Binary
Variables
 Example (Jaccard coefficient)
Name Fever Cough Test-1 Test-2 Test-3 Test-4

Tina 1 0 1 0 0 0
Dina 1 0 1 0 1 0
Meena 1 1 0 0 0 0
SUSHIL KULKARNI
Variables
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Tina 1 0 1 0 0 0
Dina 1 0 1 0 1 0
Consider name as Tina , Dina

Number of attributes where name_1 is 1 and name_2 is 1=2 =a
(Fever, Test-1)
Number of attributes where name_ 1 is 1 and name_ 2 as 0 = b =0
Number of attributes where name_ 1 is 0 and name_ 2 as 1 = c =1
(Test-3)
Number of attributes where both names = 0 = d = 3
(Cough, Test-2, Test-4)
SUSHIL KULKARNI
Variables
Using Jaccard coefficient distance we get :
a =2
d (i, j ) = b +c
b=0
a+ b+ c
c=1
0 +1 d=3
d(Tina , Dina ) = = 0.33
2 + 0 +1
1+1
d(Tina , Meena ) = = 0.67
1+1+1
1+ 2
d(Meena , Dina ) = = 0.75
1+1+ 2
Based on the magnitude of the similarity coefficient, we
conclude that Meena and Dina are most similar and Tina and
Dina are least similar Other pair lies between these extremes
SUSHIL KULKARNI
Cluster Analysis
 A Categorization of Major Clustering
Methods
SUSHIL KULKARNI
Major Clustering Approaches
 Partitioning algorithms: Construct

random partitions and then iteratively
refine them by some criterion
 Hierarchical algorithms: Create a
hierarchical decomposition of the set of
data (or objects) using some criterion
SUSHIL KULKARNI
Cluster Analysis

SUSHIL KULKARNI
Partitioning Algorithms: Basic
Concepts
 Partitioning method: Construct a partition

of a database D of n objects into a set of k
clusters
 Methods: k-means and k-medoids
algorithms
 k-means (MacQueen’67): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster
SUSHIL KULKARNI
Centroid or Medoid
Centroid Medoid
SUSHIL KULKARNI
The k-means Clustering Method
 Given k, the k-means algorithm is
implemented in 4 steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of
the clusters of the current partition. The
centroid is the center (mean point) of the
cluster.
 Assign each object to the cluster with the
nearest seed point.
 Go back to Step 2, stop when no more
new assignment.
SUSHIL KULKARNI
K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}
 Assume that we want two clusters.
 Write the elements in increasing order
{2,3,4,10,11,12,20,25,30}
 Randomly assign means: m1=3,m2=4
 K1={2,3}, K2={4,10,11,12,20,25,30}
 Means are m1=2.5,m2=16
 K1={2,3,4},K2={10,11,12,20,25,30}
 Means are m1=3,m2=18
 K1={2,3,4,10},K2={11,12,20,25,30}
 Means are m1=4.75,m2=19.6
 K1={2,3,4,10,11,12},K2={20,25,30}
 Means are m1=7,m2=25
 Stop as the clusters with these means as no more
jumps from K2 to K 1 is possible. SUSHIL KULKARNI
The k-means Clustering Method
 Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
SUSHIL KULKARNI
Comments on the k-means Method
 Applicable only when mean is defined,

then what about categorical data?
 Need to specify k, the number of
clusters, in advance
 Unable to handle noisy data and outliers
SUSHIL KULKARNI
The K-Medoids Clustering Method
 A medoid can be defined as that object of a

cluster, whose average dissimilarity to all the
objects in the cluster is minimal i.e. it is a most
centrally located point in the given data set.
SUSHIL KULKARNI
The K-Medoids Clustering Algorithm
A medoid clustering algorithm is as follows:

 The algorithm begins with arbitrary selection of the k
objects as medoid points out of n data points (n>k)
 After selection of the k medoid points, associate each
data object in the given data set to most similar medoid.
The similarity here is defined using distance measure
that can be Euclidean distance
 Randomly select non medoid object O′
 compute total cost S of swapping initial medoid object to
O′
 If S<0, then swap initial medoid with the new one (if
S<0 then there will be new set of medoids)
 repeat steps 2 to 5 until there is no change in the
medoid
SUSHIL KULKARNI
 Cluster the following data set of ten objects into
two clusters i.e k = 2.
 Consider a data set of ten objects as follows
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6 SUSHIL KULKARNI
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6 SUSHIL KULKARNI
Step 1
 Initialise k centre
 Let us assume c1 = (3,4) and c2 = (7,4)
 So here c1 and c2 are selected as medoid.
 Calculating distance so as to associate each data
object to its nearest medoid. Cost is calculated using
Minkowski distance metric.
SUSHIL KULKARNI
Between Objects
 The most popular conform to Minkowski

distance:
 p p p 1/ p
L p (i, j) = | x − x | + | x − x | + ...+ | x − x |

 

 i1 j1 i2 j 2 in jn 

where i = (xi1, xi2, …, xin) and

j = (xj1, xj2, …, xjn) are two n-dimensional
data objects, and p is a positive integer
SUSHIL KULKARNI
Between Objects
 If p = 1, then L1 is :
L (i, j) =| x − x | + | x − x | + ...+ | x − x |
1 i1 j1 i2 j 2 i n jn
SUSHIL KULKARNI
c1 Data Cost c2 Data Cost
objects (distance) objects (distance)
(Xi) (Xi)
(3,4) (2,6) 3 (7,4) (2,6) 7

(3,4) (3,8) 4 (7,4) (3,8) 8
(3,4) (4,7) 4 (7,4) (4,7) 6
(3,4) (6,2) 5 (7,4) (6,2) 3
(3,4) (6,4) 3 (7,4) (6,4) 1
(3,4) (7,3) 5 (7,4) (7,3) 1
(3,4) (8,5) 6 (7,4) (8,5) 2
(3,4) (7,6) 6 (7,4) (7,6) 2
SUSHIL KULKARNI
 So the clusters become:
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
 Since the points (2,6) (3,8) and (4,7) are close to c1 hence
they form one cluster while remaining points form another
cluster.
 Cost between any two points is found using formula
d
cos t ( x , c) = ∑ | x − c |
i =1
where x is any data object, c is the medoid, and d is the
dimension of the object which in this case is 2. Cost

between any two points is found using formula
Total cost is the summation of the cost of data object

from its medoid in its cluster so here: SUSHIL KULKARNI
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 ={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
 Total Cost = {cost( (3,4),(2,6))+cost( (3,4), (3,8))+

cost( (3,4), (4,7))}
+
{cost( (7,4),(6,2))+ cost( (7,4),(7,3))+
cost((7,4),(8,5))+ cost( (7,4),(7,6))}
=3+4+4+3+1+1+2+2
=20
SUSHIL KULKARNI
clusters after step 1

SUSHIL KULKARNI
Step 2
 Selection of nonmedoid O′ randomly

 Let us assume O′ = (7,3)
 So now the medoids are c1(3,4) and O′(7,3)
 If c1 and O′ are new medoids, calculate the
total cost involved by using the formula in
the step 1
SUSHIL KULKARNI
c1 Data Cost O’ Data Cost
objects (distance) objects (distance)
(Xi) (Xi)
(3,4) (2,6) 3 (7,3) (2,6) 8

(3,4) (3,8) 4 (7,3) (3,8) 9
(3,4) (4,7) 4 (7,3) (4,7) 7
(3,4) (6,2) 5 (7,3) (6,2) 2
(3,4) (6,4) 3 (7,3) (6,4) 2
(3,4) (7,4) 4 (7,3) (7,4) 1
(3,4) (8,5) 6 (7,3) (8,5) 3
(3,4) (7,6) 6 (7,3) (7,6) 3
SUSHIL KULKARNI
SUSHIL KULKARNI
 Total Cost = 3+4+4+2+2+1+3+3=22

 So cost of swapping medoid from c2 to
O′ is
 S = current total cost – Past total cost
= 22-20
=2 > 0
 So moving to O′ would be bad idea, so the
previous choice was good and algorithm
terminates here (i.e there is no change in the
medoids).
SUSHIL KULKARNI
Cluster Analysis
 What is a Cluster ?
 Types of Data in Cluster
SUSHIL KULKARNI
Hierarchical Clustering
 Use distance matrix as clustering criteria.
This method does not require the number of
clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting)
 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge objects that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all objects belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
 Single-Link: each time merge the clusters (C1,C2) which are

connected by the shortest single link of objects, i.e., minp∈C1,q∈
C2dist(p,q)
SUSHIL KULKARNI
AGNES (Agglomerative Nesting)
• Uses dissimilarity/distance matrix as input.

• Start with each individual item in its own cluster
Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
• The output is a dendogram which is represented as
a set of ordered triples (d, k , K) where d is the

threshold distance, k is the number of clusters and
K is the set of clusters
SUSHIL KULKARNI
AGNES (Agglomerative Nesting):
Minimum Distance Method
A B C D E
Single link dendrogram
A B A B
1
1
2 3 2
3 2
4 C C
E E
5
1 1
3 D D
Graph with all distances Graph with threshhold dmin=1

AGNES (Agglomerative Nesting):
Minimum Distance Method
A B A B
1 1
2 2
2 3 2
2 3 2
E C C
E
1 1
D 3
D
Graph with threshhold dmin=2 Graph with threshhold dmin=3
A B
1
2 3 2
3 2
E 4 C
1
3 D A B C D E
Graph with threshhold dmin=4 Single link dendrogram
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into d
a several levels of nested e
partitioning (tree of clusters), b
a
called a dendrogram. c
A clustering of the data level 4

objects is obtained by cutting
the dendrogram at the level 3
desired level, then each
connected component forms
a cluster. level 2
E.g., level 1 gives 4 clusters: level 1

{a,b},{c},{d},{e},
level 2 gives 3 clusters:
{a,b},{c},{d,e}
level 3 gives 2 clusters: a b c d e
{a,b},{c,d,e}, etc.
Agglomerative Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D
Threshold of
1 2 34 5
A B C D E
SUSHIL KULKARNI
MST Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D
SUSHIL KULKARNI
DIANA (Divisive Analysis)
 Implemented in statistical analysis packages,

e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its
own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
o Inverse order of AGNES
o Initially all items are placed in one cluster
oThe clusters are split when some elements are not

efficiently close to other elements
o Eventually each node forms a cluster on its own
o One simple example of a divisive algorithm is based
on the MST version of the single link algorithm

o Edges are cut out from the minimum spanning tree
from largest to the smallest
SUSHIL KULKARNI
A B
1
2 3 2
3 2
E 4 C
5 A B
1
1
3
2
D
E C
1
3
D
DIANA (Divisive
A
1
Analysis)
C
2
E
B
1
3 D
A B
1
Cut the largest edge ED
2 The cluster {A,B,C,D,E} is split
into two clusters
E
{E} {A,B,C,D}
1 C
D
SUSHIL KULKARNI
DIANA (Divisive
A Analysis)
1
C
2
Two clusters are
E {E} {A,B,C,D}
B Next cut the edge between B
1
and C
D
A B The cluster {A,B,C,D} is split into
1 {A,B} and {C,D}
1 C
D
SUSHIL KULKARNI
DIANA (Divisive
A Analysis)
1
C
In the next step these will be

E split finally giving clusters as
B {E}, {A}, {B}, {C}, {D},
1
D
A B
C
D
SUSHIL KULKARNI
More on Hierarchical Clustering Methods
 Integration
of hierarchical with
distance-based clustering
 BIRCH (1996): uses CF-tree and
incrementally adjusts the quality of sub-
clusters
 CURE (1998): selects well-scattered points
from the cluster and then shrinks them
towards the center of the cluster by a
specified fraction
 CHAMELEON (1999): hierarchical clustering
using dynamic modeling
SUSHIL KULKARNI

Cluster Analysis in Data Minining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis in Data Minining

Uploaded by

Copyright:

Available Formats

Cluster Analysis

 What is Cluster Analysis?

 Outliers are objects that do not belong to any

 What is Cluster Analysis?

 The dissimilarity d(i, j) between two objects

 Distance metrics are normally used to

 Jaccard coefficient distance (noninvariant if the binary

 1-1 match is stronger indicator of similarity

Name Fever Cough Test-1 Test-2 Test-3 Test-4

Consider name as Tina , Dina

 Partitioning algorithms: Construct

 What is Cluster Analysis?

 Partitioning method: Construct a partition

 Applicable only when mean is defined,

 A medoid can be defined as that object of a

A medoid clustering algorithm is as follows:

 The most popular conform to Minkowski

where i = (xi1, xi2, …, xin) and

(3,4) (2,6) 3 (7,4) (2,6) 7

dimension of the object which in this case is 2. Cost

Total cost is the summation of the cost of data object

 Total Cost = {cost( (3,4),(2,6))+cost( (3,4), (3,8))+

clusters after step 1

 Selection of nonmedoid O′ randomly

(3,4) (2,6) 3 (7,3) (2,6) 8

 Total Cost = 3+4+4+2+2+1+3+3=22

 Single-Link: each time merge the clusters (C1,C2) which are

• Uses dissimilarity/distance matrix as input.

a set of ordered triples (d, k , K) where d is the

Graph with all distances Graph with threshhold dmin=1

A clustering of the data level 4

E.g., level 1 gives 4 clusters: level 1

 Implemented in statistical analysis packages,

o Inverse order of AGNES

o Initially all items are placed in one cluster

oThe clusters are split when some elements are not

on the MST version of the single link algorithm

In the next step these will be

You might also like