Professional Documents
Culture Documents
Cosmin Lazar
12/11/12
Introduction
What is a cluster?
No general accepted definition!!!
A cluster is …
D2: … a set of entities which are alike, and entities from different clusters are not alike
D3: … an aggregation of points in the test space such that the distance between any two
points in the cluster is less that the distance between any point in the cluster and any
point not in it.
12/11/12
Introduction
It is hard to give a general accepted definition of a cluster because objects can be
grouped with different purposes in mind.
12/11/12
Overview
! What is cluster analysis?
! How it works?
! Applications
12/11/12
What is cluster analysis?
Association rules
Supervised learning
Neural networks
AI techniques
etc.
Semi - supervised learning 12/11/12
What is cluster analysis?
Cluster analysis is a multivariate data mining technique whose goal is to groups
objects based on a set of user selected characteristics
Clusters should exhibit high internal homogeneity and high external heterogeneity
Feature 2
When plotted geometrically, objects
within clusters should be very close
together and clusters will be far apart.
Feature 2
Clustering or
Unsupervised learning
Clustering ≈ natural
grouping of data
Feature 1 Feature 1
What is not?
Feature 2
Feature 2
Supervised learning
12/11/12
Feature 1 Feature 1
Definitions & notations
! Objects or elementary data
Sample1 Sample2 Sample3 ...
! Features or cluster variate
Variable1 Value11 Value21 Value31
Variable2
! Data dimension
Value12 Value22 Value32
Variable3 Value13 Value23 Value33 ! Similarity measure
. . .
... . . . ! Cluster
. . .
! Cluster seed
! Cluster centroid
Feature 2
! Cluster solution
! Outlier
12/11/12
Feature 1
Definitions & notations
Sample1 Sample2 Sample3 ... Number of variables per sample
1 - Univariate data
Variable1 Value11 Value21 Value31 2 - Bivariate data
Variable2 Value12 Value22 Value32 3 - Trivariate data
Variable3 Value13 Value23 Value33 >3 Multi&HyperVariate data
. . .
... . . .
. . .
Remark: Quantitative variables (can do math on them)
An example
20 20 20
15
15 15
10
10 10
5
5 5
0
0 0
ï5
ï10 ï5 ï5
12/11/12
How does it work?
Living things
Classes
Orders
Families
Genera
Species
12/11/12
How does it work?
A simple example:
Suppose that a biologist wants to determine the subspecies in a population
of birds belonging the same specie
Clustering Objects
variables
S1 S2 S3 S4 S5 S6 S7 S8
12/11/12
How does it works?
A simple example:
Clustering Objects
variables
S1 S2 S3 S4 S5 S6 S7 S8
Objective
Identify structures (classes) in the data by grouping the most similar objects into groups
S8
1.6 S7
1.5
1.4
V2 1.3
1.2 S2 S5
1.1 S1 S4
S3
1 S6
12/11/12
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 V1
How does it work?
Q1: how does he measure the similarity between objects?
A1: build similarity matrix between all pairs of observations
Observatio Observations
ns S1 S2 S3 S4 S5 S6 S7 S8
S1 --------- --------- --------- --------- -------- --------- --------- ---------
12/11/12
S6 --------- --------- ---------
How does it work?
Q2: how does he form the clusters?
A21: group observations which are most similar into clusters
A22: iteratively merge clusters which are more close one to another
S8
1.6 S7
1.5
1.4
V2 1.3
1.2 S2 S5
1.1 S1 S4
S3
1 S6
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 V1 12/11/12
How does it work?
Q3: how to determine the number of clusters in the final solution?
A3: measuring homogeneity of a cluster solution by averaging all
distances between observations within clusters
S8
1.6 S7
1.5
1.4
V2 1.3
1.2 S2 S5
1.1 S1 S4
S3
1 S6 S1 S3 S2 S4 S6 S5 S7 S8
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 V1 12/11/12
Area of applications
Engineering Sociology
12/11/12
Most common criticisms
Cluster analysis - is “descriptive, atheoretical and noninferential”
12/11/12
Review
! Key elements and notations in cluster analysis
! How it works?
12/11/12
Cluster Analysis Diagram
Stage 1: Objectives of
Cluster Analysis
Stage 2: Research
Design Issues
Stage 3: Assumptions
in Cluster Analysis
Stage 5: Interpreting
the Clusters
12/11/12
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
Feature selection methods enable users to select the most relevant variables to be
used in cluster analysis
Feature extraction methods enable users to derive new features from the existing
features which could be more relevant then the existing features for cluster analysis
12/11/12
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
Remark: 1. Interest is focus on the identification of small groups – large sample size
2. Interest is focus on the identification of large groups – small sample size
12/11/12
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
1. Truly aberrant observation not - distort the actual structure and result in
representative for the population unrepresentative clusters – should be removed
3. An undersampling of the actual - they represent valid and relevant groups - should
group in the population that causes be included in the clustering solution
poor representation of the group 12/11/12
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
correlation measures
Three ways to measure require metric data
inter-objects similarities distance measures
12/11/12
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
8
k =1
correlation CC(X i , X j ) =
1 d d
6
Spectral
! SA(X i , X j ) = acos(CC(X i , X j ))
ï2
angle
ï4
0 2 4 6 8 10 12 14 16 18 20
12/11/12
!
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
% k =1 ( r =1 Manhattan distance
! Metric exponent
! distance
r = 2 Euclidian
!order metrics
r " 3 High
12/11/12
! !
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
# (X i " X j)
Mahalanobis MD(X i , X j ) = d
k =1
d
1
distance # (X " µ!
X i )# (X j " µX j )
d "1 k "1 i k =1
15
10
ï5
20
ï10 10
ï5
0
5 0
10
15 ï10
20
Pearson’s
Low dimensional spaces – correlation
Euclidean distance coefficient
Should be used when data
High dimensional spaces – are dissimilar from the
Manhatan or fractionary correlation point of view
Spectral
metrics
angle 12/11/12
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues
A5: Clustering variables that are not all of the same scale should be standardized.
12/11/12
Cluster Analysis – Research
design issues
Stage 2: Research Design Issues Sample1 Sample2 Sample3 ...
12/11/12
Cluster Analysis – Methods
Stage 4: Deriving Clusters and Assessing Overall Fit
Methods: Agglomerative
Hierarchical clustering
Divisive
K-means
Denclust
CLUPOT
Mean Shift Density based clustering
SVC
12/11/12
Parzen-Watershed
Cluster Analysis – Methods
Hierarchical clustering
Principle: compute the Distance-Matrix between all objects (initially one
Agglomerative object = one cluster). Find the two clusters with the closest distance and
(bottom - up) put those two clusters into one. Compute the new Distance-Matrix.
12/11/12
Cluster Analysis – Methods
Hierarchical clustering
Agglomerative (bottom - up)
Single-link (nearest neighbor method) sim(c i ,c j ) = min sim(x, y)
x"c i ,y"c j
S7 S9 S8 S9
S7 S8
S2 S5 S2 S5
S1
!
S4 S1 S4
S3 S6 S3 S6
Drawback: can result in long and thin clusters due to chaining effect
12/11/12
Cluster Analysis – Methods
Single-link (nearest neighbor method)
Drawback: can result in long and thin clusters due to chaining effect
S3 S3 S3
S3
S3
S2
S1 S2 S2
S1 S2
S2
S1 S2 S2
S1 S4 S5
S1 S2
S1 S2
12/11/12
Cluster Analysis – Methods
S7 S9 S8 S9
S7 S8
S2 S5 ! S2 S5
S1 S4 S1 S4
S3 S6 S3 S6
S7 S9 S8 S9
S7 S8
C1 S2 S5 C2 S2 S5
S1 S4 S1 S4
S3 S6 S3 S6
12/11/12
Cluster Analysis – Methods
Partitional clustering
!
12/11/12
Cluster Analysis – Methods
K-means algorithm
25
Start
20
15
10
ï5
ï10
ï10 ï5 0 5 10 15 20 25 30
12/11/12
Cluster Analysis – Methods
K-means algorithm
25
Start
20
15
Initialization:
number of 10
clusters K and 5
initial centroids
0
ï5
ï10
ï10 ï5 0 5 10 15 20 25 30
12/11/12
Cluster Analysis – Methods
K-means algorithm
25
Start
20
Initialization:
15
number of 10
clusters K and 5
initial centroids
0
Distance objects ï5
to centroids ï10
ï10 ï5 0 5 10 15 20 25 30
Grouping based
on the minimum
distance
12/11/12
Cluster Analysis – Methods
K-means algorithm
25
Start
20
Initialization: 15
number of 10
clusters K and 5
initial centroids
0
Estimate
Distance objects
new
ï5
to centroids
centroids ï10
ï10 ï5 0 5 10 15 20 25 30
Grouping based NO
on the minimum
distance
Stability?
12/11/12
Cluster Analysis – Methods
K-means algorithm
25
Start
20
Initialization: 15
number of 10
clusters K and 5
initial centroids
0
Estimate
Distance objects
new ï5
to centroids
centroids ï10
ï10 ï5 0 5 10 15 20 25 30
Grouping based
on the minimum
distance YES
Stability End
12/11/12
Cluster Analysis – Methods
Drawbacks Sensitive to initial seed points
Converge to a local optimum that may be unwanted solution
25
20
Need to specify K, the number of clusters, in advance
15
10
ï10
ï15
0 5 10 15 20 25 30
Applicable only when mean is defined, then what about categorical data?
12/11/12
Cluster Analysis – Methods
Density based clustering
! Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly constructed
density function
! Major features
! Discover clusters of arbitrary shape
! Handle noise (outliers)
DBSCAN - Ester, et al. 1996 - http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf
DENCLUE - Hinneburg & D. Keim 1998 -
http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf
Parzen Watershed - http://www.ecmjournal.org/journal/smi/pdf/smi97-01.pdf
MeanShift - http://courses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf
Support Vector Clustering - 12/11/12
http://jmlr.csail.mit.edu/papers/volume2/horn01a/rev1/horn01ar1.pdf
Cluster Analysis – Methods
Density is the number of points within a specified space range
Density estimation An example on univariate data
6
0.9
5
0.8
0.7
4
0.6
0.4
3
0.3 2
0.2
0.1 1
0
0
ï0.1
ï0.2 0 0.2 0.4 0.6 0.8 1 1.2 ï0.2 0 0.2 0.4 0.6 0.8 1 1.2
1.5 1.5
1 1
# x " x 2&
f (x) = ) K(x " x i ) = ) k%% 2
i
(
( 0.5 0.5
i i $ h '
0 0
k(r) - kernel function or parzen window ï1 ï0.5 0 0.5 1 1.5 2 ï1 ï0.5 0 0.5 1 12/11/12
1.5 2
Cluster Analysis – Methods
Why is density estimation computational expensive in
high dimensional spaces? 2 # x"x &
f (x) = ) K(x " x i ) = ) k%% 2
i
(
(
i i $ h '
k(r) - kernel function or parzen window
.
.
.
12/11/12
S10 – discrete space 2D with n^10 = 10.000.000.000 discrete values
Parzen Watershed algorithm
In based on the density estimation of the pdf in the feature space
Algorithm
25
Start
20
Dimension reduction 15
10
ï5
ï10
ï15
ï5 0 5 10 15 20 25 30 35 40
12/11/12
Parzen Watershed algorithm
In based on the density estimation of the pdf in the feature space
Algorithm 25
20
15
10
Start
0
Dimension reduction ï5
ï10
ï15
ï5 0 5 10 15 20 25 30 35 40
300
PDF estimation 250
200
150
100
50
0
80
60
40
0
20 20
40
0 60
80
12/11/12
Parzen Watershed algorithm
In based on the density estimation of the pdf in the feature space
Algorithm 25
20
15
10
Start
0
Dimension reduction ï5
ï10
ï15
ï5 0 5 10 15 20 25 30 35 40
300
250
200
100
50
0
80
60
40
0
20 20
40
0 60
80
12/11/12
Parzen Watershed algorithm
In based on the density estimation of the pdf in the feature space
Algorithm 25
20
15
10
Start
0
Dimension reduction ï5
ï10
25
20
ï15
ï5 0 5 10 15 20 25 30 35 40
15
300
250 10
200
150
5
100
PDF estimation
50
0
0
80
60
40 ï5
0
20 20
40
0 60
80
ï10
ï15
ï5 0 5 10 15 20 25 30 35 40
The cluster centroid (a mean profile of the cluster on each cluster variable) is
particularly useful in the interpretation stage
12/11/12
Cluster Analysis – Validation
Stage 6: Validating and Profiling the Clusters
“The validation of clustering structures is the most difficult and frustrating part of
cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art
accessible only to those true believers who have experience and great courage.”
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-
random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to externally
given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to external
information.- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine the stability of the solution.
5. Determining the ‘correct’ number of clusters.
12/11/12
Cluster analysis – Validation
Indices for cluster validation
! Cross validation
! External index – used to measure the extent to which cluster labels match
externally supplied class labels
! Labels provided by experts or ground truth
12/11/12
Cluster analysis – Validation
Internal indices – example: silhouette coefficient
c
sc = 1 "
s
10
9 !
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 1112
h 12/11/12
Applications
A good way to test random hypothesis (hierarchical and density based clustering)
Medical imaging
Image analysis For border detection
Remote sensing imaging and object recognition
Microscopy imaging
Character recognition
Computational biology and bioinformatics
Information retrieval
Database segmentation
12/11/12
Cluster analysis – application
to image segmentation
Stage 1: Objectives of Cluster
Analysis
Data Features
Stage 2: Research Design
Issues
Stage 3: Assumptions in
Cluster Analysis
Stage 3: Assumptions in
Cluster Analysis 1. What variables are relevant?
2. Is the sample size adequate?
Stage 4: Deriving Clusters and 3. Can outliers be detected and if so
Assessing Overall Fit should they be removed?
4. How should object similarity be
Stage 5: Interpreting the measured?
Clusters
5. Should data be standardized?
Stage 6: Validating and Profiling
the Clusters 12/11/12
Cluster analysis – application
to image segmentation
Stage 1: Objectives of Cluster
Analysis
Stage 3: Assumptions in Cluster Analysis
Stage 3: Assumptions in
Cluster Analysis 2. It is assumed that variables are not
correlated; if variables are correlated,
Stage 4: Deriving Clusters and remove correlated variables or use distance
Assessing Overall Fit measures that compensates for the
correlation such as Mahanalobis distance
Stage 5: Interpreting the
Clusters
12/11/12
12/11/12
Microscopy imaging
Objectif
Tissus identification by pixel clustering
12/11/12
Application – results
Choosing the optimal metric– indice DB
For a fixed K, the minimal value of DB index reveals the
most discriminate metric
L2 Background
Reject class
Ferulic acid
L1
Ferulic acid
L0.7 Lignin
Cutin
12/11/12
Application – results
Dimension reduction has been performed by NMF
4 classes
5 classes
12/11/12
Application – validation of
results
5-means
4-means
6-means
Parzen-Watershed
Mean-Shift
12/11/12
Open questions
High dimensional data…
Bibliography
12/11/12
Mean shift algorithm
Strengths : Weaknesses :
! Application independent tool ! The window size (bandwidth
selection) is not trivial
! Suitable for real data analysis
! Inappropriate window size
! Does not assume any prior shape (e.g. can cause modes to be
elliptical) on data clusters merged, or generate additional
“shallow” modes -> Use
! Can handle arbitrary feature spaces adaptive window size
! Only ONE parameter to choose ! For high dimensional data
computational expensive
! H (window size) has a physical meaning,
unlike K-Means Example:
d – 10, n – 102400
Time = 3837.488139 seconds 12/11/12