Professional Documents
Culture Documents
CLUSTERING AND
CLASSIFICATION
Spring 2007, SJSU
Benjamin Lam
Overview
Definition of Clustering
Existing clustering methods
Clustering examples
Classification
Classification examples
Conclusion
Definition
Clustering can be considered the most important
unsupervised learning technique; so, as every
other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.
Simplifications
Pattern detection
Useful in data concept construction
Unsupervised learning process
Where to use clustering?
Data mining
Information retrieval
text mining
Web analysis
marketing
medical diagnostic
Which method should I
use?
Type of attributes in data
Scalability to larger dataset
Ability to work with irregular data
Time cost
complexity
Data order dependency
Result presentation
Major Existing clustering
methods
Distance-based
Hierarchical
Partitioning
Probabilistic
Measuring Similarity
Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function, which is
typically metric: d(i, j)
There is a separate quality function that
measures the goodness of a cluster.
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define similar enough or good
enough
Professor Lee, Sin-Min
the answer is typically highly subjective.
Distance based method
In this case we easily identify the 4 clusters into which the data
can be divided; the similarity criterion is distance: two or more
objects belong to the same cluster if they are close according to
a given distance. This is called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom Divisive (top down)
up)
1. Start with a big cluster
1. start with 1 point 2. Recursively divide into
(singleton) smaller clusters
2. recursively add two or 3. Stop when k number of
more appropriate clusters is achieved.
clusters
3. Stop when k number
of clusters is
achieved.
general steps of hierarchical
clustering
Given a set of N items to be clustered, and an N*N distance
(or similarity) matrix, the basic process of hierarchical
clustering (defined by S.C. Johnson in 1967) is this:
Start by assigning each item to a cluster, so that if you
have N items, you now have N clusters, each containing
just one item. Let the distances (similarities) between
the clusters the same as the distances (similarities)
between the items they contain.
Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one
cluster less.
Compute distances (similarities) between the new
cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into K
number of clusters
If 80<=x<90 then x A
grade =B.
<80 >=80
If 70<=x<80 then x B
grade =C.
If 60<=x<70 then <70 >=70
grade =D. x C
If x<50 then grade <50 >=60
=F. F D
Classification Techniques
Approach:
1. Create specific model by
evaluating training data (or
using domain experts
knowledge).
2. Apply model developed to new
data.
Classes must be predefined
Most common techniques use
DTs, NNs, or are based on
Defining Classes
Distance
Based
Partitioning
Based
Classification Using
Regression
Division: Use regression function
to divide area into regions.
Prediction: Use regression
function to predict a class
membership function. Input
includes desired class.
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Division
Prediction
Classification Using
Distance
Place items in class to which they
are closest.
Must determine distance between
an item and a class.
Classes represented by
Centroid: Central value.
Medoid: Representative point.
Individual points
Algorithm: KNN
K Nearest Neighbor
(KNN):
Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the
most number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training
set.)
KNN
KNN Algorithm
Classification Using
Decision Trees
Partitioning based: Divide
search space into rectangular
regions.
Tuple placed into class based on
the region within which it falls.
DT approaches differ in how the
tree is built: DT Induction
Internal nodes associated with
attribute and arcs with values for
that attribute.
Decision Tree
Given:
D = {t1, , tn} where ti=<ti1, , tih>
Database schema contains {A1, A2, ,
Ah}
Classes C={C1, ., Cm}
Decision or Classification Tree is a tree
associated with D such that
Each internal node is labeled with
attribute, Ai
Each arc is labeled with predicate which
can be applied to attribute at parent
Each leaf node is labeled with a class, C
DT Induction
Comparing DTs
Balanced
Deep
ID3
Creates tree using information theory concepts and
tries to reduce expected number of comparison..
ID3 chooses split attribute with the highest
information gain using entropy as base for
calculation.
Conclusion
very useful in data mining
applicable for both text and
graphical based data
Help simplify data complexity
classification
detect hidden pattern in data
Reference
Dr. M.H. Dunham -
http://engr.smu.edu/~mhd/dmbook/part2.ppt.
Dr. Lee, Sin-Min San Jose State University
Mu-Yu Lu, SJSU
Database System Concepts, Silberschatz, Korth,
Sudarshan