Professional Documents
Culture Documents
Principles of Knowledge
Discovery in Databases
Fall 1999 • What is classification of data and prediction?
Course Content
Chapter 8 Objectives
• Introduction to Data Mining
• Data warehousing and OLAP
• Data cleaning
Learn basic techniques for data clustering.
• Data mining operations
• Data summarization Understand the issues and the major
• Association analysis challenges in clustering large data sets in
• Classification and prediction
multi-dimensional spaces.
• Clustering
• Web Mining
• Similarity Search
• Other topics if time permits
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 3 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 4
1
What is Clustering in Data Mining? Supervised and Unsupervised
Î
Supervised Classification = Classification
We know the class labels and the number of classes
Clustering is a process of partitioning a set of data (or objects) in
gray red blue green black
a set of meaningful sub-classes, called clusters.
1 2 3 4 … n
– Helps users understand the natural grouping or structure in a
Î
Unsupervised Classification = Clustering
data set. We do not know the class labels and may not know the number
of classes
• Cluster: a collection of data objects that are “similar” to one ? ? ? ? ?
another and thus can be treated collectively as one group. 1 2 3 4 … n
• Clustering: unsupervised classification: no predefined classes. ? ? ? ? ?
1 2 3 4 … ?
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 7 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 8
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 9 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 10
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 11 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 12
2
Examples of Clustering Applications Data Clustering Outline
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
• What is cluster analysis?
marketing programs.
• Land use: Identification of areas of similar land use in an earth • What do we use clustering for?
observation database. • Are there different approaches to data clustering?
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost. • What are the major clustering techniques?
• City-planning: Identifying groups of houses according to their • What are the challenges to data clustering?
house type, value, and geographical location.
• Earthquake studies: Observed earthquake epicenters should be
clustered along continent faults.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 13 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 14
– model-based method and the idea is to find the best fit of that model to each other.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 15 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 16
• What are the major clustering techniques? – Heuristic methods: k-means and k-medoids algorithms.
– k-means (MacQueen’67): Each cluster is represented by the center of the
• What are the challenges to data clustering? cluster.
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 17 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 18
3
The K-Means Clustering Method Comments on the K-Means Method
• Given k, the k-means algorithm is implemented in 4 steps:
• Strength of the k-means:
1. Partition objects into k nonempty subsets
– Relatively efficient: O(tkn), where n is # of objects, k is # of
2. Compute seed points as the centroids of the clusters of the
clusters, and t is # of iterations. Normally, k, t << n.
current partition. The centroid is the center (mean point) of
the cluster. – Often terminates at a local optimum.
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment. • Weakness of the k-means:
– Applicable only when mean is defined, then what about
10 10 10
10
categorical data?
9 9 9
9
8
7
8
7
8
7
8
7
– Need to specify k, the number of clusters, in advance.
6 6 6 6
5
3
5
3
5
3
5
0
2
0
2
0
2
0
– Not suitable to discover clusters with non-convex shapes.
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 19 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 20
PAM (Partitioning Around Medoids) (1987) CLARA (Clustering Large Applications) (1990)
• PAM (Kaufman and Rousseeuw, 1987), built in S+. • CLARA (Kaufmann and Rousseeuw in 1990)
• Use real object to represent the cluster. • Built in statistical analysis packages, such as S+.
1. Select k representative objects arbitrarily • It draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output.
2. For each pair of non-selected object h and selected object
i, calculate the total swapping cost TCih. • Strength of CLARA:
O(kS 2+k(n-k))
– deal with larger data sets than PAM.
• If TCih < 0, i is replaced by h.
• Weakness of CLARA:
3. Then assign each non-selected object to the most similar
– Efficiency depends on the sample size.
representative object
– A good clustering based on samples will not necessarily
4. Repeat steps 2-3 until there is no change
represent a good clustering of the whole data set if the
O(k(n-k)2) sample is biased.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 23 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 24
4
CLARANS (“Randomized” CLARA) (1994) Two Types of Hierarchical Clustering
Algorithms
• CLARANS (A Clustering Algorithm based on Randomized • Agglomerative (bottom-up): merge clusters iteratively.
Search) by Ng and Han in 1994. – start by placing each object in its own cluster.
• CLARANS draws sample of neighbours dynamically. – merge these atomic clusters into larger and larger clusters.
• The clustering process can be presented as searching a graph – until all objects are in a single cluster.
where every node is a potential solution, that is, a set of k – Most hierarchical methods belong to this category. They differ
medoids. only in their definition of between-cluster similarity.
• If the local optimum is found, CLARANS starts with new • Divisive (top-down): split a cluster iteratively.
randomly selected node in search for a new local optimum. – It does the reverse by starting with all objects in one cluster and
• It is more efficient and scalable than both PAM and CLARA. subdividing them into smaller pieces.
• Focusing techniques and spatial access structures may further – Divisive methods are not generally available, and rarely have
improve its performance (Ester et al.’95). been applied.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 25 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 26
cde 9
8
9
8
8
d 7
7
6
de 6
5
6
5
5
e 4
4
3
3
divisive 3
2
2
1
2
(DIANA)
1
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 27 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 28
8 8
8
7 7
7
5
6
5
6
2
4
3
3
2
specified fraction.
2
modeling.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 29 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 30
5
BIRCH (1996) Clustering Feature Vector
• Birch: Balanced Iterative Reducing and Clustering using
Clustering Feature: CF = (N, LS, SS)
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96).
• Incrementally construct a CF (Clustering Feature) tree, a N: Number of data points
hierarchical data structure for multiphase clustering: LS: ∑Ni=1=Xi
– Phase 1: scan DB to build an initial in-memory CF tree (a SS: ∑Ni=1=Xi2 CF = (5, (16,30),(54,190))
multi-level compression of the data that tries to preserve the
inherent clustering structure of the data) 10
(3,4)
9
7
(2,6)
leaf nodes of the CF-tree. 6
(4,5)
4
• Scales linearly: finds a good clustering with a single scan and See class presentation 3
2
(4,7)
1
improves the quality with a few additional scans. for algorithm details 0
0 1 2 3 4 5 6 7 8 9 10
(3,8)
• Weakness: handles only numeric data, and sensitive to the order
of the data record.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 31 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 32
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 33 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 34
6
OPTICS: A Cluster-Ordering Method (1999) CLIQUE (1998)
• OPTICS: Ordering Points To Identify the Clustering Structure • CLIQUE (Clustering In QUEst) by Agrawal, Gehrke, Gunopulos,
– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99). Raghavan (SIGMOD’98).
– Extensions to DBSCAN. • Automatic subspace clustering of high dimensional data
– Produces a special order of the database with regard to its
• CLIQUE can be considered as both density-based and grid-based
density-based clustering structure.
– This cluster-ordering contains information equivalent to the • Input parameters:
density-based clusterings corresponding to a broad range of – size of the grid and a global density threshold
parameter settings. • It partitions an m-dimensional data space into non-overlapping
– Good for both automatic and interactive cluster analysis, rectangular units.
including finding intrinsic clustering structure. • A unit is dense if the fraction of total data points contained in the
– Can be represented graphically or using visualization unit exceeds the input model parameter.
techniques. • A cluster is a maximal set of connected dense units.
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 37 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 38
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 39 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 40
7
Clustering Categorical Data: ROCK Data Clustering Outline
• ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
– Use links to measure similarity/proximity • What is cluster analysis?
– Not distance-based
• What do we use clustering for?
– Computational complexity: O(n 2 + nmmma + n 2 log n)
• Basic ideas: • Are there different approaches to data clustering?
T1 ∩ T2
– Similarity function and neighbours: Sim ( T1 , T2 ) = • What are the major clustering techniques?
T1 ∪ T2
Let T1 = {1,2,3}, T2={3,4,5} • What are the challenges to data clustering?
{ 3} 1
S im ( T 1 , T 2 ) = = = 0 .2
{1 , 2 , 3 , 4 ,5} 5
Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 45