You are on page 1of 26

Basic concepts of Data Mining,

Clustering and Genetic Algorithms


Tsai-Yang Jea
Department of
Computer Science and Engineering
SUNY at Buffalo

Data Mining Motivation
Mechanical production of data need for
mechanical consumption of data

Large databases = vast amounts of information

Difficulty lies in accessing it
KDD and Data Mining
KDD: Extraction of knowledge from data

non-trivial extraction of implicit,
previously unknown & potentially useful
knowledge from data

Data Mining: Discovery stage of the KDD
process
Data Mining Techniques
Query tools
Statistical techniques
Visualization
On-line analytical
processing (OLAP)
Clustering
Classification
Decision trees
Association rules
Neural networks
Genetic algorithms
Any technique that helps to extract more out of data is useful
Whats Clustering
Clustering is a kind of unsupervised learning.
Clustering is a method of grouping data that share
similar trend and patterns.
Clustering of data is a method by which large sets
of data is grouped into clusters of smaller sets of
similar data.
Example:
Thus, we see clustering means grouping of data or dividing a large
data set into smaller data sets of some similarity.
After clustering:
The usage of clustering
Some engineering sciences such as pattern recognition, artificial
intelligence have been using the concepts of cluster analysis. Typical
examples to which clustering has been applied include handwritten
characters, samples of speech, fingerprints, and pictures.
In the life sciences (biology, botany, zoology, entomology, cytology,
microbiology), the objects of analysis are life forms such as plants,
animals, and insects. The clustering analysis may range from
developing complete taxonomies to classification of the species into
subspecies. The subspecies can be further classified into subspecies.
Clustering analysis is also widely used in information, policy and
decision sciences. The various applications of clustering analysis to
documents include votes on political issues, survey of markets, survey
of products, survey of sales programs, and R & D.

A Clustering Example
Income: High
Children:1
Car:Luxury
Income: Low
Children:0
Car:Compact
Car: Sedan and
Children:3
Income: Medium
Income: Medium
Children:2
Car:Truck
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Different ways of representing
clusters
(b)
a
d
k
j
h
g
i
f
e
c
b
a
d
k
j
h
g
i
f
e
c
b
(a)
(c)
1 2 3
a
b
c
0.4 0.1 0.5
0.1
0.8 0.1
0.3 0.3 0.4
...
(d)
g
a c i e
d k b j f h
K Means Clustering
(Iterative distance-based clustering)
K means clustering is an effective algorithm
to extract a given number of clusters of
patterns from a training set. Once done, the
cluster locations can be used to classify
patterns into distinct classes.
K means clustering
(Cont.)
Select the k cluster centers randomly.
Store the k cluster centers.
. cluster to belonging pattern th - the is and ,
cluster in patterns training of number the is mean, new the is where
1
: cluster the of mean the finding by center its recompute cluster, each For
1
k j X k
N M
X
N
M
jk
k k
N
j
jk
k
k
k


. of member a as classify and center cluster nearest the
find set, training in the patern each For set. training entire he Classify t

C X C
X
i
i
Loop until the
change in cluster
means is less the
amount specified
by the user.
The drawbacks of K-means
clustering
The final clusters do not represent a global
optimization result but only the local one,
and complete different final clusters can
arise from difference in the initial randomly
chosen cluster centers. (fig. 1)
We have to know how many clusters we
will have at the first.
Drawback of K-means clustering
(Cont.)
Figure 1
Clustering with Genetic
Algorithm
Introduction of Genetic Algorithm
Elements consisting GAs
Genetic Representation
Genetic operators

Introduction of GAs
Inspired by biological evolution.
Many operators mimic the process of the
biological evolution including
Natural selection
Crossover
Mutation
Elements consisting GAs
Individual (chromosome):
feasible solution in an optimization
problem
Population
Set of individuals
Should be maintained in each generation
Elements consisting GAs
Genetic operators. (crossover, mutation)
Define the fitness function.
The fitness function takes a single
chromosome as input and returns a
measure of the goodness of the solution
represented by the chromosome.
Genetic Representation
The most important starting point to develop a
genetic algorithm
Each gene has its special meaning
Based on this representation, we can define
fitness evaluation function,
crossover operator,
mutation operator.

Genetic Representation (Cont.)
Examples 1
Outlook
0
Wind
1
PlayTennis
1
Overcast
Rain
Sunny
1 1
Strong
Normal
Yes
No
0 0
If Outlook is
Overcast or Rain
and
Wind is Strong,
then
PlayTennis = Yes
0 1 1 1 0 1 0
A chromosome
Gene
Allele value
Genetic Representation (Cont.)
Examples 2 ( In clustering problem)
Each chromosome represents a set of clusters; each
gene represents an object; each allele value represents a
cluster. Genes with the same allele value are in the
same cluster.
1 2 1 4 3 5 5
A B C D E F G
Crossover
Exchange features of two individuals to produce
two offspring (children)
Selected mates may have good properties to
survive in next generations
So, we can expect that exchanging features may
produce other good individuals
Crossover (cont.)
Single-point Crossover


Two-point Crossover


Uniform Crossover
1 1 0 1 1
0 0 0 0 1
0 0 1 0 0 0
0 1 0 1 0 1
1 1 0 1 1
0 0 0 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1
0 0 0 0 1
0 0 1 0 0 0
0 1 0 1 0 1
1 1 0 0 1
0 0 0 1 1
0 1 1 0 0 0
0 0 0 1 0 1
1 0 1 0 1 0 1 0 0 1 1
1 1 0 1 1
0 0 0 0 1
0 0 1 0 0 0
0 1 0 1 0 1
1 0 0 0 1
0 1 0 1 1
0 0 0 1 0 0
0 1 1 0 0 1
Crossover template
Mutation
Usually change a single bit in a bit string
This operator should happen with very low
probability.

0 1 0 1 1
0 1 1 1 1
Mutation point
(random)
Typical Procedures
Crossover mates are probabilistically
selected based on their fitness value.
0 1 0 0 1
1 1 0 1 0
0 0 1 1 1
0 1 0 1 1
1 1 0 1 0
1 1 0 1 1
1 1 0 1 1
0 1 0 0 1
1 1 0 0 1
0 1 0 1 1
Crossover point
randomly selected
1 1 0 0 1
0 1 1 1 1
0 1 1 1 1
old generation
new generation
0 1 0 1 1
1 1 0 1 0
1 1 0 1 1
Mutation point
(random)
Probabilistically select individuals
Preparing the chromosomes

Defining genetic operators
Fusion: takes two unique allele values and combines them into a
single allele value, combining two clusters into one.

Fission: takes a single allele value and gives it a different random
allele value, breaking a cluster apart.

Defining fitness functions
How to apply GA on a clustering
problem
1 2 3 3 5
1 2 3 3 5 3 2 3 3 5
1 3 3 3 5 1 3 4 4 5
Example: (Cont.)
Crossover
Mutation
Fusion
Fission
Old generation
New generation
Select the chromosomes
according to the fitness
function.

1 2 3 3 5
1 2 4 3 5
2 1 3 3 5
2 2 4 3 5
1 1 1 3 5
2 2 3 2 5
1 2 5 3 5
2 4 3 3 4
2 2 4 3 5
2 1 2 3 5
Finally

You might also like