Tjea

Basic concepts of Data Mining, Clustering and Genetic Algorithms
Tsai-Yang Jea
Department of Computer Science and Engineering SUNY at Buffalo
Data Mining Motivation

Mechanical production of data need for mechanical consumption of data Large databases = vast amounts of information Difficulty lies in accessing it
KDD and Data Mining

KDD: Extraction of knowledge from data non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data
Data Mining: Discovery stage of the KDD process
Data Mining Techniques

Any technique that helps to extract more out of data is useful
Query tools Statistical techniques Visualization On-line analytical processing (OLAP) Clustering
Classification Decision trees Association rules Neural networks Genetic algorithms
Whats Clustering
Clustering is a kind of unsupervised learning. Clustering is a method of grouping data that share similar trend and patterns. Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.
Example:
After clustering: Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.
The usage of clustering

Some engineering sciences such as pattern recognition, artificial intelligence have been using the concepts of cluster analysis. Typical examples to which clustering has been applied include handwritten characters, samples of speech, fingerprints, and pictures. In the life sciences (biology, botany, zoology, entomology, cytology, microbiology), the objects of analysis are life forms such as plants, animals, and insects. The clustering analysis may range from developing complete taxonomies to classification of the species into subspecies. The subspecies can be further classified into subspecies. Clustering analysis is also widely used in information, policy and decision sciences. The various applications of clustering analysis to documents include votes on political issues, survey of markets, survey of products, survey of sales programs, and R & D.
A Clustering Example
Income: High Children:1 Car:Luxury
Cluster 1
Income: Low Children:0 Car:Compact
Income: Medium Children:2 Car: Sedan and Car:Truck Children:3 Income: Medium Cluster 4
Cluster 3
Cluster 2
Different ways of representing clusters

(a)
a k g (c)
a b c
(b) d j h i
1 0.4 0.1 0.3 2 3 0.1 0.5 0.8 0.1 0.3 0.4
e c f b
d a
e j g
h
i
c f
(d)
a c i e d kb j f h
...
K Means Clustering (Iterative distance-based clustering)

K means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. Once done, the cluster locations can be used to classify patterns into distinct classes.
K means clustering (Cont.)

Select the k cluster centers randomly.
Classify the entire training set. For each patern X i in the training set, find
Loop until the change in cluster means is less the amount specified by the user.
the nearest cluster center C and classify X i as a member of C .
For each cluster,recomputeits center by finding the mean of the cluster : 1 Nk Mk X jk N k j 1 where M k is the new mean, N k is the number of training patterns in cluster k , and X jk is the j - th pattern belonging to cluster k .
Store the k cluster centers.
The drawbacks of K-means clustering

The final clusters do not represent a global optimization result but only the local one, and complete different final clusters can arise from difference in the initial randomly chosen cluster centers. (fig. 1) We have to know how many clusters we will have at the first.
Drawback of K-means clustering (Cont.)
Figure 1
Clustering with Genetic Algorithm

Introduction of Genetic Algorithm Elements consisting GAs Genetic Representation Genetic operators
Introduction of GAs
Inspired by biological evolution. Many operators mimic the process of the biological evolution including Natural selection Crossover Mutation
Elements consisting GAs

Individual (chromosome): feasible solution in an optimization problem Population Set of individuals Should be maintained in each generation
Elements consisting GAs

Genetic operators. (crossover, mutation) Define the fitness function. The fitness function takes a single chromosome as input and returns a measure of the goodness of the solution represented by the chromosome.
Genetic Representation
The most important starting point to develop a genetic algorithm Each gene has its special meaning Based on this representation, we can define fitness evaluation function, crossover operator, mutation operator.
Genetic Representation (Cont.)

Examples 1 Outlook 011
Sunny Overcast Rain Strong Normal Yes No
Wind 10
PlayTennis 10
If Outlook is Overcast or Rain and Wind is Strong, then PlayTennis = Yes
Gene 0 1 1 1 0 1 0 A chromosome Allele value
Genetic Representation (Cont.)

Examples 2 ( In clustering problem) Each chromosome represents a set of clusters; each gene represents an object; each allele value represents a cluster. Genes with the same allele value are in the same cluster.
1 2 1 4 3 5 5 A B C D E F G
Crossover
Exchange features of two individuals to produce two offspring (children) Selected mates may have good properties to survive in next generations So, we can expect that exchanging features may produce other good individuals
Crossover (cont.)
Single-point Crossover
1 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 0 0 1 0
Two-point Crossover
1 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1
Uniform Crossover
1
1 0
Crossover template
0
0 1
0
1 0
0
1 0
1
0 0
1
1 1
0
0 0
1
0 1
0
1 0
1
0 0
1
0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1
Mutation
Usually change a single bit in a bit string This operator should happen with very low probability.
0 1 1 0 1 Mutation point (random) 0 1 1 1 1
Typical Procedures
0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 0 1 1
old generation Probabilistically select individuals Crossover mates are probabilistically selected based on their fitness value.
Crossover point randomly selected
Mutation point (random)

0 1 1 1 1
1
0 new generation 0 1
1
1 1 1 1
0
1 1 1 1
0
1 0 1 0
1
1 1 0 0
How to apply GA on a clustering problem

Preparing the chromosomes
1 2 3 3 5
Defining genetic operators

Fusion: takes two unique allele values and combines them into a single allele value, combining two clusters into one.
1 2 3 3 5 3 2 3 3 5
Fission: takes a single allele value and gives it a different random allele value, breaking a cluster apart.
1 3 3 3 5 1 3 4 4 5
Defining fitness functions
Example: (Cont.)
Old generation
1 2 3 3 5 1 2 3 4 5 2 1 3 3 5 2 2 3 4 5 1 1 3 1 5
Crossover Mutation Fusion Fission
Select the chromosomes according to the fitness function.
2 2 2 3 5 1 2 3 5 5 2 4 3 3 4 2 2 3 4 5 2 1 3 2 5
New generation
Finally

Tjea

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tjea

Uploaded by

Copyright:

Available Formats

Basic concepts of Data Mining, Clustering and Genetic Algorithms

Data Mining Motivation

KDD and Data Mining

Data Mining: Discovery stage of the KDD process

Data Mining Techniques

Classification Decision trees Association rules Neural networks Genetic algorithms

The usage of clustering

Income: High Children:1 Car:Luxury

Different ways of representing clusters

K Means Clustering (Iterative distance-based clustering)

K means clustering (Cont.)

the nearest cluster center C and classify X i as a member of C .

Store the k cluster centers.

The drawbacks of K-means clustering

Drawback of K-means clustering (Cont.)

Clustering with Genetic Algorithm

Elements consisting GAs

Elements consisting GAs

Genetic Representation (Cont.)

If Outlook is Overcast or Rain and Wind is Strong, then PlayTennis = Yes

Gene 0 1 1 1 0 1 0 A chromosome Allele value

Genetic Representation (Cont.)

Crossover point randomly selected

Mutation point (random)

How to apply GA on a clustering problem

Defining genetic operators

Defining fitness functions

Crossover Mutation Fusion Fission

Select the chromosomes according to the fitness function.

You might also like