You are on page 1of 24

Seminar on

Data Mining Using Genetic Algorithm(DMGA)


Presented By
Pramod Vishwakarma, M.Tech.[CSE], IIIrd Sem, CET Moradabad, param.vish@gmail.com

Supervisor
Prof. Rajiv Kumar Nath
1

Contents
What Is Data Mining? Architecture of Typical Data Mining System Biological Terminologies What is Genetic Algorithm(GA)? Basic Principles of GA Why Data Mining using Genetic Algorithm? Functions of Genetic Algorithm Pseudo Code of GA Applications of GA Advantages and Disadvantages The Tool MATLAB Conclusion & Future Work References
2

What Is Data Mining?


Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data [1]. Data mining: a misnomer?

Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Is everything data mining?


Simple search and query processing Expert systems
3

Architecture: Typical Data Mining System


Graphical User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server
data cleaning, integration, and selection Knowl edgeBase

Database

Data World-Wide Other Info Repositories Warehouse Web

[1]
4

Biological Terminologies [2]


Gene - Each gene encodes a particular protein. Basically can be said, that each gene encodes a trait, for example color of eyes. Chromosomes - A chromosome consist of genes, blocks of DNA. Chromosomes are strings of DNA and serves as a model for the whole organism. Alleles - Possible settings for a trait (e.g. blue, brown) are called alleles. Locus - Each gene has its own position in the chromosome. This position is called locus. Genome - Complete set of genetic material (all chromosomes) is called Genome. Genotype - Particular set of genes in genome is called Genotype. Phenotype The genotype contains the information required to construct an organism which is referred to as the phenotype.

Genetic Algorithm(GA)
GA was developed by John Holland in 1970. They are based on the genetic processes of biological organisms. Over many generations, natural populations evolve according to the principles of natural selection and survival of the fittest, first clearly stated by Charles Darwin in the Origin of Species. GAs are adaptive method which may be used to solve search and optimization problems. After a number of new generations built with the help of the described mechanisms one obtains a solution that cannot be improved any further. This solution is taken as a final one. 6

Basic Principles of GA
Coding Fitness function Reproduction
Selection Crossover Mutation

Convergence

Coding
Before a GA can be run, a suitable coding(or representation) for the problem must be devised. It is assumed that a potential solution to a problem may be represented as a set of parameters (for example, the dimensions of the beams in a bridge design). For example, if our problem is to maximize a function of three variables, F(x, y, z), we might represent each variable by a 10-bit binary number. Our chromosome would therefore contain three genes, and consist of 30 binary digits.
8

Fitness Function
A fitness function must be devised for each problem to be solved. Given a particular chromosome, the fitness function returns a single numerical fitness or figure of merit. Which is supposed to be proportional to the utility or ability of the individual which that chromosome represents.

Reproduction
During the reproductive phase of the GA, individuals are selected from the population and recombined, producing offspring which will comprise the next generation. Parents are selected randomly from the population using a scheme which favours the more fit individuals. Having selected two parents, their chromosomes are recombined, typically using the mechanisms of crossover and mutation.

10

Example of Crossover & Mutation

11

Convergence
Convergence is the progression towards increasing uniformity. A gene is said to have converged when 95% of the population share the same value. The population is said to have converged when all of the genes have converged. If the GA has been correctly implemented, the population will evolve over successive generations so that the fitness of the best and the average individual in each generation increases towards the global optimum.

12

Why Data Mining using Genetic Algorithm


There are more reasons for preference using genetic algorithms Its robustness Ability to work on large and noisy datasets, GAs perform global search of the solution space in comparison to most other algorithms that use Greedy approach Coping well with attribute interaction. Parallel approaches to genetic algorithms, the scalability of these algorithms can be achieved.
this characteristic is of great importance in data mining.

Moreover, genetic algorithms have high degree of autonomy that enables discovery of knowledge previously unknown by the user.

13

Functions of Genetic Algorithm


The Fitness Function The fitness score is returned as a result Parent Selection
Mating Pool

Crossover
Likelihood of crossover being applied is typically between 0.6 and 1.0.

Mutation
Mutation is applied to each child individually after crossover. It randomly alters each gene with a small probability (typically 0.001).

14

Pseudo Code of GA[3]

15

Applications of GA
Domain
Control Design Scheduling Robotics Machine Learning Signal Processing Game Playing Combinatorial Optimization

Application Types
gas pipeline, pole balancing, missile evasion, pursuit semiconductor layout, aircraft design, keyboard configuration, communication networks manufacturing, facility scheduling, resource allocation trajectory planning designing neural networks, improving classification algorithms, classifier systems filter design poker, checkers, prisoners dilemma set covering, travelling salesman, routing, bin packing, graph colouring and partitioning

Advantages and Disadvantages


Advantages:
Concept is easy to understand Modular, separate from application It doesnt have to know any rules of the problem in advance. This is very useful for very complex and loosely defined problem. With a well defined fitness function and carefully chosen attributes, genetic algorithm can perform much faster than other algorithm such as the linear method.

17

Conti
Disadvantages: The definition of the fitness function can be very complicated sometime. The fitness function may affect the performance of the process significantly if the complexity of the fitness function increase. It is because the fitness function is used to compare every element in the sample population to every data in the training data set. Sometimes an acceptable solution cannot be derived even after countless iteration if the genetic operators are wrongly chosen.

18

The Tool MATLAB

[4]

MATLAB Matrix Laboratory MATLAB is a high-performance language for technical computing. It integrates computation, visualization and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Simulink Simulink is an interactive environment for modeling, simulating, and analyzing dynamic, multi domain systems. It lets you build a block diagram, simulate the systems behavior, evaluate its performance, and refine the design.
19

Typical Uses Of Matlab


Math and computation Algorithm development Data acquisition Modeling, simulation, and prototyping Data analysis, exploration, and visualization Scientific and engineering graphics Application development, including graphical user interface building

20

Future Work
In the future work, the algorithm derived in this presentation will be implemented into program using MATLAB. Beside, the study will be focus on applying genetic algorithm on the database. Finally, it will compare with conventional data mining technique in order to find the benefit by using genetic programming.

21

Conclusion
In this seminar, the basic knowledge of Data Mining and most commonly used Architecture of Typical Data Mining System are covered then Genetic Algorithm, its various operators are depicted and the pros and cons of GA are discussed. Finally the introduction to Matlab and Simulink and future works are discussed.

22

References
1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2006 2. http://www.obitko.com/tutorials/genetic-algorithms/index.php 3. David Beasley et. al. (1993). An Overview of Genetic Algorithms: Part 1, Fundamentals, University Computing, vol.15 (2), pp. 58-69. 4. Learning MATLAB, COPYRIGHT 1984 - 2004 by The MathWorks, Inc.

23

Thank You any question or suggestion

24

You might also like