You are on page 1of 7

DATA MINING USING GENETIC ALGORITHMS

S.Durga Bhavani III/IV B.Tech(CSE), KLCE Bhavani.sudhireddy@gmail.com

Y.Harika III/IV B.Tech(CSE), KLCE sony_dil8@yahoo.co.in

ABSTRACT:
Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration, which can be viewed as a result of the natural evolution of information technology. Data Warehousing and Mining tools such as Cognos, Oracle, SQL Server, SAS have become major players major role to extract the knowledge out of Online Transaction Processing System(OLTP) then to load it on to Decision Support Systems(DSS) when then used by Applications to present the Knowledge to find out patterns, interest, subjectivity of specific object or events. Is it just another hype or what is the sales prediction for this quarter?- query from business people. This paper aims to present Data Mining concepts and Application of Data Mining using Genetic Algorithms.

to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. Knowledge mining, a shorter term , may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer that carries both data and mining became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging.

1. What Is Data Mining?


Simply stated, data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred

Fig. 1: Knowledge discovery as a process

1. Data cleaning (remove noise and inconsistent data) 2. Data integration (combining multiple data sources) 3. Data selection (data relevant analysis) 4. Data transformation (performing data transformation and aggregation operations) 5. Data mining (applying intelligent methods to extract data patterns) 6. Pattern evaluation (identifying the truly interesting patterns representing knowledge based on some interestingness measure) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) 1.1 Data Integration and Transformation: Data mining often requires data integrationthe merging of data from multiple data stores. The data may also need to be transformed into forms appropriate for mining.

minimizing process, some of the transformation techniques include min-max normalization, zscore normalization and decimal scaling. Min-max normalization performs linear transformation on original data, suppose minA and maxA are the minimum and maximum values of an attribute, A.Min-max normalization maps a value,v,of A to vl in the range [new_ minA, new_maxA] by computing.

Suppose maximum and minimum values for the attribute We would map income to the range [0.0,1.0]. By min-max normalization, a value of 73,000 for income is transformed to.

1.2Data Reduction: The data set will likely be huge! Complex data analysis and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible. Data reduction techniques an be applied to obtain a reduced representation of the data set that is much smaller volume, yet close to maintains the integration of the original data. Strategies for data reduction include Data cube aggregation, Histograms, Attribute subset selection, Dimensionally reduction, Numerosity reduction and Discretization and concept hierarchy generations. Data Cube Aggregation where aggregation partitions are applied to the data in the construction of data cube. In this example, consider you have been given with per quarter sales details from year 2005 to 2007

Where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1 or 0 to 1, new attributes are constructed and added from the given set of attributes to help the

2005

2006

2007

1.3 Concept Hierarchy Generation :


This rule can be recursively applied to each interval, creating a concept hierarchy for the given numerical attribute. Real-world data often contain extremely large positive and/or negative outlier values, which could distort any top-down discretization method based on minimum and maximum data values. For example, the assets of a few people could be several orders of magnitude higher than those of others in the same data set. Discretization based on the maximal asset values may lead to a highly biased hierarchy. Thus the top-level discretization can be performed based on the range of data values representing the majority (e.g., 5th percentile to 95th percentile) of the given data. The extremely high or low values beyond the top-level discretization will form distinct interval(s) that can be handled separately, but in a similar manner. Suppose profits at different branches of an Organization covers wide range from Rs 351,976.00 to 4,700,896.50 then 1. MIN = 351,976.00, MAX = 4,700,896.50 LOW = 159,876(5th percentile), HIGH=1,838,761(95th percentile) 5. Recursively, each interval can be further partitioned according to 3-4-5 rule for the next lever of hierarchy.

4. Examining MIN and MAX values to see how they fit into first-level partitions, the most significant digit of MIN is hundred thousand digit positions.

2. Genetic Algorithms (GA):


Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning starts as follows. An initial population is created consisting of randomly generated rules. Each rule can be represented by a string of bits. As a simple example, suppose that samples in a given training set are described by two Boolean attributes, A1 and A2, and that there are two classes, C1 andC2. The rule IF A1 ANDNOT A2 THEN C2 can be encoded as the bit string 100, where the two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001. If an attribute has k values, where k > 2, then k bits may be used to encode the attributes values. Classes can be encoded in a similar fashion. Based on the notion of survival of the fittest, a new population is formed to consist of the fittest

2. Round most significant digits(msd), round LOW and HIGH gives LOW= 100,000,00 and HIGH = 2,000,000

3.The interval ranges over to three digits, that is (200,000,000 (-100,000,000))/100,000,000=3, the segments partitioned into three equal-width segments.

rules in the current population, as well as offspring of these rules. Typically, the fitness of a rule is assessed by its classification accuracy on a set of training samples. Offspring are created by applying genetic operators such as crossover and mutation. In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly selected bits in a rules string are inverted. The process of generating new populations based on prior populations of rules continues until a population, P, evolves where each rule in P satisfies a prespecified fitness threshold. Genetic algorithms are easily parallelizable and have been used for classification as well as other optimization problems. In data mining, they may be used to evaluate the fitness of other algorithms. 2.1. GA- Working Approach: Just like as memory-based reasoning and neural networks, genetic algorithms are based on an analogy to biological processes. For over the millions of years, Evolution and natural selection have produced flexible, specialized species and individuals which are highly suitable to their environment. For optimizing the fitness of individuals over following generations, these processes are served, by spreading the genetic material in the most fit individuals of one generation to the next generation. The same idea is being applied to the problems by the genetic algorithms in which the solution can be expressed as an individual and the problem is to exploit the fitness of individuals. Genetic algorithms continues by having more fit individuals spread their genetic material to succeeding generations just like as in the case of Evolution. The individuals who are less fit and their genetic material do not live on. Genetic algorithms have been applied to three areas with hopeful results, in the past few years. The three areas are Training neural networks, generating scoring functions for MBR and as optimization engines which are embedded in scheduling the packages. And the most general function has been for training neural networks. Now a day, most of the neural network packages include genetic algorithms as an option for training. The genetic algorithms influences in biology in which genetics has been proven able of adapting life to a multitude of environments. Genetics is a field which is expanding rapidly in which the results are occurring almost once every

month. One of a good example of this is the human genome project in which it is moving forward. The basic operators that are used in genetic algorithms are selection, crossover and mutation. Selection: The size of the population is kept constant by it. But it increases the fitness of the next generation. Genomes with a higher fitness or darker shading reproduce and genomes that are with lighter shading dies off. (Berry & linoff 1997:341) In each and every evolutionary step which is also known as a iteration step or generation, the individuals in the current population are decoded and assessed according to some predefined quality criterion which is also referred as the fitness or the fitness function. for forming a new population which is a next generation, the individuals are selected according to their fitness. Hollands original fitness proportionate selection is one of a simplest selection criteria that which is used where in the individuals are selected with a probability that is proportional to their relative fitness. This makes sure that the probable number of times an individual is chosen is just about proportional to its relative performance in the population. Crossover: It is a way in which the two genomes are combined. A crossover position determines where the genomes break and where they are recombined. The new points in the search space are generated by operators that are geneticallyinspired; out of them the most recognized ones are crossover and mutation. Crossover is performed with some probability which is also known as crossover probability or crossover rate between two selected individuals, known as parents by the parts exchange of their genomes or encodings for making two new individuals, called offspring. In simple, substrings are exchanged after a randomly selected crossover point. The swapping of the genetic material of the parents from the randomly chosen crossover point and forming of the new offspring is crossover operation. This crossover method is called a single point crossover. Mutation: An occasional random change is made to a random position in a genome by the mutation. This allows features to appear that may not have been in the original population. The introduction of the mutation operator is done for preventing the premature convergence to local optima. It is done be sampling the new points in the search space randomly. This process is done

by spinning of the bits randomly with some probability. Genetic algorithms are stochastic iterative processes that which are not converged for guarantee. The termination condition might be specified as some fixed maximal number of generations or as the attainment of and acceptable fitness level. 2.2. Use of Genetics in Computers : By the evolution of consecutive generations of genomes that increases more and more fit, genetic algorithms are going to work. In natural world, fitness means whether an organism survives to reproduce. But in the case of computers it is more flexible. We use the fitness function for answering the problem. 2.3. Genetic Algorithm Example: In this example we try to find out the maximum value of a simple function by using a single parameter p in the range from 0 to 31.for this reason, the function is 31p-p where p varies in between 0 and 31. The genetic material is called a genome. In this case the genome contains only a single five-bit gene for the parameter p. the peak values of this function are 15 and 16 which are represented as 01111 and 10000 respectively. This example shows that even when there are multiple genetic algorithms, dissimilar peaks, the genetic algorithms are applicable. The working of genetic algorithms is carried out by evolving the successive generations of genomes that get progressively more and more fit. Maximizing the fitness of the genomes in the population is the main objective. In general, fitness is defined as the capability of survival of an organism for reproduction. But on a computer, we choose the fitness function for solving the problem. The appropriate function for this example is 31p-p and the following steps are being applied in the method: 1) The genome and the fitness function are identified and an initial generation of genomes is created. 2) By applying selection, crossover and mutation, the initial population is modified. 3) Step 2 is repeated until the fitness of the population no longer improves. Setting up the problem is the first step in using genetic algorithms. As already mentioned, the genome consists of a single, five-bit gene for the

parameter p. After some generations the fitness function is going to be maximized. A genome is processed and a single fitness value is produced. In this example there are four genomes present in the first generation and are produced randomly and is shown in the table below.
Genome 10110 00011 00010 11001 P 22 3 2 25 Fitness 176 87 58 150

It should be noticed that the average fitness is 117.5 in this population which is pretty good, but genetic algorithms will improve it. The initial population is modified by the basic algorithm using the three operators which are selection, crossover and then mutation. Selection is similar to the natural selection in which only the fittest individuals in the population survive for passing their genetic material on to the next generation. The survival chance of a genome to the next generation is proportional to its fitness value. If the value of the fitness is higher relatively to other genomes then there will be more copies that survive to the next generation. The table below shows the ratio of the fitness of the four genomes to the population fitness.
Geno me 1011 0 0001 1 0001 0 58 12.3% 0.49 87 18.5% 0.74 s 176 Fitnes % of total population fitness 37.4% Expected copies 1.50

The number of copies of each genome expected in the next generation is determined by this ratio. Even though the number of copies which are expected is a fraction, the amount of genomes in the population is not fractional at any time. Survival is based on choosing the genomes in a random way that is proportional to their fitness. In the spinner approach, a spinner is set up with each genome having an area proportional to its fitness. And then it is spun randomly, landing at a spot pointing to a particular genome. By the use of spinner, the fractional probabilities are converted to the whole number approximations. By the application of selection to the original four

genomes the following survivors are produced and is shown in the table below
Genome 10110 11001 00010 10110 Fitness 176 150 58 176

which is in the resulting gene may correspond to a major improvement in fitness compared to the existing population with an extra input is given by the mutation. Generally the rate of mutation is very small and also it is kept small for genetic algorithms, usually. Not more than one mutation is a reasonable bound. In the above example when the mutation takes place the bit changes from a 0 to a 1 or changes from a 1 to a 0. Let us assume that there will be only one mutation in this generation that which occurs in the second genome at position 3, below table shows the genomes population after occurrence of mutation.
Genom e 10010 11101 00110 10110 17 29 6 22 238 58 150 176 P Fitness

Generally this procedure produces more copies of the fittest genomes and few of the less fit genomes. The one which is less fit, 00011 has not survived this round of selection but there are two copies of 10110, the fittest. And also the populations average fitness has increased from 117.75 to 140. The next operator that is being applied to the surviving genomes is crossover. The crossover function creates two genomes from two existing ones by gluing together pieces of each genome. The first part of one genome swaps places with the first part of the second. For example, starting with the two genomes 10110 and 00010 and the usage of the crossover position between the second and the third position works as follows: 10011 00010 And the result of the crossover is shown below: 10010 00011 The genomes in the result which are called children have got a piece inherited from each of their parents. The application of the crossover to the population proceeds by selecting pairs of genomes and flipping a coin for determining whether they cross over. This is called the cross over probability, which is often denoted by pc . If they are doing cross over, then a position is chosen which is random and the original genomes children replace them in the next generation. Generally for the cross over probability, a value of 0.5 produces good results. After the selection and crossover, it should be noticed that the average fitness of the population has gone from 117.75 to 178.5 and this is a major improvement after only one generation. The operation which is to be done finally is mutation. It occurs rarely in the nature and it is obtained by the result of a miscoded genetic material that which is passed over from a parent to a child. And the change

The changes that are made by the mutation are might be unhelpful and will not withstand for more than one generation or to generations. In the above example, even though mutation seems to be harmful, in the second generation there seems to be a considerable improvement in the population when compared to the original population. 2.4. Advantages of GA 1. 2. 3. 4. 5. Genetic algorithms creates reasonable results Genetic algorithms are easy for the application of the results. A broad range of data types are handled by the genetic algorithms Genetic algorithms are valid for optimization Genetic algorithms put together well with neural networks.

2.5. Disadvantages of GA 1. 2. It is difficult in encoding many problems by the genetic algorithms As genetic algorithms are a global optimization technique, they can probably experience the problems experienced by the global optimization. Genetic algorithms are expensive when considered about the computation.

3.

4.

Genetic algorithms are available in only a small number of commercial packages.

3. Conclusion:
Genetic Algorithms is a very useful application used by different packages. Genetic Algorithms are often used different packages such as particularly by Neural Networks for improving the performance of the Neural Network. By the use of the genetic Algorithms, good reasonable results can be created and also for the purpose of getting good results, Genetic Algorithms are easy. Genetic Algorithms are adjustable to their environments, and as this type of method is attractive to the vision community for those who must have to often work in a changing environment. The genetic algorithms perform better in finding areas of interest even in a complex and real-world scene.

References:
1) Data Mining-Wikipedia, the free encyclopedia, Retrieved 02 November 2007 from http://en.wikipedia.org/wiki/Data_mining 2) XSB, Inc. - Glossary, Retrieved 02 November 2007 from http://www.xsb.com/glossary.html 3) Introduction to Genetic Algorithms, Retrieved 20 November 2007 from http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol1/h mw/article1.html#introduction 4) J Han and M Kamber, Morgan Kaufmann, 2001, Data Mining: Concepts and Techniques.

You might also like