You are on page 1of 3

1 `

CHAPTER 1

INTRODUCTION

Data mining is the process that attempts to discover patterns in large data sets and summarizing it into useful information. Data mining applications can use a variety of parameters to examine the data. They include association, sequence or path analysis, classification, clustering and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities). Application of algorithms in data mining is to search for patterns and relationships that may exist in large databases. Because datasets are so large, many relationships are possible. To search this space of possibilities, machine learning techniques are used. The different learning types are listed below

Supervised learning - Here the algorithm generates a function that maps inputs to desired outputs. It is based on the training data

Unsupervised learning - It is learning without training data e.g. clustering. It provides grouping of related documents on the basis of their content without referring to taxonomy.

Semi-supervised learning - This combines both labeled and unlabeled examples to generate an appropriate function or classifier.

Clustering can be considered as the most important unsupervised learning problem. It is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to

2 `

those in other clusters. The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The major clustering approaches are discussed in the following section.

1) Partitioning approach - This method relocates the instances by moving them from one cluster to another, starting from an initial partitioning. Typical methods are kmeans [4], k-medoids [8], CLARANS [10]. 2) Hierarchical approach - This method constructs the clusters by recursively partitioning the instances in either a top-down or bottom-up fashion. Typical methods are BIRCH [1], ROCK [11]. 3) Density-based approach - It is based on the connectivity and density functions. This method is to identify the clusters and their distribution parameters. Typical methods are DBSACN [4], OPTICS [9]. 4) Grid-based approach - In this method all of the operations for clustering are performed by partitioning the space into a finite number of cells that form a grid structure. Typical methods are WaveCluster [5], CLIQUE [7]. 5) Modal based approach - This method attempts to optimize the compact between the given data and some mathematical models. Typical methods are EM [1], SOM [1], COBWEB [1]. 6) Clustering high dimensional data - It is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Typical methods are CLIQUE [7] and PROCLUS [7]. 7) User-guided or constraint-based - Clustering is done by considering user-specified or application-specific constraints. Typical methods are COD (obstacles), constrained clustering.

These clustering techniques have the disadvantage that they can easily falls into local optimum. So to overcome this drawback, genetic algorithm is used. Ujjwal

3 `

Maulik et al (2000) designed genetic algorithm for clustering. Genetic algorithm is a global optimum algorithm based on the principle of biological evolution. It generates solution to optimization problems using techniques inspired by natural evolution. It is an iterative procedure that represents its candidate solutions as strings of gene called chromosomes which are represented in 0s and 1s.

The efficiency of genetic algorithm is shown using 0/1 Knapsack problem, as the solutions of this problem can be represented in 0s or 1s. Then to improve the efficiency of the problem hybrid genetic algorithm which comprises of multi-clustering genetic algorithm and roughest theory, is used.

Rough set theory is used to deal with uncertainty boundary objects and incomplete information. Licai Yang et al (2006) proposed rough set theory in clustering. It generates pattern of individual clusters and hence reduct is computed for each cluster. If reducts are removed from the cluster, remaining attributes in the cluster will have same attribute value pair.

The rest of the report is organized as follows. In chapter 2, the related works are reviewed. Chapter 3 describes the problem definition. Chapter 4 describes the proposed work. Chapter 5 describes the experimental setup. Chapter 6 describes the implementation and discussion. Chapter 7 concludes the project and future work.

You might also like