is the process of grouping data into classes or clusters ,so that theobjects wit
h in a cluster have high similarities in comparison to one another but are veryd
issimilar to objects in other clusters .Dissimilarities are assessed based on t
he attributevalues describing the objects. Clustering has its roots in many area
s, including datamining,statistics,biology,and machine learning.In this paper, W
e examine several clustering techniques, organized into thefollowing categories
:Partitioning methods, Hierarchical methods, Density- basedmethods, Grid-based m
ethods, Constraint-based clustering .Clustering can also be usedfor outlier det
ection.IntroductionData miningis refers to ‘extracting ‘ or ‘mining’ knowledge from larg
e amount of data. It is called as data mining. Data mining is also known as know
ledge mining fromdata, knowledge extraction, data or pattern analysis,data desig
ning.Datawarehouseisnothing but the containing large amount of information or da
ta base.The process of grouping a set of physical or abstract objects into class
es of similar objects is called clustering. A cluster is a collection of data ob
jects that are similar to oneanother with in the same cluster and are dissimilar
to the objects in other cluster. Acluster of data objects can be treated collec
tively as one group and so may be consideredas a form of data compression.The fo
llowing are typical requirements of clustering in data mining: Scalability Ability
to deal with different types of attributes Discovery of clustering with arbitrary
shape Ability to deal with noisy data Incremental clustering and insensitivity to th
eorder of input record High dimensionality Constraint based clustering1. partitioni
ng method
The most well known and commonly used partitioning methods are K-means andK-medo
ids,and their variations.(i)Centroid based technique:The K-means MethodThe K-mea
ns algorithm takes the input parameter,k,and partition a set of n objects intok
cluster so that the resulting intracluster similarly is high but the intercluste
r similarity islow.The square-error criterion is used defined as, K E=∑ ∑ │p-mi│, i=1
p €CiThe k-means partitioning algorithm: Algorithm: The k-means algorithm for part
itioning, where each cluster’s center isrepresented by the mean value of the objec
ts in the cluster.Input:K-> the number of clusters,D-> a data set containing n o
bjects.Output: A set of k cluster. Method: [1] Arbitrarily choose k objects from
D as the initial cluster center;[2] Repeat [3] (re)assign each object to t
he cluster to which the object is the most similar,based on the mean value of th
e objects in the cluster; [4] update the cluster means ,ie., calculate the mean
value of the objects for each cluster;[5] until to change; First it randomly se
lect k of the objects,each of its initially represents a cluster mean or center.
For each of the remaining objects , an object is assigned to a cluster towhich
it is similar, based on the distance between the object and the mea
ncluster.itcomputes the new mean for each cluster.this process iterates until th
e criterionfunction coverage’s.
Figure : Clustering of a set of objects based on the k-means method.(The meanof
each cluster is marked by a”+”).(ii)Represantative Object Based Technique: The K-med
oids Method The k-means algorithm is sensitive to outliers because an object wi
th extremelylarge value may substantially distort the distribution of data. This
effect is particularlyexacerbated due to the use of the square error function.
An absolute-error criterion is used, defined as K E= ∑ ∑ │p-Oj│, J=1 p € cj Where, E is
the sum of absolute error for all objects in the dataset; is the point inspace
representing in given object in cluster Cj.and Oj is representative object of Cj
.CASE1:p currently belongs to representative object ,Oj. If Oj is replaced by O
randomas a representative object and p is closest to one of the other represent
ative objects,Oi,i≠j,then p is reassigned to Oi.CASE2: p currently belongs to repr
esentative object.Oj. Oj is replaced by O random as arepresentative object and
p is closest to Orandom then p is reassigned to Orandom.CASE3:p currently belong
s to representative object, Oj. i≠j. If Oj is replaced byOrandom as a representa
tive object and p is still closest to Oi,then the assignment doesnot change.CASE
4: :p currently belongs to representative object, Oj. i≠j. If Oj is replaced byOr
andom as a representative object and p is still closest to Orandom. PAM(partiti
oning Around Medoids)was one of the first K-Medoids algorithm. Itattempts to det
ermine K partitions for n objects. After an internal random selection of krepre
sentative objects, the algorithm repeatedly tries to make a better choice of clu
ster representatives. The total cost of swapping is the sum of cost i
ncurred by anonrepresentative object. The complexity of each iteration is O(n-
You might also like