You are on page 1of 4

2016 3rd MEC International Conference on Big Data and Smart City

A Comparative Study of Various Clustering


Techniques on Big Data Sets using Apache Mahout
Dr.Venkateswara Reddy Eluri MS. Amina Salim Mohd AL-Jabri
Dept. of Information Technology Dept. of Information Technology
Shinas College of Technology Shinas College of Technology
Shinas, Oman Shinas, Oman
Email: Venkateswara.eluri@shct.edu.om Email: Amina.Aljabri@shct.edu.om

Dr. Mare Jane


Dr.M.RAMESH Dept. of Information Technology
Dept. of Information Technology Shinas College of Technology
R.V.R & JC College of Engineering Shinas, Oman
Chowdavaram, India Email: mery@shct.edu.om
Email: mrameshmailbox@gmail.com

an underlying distributed system. The framework for the


Abstract Clustering algorithms have materialized as an distributed system is Hadoop, which implements MapReduce.
unconventional tool to precisely examine the immense volume of Mahout is said to be scalable for large data sets. In this paper,
data produced by present applications. In specific, their main we focus on various clustering techniques using Apache
objective is to classify data into clusters such that objects are Mahout
grouped in the same cluster when they are similar rendering to
particular metrics and dissimilar to objects of other groups.
From the machine learning perspective clustering can be viewed II. CLUSTERING TECHNIQUES
as unsupervised learning of concepts. Hadoop is a distributed file However there are so many clustering algorithms are
system and an open-source implementation of MapReduce available; this section introduces an unsupervised clustering
dealing with big data. Apache Mahout clustering algorithms are technique and pre- clustered technique which is supported by
implemented on top of Hadoop using MapReduce paradigm. In Apache Mahout.
this paper three clustering algorithms are described: K-means,
Fuzzy K-Means (FKM) and Canopy clustering implemented by K-Means clustering Technique
using Apache Mahout as well as providing a comparison. In
addition, we underlined the clustering algorithms that are the Canopy Clustering Technique
preeminent performing for big data.
III. RELEATED WORK
KeywordsClustering Algorithm; Apache Mahout; big data; K-
means; Fuzzy K-Means, Canopy clustering. 1. K-Means Clusting Technique
The K-means algorithm takes input k from the user and
I. INTRODUCTION partition n data objects into k clusters so that the resulting intra-
In the current digital era, due to immense development of cluster similarity is very high and inter-cluster similarity is very
the internet and online world technologies we face a huge low. The cluster similarity is calculated based on the mean
volume of information and data day by day from many value of the objects in the cluster. First, it randomly picks k
different services and resources such as Sensor Networks, data objects as the mean or centroid points. For each of the
Cloud Storages, Social Networks and etc., produce big volume remaining objects, an object assigned to the centroid to which it
of data and also need to manage and reuse that data or some is most similar based on the distance between the object and
analytical aspects of the data. Therefore, a big volume of data the cluster mean. It then computes the new mean for each
has its own deficits as well. One method to overawe these cluster. This process iterates till good clusters are formed. K-
challenging problems is to have big data clustered in a dense means is well suited to generating globular clusters. The K-
format that is still an instructive version of the entire data. Such means method is numerical, un-supervised, non-deterministic
clustering techniques aim to yield a worthy quality of clusters. and iterative.
To do this, clustering of massive data, we need parallel
processing. MapReduce is a programming model which is A. Representing Data as Vector
designed for processing large volume of data in parallel. It does In two dimensions, representing vectors as an ordered list
so by dividing the work into a set of independent tasks. Mahout
of values, one for each dimension, like (4,3). We often
is a set of distributed data mining libraries that interface with
indicate x as first dimension and y as second while dealing

978-1-4673-9584-7/16/$31.00 2016 IEEE


2016 3rd MEC International Conference on Big Data and Smart City

with two dimensions, but in Mahout, a vector can have 2,3, Example: A, about, above, after, again, against, all, am ,
or 1,000 dimensions. The first is dimension 0, the next is an, and, any, as, at, be, because etc.
dimension 1, and so on. The maximum number of
dimensions possible for the vector is known as cardinality. D. TF-IDF & Weight
TF-IDF weighting is a commonly used enhancement on
simple term-frequency weighting.Term Frequency value is
multiplied by the inverse of the terms on document
frequency.

Weight= TF * Log(N/DF)
IDF= Log(N/DF)
IDF Inverse Document Frequency
DF - Document Frequency.
Fig.1. Representation of Vector N Document Count
B. Vectorization W Weight of a word in Document Vector
TF Term Frequency
Different sized and colored balls need to be transformed into
an appropriate vector. This transformation is based on how E. Normaliation
different attributes of balls translate into a decimal value. Normalization is the practice of diminishing (increasing)
magnitude of large (smaller) vectors. In mahout, normalization
exploits p-norm statistics.
F. K-means MapReduce Flow
Fig.2. Three balls with different color, sizes and weights. In Mahout, the Map Reduce edition of the k-means algorithm
is invoked using KMeansDriver class. It has a single point
TABLE 1. Vectorization process
entry- runJob() method.
Ball Weight(kg) (0) Color(1) Size(2) Vector The algorithm attains the following input parameters:
Medium,
Oval, 0.17 530 2 [0.17, 530, 2]
green
Large,
0.25 630 3 [0.25, 630, 3]
Oval, Red
Small,
Elongated, 0.19 610 1 [0.19, 610, 1]
Red
Large,
Round, 0.29 590 3 [0.25, 590, 3]
Yellow
Small,
Round, 0.15 520 1 [0.15, 520, 1]
Green
Fig.4. Flow of Running K-means in Mahout
C. Data Representation
Represent the features of an item as fields of a vector.
G. RandomSeed Generator
Dense Vector
When running the k-means job using RandomSeedGenerator,
Random Access Sparse Vector it will seed the initial centroids for running k-means.
Sequential Access Sparse Vector

H. Data Sets
To compare the above clustering techniques, two datasets are
Fig.3.The flow of Data Representation used in the experiments including :UCI Machine Learning
a) Stop Words: These are words filtered out prior to, Repository and Reuters-21578.The data set Reuters-21578
after and processing of natural language data. The filtered out consists of text newswire articles provided by the Davis D.
words occur frequently and reside in abundance in all Lewis Consulting,Ornarose,Inc and Reuters Ltd. Which
documents, hence they are excluded in processing the data. consists of 22 data files, each file having 1000 articles.
2016 3rd MEC International Conference on Big Data and Smart City

I. Results It is basically performed to speed up the clustering in


The Figure 5 shows the experimental result of Reuters-21578 the case of large data sets, in which the direct
news articles as dataset, clustered using the K-means implementation of the main algorithm may be
Clustering Technique and observe the following result: impractical due to size of the data sets.
No of Clusters : 20 The algorithm uses a fast approximate distance
Time Taken : 2975 ms (0.049583 Minutes) metric and two distance thresholds T1 > T2 for
No of Iterations : 4 processing.
a) How canopy Clustering works: Start with a set/list of
data points and two distance thresholds T1>T2 for
processing.

Fig 7.Canopy Clustering works


Fig.5. An experimental Result in Mahout: Clustering using k-means

b) Visualizing Canopy Centroids: In Mahout, the


Display Canopy class displays a set of points in a 2-
dimensional plane and shows how the canopy generation is
done using the in-memory Canopy Clusters.

Fig.6. Visualization of Clusters of different news articles (Reuters-21578)


using k-means

J. Limitations of K-means Clustering


K-means has problems when clusters are of differing
o Sizes
o Densities
o Non-globular Shapes
Problems with outliers Fig.8. Canopy Cluster with T1=3.0 and T2=1.5
Empty Clusters.
2. Canopy Clusting Technique
Canopy is a very simple, fast and surprisingly accurate method
for grouping objects into clusters. All objects are represented
as a point in a multidimensional feature space.
It is an unsupervised pre-clustering algorithm
performed before k-means clustering or hierarchical
clustering.
2016 3rd MEC International Conference on Big Data and Smart City

[9] P. Dhanalakshmi & T.Kanimozhi: Automatic Segmentation of Brain


TABLE 2. Comparison Table Tumor using K-Means Clustering and its Area Calculation, IEEE
Transsactions on Fuzzy Systems, pp. 2278-2285, Apr. 2013.
Partial [10] A Arnau Oliver, Xavier Munoz, Joan Battle: Improving Clustering
Clustering MapReduce In-memory Fixed
Member
Algorithm Implementation Implementation Clusters
ship
Algorithms for Image Segmentation using Contour and Region
Information, IEEE Transactions on Fuzzy Systems, 2009
k-means K-Means Driver K-Means Y N
Cluster
Canopy Canopy Driver Canopy Cluster N N

Fuzzy k- Fuzzy K Means Fuzzy K Means Y Y


means Driver Cluster

Dirichlet Drichlet Driver Drichlet Cluster N Y

LDA LDA Driver N/A Y Y

IV CONCLUSION

This paper provided a comprehensive study of the clustering


algorithms provided in the related work. Mahout API works
on top of Map Reduce Frame work, which provides no of built
in methods for clustering big volume of data. The K-Means
clustering algorithm is suitable for globular data set but not for
Non globular data set and also identifying the number of
clusters initially is difficult for big data set. In canopy
clustering technique, the cluster size is automatically
identified.

REFERENCES

[1] Adil Fahad, Najlaa Alshatri, Zahir Tari And Abdullah Almari, A survey
of Clustering Algorithms for bigdata: Taxonomy and Empirical
Analysis IEEE Transaction on Emerging Topics in
Computing.vol.2,No.3, pp. 267-279, Sep2014.
[2] S. Guha, R. Rastogi, and K. Shim, ``Cure: An ef_cient clustering
algorithm for large databases,'' in Proc. ACMSIGMOD Rec., Jun. 1998,
vol. 27, no. 2,pp. 73_84.
[3] M. S. Chen, J. Han, and P. S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, Vol. 8, No. 6, 1997.
[4] M. Zait and H. Messatfa, A Comparative Study of Clustering
Methods, Future Generation Computer System, Vol. 13, 1997, pp.149-
159.
[5] M. S. Chen, J. Han, and P. S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, Vol. 8, No. 6, 1997.
[6] E. Venkateswara Reddy , Dr E.S.Reddy Image Segmentation Using
Rough Set Based Fuzzy KMeans Clustering algorithm Global Journals
Inc,USA, GJCST, Volume 13 Issue 6 Version 1.0 , May2013, pp. 23-28.
[7] E.Venkateswara Reddy, Dr E.S.Reddy Image Segmentation Using
Rough Set Based Fuzzy CMeans Clustering algorithm International
Journal of Computer Applications , USA, (IJCA), Volume 74 -No.14 ,
July 2013, pp. 23-28.
[8] E.Venkateswara Reddy, Dr E.S.Reddy A Comparative Study of Color
Image Segmentation Using Hard, Fuzzy, Rough Set Based Clustering
Techniques Council for Innovative Research, IJCT,Volume-11, No-8 ,
November 2013, pp. 2873-2878.

You might also like