Professional Documents
Culture Documents
with two dimensions, but in Mahout, a vector can have 2,3, Example: A, about, above, after, again, against, all, am ,
or 1,000 dimensions. The first is dimension 0, the next is an, and, any, as, at, be, because etc.
dimension 1, and so on. The maximum number of
dimensions possible for the vector is known as cardinality. D. TF-IDF & Weight
TF-IDF weighting is a commonly used enhancement on
simple term-frequency weighting.Term Frequency value is
multiplied by the inverse of the terms on document
frequency.
Weight= TF * Log(N/DF)
IDF= Log(N/DF)
IDF Inverse Document Frequency
DF - Document Frequency.
Fig.1. Representation of Vector N Document Count
B. Vectorization W Weight of a word in Document Vector
TF Term Frequency
Different sized and colored balls need to be transformed into
an appropriate vector. This transformation is based on how E. Normaliation
different attributes of balls translate into a decimal value. Normalization is the practice of diminishing (increasing)
magnitude of large (smaller) vectors. In mahout, normalization
exploits p-norm statistics.
F. K-means MapReduce Flow
Fig.2. Three balls with different color, sizes and weights. In Mahout, the Map Reduce edition of the k-means algorithm
is invoked using KMeansDriver class. It has a single point
TABLE 1. Vectorization process
entry- runJob() method.
Ball Weight(kg) (0) Color(1) Size(2) Vector The algorithm attains the following input parameters:
Medium,
Oval, 0.17 530 2 [0.17, 530, 2]
green
Large,
0.25 630 3 [0.25, 630, 3]
Oval, Red
Small,
Elongated, 0.19 610 1 [0.19, 610, 1]
Red
Large,
Round, 0.29 590 3 [0.25, 590, 3]
Yellow
Small,
Round, 0.15 520 1 [0.15, 520, 1]
Green
Fig.4. Flow of Running K-means in Mahout
C. Data Representation
Represent the features of an item as fields of a vector.
G. RandomSeed Generator
Dense Vector
When running the k-means job using RandomSeedGenerator,
Random Access Sparse Vector it will seed the initial centroids for running k-means.
Sequential Access Sparse Vector
H. Data Sets
To compare the above clustering techniques, two datasets are
Fig.3.The flow of Data Representation used in the experiments including :UCI Machine Learning
a) Stop Words: These are words filtered out prior to, Repository and Reuters-21578.The data set Reuters-21578
after and processing of natural language data. The filtered out consists of text newswire articles provided by the Davis D.
words occur frequently and reside in abundance in all Lewis Consulting,Ornarose,Inc and Reuters Ltd. Which
documents, hence they are excluded in processing the data. consists of 22 data files, each file having 1000 articles.
2016 3rd MEC International Conference on Big Data and Smart City
IV CONCLUSION
REFERENCES
[1] Adil Fahad, Najlaa Alshatri, Zahir Tari And Abdullah Almari, A survey
of Clustering Algorithms for bigdata: Taxonomy and Empirical
Analysis IEEE Transaction on Emerging Topics in
Computing.vol.2,No.3, pp. 267-279, Sep2014.
[2] S. Guha, R. Rastogi, and K. Shim, ``Cure: An ef_cient clustering
algorithm for large databases,'' in Proc. ACMSIGMOD Rec., Jun. 1998,
vol. 27, no. 2,pp. 73_84.
[3] M. S. Chen, J. Han, and P. S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, Vol. 8, No. 6, 1997.
[4] M. Zait and H. Messatfa, A Comparative Study of Clustering
Methods, Future Generation Computer System, Vol. 13, 1997, pp.149-
159.
[5] M. S. Chen, J. Han, and P. S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, Vol. 8, No. 6, 1997.
[6] E. Venkateswara Reddy , Dr E.S.Reddy Image Segmentation Using
Rough Set Based Fuzzy KMeans Clustering algorithm Global Journals
Inc,USA, GJCST, Volume 13 Issue 6 Version 1.0 , May2013, pp. 23-28.
[7] E.Venkateswara Reddy, Dr E.S.Reddy Image Segmentation Using
Rough Set Based Fuzzy CMeans Clustering algorithm International
Journal of Computer Applications , USA, (IJCA), Volume 74 -No.14 ,
July 2013, pp. 23-28.
[8] E.Venkateswara Reddy, Dr E.S.Reddy A Comparative Study of Color
Image Segmentation Using Hard, Fuzzy, Rough Set Based Clustering
Techniques Council for Innovative Research, IJCT,Volume-11, No-8 ,
November 2013, pp. 2873-2878.