You are on page 1of 16

DA1 1

STORAGE
TECHNOLOGIES
ITE2009

DIGITAL ASSIGNMENT 1
SUMIT PATIL
SLOT: D1+TD2
FACULTY BHAVANI S
DA1 2

PAPER 1
A Modified K-Means Algorithm for Big Data Clustering
SK Ahammad Fahad, Md. Mahbub Alam2 IBAIS University, Dhaka, Bangladesh
DUET, Dhaka, Bangladesh

Measure of information is getting greater in each minute and this information


originates from all over; web-based social networking, sensors, web crawlers, GPS
signals, exchange records, satellites, money related markets, web based business
destinations and so forth. This expansive volume of information might be semi-
organized, unstructured or even organized. So it is essential to get significant data
from this gigantic informational index. Bunching is the procedure to arrange
information to such an extent that information are gathered in a similar group
when they are comparable as per particular measurements. In this paper, we are
chipping away at k-mean bunching strategy to group enormous information. A
few techniques have been proposed for enhancing the execution of the k-implies
grouping calculation. We propose a strategy for making the calculation less
tedious, more powerful and proficient for better bunching with decreased many-
sided quality. As indicated by our perception, nature of the subsequent bunches
intensely relies upon the determination of introductory centroid and changes in
information groups in the subsequence cycles. As we probably am aware, after a
specific number of cycles, a little piece of the information focuses change their
bunches. In this manner, our proposed technique initially finds the underlying
centroid and puts an interim between those information components which
won't change their bunch and those which may change their group in the
subsequence emphasess. With the goal that it will decrease the workload
fundamentally if there should be an occurrence of huge informational collections.
We assess our technique with various arrangements of information and contrast
and others strategies also.
DA1 3

MODIFIED ALGORITHM
DA1 4

RESULTS
DA1 5

PAPER 2

A Proposed Modification of K-Means Algorithm


Sharfuddin Mahmood American International University- Bangladesh, Dhaka,
1213, Bangladesh Email: smahmood@aiub.edu

K-means algorithm is one of the most popular algorithms for data clustering. With
this algorithm, data of similar types are tried to be clustered together from a large
data set with brute force strategy which is done by repeated calculations. As a
result, the computational complexity of this algorithm is very high. Several
researches have been carried out to minimize this complexity. This paper presents
the result of our research, which proposes a modified version of k-means
algorithm with an improved technique to divide the data set into specific
numbers of clusters with the help of several check point values. It requires less
computation and has enhanced accuracy than the traditional k-means algorithm
as well as some modified variant of the traditional kMeans.

MODIFIED ALGORITHM
Step 1:

Find the Euclidian distance of each data object from the origin (0i,0j0n).

Here we randomly select N data objects as initial origin. Then we find out the
Euclidian distance between each data object with respect to the origin.
DA1 6

Step 2:

Sort the N-data object in ascending order according to the distance found in the
previous step.

Step 3:

Divide the data set into K equal clusters. K will be determined according to the
user requirement or on the type of the data set. This will act as the primary
cluster.

For setting up the initial cluster this step is necessary. Depending on the number
of cluster needed we now divide the whole data set into equal portion. For every
situation this may not be the case as there may not be equal numbers of object in
every data set. As example, if we have 1000 data objects and we have to divide
them into 3 clusters then the cluster may have 333,333,334 numbers of objects in
each of it.

Step 4:

For each cluster, consider the middle point as the primary cluster center. That is,
if there is N data members and K clusters, the primary cluster center will be (
(n/k)/2)th object.

As this data set is obtained from the distance from the initial origins, so the center
points will be the most significant points in each clusters from which all the data
objects will mostly have a unified distance.

Step 5:

Find the distance between the cluster centers. If there are K clusters, there will be
Kdistances. Divide the distance by 2 and store the value in Dij (i, j=0,1,k). Here
Dij denotes the middle point of the distance from cluster center i to cluster center
j. This Dij will be used as a check point value.

For example, if cluster A and B have cluster centers Ai and Bi , and suppose the
center point of the distance between Ai and Bi will denote a point where the
DA1 7

distance is equal from both cluster centers. As a result it can be utilized to


determine the new cluster for any data objects if needed.

Step 6:

Find the Euclidian distance of each data object di (i=1. k) from the cluster center
it is assigned to.

Step 7:

Compare di with the distance stored in Dij.

If the distance is less than or equal to Dij, then the object stays in the previous
cluster.

That is, the distance from the current cluster center is less than the distance from
the center points of two cluster centers. As a result we can conclude that this
object is closer to its current cluster. Hence we do not need to calculate the
distance from other cluster center. This check point value will ensure that we
need less computation.

Else calculate the Euclidian distance of the data object with respect to the center
with which the distance crossed the check point value. That is, if Dij is exceeded
and the object was previously in the cluster with center i, then compute the
distance with respect to cluster center j.

Means the object may be closer from the other cluster center. To be sure about
this, we have to calculate the distance with respect to other cluster center.

Now compare the distances. Assign the data object to the cluster from whose
center it has a shorter distance.

Recalculate the cluster centers by taking the mean of every objects currently
present in one cluster. This point can be an imaginary point which has no
existence in our current data set or can be any current object of our dataset. This
will not affect the outcome of our algorithm.
DA1 8

Go back to step 4 and repeat until the convergence criteria is met. That is no data
object is moving from one cluster to another cluster after the cluster center is
changed. That results in the object of the cluster remains same, hence the center
also remains unchanged. Now we can draw the conclusion that we have achieved
the final clusters. That is we grouped together similar objects in each, which may
be different from other clusters.

PAPER 3
An Improvement in K-mean Clustering Algorithm Using
Better Time and Accuracy
Er. Nikhil Chaturvedi and Er. Anand Rajavat

Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense
or another) to each other than to those in other groups (clusters).K-means is one
of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The process of k means algorithm data is partitioned into K
clusters and the data are randomly choose to the clusters resulting in clusters that
have the same number of data set. This paper is proposed a new K means
clustering algorithm we calculate the initial centroids systemically instead of
random assigned due to which accuracy and time improved.
DA1 9

MODIFIED ALGORITHM
Phase 1: For the initial centroids

Steps:

1. Set p = 1;

2. Measure the distance between each data and all other data in the set D;

3. Find the closest pair of data from the set D and form a data set Ap (1<= p <= k)
which contains these two data, Delete these two data from the set D;

4. Find the data in D that is closest to the data set Ap, Add it to Ap and delete it
from D;

5. Repeat step 4 until the number of data in Ap reaches all data in D;

6. If p<=p<=k) find the mean of data in Ap. These means will be the initial
centroids.

Phase 2: Data to the clusters

Steps:

1. Compute distance between each data to all the centroids

2. for each data di find the closest centroid ci and assing to cluster j.

3. Set ClusterCL[i] = j; // j:CL of the closest cluster

4. Set Shorter_Dist[i] = d (di, cj);

5. For each cluster j (1<=j<=k), recalculate the centroids;

6. Repeat

7. for each data di,

7.1 Compute the distance from the centroids of the closest cluster;
DA1 10

7.2 If distance is less than or equal to the present closest distance, the data-
point stays in cluster; Else

7.2.1 For every centroids compute the distance. End for;

7.2.2 Data di assign to the cluster with the closest centroid cj

7.2.3 Set ClusterCL[i] =j;

7.2.4 Set Shorter_Dist[i] = d (di, cj); End for;

8. For each cluster j (1<=j<=k), recalculate the centroids; until the criteria is met.

RESULTS
DA1 11

PAPER 4

Implementing & Improvisation of K-means Clustering


Algorithm
Unnati R. Raval , Chaita Jani

The clustering techniques are the most important part of the data analysis and k-
means is the oldest and popular clustering technique used. The paper discusses
the traditional K-means algorithm with advantages and disadvantages of it. It also
includes researched on enhanced k-means proposed by various authors and it
also includes the techniques to improve traditional K-means for better accuracy
and efficiency. There are two area of concern for improving K-means; 1) is to
select initial centroids and 2) by assigning data points to nearest cluster by using
equations for calculating mean and distance between two data points. The time
complexity of the proposed K-means technique will be lesser that then the
traditional one with increase in accuracy and efficiency. The main purpose of the
article is to proposed techniques to enhance the techniques for deriving initial
centroids and the assigning of the data points to its nearest clusters. The
clustering technique proposed in this paper is enhancing the accuracy and time
complexity but it still needs some further improvements and in future it is also
viable to include efficient techniques for selecting value for initial clusters(k).
Experimental results show that the improved method can effectively improve the
speed of clustering and accuracy, reducing the computational complexity of the k-
means.
DA1 12

MODIFIED ALGORITHM
Part1: Determine initial centroids

Step1.1: Input Dataset

Step1.2: Check the Each attributes of the Records

Step1.3: Find the mean value for the given Dataset.

Step1.4: Find the distance for each data point from mean value using Equation
(Equ).

IF

The Distance between the mean value is minimum then it will be stored
in

Then Divide datasets into k cluster points dont needs to move to other
clusters.

ESLE

Recalculate distance for each data point from mean value using Equation
(Equ) until divide datasets into k cluster

Part2: Assigning data points to nearest centroids

Step2.1: Calculate Distance from each data point to centroids and assign data
points to its nearest centroid to form clusters and stored values for each data.

Step2.2: Calculate new centroids for these clusters.

Step2.3: Calculate distance from all centroids to each data point for all data
points.

IF
DA1 13

The Distance stored previously is equal to or less then Distance stored in


Step2.1

Then Those Data points dont needs to move to other clusters.

ESLE

From the distance calculated assign data point to its nearest centroid by
comparing distance from different centroids.

Step2.5: Calculate centroids for these new clusters again. Until The convergence
criterion met.

RESULTS

PAPER 5

Modification of K-means Clustering Algorithm


Sumit Patil , Dhanashri Patil , Rushikesh Babar , Abhishek Rathi
DA1 14

K-means algorithm is one of the important algorithms when it comes to


clustering. Clustering is a term where all similar kind of data is clubbed together in
a cluster. In this project we have tried to modify the algorithm for its scalability.
As this algorithm is mainly used for analyzing big data, when the data is too big it
takes hours to run the program and sometimes system can hang. So we have tried
to reduce its execution time. To get the output faster than the traditional
algorithm.

MODIFIED ALGORITHM

Step 1:

Take a reference point (0,0). (If the data contain 2 attributes). If the data contains
3 attributes (0,0,0). For n attributes we have to consider a n-d point.

Step 2:

Calculate the distance of all the points from the reference point which you have
taken. The distance can be calculated by Eludian distance formula

X1- x component of reference point

Y1- y component of reference point

X2- x component of given point

Y2- y component of given point

Step 3:

Calculate the mean of the distance which we have calculated in step 2. M= ((X-
X1)^2+(Y-Y1)^2)/N M-mean N-Total points
DA1 15

Step 4:

If the number of required cluster is n:

E=D/N

D- Distance N-Total points

Step 5:

Divide the points into N clusters

1 st cluster points having distance between 0 to E.

2 nd cluster points having distance between E to 2E.

Similarly we will go up to N.

Step 6:

Calculate the centroid of the clusters formed in step 5. We will do this by


calculating the mean of the dimensions of the points in the respective clusters.

X= (xi)/n

Y= (yi)/n

n - number of points present in that cluster.

Similarly we can go it to n dimensions.

Step 7:

Now as we have got the centroid we have to repeat the steps of the traditional
method.
DA1 16

RESULTS

You might also like