Datamining K-Means

INTRODUCTION
This project presents an efficient clustering method which is one of the partition
techniques that can group similar data point together. Given a dataset of n, a partitioning
method constructs k partitions of the data, where each partition represents a cluster and k n.
That is, it classifies the data into k groups, which together satisfy the following requirements:
(1) Each group must contain at least one object, and (2) each object must belong to exactly
one group.
Given k, the number of partitions to construct, a partitioning method creates an initial

partitioning. It then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another. The general criterion of a good
partitioning is that objects in the same cluster are “close” are related to each other, whereas
objects of the different cluster are “far apart” or very different. There are various kinds of
other criteria for judging a quality of partition.
To achieve global optimality in partitioning-based clustering would require the

exhaustive enumeration of all of the possible partitions. Instead, most applications adopt one
of a few popular heuristic methods, such as the k-means algorithm, where each cluster is
represented by the mean value of the objects in the cluster.
Using different distances measures, such as Euclidean distance, Manhattan distance,

and Minkowski distance, out of these three distance measures which distance formula works
best for K-Means and how does it impact the formation of the cluster. Also to see the impact
of using different distances measures when applying on different datasets. Using different
distance measures what way is a cluster define.
1
SYNOPSIS
Cluster analysis or clustering is the process of grouping objects into classes or clusters
so that objects within a cluster have higher similarity, but very dissimilar to objects in other
clusters.
Cluster analysis has been considered as a difficult problem because of many factors
such as effective similarity measures, criterion functions, initial conditions, high
dimensionality and different types of attributes, come into play in devising a well-tuned
clustering technique for a given clustering problem. A clustering algorithm has to be capable
to identify any irregular and intrinsic cluster shapes over variable density space with outliers.
Dataset
The dataset selected for the purpose is the Gene Expression Dataset. This collection of
data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene
expressions of patients having different types of tumors: BRCA, KIRC, COAD, LUAD, and
PRAD.
Measuring the distances

i. Euclidean distance
d(q,p)=√
d(q,p)=√∑
ii. Manhattan distance
d(q,p)= - + - +…..+ -
d(q,p)=∑
2
iii. Minkowski distance
d(q,p)= √
d(q,p)= √∑
USER REQUIREMENTS
Hardware
 Processor: 64-bit processor with x86-compatible architecture

 RAM: 1 GB required, 2 GB recommended
 Free disk space: At least 250 MB
Software
 Operating System: Windows® 7.0 (SP1), 8.1, 10

 Tools: R-3.4.1
 Technology: RStudio
3
DIAGRAMMATIC DISPLAY OF VARIOUS
PROCESS OF THE PROJECT
Justification
Emperical study
Pros and cons of the algorithm
Selection of dataset
Using dimensionality reduction
Strart coding and execute
Tabulation to show result
4
DIMENSIONALITY REDUCTION
The dataset contains a number of random variables in the beginning. Dimensionality
reduction or dimension reduction is the process of reducing the number of random variables
under consideration by obtaining a set of principal variables. It can be divided into feature
selection and feature extraction. There are different types of Dimensionality reduction
methods but the one chosen for this project is Principal Component Analysis
Principal Component Analysis

This method was introduced by Karl Pearson. It works on a condition that while the
data in a higher dimensional space is mapped to data in a lower dimension space, the variance
of the data in the lower dimensional space should be maximum.
It involves the following steps:
 Construct the covariance matrix of the data.

 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.
Advantages of Dimensionality Reduction

 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if an
5
ALGORITHM USED
The k-means algorithm takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is high but the intercluster similarity
is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster,
which can be viewed as the cluster’s centroid or center of gravity. “How does the k-means
algorithm work?” The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster
mean center. For each of the remaining objects, an object is assigned to the cluster to which
it is the most similar, based on the distance between the object and the cluster mean. It
then computes the new mean for each cluster. This process iterates until the criterion
function converges.
Typically, the square-error criterion is used, defined as
∑∑
where E is the sum of the square error for all objects in the data set; p is the point in
space representing a given object; and is the mean of clthe uster (both p and are
multidimensional). In other words, for each object in each cluster, the distance from the
object to its cluster center is squared, and the distances are summed. This criterion tries
to make the resulting k clusters as compact and as separate as possible.
Suppose that there is a set of objects located in space as depicted in the rectangle
shown in Figure 1.0(a). Let k = 3; that is, the user would like the objects to be partitioned into
three clusters.
According to the algorithm in Figure 1.0, we arbitrarily choose three objects as the
three initial cluster centers, where cluster centers are marked by a “+”. Each object is
Figure 1.0
6
distributed to a cluster based on the cluster center to which it is the nearest. Such a
distribution forms silhouettes encircled by dotted curves, as shown in Figure 1.0(a).
Next, the cluster centers are updated. That is, the mean value of each cluster is
recalculated based on the current objects in the cluster. Using the new cluster centers, the
objects are redistributed to the clusters based on which cluster center is the nearest. Such a
redistribution forms new silhouettes encircled by dashed curves, as shown in Figure 1.0(b).
This process iterates, leading to Figure 1.0(c). The process of iteratively reassigning
objects to clusters to improve the partitioning is referred to as iterative relocation.
Eventually, no redistribution of the objects in any cluster occurs, and so the process
terminates. The resulting clusters are returned by the clustering process.
The algorithm attempts to determine k partitions that minimize the square-error
function. It works well when the clusters are compact clouds that are rather well separated
from one another. The method is relatively scalable and efficient in processing large datasets
because the computational complexity of the algorithm is O(nkt), where n is the total number
of objects, k is the number of clusters, and t is the number of iterations. Normally, k n and
t n. The method often terminates at a local optimum.
The k-means method, however, can be applied only when the mean of a cluster is
defined. This may not be the case in some applications, such as when data with categorical
attributes are involved. The necessity for users to specify k, the number of clusters,
in advance can be seen as a disadvantage. The k-means method is not suitable for discovering
clusters with non-convex shapes or clusters of very different size. Moreover, it is sensitive to
noise and outlier data points because a small number of such data can substantially influence
the mean value.
K-MEANS ALGORITHM:
The k-means algorithm for partitioning, where each cluster’s center is represented by
the mean value of the objects in the cluster.
Input:
 k: the number of clusters,
 D:a data set containing n objects.
Output:
 A set of k clusters.
Method:
7
1) arbitrarily choose k objects from D as the initial cluster centers;
2) repeat
3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
4) update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
5) until no change;
Distance metric overview:
Euclidean distance is the most commonly used – it calculates the root of square
differences between coordinates of two objects.
d(q,p)=√∑
Manhattan distance or city block distance represents the distance between two points
in a city road grid. It computes the absolute differences between coordinates of two objects.
d(q,p)=∑
Minkowski distance have metric distance
d(q,p)= √∑
Note that when p=2, it represents the Euclidean distance. When p=1 it represents city block
distance.
8
K-MEANS FLOWCHART
start
Number of cluster
K
Randomly select an
object and call it as a
centroid Yes
Measure the distance Object

move to
between an object and new group?
centroids
No
Group the objects based

on minimum distance
stop
9
FUTURE ENHANCEMENT
In future, some more distance measurement and different dataset will be used.
Different comparison technique will be used in order to see how does it impact the cluster.
CONCLUSION
In conclusion, the table we can say that Minkowski distances execution time is faster
and its histogram plot is different from Euclidean and Manhattan distance. However, it is not
always the same case. Some dataset works better with Euclidean distance while other dataset
works better with Manhattan distance.
USER SYSTEM ELAPSED

EUCLIDEAN 3.28 0.09 3.38
MANHATTAN 4.12 0.098 4.266
MINKOWSKI 2.766 0.1 2.87
Figure: Euclidean distance
10
Figure: Manhattan distance
Figure: Minkowski distance
11
BIBLIOGRAPHY
Han & Kamber, Data Mining: Concepts & Techniques, Morgan Kaufmann, 2001
Kanika & Gargi Narula, Contrasting Different Distance Functions Using K-means algorithm,
Volume 3 Issue 1, Jan-Feb 2015
Kahkashan Kouser and Sunita, AssistantA comparative study of K Means Algorithm by Different
Distance Measures, Vol. 1 Issue 9, November 2013.
Kardi Teknomo, K-Means Clustering Tutorial, July 2007
https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/amp/ , April 2018
https://stackoverflow.com/questions/20655013/how-to-use-different-distance-formula-other-
than-euclidean-distance-in-k-means , April 2018
SOURCE CODE
K-Means R script using Euclidean distance:

function (x, centers, iter.max = 10L, nstart = 1L, algorithm = c("Hartigan
-Wong",
"Lloyd", "Forgy", "MacQueen"), trace = FALSE)
{
.Mimax <- .Machine$integer.max
do_one <- function(nmeth) {
switch(nmeth, {
isteps.Qtran <- as.integer(min(.Mimax, 50 * m))
iTran <- c(isteps.Qtran, integer(max(0, k - 1)))
Z <- .Fortran(C_kmns, x, m, p, centers = centers,
as.integer(k), c1 = integer(m), c2 = integer(m),
nc = integer(k), double(k), double(k), ncp = integer(k),
D = double(m), iTran = iTran, live = integer(k),
iter = iter.max, wss = double(k), ifault = as.integer(trac
e))
switch(Z$ifault, stop("empty cluster: try a better set of init
ial centers",
call. = FALSE), Z$iter <- max(Z$iter, iter.max +
1L), stop("number of cluster centres must lie between 1 an
d nrow(x)",
call. = FALSE), warning(gettextf("Quick-TRANSfer stage ste
ps exceeded maximum (= %d)",
isteps.Qtran), call. = FALSE))
}, {
12
Z <- .C(C_kmeans_Lloyd, x, m, p, centers = centers,
k, c1 = integer(m), iter = iter.max, nc = integer(k),
wss = double(k))
}, {
Z <- .C(C_kmeans_MacQueen, x, m, p, centers = as.double(center
s),
wss = double(k))
})
if (m23 <- any(nmeth == c(2L, 3L))) {
if (any(Z$nc == 0))
warning("empty cluster: try a better set of initial center
s",
call. = FALSE)
}
if (Z$iter > iter.max) {
warning(sprintf(ngettext(iter.max, "did not converge in %d ite
ration",
"did not converge in %d iterations"), iter.max),
call. = FALSE, domain = NA)
if (m23)
Z$ifault <- 2L
}
if (nmeth %in% c(2L, 3L)) {
if (any(Z$nc == 0))
s",
call. = FALSE)
}
Z
}
x <- as.matrix(x)
m <- as.integer(nrow(x))
if (is.na(m))
stop("invalid nrow(x)")
p <- as.integer(ncol(x))
if (is.na(p))
stop("invalid ncol(x)")
if (missing(centers))
stop("'centers' must be a number or a matrix")
nmeth <- switch(match.arg(algorithm), `Hartigan-Wong` = 1L,
Lloyd = 2L, Forgy = 2L, MacQueen = 3L)
storage.mode(x) <- "double"
if (length(centers) == 1L) {
k <- centers
if (nstart == 1L)
13
centers <- x[sample.int(m, k), , drop = FALSE]
if (nstart >= 2L || any(duplicated(centers))) {
cn <- unique(x)
mm <- nrow(cn)
if (mm < k)
stop("more cluster centers than distinct data points.")
centers <- cn[sample.int(mm, k), , drop = FALSE]
}
}
else {
centers <- as.matrix(centers)
if (any(duplicated(centers)))
stop("initial centers are not distinct")
cn <- NULL
k <- nrow(centers)
if (m < k)
stop("more cluster centers than data points")
}
k <- as.integer(k)
if (is.na(k))
stop(gettextf("invalid value of %s", "'k'"), domain = NA)
if (k == 1L)
nmeth <- 3L
iter.max <- as.integer(iter.max)
if (is.na(iter.max) || iter.max < 1L)
stop("'iter.max' must be positive")
if (ncol(x) != ncol(centers))
stop("must have same number of columns in 'x' and 'centers'")
storage.mode(centers) <- "double"
Z <- do_one(nmeth)
best <- sum(Z$wss)
if (nstart >= 2L && !is.null(cn))
for (i in 2:nstart) {
ZZ <- do_one(nmeth)
if ((z <- sum(ZZ$wss)) < best) {
Z <- ZZ
best <- z
}
}
centers <- matrix(Z$centers, k)
dimnames(centers) <- list(1L:k, dimnames(x)[[2L]])
cluster <- Z$c1
if (!is.null(rn <- rownames(x)))
names(cluster) <- rn
totss <- sum(scale(x, scale = FALSE)^2)
14
structure(list(cluster = cluster, centers = centers, totss = totss,
withinss = Z$wss, tot.withinss = best, betweenss = totss -
best, size = Z$nc, iter = Z$iter, ifault = Z$ifault),
class = "kmeans")
}
K-Means R script using Manhattan distance:

-Wong",
{
switch(nmeth, {
e))
ial centers",
d nrow(x)",
}, {
wss = double(k))
}, {
s),
wss = double(k))
})
if (m23 <- any(nmeth == c(2L, 3L))) {
if (any(Z$nc == 0))
s",
15
call. = FALSE)
}
ration",
if (m23)
Z$ifault <- 2L
}
if (any(Z$nc == 0))
s",
call. = FALSE)
}
Z
}
x <- as.matrix(x)
if (is.na(m))
if (is.na(p))
k <- centers
if (nstart == 1L)
cn <- unique(x)
mm <- nrow(cn)
if (mm < k)
}
}
else {
16
cn <- NULL
k <- nrow(centers)
if (m < k)
}
k <- as.integer(k)
if (is.na(k))
if (k == 1L)
nmeth <- 3L
Z <- do_one(nmeth)
best <- sum(Z$wss)
ZZ <- do_one(nmeth)
Z <- ZZ
best <- z
}
}
cluster <- Z$c1
class = "kmeans")
}
K-Means R script using Minkowski distance:

-Wong",
{
17
switch(nmeth, {
e))
ial centers",
d nrow(x)",
}, {
wss = double(k))
}, {
s),
wss = double(k))
})
if (m23 <- any(nmeth == c(2L, 3L))) {
if (any(Z$nc == 0))
s",
call. = FALSE)
}
ration",
if (m23)
Z$ifault <- 2L
}
if (any(Z$nc == 0))
18
s",
call. = FALSE)
}
Z
}
x <- as.matrix(x)
if (is.na(m))
if (is.na(p))
k <- centers
if (nstart == 1L)
cn <- unique(x)
mm <- nrow(cn)
if (mm < k)
}
}
else {
cn <- NULL
k <- nrow(centers)
if (m < k)
}
k <- as.integer(k)
if (is.na(k))
if (k == 1L)
nmeth <- 3L
19
Z <- do_one(nmeth)
best <- sum(Z$wss)
ZZ <- do_one(nmeth)
Z <- ZZ
best <- z
}
}
cluster <- Z$c1
class = "kmeans")
}
20
R code:
#import dataset
dataset <- read.csv(file.choose(), header = T)
# Get principal component vectors using prcomp instead of princomp
pc <- prcomp(dataset[,-1])
#Plotting the principle component
plot(pc)
# Extracting the first three principal components
comp <- data.frame(pc$x[,1:3])
#Loading the library
library(rgl)
library(RColorBrewer)
library(scales)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
21
# Determine number of clusters (Elbow method)
wss <- (nrow(dataset[,-1])-1)*sum(apply(dataset[,-1],2,var))
for (i in 2:15)
wss[i] <- sum(kmeans(dataset[,-1],centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
# From scree plot elbow occurs at k = 5
# Apply k-means (using euclidean distance) with k=5
l <- kmeans(comp, 5)
plot(comp, col=l$clust, pch=16)
22
# 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3, col=l$clust)
# Apply k-means (using manhattan distance) with k=5
23
# 3D plot
# Apply k-means (using minkowski distance) with k=5
24
# 3D plot
SCREENSHOTS
25

Datamining K-Means

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datamining K-Means

Uploaded by

Copyright:

Available Formats

INTRODUCTION

Given k, the number of partitions to construct, a partitioning method creates an initial

To achieve global optimality in partitioning-based clustering would require the

Using different distances measures, such as Euclidean distance, Manhattan distance,

Measuring the distances

ii. Manhattan distance

 Processor: 64-bit processor with x86-compatible architecture

 Operating System: Windows® 7.0 (SP1), 8.1, 10

Pros and cons of the algorithm

Using dimensionality reduction

Strart coding and execute

Tabulation to show result

Principal Component Analysis

It involves the following steps:

 Construct the covariance matrix of the data.

Advantages of Dimensionality Reduction

Measure the distance Object

Group the objects based

USER SYSTEM ELAPSED

Figure: Euclidean distance

Figure: Minkowski distance

Kardi Teknomo, K-Means Clustering Tutorial, July 2007

https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/amp/ , April 2018

K-Means R script using Euclidean distance:

K-Means R script using Manhattan distance:

K-Means R script using Minkowski distance:

dataset <- read.csv(file.choose(), header = T)

# Get principal component vectors using prcomp instead of princomp

#Plotting the principle component

# Extracting the first three principal components

comp <- data.frame(pc$x[,1:3])

#Loading the library

wss <- (nrow(dataset[,-1])-1)*sum(apply(dataset[,-1],2,var))

wss[i] <- sum(kmeans(dataset[,-1],centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters",

ylab="Within groups sum of squares")

# From scree plot elbow occurs at k = 5

# Apply k-means (using euclidean distance) with k=5

plot(comp, col=l$clust, pch=16)

plot3d(comp$PC1, comp$PC2, comp$PC3, col=l$clust)

# Apply k-means (using manhattan distance) with k=5

plot(comp, col=l$clust, pch=16)

plot3d(comp$PC1, comp$PC2, comp$PC3, col=l$clust)

# Apply k-means (using minkowski distance) with k=5

plot(comp, col=l$clust, pch=16)

plot3d(comp$PC1, comp$PC2, comp$PC3, col=l$clust)

You might also like