You are on page 1of 25

INTRODUCTION

This project presents an efficient clustering method which is one of the partition
techniques that can group similar data point together. Given a dataset of n, a partitioning
method constructs k partitions of the data, where each partition represents a cluster and k n.
That is, it classifies the data into k groups, which together satisfy the following requirements:
(1) Each group must contain at least one object, and (2) each object must belong to exactly
one group.

Given k, the number of partitions to construct, a partitioning method creates an initial


partitioning. It then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another. The general criterion of a good
partitioning is that objects in the same cluster are “close” are related to each other, whereas
objects of the different cluster are “far apart” or very different. There are various kinds of
other criteria for judging a quality of partition.

To achieve global optimality in partitioning-based clustering would require the


exhaustive enumeration of all of the possible partitions. Instead, most applications adopt one
of a few popular heuristic methods, such as the k-means algorithm, where each cluster is
represented by the mean value of the objects in the cluster.

Using different distances measures, such as Euclidean distance, Manhattan distance,


and Minkowski distance, out of these three distance measures which distance formula works
best for K-Means and how does it impact the formation of the cluster. Also to see the impact
of using different distances measures when applying on different datasets. Using different
distance measures what way is a cluster define.

1
SYNOPSIS
Cluster analysis or clustering is the process of grouping objects into classes or clusters
so that objects within a cluster have higher similarity, but very dissimilar to objects in other
clusters.

Cluster analysis has been considered as a difficult problem because of many factors
such as effective similarity measures, criterion functions, initial conditions, high
dimensionality and different types of attributes, come into play in devising a well-tuned
clustering technique for a given clustering problem. A clustering algorithm has to be capable
to identify any irregular and intrinsic cluster shapes over variable density space with outliers.

Dataset
The dataset selected for the purpose is the Gene Expression Dataset. This collection of
data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene
expressions of patients having different types of tumors: BRCA, KIRC, COAD, LUAD, and
PRAD.

Measuring the distances


i. Euclidean distance

d(q,p)=√

d(q,p)=√∑

ii. Manhattan distance

d(q,p)= - + - +…..+ -

d(q,p)=∑

2
iii. Minkowski distance

d(q,p)= √

d(q,p)= √∑

USER REQUIREMENTS
Hardware

 Processor: 64-bit processor with x86-compatible architecture


 RAM: 1 GB required, 2 GB recommended
 Free disk space: At least 250 MB

Software

 Operating System: Windows® 7.0 (SP1), 8.1, 10


 Tools: R-3.4.1
 Technology: RStudio

3
DIAGRAMMATIC DISPLAY OF VARIOUS
PROCESS OF THE PROJECT

Justification

Emperical study

Pros and cons of the algorithm

Selection of dataset

Using dimensionality reduction

Strart coding and execute

Tabulation to show result

4
DIMENSIONALITY REDUCTION
The dataset contains a number of random variables in the beginning. Dimensionality
reduction or dimension reduction is the process of reducing the number of random variables
under consideration by obtaining a set of principal variables. It can be divided into feature
selection and feature extraction. There are different types of Dimensionality reduction
methods but the one chosen for this project is Principal Component Analysis

Principal Component Analysis


This method was introduced by Karl Pearson. It works on a condition that while the
data in a higher dimensional space is mapped to data in a lower dimension space, the variance
of the data in the lower dimensional space should be maximum.

It involves the following steps:

 Construct the covariance matrix of the data.


 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.

Advantages of Dimensionality Reduction


 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if an

5
ALGORITHM USED
The k-means algorithm takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is high but the intercluster similarity
is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster,
which can be viewed as the cluster’s centroid or center of gravity. “How does the k-means
algorithm work?” The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster
mean center. For each of the remaining objects, an object is assigned to the cluster to which
it is the most similar, based on the distance between the object and the cluster mean. It
then computes the new mean for each cluster. This process iterates until the criterion
function converges.
Typically, the square-error criterion is used, defined as

∑∑

where E is the sum of the square error for all objects in the data set; p is the point in
space representing a given object; and is the mean of clthe uster (both p and are
multidimensional). In other words, for each object in each cluster, the distance from the
object to its cluster center is squared, and the distances are summed. This criterion tries
to make the resulting k clusters as compact and as separate as possible.

Suppose that there is a set of objects located in space as depicted in the rectangle
shown in Figure 1.0(a). Let k = 3; that is, the user would like the objects to be partitioned into
three clusters.
According to the algorithm in Figure 1.0, we arbitrarily choose three objects as the
three initial cluster centers, where cluster centers are marked by a “+”. Each object is

Figure 1.0
6
distributed to a cluster based on the cluster center to which it is the nearest. Such a
distribution forms silhouettes encircled by dotted curves, as shown in Figure 1.0(a).
Next, the cluster centers are updated. That is, the mean value of each cluster is
recalculated based on the current objects in the cluster. Using the new cluster centers, the
objects are redistributed to the clusters based on which cluster center is the nearest. Such a
redistribution forms new silhouettes encircled by dashed curves, as shown in Figure 1.0(b).
This process iterates, leading to Figure 1.0(c). The process of iteratively reassigning
objects to clusters to improve the partitioning is referred to as iterative relocation.
Eventually, no redistribution of the objects in any cluster occurs, and so the process
terminates. The resulting clusters are returned by the clustering process.
The algorithm attempts to determine k partitions that minimize the square-error
function. It works well when the clusters are compact clouds that are rather well separated
from one another. The method is relatively scalable and efficient in processing large datasets
because the computational complexity of the algorithm is O(nkt), where n is the total number
of objects, k is the number of clusters, and t is the number of iterations. Normally, k n and
t n. The method often terminates at a local optimum.
The k-means method, however, can be applied only when the mean of a cluster is
defined. This may not be the case in some applications, such as when data with categorical
attributes are involved. The necessity for users to specify k, the number of clusters,
in advance can be seen as a disadvantage. The k-means method is not suitable for discovering
clusters with non-convex shapes or clusters of very different size. Moreover, it is sensitive to
noise and outlier data points because a small number of such data can substantially influence
the mean value.

K-MEANS ALGORITHM:
The k-means algorithm for partitioning, where each cluster’s center is represented by
the mean value of the objects in the cluster.
Input:
 k: the number of clusters,
 D:a data set containing n objects.
Output:
 A set of k clusters.
Method:

7
1) arbitrarily choose k objects from D as the initial cluster centers;
2) repeat
3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
4) update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
5) until no change;
Distance metric overview:
Euclidean distance is the most commonly used – it calculates the root of square
differences between coordinates of two objects.

d(q,p)=√∑

Manhattan distance or city block distance represents the distance between two points
in a city road grid. It computes the absolute differences between coordinates of two objects.
d(q,p)=∑
Minkowski distance have metric distance

d(q,p)= √∑

Note that when p=2, it represents the Euclidean distance. When p=1 it represents city block
distance.

8
K-MEANS FLOWCHART

start

Number of cluster
K

Randomly select an
object and call it as a
centroid Yes

Measure the distance Object


move to
between an object and new group?
centroids

No

Group the objects based


on minimum distance

stop

9
FUTURE ENHANCEMENT
In future, some more distance measurement and different dataset will be used.

Different comparison technique will be used in order to see how does it impact the cluster.

CONCLUSION
In conclusion, the table we can say that Minkowski distances execution time is faster
and its histogram plot is different from Euclidean and Manhattan distance. However, it is not
always the same case. Some dataset works better with Euclidean distance while other dataset
works better with Manhattan distance.

USER SYSTEM ELAPSED


EUCLIDEAN 3.28 0.09 3.38
MANHATTAN 4.12 0.098 4.266
MINKOWSKI 2.766 0.1 2.87

Figure: Euclidean distance

10
Figure: Manhattan distance

Figure: Minkowski distance

11
BIBLIOGRAPHY
Han & Kamber, Data Mining: Concepts & Techniques, Morgan Kaufmann, 2001

Kanika & Gargi Narula, Contrasting Different Distance Functions Using K-means algorithm,
Volume 3 Issue 1, Jan-Feb 2015

Kahkashan Kouser and Sunita, AssistantA comparative study of K Means Algorithm by Different
Distance Measures, Vol. 1 Issue 9, November 2013.

Kardi Teknomo, K-Means Clustering Tutorial, July 2007

https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/amp/ , April 2018

https://stackoverflow.com/questions/20655013/how-to-use-different-distance-formula-other-
than-euclidean-distance-in-k-means , April 2018

SOURCE CODE

K-Means R script using Euclidean distance:


function (x, centers, iter.max = 10L, nstart = 1L, algorithm = c("Hartigan
-Wong",
"Lloyd", "Forgy", "MacQueen"), trace = FALSE)
{
.Mimax <- .Machine$integer.max
do_one <- function(nmeth) {
switch(nmeth, {
isteps.Qtran <- as.integer(min(.Mimax, 50 * m))
iTran <- c(isteps.Qtran, integer(max(0, k - 1)))
Z <- .Fortran(C_kmns, x, m, p, centers = centers,
as.integer(k), c1 = integer(m), c2 = integer(m),
nc = integer(k), double(k), double(k), ncp = integer(k),
D = double(m), iTran = iTran, live = integer(k),
iter = iter.max, wss = double(k), ifault = as.integer(trac
e))
switch(Z$ifault, stop("empty cluster: try a better set of init
ial centers",
call. = FALSE), Z$iter <- max(Z$iter, iter.max +
1L), stop("number of cluster centres must lie between 1 an
d nrow(x)",
call. = FALSE), warning(gettextf("Quick-TRANSfer stage ste
ps exceeded maximum (= %d)",
isteps.Qtran), call. = FALSE))
}, {

12
Z <- .C(C_kmeans_Lloyd, x, m, p, centers = centers,
k, c1 = integer(m), iter = iter.max, nc = integer(k),
wss = double(k))
}, {
Z <- .C(C_kmeans_MacQueen, x, m, p, centers = as.double(center
s),
k, c1 = integer(m), iter = iter.max, nc = integer(k),
wss = double(k))
})
if (m23 <- any(nmeth == c(2L, 3L))) {
if (any(Z$nc == 0))
warning("empty cluster: try a better set of initial center
s",
call. = FALSE)
}
if (Z$iter > iter.max) {
warning(sprintf(ngettext(iter.max, "did not converge in %d ite
ration",
"did not converge in %d iterations"), iter.max),
call. = FALSE, domain = NA)
if (m23)
Z$ifault <- 2L
}
if (nmeth %in% c(2L, 3L)) {
if (any(Z$nc == 0))
warning("empty cluster: try a better set of initial center
s",
call. = FALSE)
}
Z
}
x <- as.matrix(x)
m <- as.integer(nrow(x))
if (is.na(m))
stop("invalid nrow(x)")
p <- as.integer(ncol(x))
if (is.na(p))
stop("invalid ncol(x)")
if (missing(centers))
stop("'centers' must be a number or a matrix")
nmeth <- switch(match.arg(algorithm), `Hartigan-Wong` = 1L,
Lloyd = 2L, Forgy = 2L, MacQueen = 3L)
storage.mode(x) <- "double"
if (length(centers) == 1L) {
k <- centers
if (nstart == 1L)

13
centers <- x[sample.int(m, k), , drop = FALSE]
if (nstart >= 2L || any(duplicated(centers))) {
cn <- unique(x)
mm <- nrow(cn)
if (mm < k)
stop("more cluster centers than distinct data points.")
centers <- cn[sample.int(mm, k), , drop = FALSE]
}
}
else {
centers <- as.matrix(centers)
if (any(duplicated(centers)))
stop("initial centers are not distinct")
cn <- NULL
k <- nrow(centers)
if (m < k)
stop("more cluster centers than data points")
}
k <- as.integer(k)
if (is.na(k))
stop(gettextf("invalid value of %s", "'k'"), domain = NA)
if (k == 1L)
nmeth <- 3L
iter.max <- as.integer(iter.max)
if (is.na(iter.max) || iter.max < 1L)
stop("'iter.max' must be positive")
if (ncol(x) != ncol(centers))
stop("must have same number of columns in 'x' and 'centers'")
storage.mode(centers) <- "double"
Z <- do_one(nmeth)
best <- sum(Z$wss)
if (nstart >= 2L && !is.null(cn))
for (i in 2:nstart) {
centers <- cn[sample.int(mm, k), , drop = FALSE]
ZZ <- do_one(nmeth)
if ((z <- sum(ZZ$wss)) < best) {
Z <- ZZ
best <- z
}
}
centers <- matrix(Z$centers, k)
dimnames(centers) <- list(1L:k, dimnames(x)[[2L]])
cluster <- Z$c1
if (!is.null(rn <- rownames(x)))
names(cluster) <- rn
totss <- sum(scale(x, scale = FALSE)^2)

14
structure(list(cluster = cluster, centers = centers, totss = totss,
withinss = Z$wss, tot.withinss = best, betweenss = totss -
best, size = Z$nc, iter = Z$iter, ifault = Z$ifault),
class = "kmeans")
}

K-Means R script using Manhattan distance:


function (x, centers, iter.max = 10L, nstart = 1L, algorithm = c("Hartigan
-Wong",
"Lloyd", "Forgy", "MacQueen"), trace = FALSE)
{
.Mimax <- .Machine$integer.max
do_one <- function(nmeth) {
switch(nmeth, {
isteps.Qtran <- as.integer(min(.Mimax, 50 * m))
iTran <- c(isteps.Qtran, integer(max(0, k - 1)))
Z <- .Fortran(C_kmns, x, m, p, centers = centers,
as.integer(k), c1 = integer(m), c2 = integer(m),
nc = integer(k), double(k), double(k), ncp = integer(k),
D = double(m), iTran = iTran, live = integer(k),
iter = iter.max, wss = double(k), ifault = as.integer(trac
e))
switch(Z$ifault, stop("empty cluster: try a better set of init
ial centers",
call. = FALSE), Z$iter <- max(Z$iter, iter.max +
1L), stop("number of cluster centres must lie between 1 an
d nrow(x)",
call. = FALSE), warning(gettextf("Quick-TRANSfer stage ste
ps exceeded maximum (= %d)",
isteps.Qtran), call. = FALSE))
}, {
Z <- .C(C_kmeans_Lloyd, x, m, p, centers = centers,
k, c1 = integer(m), iter = iter.max, nc = integer(k),
wss = double(k))
}, {
Z <- .C(C_kmeans_MacQueen, x, m, p, centers = as.double(center
s),
k, c1 = integer(m), iter = iter.max, nc = integer(k),
wss = double(k))
})
if (m23 <- any(nmeth == c(2L, 3L))) {
if (any(Z$nc == 0))
warning("empty cluster: try a better set of initial center
s",

15
call. = FALSE)
}
if (Z$iter > iter.max) {
warning(sprintf(ngettext(iter.max, "did not converge in %d ite
ration",
"did not converge in %d iterations"), iter.max),
call. = FALSE, domain = NA)
if (m23)
Z$ifault <- 2L
}
if (nmeth %in% c(2L, 3L)) {
if (any(Z$nc == 0))
warning("empty cluster: try a better set of initial center
s",
call. = FALSE)
}
Z
}
x <- as.matrix(x)
m <- as.integer(nrow(x))
if (is.na(m))
stop("invalid nrow(x)")
p <- as.integer(ncol(x))
if (is.na(p))
stop("invalid ncol(x)")
if (missing(centers))
stop("'centers' must be a number or a matrix")
nmeth <- switch(match.arg(algorithm), `Hartigan-Wong` = 1L,
Lloyd = 2L, Forgy = 2L, MacQueen = 3L)
storage.mode(x) <- "double"
if (length(centers) == 1L) {
k <- centers
if (nstart == 1L)
centers <- x[sample.int(m, k), , drop = FALSE]
if (nstart >= 2L || any(duplicated(centers))) {
cn <- unique(x)
mm <- nrow(cn)
if (mm < k)
stop("more cluster centers than distinct data points.")
centers <- cn[sample.int(mm, k), , drop = FALSE]
}
}
else {
centers <- as.matrix(centers)
if (any(duplicated(centers)))
stop("initial centers are not distinct")

16
cn <- NULL
k <- nrow(centers)
if (m < k)
stop("more cluster centers than data points")
}
k <- as.integer(k)
if (is.na(k))
stop(gettextf("invalid value of %s", "'k'"), domain = NA)
if (k == 1L)
nmeth <- 3L
iter.max <- as.integer(iter.max)
if (is.na(iter.max) || iter.max < 1L)
stop("'iter.max' must be positive")
if (ncol(x) != ncol(centers))
stop("must have same number of columns in 'x' and 'centers'")
storage.mode(centers) <- "double"
Z <- do_one(nmeth)
best <- sum(Z$wss)
if (nstart >= 2L && !is.null(cn))
for (i in 2:nstart) {
centers <- cn[sample.int(mm, k), , drop = FALSE]
ZZ <- do_one(nmeth)
if ((z <- sum(ZZ$wss)) < best) {
Z <- ZZ
best <- z
}
}
centers <- matrix(Z$centers, k)
dimnames(centers) <- list(1L:k, dimnames(x)[[2L]])
cluster <- Z$c1
if (!is.null(rn <- rownames(x)))
names(cluster) <- rn
totss <- sum(scale(x, scale = FALSE)^1)
structure(list(cluster = cluster, centers = centers, totss = totss,
withinss = Z$wss, tot.withinss = best, betweenss = totss -
best, size = Z$nc, iter = Z$iter, ifault = Z$ifault),
class = "kmeans")
}

K-Means R script using Minkowski distance:


function (x, centers, iter.max = 10L, nstart = 1L, algorithm = c("Hartigan
-Wong",
"Lloyd", "Forgy", "MacQueen"), trace = FALSE)
{

17
.Mimax <- .Machine$integer.max
do_one <- function(nmeth) {
switch(nmeth, {
isteps.Qtran <- as.integer(min(.Mimax, 50 * m))
iTran <- c(isteps.Qtran, integer(max(0, k - 1)))
Z <- .Fortran(C_kmns, x, m, p, centers = centers,
as.integer(k), c1 = integer(m), c2 = integer(m),
nc = integer(k), double(k), double(k), ncp = integer(k),
D = double(m), iTran = iTran, live = integer(k),
iter = iter.max, wss = double(k), ifault = as.integer(trac
e))
switch(Z$ifault, stop("empty cluster: try a better set of init
ial centers",
call. = FALSE), Z$iter <- max(Z$iter, iter.max +
1L), stop("number of cluster centres must lie between 1 an
d nrow(x)",
call. = FALSE), warning(gettextf("Quick-TRANSfer stage ste
ps exceeded maximum (= %d)",
isteps.Qtran), call. = FALSE))
}, {
Z <- .C(C_kmeans_Lloyd, x, m, p, centers = centers,
k, c1 = integer(m), iter = iter.max, nc = integer(k),
wss = double(k))
}, {
Z <- .C(C_kmeans_MacQueen, x, m, p, centers = as.double(center
s),
k, c1 = integer(m), iter = iter.max, nc = integer(k),
wss = double(k))
})
if (m23 <- any(nmeth == c(2L, 3L))) {
if (any(Z$nc == 0))
warning("empty cluster: try a better set of initial center
s",
call. = FALSE)
}
if (Z$iter > iter.max) {
warning(sprintf(ngettext(iter.max, "did not converge in %d ite
ration",
"did not converge in %d iterations"), iter.max),
call. = FALSE, domain = NA)
if (m23)
Z$ifault <- 2L
}
if (nmeth %in% c(2L, 3L)) {
if (any(Z$nc == 0))

18
warning("empty cluster: try a better set of initial center
s",
call. = FALSE)
}
Z
}
x <- as.matrix(x)
m <- as.integer(nrow(x))
if (is.na(m))
stop("invalid nrow(x)")
p <- as.integer(ncol(x))
if (is.na(p))
stop("invalid ncol(x)")
if (missing(centers))
stop("'centers' must be a number or a matrix")
nmeth <- switch(match.arg(algorithm), `Hartigan-Wong` = 1L,
Lloyd = 2L, Forgy = 2L, MacQueen = 3L)
storage.mode(x) <- "double"
if (length(centers) == 1L) {
k <- centers
if (nstart == 1L)
centers <- x[sample.int(m, k), , drop = FALSE]
if (nstart >= 2L || any(duplicated(centers))) {
cn <- unique(x)
mm <- nrow(cn)
if (mm < k)
stop("more cluster centers than distinct data points.")
centers <- cn[sample.int(mm, k), , drop = FALSE]
}
}
else {
centers <- as.matrix(centers)
if (any(duplicated(centers)))
stop("initial centers are not distinct")
cn <- NULL
k <- nrow(centers)
if (m < k)
stop("more cluster centers than data points")
}
k <- as.integer(k)
if (is.na(k))
stop(gettextf("invalid value of %s", "'k'"), domain = NA)
if (k == 1L)
nmeth <- 3L
iter.max <- as.integer(iter.max)
if (is.na(iter.max) || iter.max < 1L)

19
stop("'iter.max' must be positive")
if (ncol(x) != ncol(centers))
stop("must have same number of columns in 'x' and 'centers'")
storage.mode(centers) <- "double"
Z <- do_one(nmeth)
best <- sum(Z$wss)
if (nstart >= 2L && !is.null(cn))
for (i in 2:nstart) {
centers <- cn[sample.int(mm, k), , drop = FALSE]
ZZ <- do_one(nmeth)
if ((z <- sum(ZZ$wss)) < best) {
Z <- ZZ
best <- z
}
}
centers <- matrix(Z$centers, k)
dimnames(centers) <- list(1L:k, dimnames(x)[[2L]])
cluster <- Z$c1
if (!is.null(rn <- rownames(x)))
names(cluster) <- rn
totss <- sum(scale(x, scale = FALSE)^3)
structure(list(cluster = cluster, centers = centers, totss = totss,
withinss = Z$wss, tot.withinss = best, betweenss = totss -
best, size = Z$nc, iter = Z$iter, ifault = Z$ifault),
class = "kmeans")
}

20
R code:
#import dataset

dataset <- read.csv(file.choose(), header = T)

# Get principal component vectors using prcomp instead of princomp

pc <- prcomp(dataset[,-1])

#Plotting the principle component

plot(pc)

# Extracting the first three principal components

comp <- data.frame(pc$x[,1:3])

#Loading the library

library(rgl)

library(RColorBrewer)

library(scales)

palette(alpha(brewer.pal(9,'Set1'), 0.5))

21
# Determine number of clusters (Elbow method)

wss <- (nrow(dataset[,-1])-1)*sum(apply(dataset[,-1],2,var))

for (i in 2:15)

wss[i] <- sum(kmeans(dataset[,-1],centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters",

ylab="Within groups sum of squares")

# From scree plot elbow occurs at k = 5

# Apply k-means (using euclidean distance) with k=5

l <- kmeans(comp, 5)

plot(comp, col=l$clust, pch=16)

22
# 3D plot

plot3d(comp$PC1, comp$PC2, comp$PC3, col=l$clust)

# Apply k-means (using manhattan distance) with k=5

l <- kmeans(comp, 5)

plot(comp, col=l$clust, pch=16)

23
# 3D plot

plot3d(comp$PC1, comp$PC2, comp$PC3, col=l$clust)

# Apply k-means (using minkowski distance) with k=5

l <- kmeans(comp, 5)

plot(comp, col=l$clust, pch=16)

24
# 3D plot

plot3d(comp$PC1, comp$PC2, comp$PC3, col=l$clust)

SCREENSHOTS

25

You might also like