You are on page 1of 41

Master en I.A.

Avanzada

Minería de datos

Actividad 3.4

Víctor M. Sobrino García


UNED.
Centro Asociado de Guadalajara
vsobrino@alumno.uned.es

Mayo 2018

1
INTENTIONALLY LEFT BLANK

2
Abstract
Clustering problems are one of the most relevant tasks to commit relative to ML techniques.
These techniques are a type of unguided (unsupervised) ML methods that takes the provided
information, measure the similarity among data and group it into clusters.

The utility of this type of algorithms is to determine the number of solutions to a problem, if
there is an alternative way to proceed and to help in planification, CV, etc.

To solve this clustering problems, several methods are provided, this task deals with the
abilities, and specific considerations related with the K-means algorithm. This is based on the
use of distance as a similarity measurement what means that it will search for the optimal
clustering by reducing the Mean Squared Error (MSE). This can drive into several problems
that, associated with a bad management of the data at the input of the machine, will make the
system misclassify instances.

Keywords.

Machine Learning, Data Mining, Clustering, K-Means, Grouping.


State of the art. Theoretical review

INTENTIONALLY LEFTY BLANK

4
Chapter 1
Technique Review
1.1 OVERVIEW
1.1.1 THE CLUSTERING PROBLEM

Clustering is one of the most relevant and common tasks that machine learning (ML) techniques
can deal with. It is referred to the job of grouping several data into an unknown number of
groups. This group is formed over the similitude of the instances that are in it. These groups are
commonly called clusters.

This problem can be solved in several ways. This work deals with the K-means algorithm, that
will be described later in this paper. Previous works have deal with Self Organized Maps (SOM)
that are other kind of clustering techniques. Other groups of techniques used to solve the
problem are (Kaushik, 2016) :
• Connectivity models
• Centroid models
• Distribution models
• Density models

Among them, the most used algorithms are the following (Seif, 2018)

1.1.1.1 K-Means Clustering.

Based on the similarity between different instances on the dataset to group them in clusters,
(will be described later in this work)

1.1.1.2 Mean-shift clustering

This is a sliding-window algorithm. The aim of the algorithm is to find areas with a higher
density of instances on the dataset. Once this is done it will cluster the group by setting up a
centroid for the detected cluster.

The main problem comes when data is to dense in a certain area. In this case, as the window
slides through it, a large number of near centroids will be obtained. Second passes are, then,
recommended. These near duplicates can be, hence, removed.
State of the art. Theoretical review

Figure 1. Mean-Shift Clustering with a fixe size single sliding window.

The procedure used in this technique is pretty simple.


• The sliding window is set up in a determined radius and position (this data should be
aleatory).
• It takes the instances inside the window and calculate the centroid
• The window slides to the densest area of the dataset. To do so it moves the center
position of the window to the position of the mean of all the instances on the window
• This iterative movement of the window end when all the instances have been included
in a window.

As said before additional passes can be done using previously obtained clusters and different
windows sizes. By doing this, near duplicates will be removed and the minimum number of
clusters will be obtained.

The strength of this technique lays in the fact that there is no need to previously know the
number of clusters in the dataset as the overall process will join up sub cluster into the minimum
clusters set. This cannot be done when using K-means, where the number of centroids to find
out is needed at the beginning of the computation.

1.1.1.3 Density-based Spatial Clustering of Applications with noise (DBSCAN).

This technique is based on the classification of instances by initially allocating them into one
of the following three different classes.

2
State of the art. Theoretical review

• Core point. One instance is a core point if at least minPts are within a given distance (𝜀)
of it. That is the maximum radius of the neighborhood of p.
• A point is directly reachable form q if q is a core point and the reachable point is within
𝜀. All the above-mentioned points are said to be directly reachable.
• A point q is reachable from p if there is a path from q to p in which all the points are
directly reachable from the previous one.
• All not reachable points are called outliers (noise).

Then each core point and those points that are reachable from it forms a cluster. As all the points
on the path from a core point to a reachable point are core points, there can be more than a core
point in the cluster. In fact, reachable points (those that are neither outliers nor core points)
form the edge of the cluster.

As in the Mean-shift clustering, in this procedure all the instances can be visited more than once
in the process.

Figure 2. Example of DBSCAN.. MinPts are 3 and there is an outlier (noise point)

3
State of the art. Theoretical review

Figure 3. Example of clustering when applying DBSCAN

The algorithm runs as follows.

• Selection of arbitrary point the value of 𝜀and minPoints


• If the value of minPts is met, the point is selected as a cluster and the process. If not, it
will be considered as outlier.
• This is done until the cluster is finished. Then if there are more points available, the
process starts with a different cluster.

The strength of the algorithm is that there is no need to present the initial number of clusters.
It will find it out. The worst part is the fact that it does not performs well when the density of
the space is not constant.

1.1.1.4 Expectation-Maximization (EM) Clustering using Gaussian Mixture Models


(GMM)

Gaussian mixture model (GMM) is a parametric density function represented as the weighted
sum of Gaussian component densities (Kempt, 2011), The EM algorithm when applied to GMM
improves some of the K-means major weakness, like the fact that as clusters use Euclidean
distance to measure similarity, all the resulting clusters have circular shape. GMM uses
gaussian distribution so this effect is avoided.

The way this algorithm works is similar to the K-means one:

• Given the number of clusters, the process starts with a random initialization of the
gaussian parameters for each cluster.

4
State of the art. Theoretical review

• Once this done, the next is to calculate the probability of each instance to be on a certain
clutter.
• The values of the parameters of each gaussian distribution to maximize the probabilities
of data points within the clusters.
• This process follows on until convergence (no changes on the process from one iteration
to another).

Figure 4. GMM with EM clustering algorithm example.

As an advantage, the is a more flexible algorithm (since different variance values can vary the
solution) in which an instance can be included on different clusters at the very same time (As
the pertinence of an instance to a cluster is a probability issue).

1.1.1.5 Agglomerative Hierarchical Clustering.

This algorithm is based on merging sub clusters and clusters based on similarity. There are two
different approaches to this process, most commonly used is the bottom up approach.
The bottom up procedure works as follows:
• Every instance will be considered as a cluster
• If two instances are similar enough they will be merged
• This process is repeated until the root of the formed tree is met.

5
State of the art. Theoretical review

Figure 5. AHC tree example.

This process does not need the number of classes, it only requests the threshold value for the
similarity function (typically Euclidean distance). The complexity of the computation is also
better than any other of the already mentioned.

1.2 K-MEANS

This is one of the easiest and used clustering techniques on the ML field (Introduction to K-
means Clustering, 2016). K means is an unsupervised clustering ML technique. The aim of the
process is to divide some input instances into several clusters based on characteristics (features)
similarity. The only information needed is the number of clusters to form in the process and
some other parameters to correct bias and noise on the algorithm.

The technique is simple (Politecnico Milano. Departamento di electronica., 2018).


• When the value on the number of clusters (k) is known k aleatory points (centroids) will
be designated.
• All data points are the labeled by using the correspondent clusters based on a proximity
criterion.
1
(%) (%) (%)
𝑆# = (𝑥* : ,- 𝑥* − 𝑚# -, ≤ 3-𝑥* − 𝑚4 -|1 , ∀𝑗, 1 ≤ 𝑗 ≤ 𝑘; (1.1)
(%)
Where 𝑥* is the instance to be assigned to a cluster, 𝑚# is the i-mean value at time t.
• Once this is done a new value for the centroid of each cluster is calculated.
(%<=) 1
𝑚# = (%) > 𝑥4 (1.2)
|𝑆# | (F)
AB ∈DE

6
State of the art. Theoretical review

• Once the new cluster coordinates are obtained, the labeling process starts. New clusters
will be formed and hence new centroids.
• This process is repeated until convergence is met.

Figure 6. Evolution of the centroids (m1 and m2) for a 2 groups dataset clustering process.

It is the ease of application of the process what make it useful. Several applications in different
fields have taken advantage of this technique. There are two important issues to consider. It is
only limited to circular-shape clusters and it depends on the similarity function. As the
foundation of these techniques is to reduce squared mean error (SME), the unique function to
use when talking about similarity is distance measure between two instances (data points). This
is a drawback of the overall technique since some other approaches allows better convergence
and accuracy meanwhile this technique prone to get stuck on local minima.

More formally, the objective of the algorithm is to minimize the within-cluster sum of squares
as in eq 1.1 what brings Eq. 1.3

M M

arg 𝑚𝑖𝑛D > > ||𝑥 − 𝜇# ||1 = arg 𝑚𝑖𝑛D >|𝑆# |𝑉𝑎𝑟𝑆# (1.3)
#N= A∈DE #N=

As the process is to be performed iteratively until there is no convergence, the complexity


depends on the problem type. It is NP hard for k cluster in the plane or for 2 or more clusters in
a d n-dimensional Euclidean space. The problem is solved in 𝑂(𝑛)TM<= . Some heuristics has
been proven to reduce computational time. Lloyd´s algorithm (among others) is one of the most
knows algorithm. (Hamerly & Drake, 2015).

7
State of the art. Theoretical review

1.2.1 DISTANCE FUNCTIONS

Event though, as already mentioned, all the similarity functions are based on the measure of
distance between data points, there are several ways that are typically used in k-means
algorithm to perform this measurement. The most important ones are

1.2.1.1 Euclidean distance

It is the straight-line measure between two points located in a Euclidean space. An euclidean
space together with this Euclidean distance functions form a metric space.

Given two vectors in a n-dimensional Euclidean space that are defined by their coordinates in
this space in the following way, X (𝑥= , 𝑥1 , … . , 𝑥V ) and Y (𝑦= , 𝑦1 , … . , 𝑦V ), the Euclidean
distance measure is calculated as

𝑑 (𝑋, 𝑌) = [(𝑥= − 𝑦= )1 + (𝑥1 − 𝑦1 )1 + … . +(𝑥V − 𝑦V )1 = ]>(𝑥# − 𝑦# )1 (1.4)


#N=

Figure 7. Euclidean distance representation.

1.2.1.2 Manhattan distance

As well as the Euclidean distance, Manhattan distance, also called taxicab distance, is useful
when applied in a Euclidean space. This function restricts the distance measure to the sum of
the distance of the movements parallel to the axis of the space (typically horizontal and vertical)
needed to connect desired points in space.

8
State of the art. Theoretical review

In other words, the computation is equivalent to perform the sum of the absolute differences of
the cartesian coordinates of the points under measurement.

𝑑# (𝑋, 𝑌) = ||𝑥 − 𝑦||# = >|𝑥# − 𝑦# | (1.5)


#N=

As the measure is performed in the directions provided by the axis of the space system, depends
on the rotation of those axis, but not on the translation.

Figure 8. Manhattan and Euclidean distance comparison.

1.2.1.3 Chebyshev distance

Also called maximum metric. It takes the distance between two vectors (instances) as the
maximum possible distance, that is, the greatest difference between both along any coordinate
dimension. It is also called the chessboard distance. More formally, is defined as:

𝑑(𝑋, 𝑌) = 𝑚𝑎𝑥# (|𝑥# − 𝑦# |) (1.6)

As said, is the maximum distance in any coordinate, that is the maximum distance between
the coordinates of vectors in this coordinate direction.

9
State of the art. Theoretical review

Figure 9. Chebyshev distance representation.

1.2.1.4 Minkowski distance

Minkowski distance is a measure of distance used only in normed spaces (Spaces in which a
norm has been defined). In this case distance, it enhance the influence of more representative
dimensions coordinates. Formally the definition is

V =/a

𝑑 (𝑋, 𝑌) = (> |𝑥# − 𝑦# |a )


#N=

Different values of p stablish different measure paths

Figure 10. Different representations of Minkowsky distance for different p values.

10
State of the art. Theoretical review

Figure 11. Comparison between Euclidean, Manhattan and Minkowsky distances

1.3 K-MEANS IN WEKA

Most of the techniques mentioned in this paper are one of the clustering techniques available
on Weka. K-means clustering algorithm is referred as SimpleKMeans. The most part of the
options for the execution of the techniques deal with representation. Nevertheless, there are
some interesting options that can affect the clustering process of the algorithm.

-C. This option enables the use of canopies to reduce the number of distance calculations. In
this case, low density canopies are detected and will be pruned
-Periodic-pruning <num>. This number set up the pruning cycle for low density canopies.
The standard is 10.000 so instances will be pruned every 10.000 training instances.
-Min-density: Set ups the minimum density to create a canopy. Is the number of instances is
less that this number the canopy will be pruned in the pruning period.
-T1 & -T2: Distances to use on the canopy clustering process.
-A: Distance function to use for computation. As said distance is the only similarity function
used but, among these functions there are some options that can be used in this case. This are
• Euclidean distance
• Manhattan Distance
• Filtered Distance
• Chebyshev Distance
• Minkowski Distance

11
State of the art. Theoretical review

The rest of the most remarkable options are the number of clusters (that is the minimum needed
information) the number of maximum iterations and execution slots.

Figure 12. Configuration window for K-means algorithm in weka.

12
Study

Chapter 2

Study
2.1 WEKA WORK

This task deals with the use of K-Means algorithm capabilities using Weka. The first step will
be to build up the structure with the training set to test it and compare different solutions with
different configurations. As the number of clusters is an input to build the system, K-means
algorithm is not valid to determine the number of clusters by itself. The only solution is to
compare several executions with different numbers trying to minimize squared mean error. This
can derive into a bad clustering problem since the minimum error is the obtained when each
instance is a centroid (and, hence, a cluster) by itself. This must be considered when the system
is designed. Some other algorithms may help.

2.1.1 BUILDING THE SISTEM.

An Iris flowers dataset has been provided to perform the training and test of the K-means
machine for this task. The dataset has four main attributes and the class is the fifth. As K-means
is an unguided machine learning technique, there is no possibility of feedback. The method is
hence, static. Class attribute will, then, be removed from the dataset.

The training phase will be done by using a percentage split dataset selection. This is usually
done when the dataset is big enough to split it in two. One part of the dataset is, then, used to
train the model and the other one is used for testing. This way of performing the training and
testing phase is better that cross validation when talking about overfitting issues. Percentage
split performs better. The drawback is that the dataset must be big enough to allow splitting for
training and testing purposes.

This last is valid for any supervised (guided) learning as there is a chance to correct the machine
behavior in every learning step. As there is no guidance in clustering, there is no option to
perform cross validation son the only options available is to use the dataset as a training set, to
use certain percentage of the dataset as a training one (percentage split) or to provide ne
different dataset for testing. If there is no dataset reserved (new or as a percentage of the training
set) for the testing phase, there will be no possibilities to test the overfitting and effects of noise
Study

in the model. This is why overfitting for misbehavior depends only on the representation
abilities of the dataset respect to the overall of the population. If the dataset is a representative
sample of the overall expected population there will not exists this problem in the result, this is
why splitting the dataset should be done carefully as if the dataset is not big enough, some noise
can be introduced on the model training phase.

2.1.1.1 Initial set up and results

The main parameters needed for the initial model set up are the following:
• Avoid class. If not, some bias will be added to the system. As this is an unsupervised
(unguided) clustering method, there should be no knowledge about the classes on the
dataset.
• Clusters: 3 Clusters will be requested
• Seed: 10 (default)
• Percentage split: 66% (default)

When this is done, the obtained results are the depicted in Fig 13.

Figure 13. Results for the initial set-up

2
Study

As seen in fig 13, there is a change in the values of the coordinates for the centroids of every
cluster This happens when new instances are added to the system. Let´s consider some other
options for the percentage split to compare the coordinates of the centroids.
PERCENTAGE SPLIT 66/33
Training Data (66%) Testing Data (33%)
Full Data 0 1 2 Full Data 0 1 2
sepallength 5,8433 5,8885 5,006 6,8462 5,8313 5,0514 6,725 5,6571
sepalwitdth 3,054 2,7377 3,418 3,0821 3,0568 3,4543 3,0139 2,6214
petallength 3,7587 4,3967 1,464 5,7026 3,6848 1,4771 5,4389 4,1893
petalwidth 1,1987 1,4181 0,244 2,0795 1,1657 0,2571 1,9139 1,3393
MSE 6,998114 MSE 4,731454
USE TRAINIG SET
Training Data (100%) Testing Data (0%)
Full Data 0 1 2
sepallength 5,8433 5,8885 5,006 6,8462
sepalwitdth 3,054 2,7377 3,418 3,0821
petallength 3,7587 4,3967 1,464 5,7026
petalwidth 1,1987 1,4181 0,244 2,0795
MSE 6,998114 MSE 0
PERCENTAGE SPLIT 50/50
Training Data (50%) Testing Data (50%)
Full Data 0 1 2 Full Data 0 1 2
sepallength 5,8433 5,8885 5,006 6,8462 5,784 4,8762 5,4143 6,2446
sepalwitdth 3,054 2,7377 3,418 3,0821 3,0573 3,2429 3,8143 2,8617
petallength 3,7587 4,3967 1,464 5,7026 3,5773 1,4381 1,4857 4,8847
petalwidth 1,1987 1,4181 0,244 2,0795 1,152 0,2476 0,3 1,683
MSE 6,998114 MSE 5,1540819
PERCENTAGE SPLIT 33/66
Training Data (33%) Testing Data (66%)
Full Data 0 1 2 Full Data 0 1 2
sepallength 5,8433 5,8885 5,006 6,8462 5,8551 5,7154 5,0059 6,7105
sepalwitdth 3,054 2,7377 3,418 3,0821 3,0347 2,6462 3,4 2,9737
petallength 3,7587 4,3967 1,464 5,7026 3,649 4,0231 1,4294 5,3789
petalwidth 1,1987 1,4181 0,244 2,0795 1,1694 1,2615 0,3 1,8842
MSE 6,998114 MSE 2,435235
Table 1. Results after initial set up for a 66/33, percentage split, Total training set, 50/50 and 33/66 percentage split

3
Study

The obtained results can be related to the difference in the quadratic error when varying the
percentage in the split. Even when the training percentage is decreased to a 33% the centroids
for the training phase are the same. This number is maintained even with the minimum data on
the training phase. As the evolution of the algorithm depends on initialization and, hence, on
the selected seed (this part will be received later on this text) local minima should be the same
(or very similar) for same seeds.

The fact of the data standing in some numbers for the training phase in every situation is caused
by the size of the data set together with the algorithm itself. The convergence is produced easily
so the numbers are the same in every training situation as the K-means stuck in local minima.

One of the greatest drawbacks of K-means is that even in the best data it can be stuck on local
minima. Depending on the number of iterations, sometimes the partitions are badly done
because either the number of clusters is not the optimum or the procedure get stuck (or both).
In Fig 14, in the bottom right corner one cluster is split in tree parts even though it is a clearly
defined cluster.

Figure 14. Example of clustering process from K- means algorithm getting stuck in local minima.

With this numbers on the training phase the difference comes in the testing phase. The results
shown on Table 1 are represented on the following image:

4
Study

Percentage Split 66%


8
7
6
5
4
3
2
1
0
0 1 2 3 4 5

Full Data Trn0 Trn1 Trn2 Full Data Tst0 Tst1 Tst2

Percentage Split 50%


8
7
6
5
4
3
2
1
0
0 1 2 3 4 5

Full Data Trn0 Trn1 Trn2 Full Data Tst0 Tst1 Tst2

Percentage Split 33%


8
7
6
5
4
3
2
1
0
0 1 2 3 4 5

Full Data Trn0 Trn1 Trn2 Full Data Tst0 Tst1 Tst2

Figure 15. Graphs representing centroids position in the training and testing phases for different percentage splits

Some lessons identified come from Fig. 15. They are related with the distance between
centroids. In the case of the 50% percentage split the difference between the centroids for is

5
Study

uniform. There is no grouping and it seems that mistakes in the clustering process can appear.
Nevertheless this 50/50 percentage split seems to generate the small SME difference between
the centroids what happens because all of them are almost equally distributed between the
minimum and the maximum. In the other two cases there are some grouping involving clusters.

In the case of the 66% percentage split the grouping is clearer, majority of the clusters are close
to each other. This means that the training process of the 66% split is better, that is that when
the maximum of training instances, as this is an unguided process, the better clustering approach
will be met.

The other interesting question is the following: Since there is some grouping between clusters
and some other clusters clearly separated from this groups, is there any other better number for
k different to 3? This option will be considered later on.

Some information about the data distribution is available in weka.

Figure 16. Class distribution in an instance vs sepal length graph.

6
Study

Fig. 16 shows the distribution of data points (instances) for different classes based on the length
of the sepal. Some interesting information can be guessed by observing the graph. Data is mixed
up in this case there are more than one observable cluster, nevertheless, in this case, K means
methodology will join instances of Iris-Versicolor with Iris-setose in up left most part of the
graph. This will after lead to misclassification. These also happens when comparing Iris-
virginica with iris-versicolor.

The result is, hence, that this data is difficult to classify, the use of a fixed number of clusters
set up to 3 groups the ability of the algorithm. Normalization can help in the process of
clustering since it can be used to remove outliers and grouping data into clusters with values
near the mean. If this process is still unsuccessful, to remove outliers can be useful to cluster
data previous attribute selection. If no solution is valid, then, it must mean that the data is
irrelevant. Further study can be done when all that data has been observed.

7
Study

Figure 17. Clustering for sepalwidth, petallength, petalwidth.

Unlike in the case of fig.16, there is some observed clustering associated to sepalwidth,
petallength and petalwidth. This is why clustering can be performed (and it´s done actually).
At the same time, petal associated features show 3 clear clusters. This is why tree seems to be
a good number of clusters to obtain (further study will be performed in this area). As the number
and value of the classes is already known (just for learning purposes), if a cluster number
adjustment is done, 3 should be the optimum number of clusters obtained.

Sepal related features, in the other hand, do not show any clear clustering, besides, there is a
mix in the distribution. There are outliers observed far away of the possible clusters centroids.
There is no clear distribution observed and the form of the distribution is the same as the one
associated to noise. It seems that both sepal related features bring noise to the system what
means that the clustering will not be easily performed.

2.1.1.2 Evaluation of the system, final class comparison.

Weka has the ability to perform clustering analysis when the class for every instance is
provided. When this happens, Weka first performs the clustering phase by applying the selected
algorithm (K-means in this task), and then, when the clusters are formed, uses the centroid
information and distance to evaluate the model in a short of classifying process.

When this is done, the obtained values are the depicted in fig 18. On it some new important
information is shown.

8
Study

Figure 18. Obtained values when classes to cluster evaluation is performed.

The new information available in this case is the cluster per class (which has been used in the
previous section to explain some of the observed effects of the process) When data is analysed
and classes are clustered, a 11.3333% of misclassification is obtained. That means that
clustering process fails in 17 out of 150 classes. The allocation for each misclassified instance
is the represented in fig. 19. Fig. 18 sows the confusion matrix. Maximum confusion in the
classification process appears in the Iris-virginica case. 14 instances are classified as iris
versicolor and they should not have been.

9
Study

As shown before, this happens because in the sepal related features do not bring enough
separation of the instances to consider independent clusters. This can be shown in fig 20. This
figure represents instances distribution in accordance with the attributes. Some mixing can be
shown. This mixing of data points is the cause of the misclassification.

Figure 19. Clusters per classes representation. Misclassified instances are shown squared

10
Study

Figure 20. Data points allocation in accordance with the sepal width and sepallength relationship.

Fig 21 shows the same data relationship but, this time applied to petal properties. In this case
clusters are now easy to identify and the mix-up effect is not so remarkable.

Figure 21. Data points allocation in accordance with petal related features.

11
Study

This study shows the effect of noise provided by features that are irrelevant or almost aleatory
(irrelevant either). This is the case for the sepal related attributes. They do not add any other
relevant information but noise.

2.1.1.3 Effect of initialization.

As already mentioned, initial values for the centroids are important on the process. If the
initialization (that is, in theory, an aleatory process) is performed using statistical-based
heuristic approaches the convergence to data can be quicker and more effective avoiding the
tendency of the algorithm to get stuck on local minima. Weka allows the user to modify the
initial values of the centroids by modifying the seed that is used for the pseud-aleatory random
assign the initial centroid positions. As in the case of genetic algorithms, several different
initializations can drive in different paths along the solution space avoiding minimum on the
cost function (Euclidean distance) that force the algorithm to get stuck. The values for the
centroids are the depicted in the following figure.

12
Study

Figure 22. Five examples for different seed values 10(Top Left). 51(top right), 88(bottom left), 91(bottom right), 139 (centre)

Fig 22 shows the effects of different initializations. The movement of the clusters in the search
for the reduction of the mean squared rood error can also be noticed in the number of iterations
which grows with the difference between start and end points of the algorithm. One important
issue to remark is the observed effect in the case of the seed with value 91. In this case, the
algorithm clearly gets stuck in a local minimum. The error is near 10, that is much bigger than
in any other value for the seed.

In this case even though the number of iterations needed to reach the optimum value is higher,
the algorithm gets stuck bringing a sub-optimal situation as a result. In the cases in which the
mean squared error is near the minimum obtained value, the centroid values for the petal related
features suffer from minimum variations.

The effect of seed is hence, important. As the initialization is not aleatory (it should be in theory
but depends on the seed), several runs of the same algorithm using different seeds is
recommended. When this is done, the risk of getting stuck in a local minimum is reduced. This
should be a “must” in clustering techniques since there in no chance to perform guided training
for K-Means algorithm.

13
Study

Figure 23. Cluster visualization for different seeds, 51 top-left, 88 top-right, 91 bottom-left, 139 bottom-right

The effect of seed and the avoidance of local minima is easy to detect in Fig.23. As the seed is
different, random initialization of the clusters allow to avoid local minima and the clustering
process will, hence, group instances in a better way.

Different initializations with different seeds bring to different final results. The value of the
seed is not related with the capabilities of the algorithm. In fact, in some cases, as the presence
of outliers was not avoided in the pre-processing phase, some of the obtained values for MSE
and incorrectly classified instances are worst because the algorithm get stuck also in a worst
local minimum, that the experiment in which 10 is used as seed value.

As there is, hence, no relationship between the seed value and the cluster value, to perform
several tests with different seed can drive to an overall minimum.

14
Study

2.1.1.4 Features predictive power.

As already mentioned in this work, some of the features involved on the development of a ML
technique are better than others. In this case, the presence of noise in a feature, irrelevance or
dependency can make this attribute and its values totally useless. To remove the feature is,
sometimes, the best option.

In this example there are several clues pointing towards the low utility of sepal-related
attributes. The fact that there is no clustering observed even when these attributes are considered
by pairs and that the distribution is graphically similar to noise can be indicators of the low
influence of these features in the final result for the overall algorithm. In all the experiments,
the class will always be ignored.

In the first experiment, both sepal-related instances will be disrgarded. In this case an MSE of
1.705099 is obtained. When the class to cluster evaluation is performed, only 4% of the
instances (6) has been wrongly classified. The results are shown in Fig. 24

Figure 24. Results when only petal-related features are considered.

15
Study

Figure 25. Clustering visualization for experiment 1.

Fig 25 shows the clustering obtained. In the results, misclassified instances number decrease
and the cluster is performed easily.

Other important remarkable effect of the selection of this algorithm appears when different
seeds are selected (91,55,88,139). As the values of the clusters and MSE is the same, the
possibilities of the algorithm of having found the overall minima (optimal solution) grows.

In the second experiment, the only features to consider will be the sepal-related attributes. In
this case MSE grows until 4.1550991. When the classes to cluster evaluation is performed, 20%
of the instances are misclassified (30 instances). This shows that, as suspected, sepal-related
features have low influence on the clustering process. Fig 26 shows the results.

16
Study

Figure 26. Results for sepal-related features.

Figure 27. Clustering visualization for sepal-related features.

17
Study

Figure 27 also shows the result of the clustering process. The number of misclassified instances
has grown. In this case, sepal-related algorithm seems to perform a good differentiation of only
iris-setosa. If the aim is to filter this unique class, sepal related features can be used. If not, the
effect of noise affects the clustering abilities of the machine.

The results shown in Fig 26 and 27 has been obtained with seed 51, unlike in the previous
experiment case, the effect of the seed in the overall clustering process is significative. The
number of misclassified instances can grow to 34 for other seed values. This is logical since the
distribution of instances is noisy and erratic. In these conditions the effect of aleatory
initialization is sensed in the overall performance of the algorithm.

In the third experiment, only width-related attributes are selected. In this case, overall obtained
MSE is located between the values of the previous two experiments. This means that the
predictive power of the width of both the sepal and the petal is significative better than the
derived from consider only sepal related measures. This seems logic since different classes of
the flower can be differentiated (clustered) by the measures of both the sepal or petals.

Nevertheless, considering the dimensions of the petals is more promising that considering only
the width of both petal and sepal. The results are the expected since the predictive power of part
of the previous experiments are mixed in this one. The clustering provided by petal-related
features (width in this case) corrects the sepal-related noise (width in this case).

When the classes to cluster evaluation is performed, values obtained for misclassification
decrease to 11 instances (7,333%). Fig 28 and 29 show the results.

18
Study

Figure 28. Results for width-related attributes

Figure 29. Clustering visualization for experiment 3

In this case, the effect of seeding is softened by the presence of a petal related feature.

19
Study

Fourth experiment. In this case, only length related features will be considered. Results are
shown in Fig. 30 and 31.

Figure 30. Results for Experiment n 4.

Figure 31. Clustering Visualization for the 4th Experiment.

20
Study

As expected, the results are, again, not as good as the first ones. This result comes from the fact
that both less representative features of each feature has been used. 24 (16%) instances have
been observed as misclassified when the classes to cluster evaluation is done. The system
suffers the noise of both features.

Experiment five. In this experiment, a test for each isolated attribute has been performed. As
expected, sepal related attributes are useless. Table 3 shows the obtained values when each
singleton attribute is considered for the classification.

Feature MSE Class to Cluster Misclassified instances


Sepallenght 1.323548789173789 55
SepalWidth 1.0151916829109804 75
PetalLengt 0.7083608366129245 10
PetalWidth 0.857260690642333 6
Table 2. Values obtained for singleton attributes

As a conclusion, to obtain initial knowledge on the features a perform feature selection can
drive into much more effective machines.

2.1.1.5 Initial Clusters

As the number of initial clusters is unknown, but there is a need to determine the optimal
number, the only way to perform this is by using different values for the algorithm in a search
by performing multiple values analysis.

In the sixth experiment, only 2 clusters has been be selected. In this case MSE grows up to
12,143688 and in the classes to clusters test 33.33333% (50) of the instances are misclassified.

The effect of determining a bad number of desired clusters will drive in unacceptable values.
Fig 32 represents the distribution. In it a clustering of two classes in one group can be shown.

21
Study

Figure 32.Clustering visualization for experiment 6.

In the seventh experiment the number of clusters has been set up to 4. Fig 33 shows that when
this is done, a cluster is split and, as there is not enough close outliers to create a new cluster,
MSE and the number of misclassified instances grow. In the evaluation phase 44 (29,33333)%
instances were misclassified. The value for MSE is, hence 5.532831

Figure 33. Clustering Visualization for experiment 7.

As a conclusion of this part of the work, to obtain the optimal number of clusters, several
experiments must be done. If the number of clusters is too high, squared mean error will
decrease down to the minimum when, in the limit, every instance can be associated to a cluster
and, hence, there will be no distance between the instance and the centroid. This is why MSE

22
Study

is not the best option to measure the quality of the algorithm, it can get stuck in local minima
providing wrong information. There are some other algorithms that perform a search for the
optimal number of clusters. These must be considered if the number of clusters is unknown or
unpredictable by data. When this is done (using DBSCAN) the optimal number of cluster
obtained is 3.

2.2 CONCLUSIONS

Even though the conclusions of this job are mentioned along every section, there are some brief
concepts to consider when working with a clustering algorithm and, more precisely, with K-
means.

Quality of data is very important. If data is not analyzed and filtered there will be noise in the
system. If this is the case, it will be impossible to perform a perfect clustering process. Some
attributes have better predictive power than others. If there is previous knowledge or statistics
about that, only best ones must be considered, if not, noise will affect clustering abilities of the
machine.

As the algorithm is based on reduction of MSE, K-means prone to get stuck in local minima.
That means that, sometimes, k-means will not bring the better results not only by the effect of
noise but by the fact that a local minimum has been reached. As it happens with genetic
algorithms, restarts must be recommended to avoid this situation and to perform additional
search for a better solution.

As the initial centroid positions are assigned randomly, each initialization must be done with
different centroid positions. To assure that, Weka brings the possibility to modify the seed of
the pseudo-aleatory initialization, different seed can drive to different minima.

There is an effect of the number of clusters selected as an input to the algorithm. There is an
optimal number of clusters in the clustering process. If the number of desired cluster is
unknown, an in-depth study with several runs that involve different features, seeds and number
of clusters must be performed to search for the optimal configuration. All of them should be
based on the previous knowledge of the system (if any) or, if there is no such information
available, on the MSE. Some other algorithms as DBSCAN have the ability to obtain the
optimal number of clusters from data. They can be used as a first step to determine this
information for the K-means algorithm.

MSE is not always the best way to measure the capabilities of the machine. Even when the
amount of data that is misclassified after performing a class to cluster evaluation if high, MSE
can be small. This effect is produced by variance. If there are no outliers and all the data points

23
Study

are clustered near each other, the error value can be small even when the clustering abilities are
bad. The best option is to perform several tests to determine the optimal configuration.

For more detailed analysis, please refer to the main body of this paper.

24
Bibliography

Bibliography

Hamerly, G., & Drake, J. (2015). Accelerating Lloyd’s Algorithm for k-Means Clustering. En
G. Hamerly, & J. Drake, Partitional Clustering Algorithms (pág. 37). Celebi: Springer
International Publishing.
Introduction to K-means Clustering. (06 de Dec de 2016). (Datascience) Recuperado el May
de 2018, de LEARN DATA SCIENCE, MACHINE LEARNING:
https://www.datascience.com/blog/k-means-clustering
Kaushik, S. (3 de November de 2016). An Introduction to Clustering and different methods of
clustering. (ANALYTICS VIDHYA) Recuperado el May de 2018, de Analitics
Vifhya: https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-
and-different-methods-of-clustering/
Kempt, J. (6 de May de 2011). GMM with EM. Recuperado el May de 2018, de Mr. Joel
Kemp Tech articles and poetry: http://mrjoelkemp.com/2011/05/gaussian-mixture-
models-with-expectation-maximization/
OnMyPhD. (2017). OnMyPhD. Recuperado el Abril de 2018, de Karush-Kuhn-Tucker (KKT)
conditions: http://www.onmyphd.com/?p=kkt.karush.kuhn.tucker
Politecnico Milano. Departamento di electronica. (2018). K-Means Clustering. Obtenido de A
Tutorial on Clustering Algorithms:
https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
Seif, G. (5 de Feb de 2018). The 5 Clustering Algorithms Data Scientists Need to Know.
(Medium) Recuperado el May de 2018, de Towards Data Science:
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-
know-a36d136ef68

You might also like