Professional Documents
Culture Documents
a type of classification imposed on finite set of objects [3]. clustering is that the data categories (clusters) can generate
The relationship between objects is represented in a from initial labeled data as well as extend and modify the
proximity matrix in which the rows represent n e-mails existing ones to reflect other regularities in the data [1, 7,
and columns correspond to the terms given as dimensions. 9].
If objects are categorized as patterns, or points in a d-
dimensional metric space, the proximity measure can be 2. RELATED WORK
Euclidean distance between a pair of points. Unless a
Clustering [2, 11] an unsupervised technique is the process
meaningful measure of distance or proximity, between a
of organizing objects into groups such that similarity
pair of objects is established, no meaningful cluster
within the same cluster is maximized and similarities
analysis is possible. Clustering is useful in many
among different clusters are minimized. In many real
applications like decision making, data mining, text
world problems, clustering with equal weights for each
mining, machine learning, grouping, and pattern
attribute does not provide the desired results since
classification and intrusion detection. Clustering has to be
different attributes have different significance levels [6].
done as it helps in detecting outliners and to examine small
Same weights are assigned to all the attributes in many
size clusters [1]. The proximity matrix is used in this
clustering algorithms irrespective of the fact that all
context and thus serves as a useful input to the clustering
attributes do not have equal importance or weights in most
algorithm. It represents a cluster of n patterns by m points.
of the real world problems. Weighted k-means clustering
Typically, m < n leading to data compression, can use
is considered as the popular solution to handle such kind
centroid [3]. This would help in prototype selection for
of problems [9]. In order to introduce the different weights
efficient classification. The clustering algorithms are
for different attributes, parametric Minkowski model [3] is
applied to the training set belonging to two different
used to consider the weightage scheme in weighted
classes separately to obtain their correspondent cluster
kmeans clustering algorithm. In parametric Minkowski
representatives. There are different stages in clustering [2].
model, the distance function is defined by a weighted
Clustering is a technique to partition a dataset into version of the Minkowski distance measure. The
homogeneous clusters such that the data points in the same parameters for this model are the weights in different
cluster are more similar to each other than in different dimensions [10].
clusters where classification is to label or classify a new
PC-Kmeans algorithm is similar to COP-Kmeans, but the
unknown data from a collection of labelled, pre-classified,
main difference is that this algorithm can violate the
data. Clustering generally known as unsupervised learning
constraints with some trade off as penalty for doing so [4].
where classification known as supervised learning [1]. The
It tries to come up with a good cluster formation while
term learning states an algorithm that examines a set of
minimizing the penalty that it incurs. A major limitation of
points without examining any corresponding class/cluster
this approach is that it assumes a single metric for all
label [1, 5, and 7]. In various real and practical
clusters, preventing them from having different shapes.
applications like bioinformatics, medical, pattern
Bilenko et. al. [10] has proposed metric pair-wise
recognition etc., a large amount of unknown data is
constraint kmeans (MPCK-Means) algorithm to get rid of
available than the labelled ones. To generate labelled data
this limitation. MPCK-Means is considered as one of the
become a lengthy and slow process using unsupervised
most popular semi-supervised clustering algorithms in the
method, also is a tedious work to label all data using
recent past. Therefore, the proposed approach has been
supervised method [7]. Therefore, one may wish to use
compared with MPCK-Means in the paper [7]. The
large dataset without labelling or generating data should
proposed approach based on Hyperlink-Induced Topic
employ semi-supervised learning. Semi-supervised
Search (HITS) algorithm is introduced to overcome the
learning is a technique of learning from a combination of
limitations of earlier work. Using the proposed approach
labelled and unlabelled data. This can be used for both
the weights for the attributes are generated automatically
classification and clustering purpose [4]. Semi-supervised
from the data, for the weighted k-means using parametric
classification uses labelled data along-with some unlabeled
Minkowskis model, some of the preliminaries i.e.
data to train the classifier where semi-supervised
parametric Minkowski model and HITS algorithm are
clustering, involves some labelled class data or pair wise
described in the next section of the paper [9].
constraints along with the unlabelled data to obtain better
clustering [3]. There are several semi-supervised The clustering algorithms are classified into generative
classification algorithms like co-training, transductive and discriminative. Generative is a parametric form of data
support vector machines (SVMs), Expectation generation is assumed and the goal in the maximum
maximization etc. for using unlabelled data to improve likelihood formulation is to find the parameters that
classification accuracy. The advantage of semi-supervised maximize the probability (likelihood) of generation of the
2
ABDELRAHMAN ELSHARIF KARRAR, International Journal of Advanced Trends in Computer Science and Engineering, 6(1), January February 2017, 1- 7
data given the model [9]. Discriminative tries to cluster the Minimizing Q is therefore equivalent to minimizing the
data so as to maximize within-cluster similarity and rst term [9]. The centroid update formula is:
minimize between-cluster similarity based on a particular
similarity metric, where it is not necessary to consider an
underlying parametric data generation model. Both can be
implemented using Expectation maximization and k-
means [7]. In addition, k-means is a flat partitioning type 3.2 CLUSTER WEIGHTING (CW)
of clustering algorithm that divides the data points into k
partitions or clusters by grouping the similar features The proposed objective function should also take into
(usually Euclidean) and assigning each point to the cluster account how informative is the shared feature set. I use the
whose mean value on a set of x variables is nearest to it on weight wj to reect the importance of the shared feature
that set. Furthermore, semi-supervised clustering set in terms of discriminating cluster j from other clusters
algorithms categorize into similarity or distance-based or [9]. The concept of feature weighting has been used in
partially labelled data and search-based or pair constraint clustering by Frigui et. al. [7]. Let D (d,j) denote the
based methods [5]. The former, used to classify the discriminating ability of the
unlabeled data to the appropriate clusters using the known
clusters where the later considers Must-link constraints
require that two observations must be placed in the same
cluster, and cannot-link constraints require that two
observations must not be placed in the same cluster [1].
Another difference between the two is in distance-based Where d (ui, cj, <d) is the distance between ui and centroid
method uses traditional clustering algorithm like k-means cj based on the feature set <d. Intuitively, D (d, j) is the
that uses a similarity metric where in search-based ratio of the between-cluster distance and the within-cluster
approaches, the clustering algorithm itself is modified so distance [7]. The higher the ratio, the more informative is
that user-provided labels or constraints are used to the feature set in terms of discriminating cluster j from
bias the search for an appropriate partitioning [4]. other clusters.
In Semi Hard clustering, [tij] takes the value from {0, 1}.
The objective function for hard clustering can be re- Where p is the number of shared features and d is the total
written as follows: number of features in U.
3
ABDELRAHMAN ELSHARIF KARRAR, International Journal of Advanced Trends in Computer Science and Engineering, 6(1), January February 2017, 1- 7
SeSProC uses the available instance label information to 11: Use new labelled set with an attribute clustering
guide the clustering of the unlabeled instances [4]. Based method to provide clusters of attributes F 12: until No
on this information, the model learning process can be new instances added
divided in to two learning parts: the labelled instances 13: Output:
are correctly classied in to known classes {1... C} The set of clusters of attributes and the classier trained
(classication term) and the unlabeled instances can be in the nal iteration
grouped either in those known or in other unknown
components {C +1... K} (clustering term) [9].
4
ABDELRAHMAN ELSHARIF KARRAR, International Journal of Advanced Trends in Computer Science and Engineering, 6(1), January February 2017, 1- 7
5. EXPERIMENTAL RESULTS
Balance Dataset
5.1 BALANCE DATASET RESULTS
Algorithm Accuracy Precision Recall F-Measure
belonging to two classes. The data set contains 351 SESPROC 70.92 90.67 91.89 85.78
objects in total, which are all 34-dimensional. SSAC 80.67 96.67 70.78 88.67
Soybean. It is collected from the Michalskis famous SSE 90.78 78.76 82.54 90.89
Ionosphere Dataset
100
Algorithm Accuracy Precision Recall F-Measure Soybean
90
SHC 79.45 88.91 84.77 88.89 Dataset
80
CW 74.91 90.08 90.78 70.56
Accuracy
70
SESPROC 80.98 76.67 72.89 85.78 60 Soybean
SSAC 88.67 70.67 77.78 90.67 50 Dataset
SSE 90.56 83.45 88.34 75.89 40 Precision
SSAC) algorithms. The Precision of SSAC algorithm is 30 Soybean
96.67 which is higher when compare to other four (SHC, 20 Dataset Recall
CW, SESPROC, SSE) algorithms. The Recall of CW 10
algorithm is 94.78 which is higher when compare to other 0
four (SSE, SHC, SESPROC, SSAC) algorithms. The F- Soybean
Measure of SSE algorithm is 90.89 which is higher when Dataset F-
compare to other four (SHC, CW, SESPROC, SSAC) Measure
algorithms.
for incremental update of clusters that can be applied in Using Nonnegative Matrix Factorization, In Proc. of
semi-supervised way. The effectiveness of the proposed ICDM. 2009.
approach over the existing clustering algorithm has been [7] Miller D, Chu-Fang L, Kesidis G, Collins, Semi
illustrated using UCI machine learning repository supervised mixture modelling with ne-grained
datasets and compared with the popular clustering component-conditional class labelling and
algorithms such as SSE and SSAC. The proposed transductive inference, In: IEEE international
clustering approach produces better results even with the workshop on machine learning for signal processing,
widely used semi-supervised clustering algorithm like pp. 16,2010.
SESPROC [4]. It can be applied to large scale of [8] X. Zhu, Semi-supervised clustering literature
practical problems as most of the real world problems do survey, Technical Report 1530, Computer Sciences,
not have equal weights for each of the attributes and
University of Wisconsin-Madison, 2011.
[9] X. Zhu and A. B. Goldberg, Introduction to
weights are either unknown or hard to obtain. The
Semi-supervised Clustering , Morgan & Claypool
approach can play an important role for wider variety of
Publishers, 2009.
clustering problems especially where the attributes of a
[10] X. Zhu, Z. Ghahramani, and J. Lafferty, Semi-
dataset do not have equal weights [3]. I have proposed a
supervised learning using Gaussian elds and
semi-supervised method, called SeSProC, capable of harmonic functions, In Proceedings of International
discovering unknown clusters, based on EM algorithm, Conference on Machine Learning, pages 912 919,
and including a LFSS. This algorithm includes available 2013.
information in the search for subspaces and clusters. [11] Kulis, B., Basu, S., Dhillon, I. S. & Mooney R.
Besides, SeSProC has two major advantages over related J, Semi-supervised Graph Clustering: a Kernel
algorithms. The rst one is that my proposal has only Approach, In: Proc. of International Conference on
one, easily adjustable input parameter [7]. Whereas other Machine Learning (ICML) (pp. 457464), 2011.
algorithms are unable to nd a nal solution without [12] S. Basu, M. Bilenko, and R. J. Mooney, A
proper parameter tuning, SeSProC always obtains a probabilistic framework for semi-supervised
clustering solution regardless of the value of the input clustering, In KDD 04: Proceedings of the 2004
parameter. The second advantage is related to the known ACM SIGKDD international conference on
labels. SeSProC is able to nd hidden clusters that are Knowledge discovery and data mining, pages 5968.
not represented by the labelled instances. It uses a novel ACM Press, 2014.
greedy process to nd these clusters, assuming that
instances that t the known clusters worst are candidates
for initializing new clusters [9].
REFERENCES