MainClustering of Text Documents

1 INTRODUCTION
1.1 Introduction:
Clustering is main process in engineering and in various
fields of scientific research, which tries to group a set of points into
clusters such that points in the same cluster are more homogeneous
to each other when compared to the points in different clusters.
Document clustering is nothing but group the documents by based
on similarity among the documents in an unsupervised manner.
Document clustering used in Quick topic extraction or filtering and
information retrieval. We are facing an ever increasing volume of
text documents. The texts flowing over then termed, vast collections
of documents in repositories, digital libraries and digitized personal
information such as articles and emails .These have brought
challenges for the effective and efficient organization of text
documents
There is no known single optimization method available for
solving all optimization problems. A lot of optimization methods
have been developed for solving different types of optimization
problems in recent years. The modern optimization methods
(sometimes called nontraditional optimization methods) are very
powerful and popular methods for solving complex engineering
problems.
These
methods
algorithm,
neural
networks,
are
particle
genetic
swarm
algorithms,
optimization
ant
colony
optimization, artificial immune systems, and fuzzy optimization.

The Particle Swarm Optimization algorithm (abbreviated as PSO)
is a novel population-based stochastic search algorithm and an
alternative solution to the complex non-linear optimization problem.
The PSO algorithm was first introduced by Dr. Kennedy and Dr.
Eberhart in 1995 and its basic idea was originally inspired by
simulation of the social behavior of animals such as bird flocking,
fish schooling and so on. It is based on the natural process of group
communication to share individual knowledge when a group of birds
or insects search food or migrate and so forth in a searching space,

although all birds or insects do not know where the best position is.
But from the nature of the social behavior, if any member can find
out a desirable path to go, the rest of the members will follow
quickly.
The PSO algorithm basically learned from animals activity or
behavior to solve optimization problems. In PSO, each member of
the population is called a particle and the population is called a
swarm. Starting with a randomly initialized population and moving
in randomly chosen directions, on each particle.
In this thesis, a meta-heuristic called Tabu Search and discusses
the features of the tabu search algorithm. This is one of the most
efficient heuristic in finding quality solutions in relatively short
running time. The principal characteristic of tabu search is based on
using a mechanism which is inspired by the human memory i.e., to
use the information that is stored in the memory to guide and
restrict the future search in a way to obtain quality solutions and to
overcome the local optimality. This thesis provides insight about the
algorithm or procedure of the working of tabu search algorithm on
Document clustering problems and merging the other optimization
technique.
Particle swarm optimization (PSO) method for solving the economic
dispatch
(ED)
problem
in
power
systems.
Many
nonlinear
characteristics of the generator, such as ramp rate limits, prohibited

operating zone, and nonsmooth cost functions are considered using
the
proposed
method
in
practical
generator
operation.
The
feasibility of the proposed method is demonstrated for three

different systems, and it is compared with the GA method in terms
of the solution quality and computation efficiency. The experimental
results show that the proposed PSO method was indeed capable of
obtaining higher quality solutions efficiently in ED problems.
Tabu Search (TS), a heuristic method originally proposed by Glover

in 1986, to various combinatorial problems have appeared in the
operations research literature. In several cases, the methods
described provide solutions very close to optimality and are among
the most effective, if not the best, to tackle the difficult problems at
hand. These successes have made TS extremely popular among
those interested in finding good solutions to the large combinatorial
problems encountered in many practical settings. Several papers,
book chapters, special issues and books have surveyed the rich TS
literature (a list of some of the most important references is
provided in a later section). In spite of this abundant literature,
there still seem to be many researchers who, while they are eager
to apply TS to new problem settings, find it difficult to properly
grasp the fundamental concepts of the method, its strengths and its
limitations, and to come up with effective implementations. The
purpose of this paper is to address this situation by providing an
introduction in the form of a tutorial focusing on the fundamental
concepts
of
TS. Throughout
the
paper,
two
relatively
straightforward, yet challenging and relevant, problems will be used

to illustrate these concepts: the Classical Vehicle Routing Problem
(CVRP) and the Capacitated Plant Location Problem (CPLP). These
will be introduced in the following section. The remainder of the
paper is organized as follows. The basic concepts of TS (search
space, neighborhoods, and short-term tabu lists) are described and
illustrated in Section 2. Intermediate, yet critical, concepts, such as
intensification and diversification, are described in Section 3. This is
followed in Section 4 by a brief discussion of advanced topics and
recent trends in TS, and in Section 5 by a short list of key references
on TS and its applications. Section 6 provides practical tips for
newcomers struggling with unforeseen problems as they first try to
apply TS to their favorite problem. Section 7 concludes the paper
with some general advice on the application of TS to combinatorial
problems.
Tabu search (TS) has its antecedents in methods designed to

cross boundaries of feasibility or local optimality treated as barriers
in classical procedures, and to systematically impose and release
constraints to permit exploration of otherwise forbidden regions
(Glover, 1977). The tabu search name and terminology comes from
Glover (1986). A distinguishing feature of the approach is its use of
adaptive memory and special associated problem-solving strategies.
(TS provides the origin of the memory-based and strategy-intensive
focus in the metaheuristic literature, as opposed to methods that
are memory-less or use only a weak inheritance-based memory. It is
also responsible for emphasizing the use of structured designs to
exploit historical patterns of search, as opposed to processes that
rely almost exclusively on randomization.)
The fundamental principles of tabu search were elaborated in a
series of papers in the late 1980s and early 1990s, and have been
assembled and in the book Tabu Search (Glover and Laguna, 1997).
The
remarkable
successes
of
tabu
search
for
solving
hard
optimization problems (especially those arising in real world

applications) has caused an explosion of new TS applications in the
last several years.
The tabu search philosophy is to derive and exploit a
collection of intelligent problem solving strategies, based on implicit
and explicit learning procedures. The adaptive memory framework
of TS not only involves the exploitation of the history of the problemsolving process, but also entails the creation of structures to make
such exploitation possible. Problem-solving history extends to
experience gained from solving multiple instances of a problem
class by joining TS with an associated learning approach called
Target Analysis (see, e.g., chapter 9 of Glover and Laguna, 1997). TS
is an iterative procedure designed for the solution of optimization
problems. TS starts with a random solution and evaluate the fitness
function for the given solution. Then all possible neighbors of the
given solution are generated and evaluated. A neighbor is a solution
which can be reached from the current solution by a simple, basic
transformation. If the best of these neighbors is not in tabu list then
pick it to be the new current solution. The tabu list keeps track of
previously explored solutions and prohibits TS from revisiting them
again. Thus, if the best neighbor solution is worse than the current
design, TS will go uphill. In this way, local minima can be overcome.
Any reversal of these solutions or moves is then forbad move and is
classified as tabu. Some aspiration criteria which allow overriding of
tabu status can be introduced if that moves is still found to lead to a
better fitness with respect to the fitness of the current optimum. If
no more neighbors are present (all are tabu), or when during a
4
predetermined number of iterations no improvements are found, the

algorithm stops. Otherwise, the algorithm continues the TS
procedures.
Engineering and technology have been continuously providing
examples of difficult optimization problems. In this talk we shall
present the tabu search technique which with its various ingredients
may be viewed as an engineer designed approach: no clean proof of
convergence is known but the technique has shown a remarkable
efficiency on many problems. The roots of tabu search go back to
the 1970's; it was first presented in its present form by Glover
[Glover, 1986]; the basic ideas have also been sketched by Hansen
[Hansen 1986]. Additional efforts of formalization are reported in
[Glover, 1989], [de Werra & Hertz, 1989], [Glover, 1990]. Many
computational experiments have shown that tabu search has now
become an established optimization technique which can compete
with almost all known techniques and which - by its flexibility - can
beat many classical procedures. Up to now, there is no formal
explanation of this good behavior. Recently, theoretical aspects of
tabu search have been investigated [Faigle & Kern, 1992], [Glover,
1992], [Fox, 1993]. A didactic presentation of tabu search and a
series of applications have been collected in a recent book [Glover,
Taillard, Laguna & de Werra, 1992]. Its interest lies in the fact that
success with tabu search implies often that a serious effort of
modeling be done from the beginning. The applications in [Glover,
Taillard, Laguna & de Werra, 1992] provide many such examples
together with a collection of references. A huge collection of
optimization techniques have been suggested by a crowd of
researchers of different fields; an infinity of refinements have made
these techniques work on specific types of applications. All these
procedures are based on some common ideas and are furthermore
characterized by a few additional specific features. Among the
optimization procedures the iterative techniques play an important
role: for most optimization problems no procedure is known in
general to get directly an "optimal" solution. The general step of an
iterative procedure consists in constructing from a current solution i

a next solution j and in checking whether one should stop there or
perform another step. Neighbourhood search methods are iterative
procedures in which a neighbourhood N(i) is defined for each
feasible solution i, and the next solution j is searched among the
solutions in N(i).
Non-linear optimization problems are defined by non-linearity
constraints and/or non-linearity objective. These problems are
considered in several domains, including chemical engineering,
energy analysis, environmental planning, biotechnology and thermal
processes, among others. Different techniques and methods are
employed to model and solve these problems. A literature survey
shows that the most used techniques are evolutionary algorithms
[1,
2],
swarm
optimization
[6]
and
non-linear
mathematical
programming [15]. Leyffer and Mahajan (2010) present a survey of

non-linearly constrained software and methods, focusing on the
contrasting strategies of local optimization and global optimization
[15]. Some of those approaches such as Genetic algorithms are
reported to require a lot of parameters and to entail considerable
effort to implement. In the thermal engineering field, many complex
optimization
problems
arise
in
practice.
Recently,
non-linear
optimization problems have increasingly been subjected to analysis

by non-traditional optimization techniques. Patel and Rao [17, 18]
recommend the use of particle swarm optimization (PSO) based on
case studies showing that PSO is simple in concept, requires few
parameters, is easy to implement and performs well compared to
traditional techniques like genetic algorithms [17, 18]. The PSO
method has produced good outcomes for a variety of optimization
problems, but many authors have pointed out a limitation in its
ability to diversify the population (see [8, 24]). To deal with this
problem, research efforts are underway on several fronts to
hybridize the PSO method with other id-2 ICSI 2011: International
conference on swarm intelligence Cergy, France, June 14-15, 2011
meta-heuristics. The most commly used methods to create PSO
hybrids are genetic algorithms and differential evolution algorithms

[24]. For global optimization, a PSO-TS hybrid algorithm which joins
PSO with tabu search (TS) has been proposed in [11]. More recently,
Shelokar et al. hybridize PSO with ant colony algorithm for
continuous optimization [21]. In this work, we focus on a thermal
optimization problem known as the T-junction problem, which
consists in designing the main channel in electrical machines
responsible for evacuating generated heat. The objective is to
determine the ideal channel features that optimize the temperature
in the system. This problem, identified through a collaborative
industrial project, can be formulated as a constrained non-linear
optimization problem (CNOP). The fitness function used to evaluate
solutions of this problem takes extensive computation time. The use
of meta-heuristics like genetic algorithms in this case has proved to
be very time consuming. We apply the PSO meta-heuristic to solve
the problem due to its simple implementation and the limited
number of parameters to adjust, as well as for the ability to control
its fitness function effectively. To avoid premature convergence of
our method, a tabu search procedure is embedded within the PSO.
High-density DNA microarrays are one of the most powerful tools for
functional genomic studies and the development of microarray
technology allows for measuring expression levels of thousands of
genes simultaneously Schena et al. (1995). Recent studies have
shown that one of the most important applications of microarrays is
tumor classification (Cho et al., 2003; Li et al., 2004). Gene selection
is an important component for gene expression-based tumor
classification
systems.
Microarray
experiments
generate
large
datasets with expression values for thousands even tens of

thousands of genes but not more than a few tissue samples. Most of
the genes monitored in microarray may be irrelevant to analysis and
the use of all the genes may potentially inhibit the prediction
performance of classification rule by masking the contribution of the
relevant genes (Li, 2006; Li and Yang, 2002; Stephanopoulos et al.,
2002; Nguyen and Rocke, 2002; Biceiato et al., 2003; Tan et al.,
2004). An efficient way to solve this problem is gene selection and

the Corresponding author. Tel.: +86 371 67767957; fax: +86 371
67763220.
E-mail
address:
shiweimin@zzu.edu.cn
(W.-M. Shi).
selection of discriminatory genes is critical to improving the

accuracy and decrease computational complexity and cost. By
selecting relevant genes, conventional classification techniques can
be applied to the microarray data. Gene selection may highlight
those relevant genes and it could enable biologists to gain
significant insight into the genetic nature of the disease and the
mechanisms responsible for it (Guyon et al., 2002; Wang et al.,
2005). Several gene selections techniques have been employed in
classification problems, such as t-test filtering approach, as well as
some artificial intelligence techniques such as genetic algorithms
(GAs), evolution algorithms (EAs) (Golub et al., 1999; Furey et al.,
2000; Xiong et al., 2001; Peng et al., 2003; Li et al., 2005; Tibshirani
et al., 2002; Sima and Dougherty, 2006), simulated annealing, tabu
search and particle swarm optimization. Particle swarm optimization
(PSO) algorithm (Kennedy and Eberhart, 1995; Shi and Eberhart,
1998; Clerc and Kennedy, 2002) is a recently proposed algorithm by
James Kennedy and R.C. Eberhart in 1995, motivated by social
behavior of organisms such as bird flocking and fish schooling.
Particle swarm optimization comprises a very simple concept, and
can be implemented in a few lines of computer code. It requires only
few parameters to adjust, and is computationally inexpensive in
terms of both memory requirements and speed. A modified discrete
PSO algorithm has been proposed in our previous study (Shen et al.,
2004a,b, in press) to reduce dimension and shown satisfied
performance. Although PSO has proved to be a potent search
technique for solving optimization, there are still many complex
situations where the PSO tends to converge to local optima and
does not perform particularly well. Tabu search (TS) is a powerful
optimization procedure that has been successfully applied to a
number of combinatorial optimization problems Glover (1986). It has
the ability to avoid convergence to local minima by employing a
flexible memory system. But the convergence speed of TS depends

on the initial solution and the parallelism of PSO population would
help the TS find the promising regions of the search space very
quickly. In this paper, we develop a hybrid PSO and TS (HPSOTS)
approach
for
gene
selection
for
tumor
classification.
The
incorporation of TS as a local improvement procedure enables the

algorithm HPSOTS to overleap local optima and show satisfactory
performance. The formulation and corresponding programming flow
chart are presented in details in the paper. To evaluate the
performance of HPSOTS, the proposed approach is applied to three
publicly available microarray datasets. Moreover, we compare the
performance of HPSOTS on these datasets to that of stepwise
selection, the pure TS and PSO algorithm. It has been demonstrated
that the HPSOTS is a useful tool for gene selection and mining high
dimension data.
1.2 Motivation:
PSO performs excellently in global search while not so
well in local search, meanwhile, TS performs excellently in local
search while not so well in global search. Therefore in this thesis i to
combine the two algorithms so the new hybrid algorithm conducts
both global search and local search in every iteration , so the
probability of finding the optimal points significantly increases.
However to the best of the authors knowledge, TSPSO has not been
used to cluster text documents. In This study a document clustering
algorithm based on TSPSO is proposed.
1.3 Thesis Overview:

In this thesis involves clustering documents into
categories using Optimization
algorithms. Initially we start with
data matrix obtained from the text documents after preprocessing

steps. This data matrix is represented with each row as a document
vector and each column as weight of a significant term. This data
matrix is provided as an input to AMOC algorithm for finding the k
9
value and the produced

TSPSO
k value is given to the PSO and TS and
algorithms to form clustering documents. The results
obtained from above process compare with obtained
VRC values
and also their time complexities.
1.4 Clustering
A general definition of clustering stated by Brian Everitt et al.
[6]Given a number of objects or individuals, each of which is
described by a set of numerical measures, devise a classification
scheme for grouping the objects into a number of classes such that
objects within classes are similar in some respect and unlike those
from other classes. The number of classes and the characteristics of
each class are to be determined. The clustering problem can be
formalized as an optimization problem, i.e. the minimization or
maximization of a function subject to a set of constraints. The goal
of clustering can be defined as follows:
Given
I. a dataset X = {x1, x2, . , xn}
II. the desired number of clusters k
III. a function f that evaluates the quality of clustering
we want to compute a mapping
:{1,2,.....,n}{1,2,.....,k}
that minimizes the function f subject to some constraints. The
function f that evaluates the clustering quality are often defined in
terms of similarity between objects and it is also called distortion
function or divergence. The similarity measure is the key input to a
clustering algorithm.
1.5 Document clustering

Clustering of documents is used to group documents into
relevant topics. The major difficulty in document clustering is its
high dimension. It requires efficient algorithms which can solve this
high dimensional clustering. A document clustering is a major topic
10
in information retrieval area .Example includes search engines. The

basic steps used in document clustering process are shown in figure
2.
The goal of a document clustering scheme is to minimize intracluster distances between
documents, while maximizing inter-cluster distances (using an
appropriate distance
measure between documents). A distance measure (or, dually,
similarity measure) thus
lies at the heart of document clustering. The large variety of
documents makes it almost
impossible to create a general algorithm which can work best in
case of all kinds of
datasets.
Figure 2.Flow diagram for representing basic Steps

in text clustering
Peprocessing
The text document preprocessing basically consists of a
process
to
capitalization,
strip
all
formatting
punctuation,
and
from
the
extraneous
article,
markup
including
(like
the
dateline,tags). Then the stop words are removed. Stop words term
(i.e., pronouns, prepositions, conjunctions etc) are the words that
don't carry semantic meaning. Stop words can be eliminated using a
list of stop words. Stop words elimination using a list of stop word
11
list will greatly reduce the amount of noise in text collection, as well
as make the computation easier. The benefit of removing stop words
leaves us with condensed version of the documents containing
content words only. The next process is to stem a word. Stemming is
the process for reducing some derived words into their root form.
For English documents, a popularly known algorithm called the
Porter stemmer [7] is used. The performance of text clustering can
be improved by using Porter stemmer.
Document Representation
Preprocessing is done to represent the data in the form that
can be used for clustering. There are many ways of representing
documents, like the vector space model, graphical model etc.[11] .
Vector Space Model
Vector Space Model (VSM) can be the simplest level of
document representation in clusters from [18]. Given a document
collection, any word present in the collection is counted as a
dimension. If there are totally d separate words, each document is
treated as a d-dimensional vector, whose coordinate values are the
frequencies
of appearance of
the words
in that document.
Consequently, this vector is very high dimensional but extremely

sparse, because a collection normally contains so many documents
that only a tiny portion of the words
actually belongs to an
individual document.
This representation model treats words as independent
entities, completely ignoring the structural information inside
documents, such as syntax and meaningful relationship between
words or between sentences. Recently, many efforts have been
made to find a better way of representing text document. As
mentioned, scarcity is a problem of VSM. A document vector has so
many unrelated dimensions that may hide its actual meaning.
Researchers have tried to make use of semantic relatedness of
words, or to find some sort of concepts, instead of words, to
represent documents. Its simplicity facilitates fast computation, at
the
same
time
provides
sufficient
12
numerical
and
statistical
information. Hence, it is the common model used in most of the

clustering algorithms nowadays.
The weights assigned to each term can be either the term
frequency
(tf)
orterm
frequency-inverse
document
frequency
(tfidf). In first case, thefrequency of occurrence for a term in a

document is included in the vectordtf=(tf1,tf2,tfm), where
tfi is the
frequency of the ithterm in the document.Usually, very common

words are removed and the terms are stemmed. A refinementto this
weighting scheme is the so-called tf idf weighting scheme. In this
approach,a term that appears in many documents should not be
regarded as more importantthan the one that appears in few
documents, and for this reason it needs to be deemphasized.
Figure 2.3: Vector space model

The figure 2.3 explains the vector space model. After
preprosessing the document dataset we have the list of words which
are common in all the documents. Then these word list is used as
dimensions to represent the documents into vectors. Documents
have many common words in the dataset so dimensions are high.
The
figure
2.3
explains
terms(TERM1,TERM2,TERM3)
that
common
there
are
in
three
three
documents(DOC1,DOC2,DOC3) so the three terms are considered as

dimensions in the plane. Then the documents the drawn in the
plane as vectors.
13
Let
be
the
total
number
of
documents
in
the
collection;dfi(documentfrequency) be the number of documents in

which the kiterm appears, and freqi,jbethe raw frequency of the
term ki in the document dj.
The inverse documentfrequency(idfi) for ki is defined as:
Idfi=logN/dfi
(2.1)
Thetfidf weight of term i is computed by:

wij=freqijlogN/dfi(2.2)
To account for documents of different length, each vector is
normalized so that it is ofunit length.
The main advantages of Vector Space Model (VSM) are :
The documents are sorted by decreasing similarity with the query
q.
The terms are weighted by importance.
It allows for partial matching: the documents need not have
exactly the sameterms with the query.
One disadvantage of VSM is that the terms are assumed to be
independent. Moreover, weighting is intuitive and not very formal.
Dimension reduction techniques

Dimension reduction can be divided into feature
selection and feature extraction. Feature selection is the process of
selecting smaller subsets (features) from larger set of inputs and
Feature extraction transforms the high dimensional data space to a
space of low dimension. The goals of dimension reduction methods
are to allow fewer dimensions for broader comparisons of the
concepts contained in a text collection.
Similarity Measurement
Accurate clustering requires a precise definition of the
closeness between a pair of objects, in terms of either the pair wise
similarity
or
distance.
Before
clustering,
similarity/distance
measure must be determined. The measure reflects the degree of
14
closeness or separation of the target objects and should correspond

to the characteristics that are believed to distinguish the clusters
embedded in the data. In many cases, these characteristics are
dependent on the data or the problem context at hand, and there is
no measure that is universally best for all kinds of clustering
problems.
Moreover, choosing an appropriate similarity measure is also
crucial for cluster analysis, especially for a particular type of
clustering algorithms. For example, the density-based
algorithms,
such as DBSCAN,
clustering
rely heavily on the similarity
computation. Density-based clustering finds clusters as dense areas

in the data set, and the density of a given point is in turn estimated
as the closeness of the corresponding data object to its neighboring
objects.
Recalling
that
distance/similarity value,
closeness
is
quantified
as
the
we can see that a large number of
distance/similarity computations are required for finding dense

areas and estimate cluster assignment of new data objects.
Therefore, understanding the effectiveness of different measures is
of great importance in helping to choose the best one.
In general, similarity/distance measures map the distance or
similarity between the symbolic descriptions of two objects into a
single numeric value, which depends on two factorsthe properties
of the two objects and the measure itself. There are four measures
[23] are discussed below.
Euclidean Distance
Euclidean distance is a popular similarity measure used in the
data clustering. The similarity between the two documents d i and
djis calculated as
(2.3)
It is used in the traditional k-meansalgorithm[2]. The objective
of k-means is to minimize theEuclidean distance between objects of
a cluster and thatclusters centroid:
15
(2.4)
Cosine Similarity
When documents are represented as term vectors, the
similarity of two documents corresponds to the correlation between
the vectors. This is quantified as the cosine of the angle between
vectors, that is, the so-called cosine similarity. Cosine similarity is
one of the most popular similarity measure applied to text
documents, such as in numerous information retrieval applications
in [11] and clustering tool kit from [13].An important property of the
cosine similarity is its independence of document length.The
similarity of two document vectors di and dj,Sim (di, dj), is defined as
the cosine of the angle between them. For unit vectors, this equals
to their inner product:
(2.5)
Cosine measure is used in a variant of K-means called
spherical K-Means in [4]. While K-Means aims to minimize Euclidean
distance, spherical K-Means intends
to maximize the cosine
similarity between the documents in a cluster and that clusters

centroid.
(2.6)
Jaccard Coefficient
The Jaccard coefficient, which is sometimes referred to as the
Tanimoto coefficient, measures similarity as the intersection divided
by the union of the objects. For text document, the Jaccard
coefficient compares the sum weight of shared terms to the sum
weight of terms that are present in either of the two documents but
are not the shared terms. Given non unit document vectors u i,uj
, their Jaccard coefficient is:
16
(2.7)
Pearson Correlation Coefficient
Correlation Clustering, introduced by Bansal, Blum and Chawla
[9], provides a method for clustering a set of objects into the best
possible number of clusters, without specifying that number in
proceed. Correlation clustering that does not require a bound on the
number of clusters that the data is partitioned into. Rather,
Correlation Clustering in the paper [10] divides the data into the
optimal number of clusters based on the similarity between the data
points. In their paper, [9] Bansal et al. discuss two objectives of
correlation clustering: minimizing disagreements and maximizing
agreements between clusters.
The normalized Pearson correlation is defined as:
(2.8)
Where
denotes the average feature value of x overall
dimensions.
In [20] Strehl et al. compared four measures: Euclidean,
Cosine, Pearson correlation and Extended Jaccard, and concluded
that cosine and extended Jaccard are the best ones on the web
documents.
1.6 Clustering Applications

Clustering is the most common form of unsupervised
learning and is a major tool in a number of applications in many
fields of business and science. Hereby, we summarize the basic
directions in which clustering is used.
Finding Similar Documents This feature is often used when the
user has spotted one good document in a search result and wants
more-like-this. The interesting property here is that clustering is able
to discover documents that are conceptually alike in contrast to
17
search-based approaches that are only able to discover whether the

documents share many of the same words.
Organizing Large Document Collections Document retrieval
focuses on finding documents relevant to a particular query, but it
fails to solve the problem of making sense of a large number of
uncategorized documents. The challenge here is to organize these
documents in a taxonomy identical to the one humans would create
given enough time and use it as a browsing interface to the original
collection of documents.
Duplicate Content Detection In many applications there is a
need to find duplicates or near-duplicates in a large number of
documents.
Clustering
is
employed
for
plagiarism
detection,
grouping of related news stories and to reordersearch results

rankings
(to
assure
higher
diversity
among
the
topmost
documents).Note that in such applications the description of clusters

is rarely needed.
Recommendation
System
In
this
application
user
is
recommended articles based on the articles the user has already

read. Clustering of the articles makes itpossible in real time and
improves the quality a lot.
Search Optimization Clustering helps a lot in improving the
quality and efficiency of search engines as the user query can be
first compared to the clusters instead of comparing it directly to the
documents and the search results can also be arranged easily.
1.7 Challenges in Document Clustering

Document clustering is being studied from many
decades but still it is far from a trivial and solved problem. The
challenges are:
1. Selecting appropriate features of the documents that should be
used for clustering.
2. Selecting an appropriate similarity measure between documents.
3. Selecting an appropriate clustering method utilizing the above
similarity measure.
18
4. Implementing the clustering algorithm in an efficient way that

makes it feasible in
terms of required memory and CPU resources.
5. Finding ways of assessing the quality of the performed clustering.
Furthermore, with medium to large document collections (10,000+
documents), the
number of term-document relations is fairly high (millions+), and
the computational
complexity of the algorithm applied is thus a central factor in
whether it is feasible for
real-life applications. If a dense matrix is constructed to represent
term-document
relations, this matrix could easily become too large to keep in
memory - e.g. 100, 000
documents 100, 000 terms = 1010 entries ~ 40 GB using 32-bit
floating point values. If
the vector model is applied, the dimensionality of the resulting
vector space will likewise
be quite high (10,000+). This means that simple operations, like
finding the Euclidean
distance between two documents in the vector space, become time
consuming tasks.
PARTITIONAL CLUSTERING
Partitional clustering algorithms describes that there is
maximum similarity within the clusters and minimum dissimilarity
between the clusters.The very popular partition based clustering
algorithm is K means algorithm because of its easy implementation
and simple. But the main drawback is that difficult to predict K value
.To overcome the drawback we are using Automatic Merging of
Optimal Clusters (AMOC).The aim of AMOC is to automatically
generate optimal clusters for the given datasets. The AMOC is an
addition to k-means with a two phase iterative procedure merging
19
validation techniques in order to find optimal clusters with

automatically combining of clusters.
Let X = {X1, X2, , X
} be a set of m objects,every
individual object in Xi is represented as[xi1,xi2,xin] where n is

number of attributes. This algorithm takes
kmax as the upper
bound of the number of clusters. It iteratively integrate with the

Clusters having
lower probability
with its nearest cluster and
validates the merging result using Rand Index .

Steps:
1. Initialize kmax to the square root of total number of objects.
2. Assign objects of kmax, randomly to the centroids of cluster
3. By using k-means then find the clusters
4. Calculate the Intra cluster distance.
5. Find a cluster that has minimal probability and combine with its
closest cluster.
Recalculate
centroids and decrease the number of clusters by
one.
6. Whenever the step 5 has been executed for every cluster, then
go to step7, or else
go to step5.
7.Whenever if there is no difference in the number of clusters, then

stop, or else go to step2.
Criterion Function
The frequently used partitional clustering similarity
strategy is the Variance Ratio Criterion (VRC) . Its definition is as
formulated
Bd
VRC=
n-k
(1)
Wd
k-1
Here B and W denote the between-cluster variations and withincluster, respectively. They are defined as:
20
oij -- oj)T(oij -- o j )
W=
B=
o )T (oj
(2)
o)
(3)
Where nj denotes the cardinal of the cluster c j, oi
denotes the ith
object assigned to the cluster c j, and o denotes the n-dimensional

vector of overall sample means (data centroid), and o j denotes the
n-dimensional vector of sample means within jth cluster (cluster
centroid). Between-cluster variations (k-1) is the degree of freedom
and within-cluster variations (n-k) is degree of freedom.
As a consequence, compact and separated clusters are assumed
to have minimum values of W and maximum values of B. Hence,
better the data partition, the more
value of
VRC. The
normalization term (n-k)/(k-1) prevents the ratio to increase

monotonically the number of clusters, thus making VRC as an
optimization (maximization) criterion.
PSO
Particle swarm optimization (PSO) is a computational
method that optimizes a problem by iteratively trying to enhance a
candidate
solution
considering
the
measure
of
quality.
PSO
optimizes a problem by having a candidate solution and moving

these particles around in the search -space according to simple
mathematical formulae over the position and velocity of particle's.
Vid=w*vid+c1*rand1*(pid-xid)+c2*rand2*(pgd-xid)
Xid=xid+vid
(4)
(5)
where w is the inertia weight factor ; location of the element

value is pid that realizes the local best value; location of the
elements pgd that realizes a overall best value; c1 and c2 called as
acceleration coefficients and constants; the dimension of the search
21
domain is d; rand 1, rand2 refers the arbitrary values distributed in

the interval [0 ,1].
Each and every particle's shift is effected by its local best
position and is also accompanied toward the best familiar positions
in the space, which are upgraded as better positions are found by
specific particles. This will make the swarm closer to the best
position. The PSO Clustering algorithm step
by step overview is
given below:
Step1:Initialize the population randomly.
Step2. Perform the following for each particle:
(a) Using the velocity and particle position to Update
equation (4) and (5) and
to generate the next solution.
(b) Compute the fitness value using fitness function(1).

Step3. Perform step (2) again and again till any of the below
conditions is fulfilled.
(a) The number of iterations performed has reaches maximum
or minimum.
(b) The average change in fitness values is negligible.
Tabu search
Fred Glover proposed an approach in 1986, which is called as
Tabu Search, used to allow Local Search (LS) methods to overcome
local optima. The main concept of TS is to pursue LS whenever it
reaches a local optimum by allow non-improving moves. The
difference between meta heuristic approaches and tabu search is,
tabu search based on the notion of tabu list. That is combination of
before visited solutions including disallow moves. we are using short
term memory so it reserves few of the attributes of solutions instead
22
of whole solution. So it doesnt grants any permission to revisited

solutions
Steps:
Step 1 Create initial solution x.
Step 2 Initialize the Tabu List.
Step 3 While set of X candidate solutions is not complete.
Step 3.1 compute x candidate solution from present solution x
Step 3.2. Add x to X iif x at least one Aspiration Criterion is
satisfied.
Step 4 Select the best x* candidate solution in X.
Step 5 .If fitness(x) < fitness(x*) then x = x*.
Step 6 Then Tabu List is updated.
Step 7 If termination criteria is reached then finish.
TSPSO:
In this section We introduced the TSPSO algorithm. The
algorithm combines PSO technique with TS. Particle swarm
optimization (PSO) is a computational method used to optimize
the results by iteratively trying to enhance a candidate solution
with view to a given measure of quality. It is a meta heuristic
method, it makes some or no hypothesis about the difficulties
being optimized and can search
solutions. It
a lot
spaces of
applicant
is not uses the gradient of the trouble being
optimized, which means PSO is not required for the optimization

problem but it is to be distinguished as is required by classic
optimization methods such as quasi-newton methods. Gradient
descent PSO is also used on optimization problems which are
23
relatively noisy, asymmetry, adjusting, etc. .However, PSO suffer

from following two aspects : I) It is easy to be confined into local
minima; II) it costs too much time to converge especially in a
complex high dimensional space.
found
all the particles are
When the optimal solution

situated
at the same local
minimum. After finding the optimal solution it is t impossible for

particles to
move
and do further searching because of
the
velocity update equation . To overcome aforementioned problem,

we proposed hybrid approach which combines the PSO and Tabu
Search (TS) considering that TS belongs to the class of local
search techniques. To overcome this drawback, we merge PSO
with a local search algorithm called TS. we combine TS and PSO
algorithm to use the exploration ability of both algorithms and to
avoid flaws of each other.
The flow chart of TSPSO is shown in fig.2
The TSPSO steps are listed bellow

24
Steps:
Step1 . the population Initialized randomly;
Step 2. compute the fitness function (1) for each particles
Step 3. randomly divide the population into two halves:
a) one half of population was updated by PSO. i.e Update the
position and
velocity of each particle.
b) The another half of population was updated by TS. it searches
the local
best solution for Each particle.
Step 4. Merge the two halves population, and update the pbest
and gbest particles
and the tabu list (TL).
Step 5 . Iterate Step 2-Step 4 whenever termination condition was
reached.
Step 6 . Output the result
2. Literature Reviews
Tabu-KM: A Hybrid Clustering Algorithm Based on
Tabu Search Approach
Abstract
The clustering problem under the criterion of minimum
sum of squares is a non-convex and non-linear program,
which possesses many locally optimal values, resulting
that its solution often falls into these trap and therefore
cannot converge to global optima solution. In this paper,
an efficient hybrid optimization algorithm is developed for
solving this problem, called Tabu-KM. It gathers the
optimization property of tabu search and the local search
capability of k-means algorithm together. The contribution
of proposed algorithm is to produce tabu space for
escaping from the trap of local optima and finding better
solutions effectively. The Tabu-KM algorithm is tested on
several simulated and standard datasets and its
25
performance is compared with k-means, simulated

annealing, tabu search, genetic algorithm, and ant colony
optimization algorithms. The experimental results on
simulated and standard test problems denote the
robustness and efficiency of the algorithm and confirm
that the proposed method is a suitable choice for solving
data clustering problems.
Introduction
Clustering is an important process in engineering and
other fields of scientific research. It is the process of
grouping patterns into a number of clusters, each of which
contains the patterns that are similar to each other
according to a specified similarity measure. Clustering is a
sequential process, which takes data as a raw material
and produces clusters as a result without any
predetermined goal [16]. To analyze the clusters, the
objects are represented by points in N-dimensional space,
where the objects of the vector are values for the
attributes of the object and the objective is to classify
these points into K clusters such that a certain similarity
measure is optimized. Corresponding author. M. Yaghini
Email: yaghini@iust.ac.ir Paper first received April. 07.
2010 ,and in revised form June. 24. 2010. We consider
clustering problem stated as follows: given N objects in ,
allocate each object to one of K clusters such that the sum
of squared Euclidean distances between each object and
the center of belonging cluster is minimized. The
clustering problem can be mathematically described as
follows: 2 1 1 ( , ) N K i j i j Min F W C wij x c = = =
(1) Where 1 1 K ij j w= = , i = 1,, N. If object xi
allocated to cluster Cj , then is equal to 1; otherwise is
equal to 0. In equation 1, N denotes the number of
objects, K denotes the number of clusters, X={x1,x2,
,xN} denotes the set of N objects, C ={c1,,cK}
denotes the set of K Clustering problem, Hybrid algorithm,
Tabu search algorithm, k-Means algorithm. September
2010, Volume 21, Number 2 International Journal of
Industrial Engineering & Production Research
http://IJIEPR.iust.ac.ir/ International Journal of Industrial
Engineering & Production Research (2010) pp. 71-79 ISSN:
2008-4889 72 M. Yaghini & N.Ghazanfari Tabu-KM: A
26
Hybrid Clustering Algorithm Based clusters, and W

denotes the 0-1 matrix. Cluster center cj is calculated as
follows: 1 , 1,..., i j j j j x c c x j k n = = (2) Where nj
denotes the number of objects belonging to cluster cj . It is
known that this clustering problem is a non-convex and
non-linear which possess many locally optimal values,
resulting that its solution often falls into these traps [24].
k-Means algorithm is one of the popular center based
algorithms [18] which proved to fail to convergence to a
local minimum under certain condition. The criterion it
uses minimizes the total mean squared distance from
each point in N to that points closest center in K. However
there are two main problems for k-means method [21] and
[19]. First is that the algorithm depends on the initial
states and the value of K. Second problem is that it is
easily converges to some local optima which is much
worse than the desired global optima solution. In this
paper, a new efficient algorithm is designed and
implemented based on tabu search approach for escaping
from local optima. The key idea of proposed algorithm is to
produce tabu space and select new center of cluster from
the objects not in tabu space. Then the k-means algorithm
is run to local search. This paper is organized as follows:
the tabu search approach for clustering and also related
works are reviewed in section 2. In section 3, we propose
the Tabu-KM algorithm and give detailed descriptions.
Section 4 presents experimental results with simulated
and standard datasets that show our method outperforms
some other methods. Finally, conclusions of the current
work are reported in section 5.
Conclusion
An effective hybrid clustering algorithm based on tabu
search approach called TabuKM is developed by
integrating the tabu space and move generator for
restricting objects to select as center of cluster. Tabu-KM
algorithm is used to escape from the trap of local optima
and finding better solutions, in the clustering problem
under the criterion of minimum sum of squares. To
produce the tabu space, two strategies are investigated:
the spherical space around the center of cluster with fixed
or dynamic radius. In addition, three different strategies
27
are discussed to select objects as center of new cluster

and generate a feasible solution: (1) move to the closest
object to the center of initial k-means cluster, (2) move to
the closest object to the center of current cluster, (3)
move to the closest object to the center of best-so-far
cluster. All above-mentioned strategies were investigated.
According to the result, the dynamic
A Survey on K-mean Clustering and Particle Swarm
Optimization
Abstract
In Data Mining, Clustering is an important research topic
and wide range of unsupervised classification application.
Clustering is technique which divides a data into
meaningful groups. K-mean is one of the popular
clustering algorithms. K-mean clustering is widely used to
minimize squared distance between features values of two
points reside in the same cluster. Particle swarm
optimization is an evolutionary computation technique
which finds optimum solution in many applications. Using
the PSO optimized clustering results in the components, in
order to get a more precise clustering efficiency. In this
paper, we present the comparison of K-mean clustering
and the Particle swarm optimization.
Introduction
Clustering is a technique which divides data objects into
groups based on the information found in data that
describes the objects and relationships among them, their
feature values which can be used in many applications,
such as knowledge discovery, vector quantization, pattern
recognition, data mining, data dredging and etc. [1] There
are mainly two techniques for clustering: hierarchical
clustering and partitioned clustering. Data are not
partitioned into a particular cluster in a single step, but a
series of partitions takes place in hierarchical clustering,
which may run from a single cluster containing all objects
to n clusters each containing a single object. And each
cluster can have sub clusters, so it can be viewed as a
tree, a node in the tree is a cluster, the root of the tree is
28
the cluster containing all the objects, and each node,

except the leaf nodes, is the union of its children. But in
partitioned clustering, the algorithms typically determine
all clusters at once, it divides the set of data objects into
non-overlapping clusters, and each data object is in
exactly one cluster. Particle swarm optimization (PSO) has
gained much attention, and it has been applied in many
fields [2]. PSO is a useful stochastic optimization algorithm
based on population. The birds in a flock are represented
as particles, and particles are considered as simple agents
flying through a problem area. And in the multidimensional problem space, the particles location can
represent the solution for the problem. But the PSO may
lack global search ability at the end of a run due to the
utilization of a linearly decreasing inertia weight and PSO
may fail to find the required optima when the problem to
be solved is too complicated and complex. K-means is the
most widely used and studied clustering algorithm. Given
a set of n data points in real d-dimensional space (Rd),
and an integer k, the clustering problem is to determine a
set of k points in Rd, the set of points is called cluster
centres, the set of n data points are divided into k groups
based on the distance between them and cluster centres.
K means algorithm is flexible and simple. But it has some
limitation, the cluster result mainly depends on the
selection of initial cluster centroids and it may converge to
the local optima [3]. However, the same initial cluster
centre in a data space can always generate the same
cluster results, if a good cluster centre can always be
obtained, the K-means will work well.
Conclusion
Study of the k-mean clustering and Particle swam
optimization we say that the k-mean which is depend on
initial condition, which cause the algorithm may converge
to suboptimal solution. On the other side Particle swarm
optimization is less sensitive for initial condition due to its
population based nature. So Particle swarm optimization is
more likely to find near optimal solution.
Cluster Analysis by Variance Ratio Criterion and
Firefly Algorithm
29
Abstract
In order to solve the cluster analysis problem more
efficiently, we presented a new approach based on firefly
algorithm (FA). First, we created the optimization model
using the variance ratio criterion (VRC) as fitness function.
Second, FA was introduced to find the maximal point of
the VRC. The experimental dataset contains 400 data of 4
groups with three different levels of overlapping degrees:
non-overlapping, partial overlapping, and severely
overlapping. We compared the FA with genetic algorithm
(GA) and combinatorial particle swarm optimization
(CPSO). Each algorithm was run 20 times. The results
show that FA can found the largest VRC values among all
three algorithms, while costs the least time. Therefore, FA
is effective and rapid for the cluster analysis problem.
Introduction
Cluster analysis is the assignment of a set of observations
into subsets without any priori knowledge so that
observations in the same cluster are similar to each other
than to those in other clusters [1]. Clustering is a method
of unsupervised learning, and a common technique for
statistical data analysis used in many fields [2], including
machine learning [3], data mining [4], pattern recognition
[5], image analysis [6], medical image classification [7],
and bioinformatics [8]. Cluster analysis can be achieved
by various algorithms that differ significantly. Those
methods can be basically classified into four categories: I.
Hierarchical Methods. They find successive clusters using
previously established clusters. They can be further
divided into the agglomerative methods and the divisive
methods [9]. Agglomerative algorithms start with onepoint clusters and recursively merges two or more most
appropriate clusters [10]. Divisive algorithms begin with
the whole set and proceed to divide it into successively
smaller clusters [11]. II. Partition Methods. They generate
a single partition of data with a specified or estimated
number of non overlapping clusters, in an attempt to
recover natural groups present in the data [12]. III.
Density-based Methods. They are devised to discover
30
arbitrary-shaped clusters. In this approach, a cluster is

regarded as a region in which the density of data objects
exceeds a threshold. DBSCAN [13] is the typical algorithm
of this kind. IV. Subspace Methods. They look for clusters
that can only be seen in a particular projection (subspace,
manifold) of the data. These methods thus can ignore
irrelevant attributes [14]. In this study, we focus our
attention on Partition Clustering methods. The K-means
clustering [15] and the fuzzy c-means clustering (FCM)
[16] are two typical algorithms of this type. They are
iterative algorithms and the solution obtained depends on
the selection of the initial partition and may converge to a
local minimum of criterion function value if the initial
partition is not properly chosen [17]. Branch and bound
algorithm was proposed to find the global optimum
clustering. However, it takes too much computation time
[18]. In the last decade, evolutionary algorithms were
proposed to clustering problem since they are not
sensitive to initial values and able to jump out of local
minimal point. For example, Lin et al. [19] Cluster Analysis
by Variance Ratio Criterion and Firefly Algorithm Yudong
Zhang, Dayong Li International Journal of Digital Content
Technology and its Applications(JDCTA)
Volume7,Number3,February 2013
doi:10.4156/jdcta.vol7.issue3.84 689 pointed out that kAnonymity has been widely adopted as a model for
protecting public released microdata from individual
identification. Their work proposed a novel genetic
algorithm-based clustering approach for k-anonymization.
Their proposed approach adopted various heuristics to
select genes for crossover operations. Experimental
results showed that their approach can further reduce the
information loss caused by traditional clustering-based kanonymization techniques. Chang et al. [20] proposed a
new clustering algorithm based on genetic algorithm (GA)
with gene rearrangement (GAGR), which in application
may effectively remove the degeneracy for the purpose of
a more efficient search. They used a new crossover
operator that exploited a measure of similarity between
chromosomes in a population. They also employed
adaptive probabilities of crossover and mutation to
prevent the convergence of the GAGR to a local optimum.
Using the real-world data sets, they compared the
31
performance of GAGR clustering algorithm with K-means

algorithm and other GA methods. Their experiment results
demonstrated that the GAGR clustering algorithm had
high performance, effectiveness and flexibility. Agard et al.
[21] pointed out defining an efficient bill of materials for a
family of complex products was a real challenge for
companies, largely because of the diversity they offered to
consumers. They solution is to define a set of components
(called modules), each of which contained a set of primary
functions. An individual product was then built by
combining selected modules. The industrial problem leads,
in turn, to the complex optimization problem. They solved
the problem via a simulated annealing method based on a
clustering approach. Jarboui et al. [12] presented a new
clustering approach based on the combinatorial particle
swarm optimization (CPSO) algorithm. Each particle was
represented as a string of length n (where n is the number
of data points), and the ith element of the string denoted
the group number assigned to object i. An integer vector
corresponded to a candidate solution to the clustering
problem. A swarm of particles were initiated and fly
through the solution space for targeting the optimal
solution. To verify the efficiency of the proposed CPSO
algorithm, comparisons with a genetic algorithm were
performed. Computational results showed that their
proposed CPSO algorithm was very competitive and
outperforms the genetic algorithm. Niknam et al. [22]
considered the k-means algorithm highly depended on the
initial state and converged to local optimum solution.
Therefore, they presented a new hybrid evolutionary
algorithm to solve nonlinear partitional clustering problem.
Their proposed hybrid evolutionary algorithm was the
combination of FAPSO (fuzzy adaptive particle swarm
optimization), ACO (ant colony optimization) and k-means
algorithms, called FAPSO-ACO-K, which can find better
cluster partition. The performance of their proposed
algorithm was evaluated through several benchmark data
sets. Their simulation results showed that the performance
of the proposed algorithm was better than other
algorithms such as PSO, ACO, simulated annealing (SA),
combination of PSO and SA (PSO-SA), combination of ACO
and SA (ACO-SA), combination of PSO and ACO (PSO-ACO),
genetic algorithm (GA), Tabu search (TS), honey bee
32
mating optimization (HBMO) and k-means for partitional

clustering problem. However, those aforementioned
algorithms yet performed ideally. They sometimes
converged too slow, or even converged to local minima
points, which lead to a wrong solution. Recently, the firefly
algorithm (FA) is a hot nature-inspired technique and has
been used for solving nonlinear multimodal optimization
problems in dynamic environment [23]. The algorithm is
based on the behavior of the fireflies. In social insect
colonies, each firefly seems to have its own plans, and yet
the group acts as a whole appears to be highly organized.
Scholars published immense literatures reporting its
performance, effectiveness, and robustness are superior
to GA, PSO, and other global algorithms in a wide range of
fields [23, 24]. The structure of the rest of this paper was
organized as follows. Next section 2 defined the partitional
problem, and gave the encoding strategy and clustering
criterion. Section 3 introduced the firefly algorithm.
Experiments in section 4 contained three types of artificial
data with different overlapping degree. Final section 5 was
devoted to conclusions and future work.
Conclusion
we first investigate the optimization model including both
the encoding strategy and the criterion function of VRC.
Afterwards, an FA algorithm was introduced for solving the
model. Experiments on three types of artificial data with
different overlapping degrees all demonstrate the FA is
more robust and costs less time than either GA or CPSO.
Future works contains following points: 1) Develop a
method that can determine the number of clusters
automatically; 2) Use more benchmark data to test the FA;
3) Apply our FA to practical clustering problems, including
mathematics [30], face estimation [31], image
segmentation [32], image registration [33], image
classification [34], UCAV path planning [35], and
prediction [36].
Document Clustering: The Next Frontier
Introduction
33
The proliferation of documents, on both the Web and in

private systems, makes knowledge discovery in document
collections arduous. Clustering has been long recognized
as a useful tool for the task. It groups like-items together,
maximizing intra-cluster similarity and inter-cluster
distance. Clustering can provide insight into the make-up
of a document collection and is often used as the initial
step in data analysis. While most document clustering
research to date has focused on moderate length single
topic documents, real-life collections are often made up of
very short or long documents. Short documents do not
contain enough text to accurately compute similarities.
Long documents often span multiple topics that general
document similarity measures do not take into account. In
this paper we will first give an overview of general
purpose document clustering, and then focus on recent
advancements in the next frontier in document clustering:
long and short documents
.
Conclusion
This chapter primarily focused on reviewing some
recently developed text clustering methods that are
specifically suited for long and for short document
collections. These types of document collections introduce
new sets of challenges. Long document are by their nature
multi-topic and as such the underlying document
clustering methods must explicitly focus on modeling
and/or accounting for these topics. On the other hand,
short documents often contain domain-specific
vocabulary, are very noisy, and their proper
modeling/understanding often requires the incorporation
of external information. We strongly believe research in
clustering long and short documents is in its early stages
and many new methods will be developed in the years to
come. Moreover, many real datasets are not only
composed of standard, long, or short documents, but
rather documents of mixed length. Current scholarship
lacks studies on these types of data. Since different
methods are often used for clustering standard, long, or
short documents, new methods or frameworks should be
investigated that address mixed collections. Traditional
document clustering is also faced with new challenges.
34
Todays very large, high-dimensional document collections

often lead to multiple valid clustering solutions.
Subspace/projective clustering approaches [67], [82] have
been used to cope with high dimensionality when
performing the clustering task. Ensemble clustering [40]
and multiview/alternative clustering approaches [58], [91],
which aim to summarize or detect different clustering
solutions, have been used to manage the availability of
multiple, possibly alternative clusterings for a given
dataset. Relatively little work has been done so far in
document clustering research to take advantage of
lessons learned from these methods. Integrating
subspace/ensemble/multi-view clustering with topic
models or segmentation may lead to developing the nextgeneration clustering methods specialized for the
document domain. Some topics that we have only briefly
touched on in this article are further detailed in other
chapters of this book. Other topics related to clustering
documents, such as semisupervised clustering, stream
document clustering, parallel clustering algorithms, and
kernel methods for dimensionality reduction or clustering,
were left for further study. Interested readers may consult
document clustering surveys by Aggarwal and Zhai [3],
Andrews and Fox [9], and Steinbach et al.
Discrete PSO with GA Operators for Document
Clustering
Abstract
The paper presents Discrete PSO algorithm for document
clustering problems. This algorithm is hybrid of PSO with
GA operators. The proposed system is based on
population-based heuristic search technique, which can be
used to solve combinatorial optimization problems,
modeled on the concepts of cultural and social rules
derived from the analysis of the swarm intelligence (PSO)
with GA operators such as crossover and mutation. In
standard PSO the non-oscillatory route can quickly cause a
particle to stagnate and also it may prematurely converge
on suboptimal solutions that are not even guaranteed to
local optimal solution. In this paper a modification strategy
is proposed for the particle swarm optimization (PSO)
35
algorithm and applied in the document corpus. The

strategy adds reproduction by using crossover and
mutation operators when the stagnation in movement of
the particle is identified. Reproduction has the capability
to achieve faster convergence and better solution.
Experiments results are examined with document corpus.
It demonstrates that the proposed DPSO algorithm
statistically outperforms the Simple PSO.
Introduction
Document clustering is an automatic grouping of text
documents into clusters so that documents within a
cluster have high similarity in comparison to one another,
but are dissimilar to documents in other clusters. Unlike
document classification [22], no labeled documents are
provided in clustering; hence, clustering is also known as
unsupervised learning. Document clustering is widely
applicable in areas such as search engines, web mining,
information retrieval and topological analysis. Document
clustering has become an increasingly important task in
analyzing huge numbers of documents distributed among
various sites. The challenging aspect is to analyze this
enormous number of extremely high dimensional
distributed documents and to organize them in such a way
that results in better search and knowledge extraction
without introducing much extra cost and complexity.
Clustering, in data mining, is useful to discover distribution
patterns in the underlying data. The K-means and its
variants [14][15] represent the category of partitioning
clustering algorithms that create a flat, non hierarchical
clustering that consist of k clusters. The K-means
algorithm iteratively refines a randomly chosen set of k
initial centroids, minimizing the average distance (i.e.,
maximizing the similarity) of documents to their closest
(most similar) centroid. A common document clustering
method [1][19] is the one that first calculates the
similarities between all pairs of the documents and then
cluster documents together if their similarity values are
above mentioned threshold. The common clustering
techniques are partitioning and hierarchical [11]. Most of
the document clustering algorithms can be classified into
these two groups. In this study, a document clustering
36
algorithm based on DPSO is proposed. The remainder of

this paper is organized as follows: Section II provides the
related works in document clustering using PSO. Section III
gives the overview of the PSO. The DPSO with GA
operators clustering algorithm is described in Section IV.
Section V presents the detailed experimental setup and
results for comparing the performance of the proposed
algorithm with the standard PSO (SPSO) and K-means
approaches.
Conclusion
The proposed system uses the vector space model for
document representation. The total number of documents
exist in CISI is 1460, Cranfield is 1400 and ADI is 82. Each
particle in the swarm is represented by 2942 dimensions.
The advantages of the PSO are very few parameters to
deal with and the large number of processing elements, so
called dimensions, which enable to fly around the solution
space effectively. On the other hand, it converges to a
solution very quickly which should be carefully dealt with
when using it for combinatorial optimization problems. In
this study, the proposed DPSO with GA operators
algorithm developed for much more complex, NP-hard
document clustering is verified on the document corpus. It
is shown that it increases the performance of the
clustering and the best results are derived from the
proposed technique. Consequently, the proposed
technique markedly increased the success of the
document clustering problem. The main objective of the
paper is to improve the fitness value of the problem. The
fitness value achieved from the standard PSO is low since
it has the stagnation it causes the premature
convergence. However, it can be handled by the DPSO
with the crossover and mutation operators of Genetic
Algorithm that tries to avoid the stagnation behavior of
the particles. The proposed system does not always avoid
the stagnation behavior of the particles. But for seldom it
avoids the stagnation, which is the source for the
improvement in the particles position.
37
3. System Design
3.1 Hardware and software specifications
H/W System Configuration:
Processor
Speed
RAM
Hard Disk
Key Board
Keyboard
Mouse
Pentium i5
- 2.3 Ghz
- 4 GB
- 500 GB
- Standard Laptop
-
USB mouse
S/W System Configuration:-
Operating System
Development tool
Language
Language Version
Technologies
:Windows 10
: Net beans 7.0.1
: JAVA
: jdk 1.7
: AWT, Swings
38
3.2 UML Diagrams

Use case diagram
There is only one actor, he can access the following functionality
Reading Vector and feature File from user
Applying AMOC for generating clusters

Applying Tabu Search for testing the TSPSO values
Applying PSO for testing the TSPSO values
Using TSPO for solve the document cluster analysis difficulties
more efficiently and quickly
39
Read Vector and feature File
Apply AMOC
Apply Tabu Search
User
Apply PSO
Apply TSPSO
View Results
Class Diagram:
Here Mining executer is the main class where it utilizes the
methods of OptionSelection when a user invokes the
action function the ResultForm is invoked where the inputs
of the ResultForm is given to the DocClusteringModel.
Finally the output class is executed with the inputs of
DocClustering, here StartUp class is generation of
DocClustering.
40
Sequence diagram
Here User is a main class whenever he wants to view the
datasets he can view them by requesting them using the
vectors and features file. Similarly he can optimize the
number of clusters that are present from the datasets
using AMOC. Whenever he wants to apply the PSO for
41
generating one of the test case for TSPO he can use them,
he can apply TSPO for generating efficient clusters finally
he can view all the result when required.
Read_dataset
Apply AMCO
Apply PSO
Apply TSPSO
: User
1 : Browse Vector and Feature File()
2 : Vector and Feature file Succefuly Read()

3 : Apply AMCO()
4 : Optimized number of Clusters Arrived()
5 : Apply PSO()
6 : PSO Applied()
7 : Apply TSPSO()
8 : TSPSO Applied()
9 : View Reults()
10 : Results Viewed()
Activity diagram
Behavior of the system in terms of activities are describes below. Here
as user starts initiates the process he browses through the
file for selection of features and vectors files. Then he
applies AOC for generating clusters. Then he can Apply
PSO for generating test sample1 for TSPO, Then he can
apply TS for generating test sampple2 Finally these can be
42
View Results
used for TSPO test and TSPO is applied for generating

Efficient clusters.
application
User
Browse Files
Apply AMCO
Apply PSO
Apply TS
Apply TSPSO
View Results
State chart diagram

This state chart describes the way or the sequence users and their
interactions here as user starts initiates the process he
browses through the file for selection of features and
vectors files. Then he applies AOC for generating clusters.
Then he can Apply PSO for generating test sample1 for
43
TSPO, Then he can apply TS for generating test sampple2

Finally these can be used for TSPO test and TSPO is
applied for generating Efficient clusters.
Browse Files
Apply AMCO
Apply PSO
Apply TS
Apply TSPSO
View Results
Component Diagram:
The figure shows the various interactions of user with
different components. User interacts with Read datasets to
browse the vectors an features file on success he can read
successfully, he can interact with AMOC component to
apply it and get optimized clusters, he then interacts with
PSO component to apply it an generates samples,
44
Similarly he can interact with TSPO component and on

success to generate optimized clusters
ALGORITHMS
1. PSO
Let S be the number of particles in the swarm, each
having a position xi Rn in the search-space and a
velocity vi Rn. Let pi be the best known position of
45
particle i and let g be the best known position of the entire

swarm. A basic PSO algorithm is then:
For each particle i = 1, ..., S do:
Initialize
the
particle's
position
with
uniformly
distributed random vector: xi ~ U(blo, bup), where blo

and bup are the lower and upper boundaries of the
search-space.
Initialize the particle's best known position to its initial
position: pi xi
If (f(pi) < f(g)) update the swarm's best known position:
g pi
Initialize the particle's velocity: vi ~ U(-|bup-blo|, |bupblo|)
Until a termination criterion is met (e.g. number of
iterations performed, or a solution with adequate
objective function value is found), repeat:
For each particle i = 1, ..., S do:
Pick random numbers: rp, rg ~ U(0,1)
For each dimension d = 1, ..., n do:
Update the particle's velocity: vi,d vi,d + p rp
(pi,d-xi,d) + g rg (gd-xi,d)
Update the particle's position: xi xi + vi
46
If (f(xi) < f(pi)) do:

Update the particle's best known position: pi xi
If (f(pi) < f(g)) update the swarm's best known position:
g pi
Now g holds the best found solution.
2. Tabu search:
Steps involved:
Step 1 Create initial solution x.
Step 2 Initialize the Tabu List.
Step 3
While set of X candidate solutions is not
complete.
Step 3.1 compute x candidate solution from present
solution x
Step 3.2. Add x to X
iif x
at least one Aspiration
Criterion is satisfied.
Step 4 Select the best x* candidate solution in X.
Step 5 .If fitness(x) < fitness(x*) then x = x*.
Step 6 Then Tabu List is updated.
Step 7 If termination criteria is reached then finish.
Criterion Function
47
The Variance Ratio Criterion (VRC) is the mostly widely

used partitioned
clustering strategy.
Where nj denotes the cardinal of the cluster cj, oij

denotes the ith object assigned to the
cluster cj, and o
denotes the n-dimensional vector of overall sample

means (data centroid), and
o j denotes the n-
dimensional vector of sample means within jth cluster

(cluster centroid).
Between-cluster variations (k-1) is
the degree of freedom.
3. TSPSO
Steps involved:
Step1 . the population Initialized randomly;
Step 2. Compute the fitness function (1)
for each
particles
Step 3. Randomly divide the population into two halves:
a) one half of population was updated by PSO. i.e Update
the position and velocity of each particle.
b) The another half of population was updated by TS. it
searches the local
best solution for Each particle.
Step 4. Merge the two halves population, and update the
pbest and gbest particles
and the tabu list (TL).
48
Step 5 . Iterate Step 2-Step 4 whenever termination

condition was reached.
Step 6 . Output the result
4. Implementation
4.1 Introduction to technologies
The feasibility of the project is analyzed in this
phase and business proposal is put forth with a very
general plan for the project and some cost estimates.
During system analysis the feasibility study of the
proposed system is to be carried out. This is to ensure that
the proposed system is not a burden to the company. For
feasibility analysis, some understanding of the major
requirements for the system is essential.
Three
key
considerations
involved
in
the
feasibility
analysis are
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic
impact that the system will have on the organization. The
amount of fund that the company can pour into the
49
research and development of the system is limited. The

expenditures must be justified. Thus the developed
system as well within the budget and this was achieved
because
most
of
the
technologies
used
are
freely
available. Only the customized products had to be

purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical
feasibility, that is, the technical requirements of the
system. Any system developed must not have a high
demand on the available technical resources. This will lead
to high demands on the available technical resources. This
will lead to high demands being placed on the client. The
developed system must have a modest requirement, as
only
minimal
or
null
changes
are
required
for
implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of
acceptance of the system by the user. This includes the
process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead
50
must accept it as a necessity. The level of acceptance by

the users solely depends on the methods that are
employed to educate the user about the system and to
make him familiar with it. His level of confidence must be
raised so that he is also able to make some constructive
criticism, which is welcomed, as he is the final user of the
system.
4.2 Sample Code

package miner.psoAlgo;
/**
*
*/
//package miner;
import
import
import
import
java.io.*;
java.util.*;
javax.swing.JOptionPane;
miner.*;
public class pso

{
float tfIdf[][];
float particles[][][];
float fitness[];
float partiVelocity[][][];
float pBest[][][];
public float gBest[][];
float newFitness[];
float gBestFitness;
//boolean clusterPoints[][];
int clusterSize[];
51
float distance[];
float intraclustDistance[];
boolean clusterPoints[][];
small little=new small();
public void extractData() throws IOException
{
Scanner s=null;
try
{
s=new Scanner(new BufferedReader(new
FileReader("c:\\dc\\tfIdfMatrix.txt")));
String a,b;
int col=-1;
while(s.hasNext())
{
a=s.next();
if(a.indexOf("column")!=-1)
{
col++;
for(int j=0;j<tfIdf.length;j++)
{
a=s.next();
tfIdf[j]
[col]=Float.parseFloat(a);
}/*End of for*/
}/*End of If*/
}/*End of while*/
}/*End of try*/
catch(IOException e)
{
JOptionPane.showMessageDialog(null,e.toStri
ng(),"pso-extractData()",JOptionPane.ERROR_MESSAGE);
//return count;
}
finally
{
if(s!=null)
s.close();/*End of if*/
52
/*if(out!=null)
out.close();*/
}/*End of finally*/
}/*End of Extract data*/
public pso()
{
}
public pso(int Rows,int Columns,int noOfClusters,int
noOfParticles)
{
System.out.println("parameterised Constructor
Executed");
tfIdf=new float[Rows][Columns];
System.out.println("The size of the matrix
is:"+tfIdf.length+"\t"+tfIdf[0].length);
particles=new float[noOfParticles][noOfClusters]
[Columns];
fitness=new float[particles.length];
partiVelocity=new
float[noOfParticles]
[noOfClusters][Columns];
pBest=new
float[noOfParticles][noOfClusters]
[Columns];
gBest=new float[noOfClusters][Columns];
newFitness=new float[particles.length];
//clusterPoints=new
boolean[tfIdf.length]
[particles[0].length];
clusterSize=new int[particles[0].length];
distance=new float[particles[0].length];
intraclustDistance=new
float[particles[0].length];
Arrays.fill(fitness,0);
Arrays.fill(fitness,0);
for(int i=0;i<gBest.length;i++)
Arrays.fill(gBest[i],0);
for(int i=0;i<pBest.length;i++)
for(int j=0;j<pBest[i].length;j++)
{
53
Arrays.fill(partiVelocity[i][j],0);
Arrays.fill(pBest[i][j],0);
}
}
public void assignParticles() throws IOException
{
int Particles[];
try
{
numberGenerator n;
n=new numberGenerator();
Particles=n.extractNumbers((particles.length)*(particles[0]
.length));
System.out.println(Particles.length);
int l=0;
for(int i=0;i<particles.length;i++)
{
for(int j=0;j<(particles[0].length);j++)
{
for(int k=0;k<(particles[0]
[0].length);k++)
{
particles[i][j][k]=tfIdf[Particles[l]1][k];
}
l++;
System.out.println(l);
}
}
}
{
ng(),"psoassignParticles()",JOptionPane.ERROR_MESSAGE);
//return count;
}
54
finally
{
}
}
public float eucliDistance(float a[],float b[])

{
float distance=0,temp;
for(int i=0;i<a.length;i++)
{
temp=a[i]-b[i];
distance+=temp*temp;
}
//distance=(float)(distance/tfIdf[0].length);
distance=(float)(Math.sqrt(distance));
return distance;
}
public small Small(float distance[])

{
small a=new small();
a.distance=distance[0];
a.pos=0;
for(int i=1;i<distance.length;i++)
{
if(a.distance==0 && i==1)
{
int j=i;
while(true)
{
if(distance[i]!=0)
{
a.distance=distance[j];
break;
}
55
j++;
}
}
if(a.distance>distance[i]
&&
distance[i]!
=0)
{
a.distance=distance[i];
a.pos=i;
}
}
return a;
}
public void calFitness()

{
//int clusterSize[];
//float distance[];
//float intraclustDistance[];
//clusterSize=new int[particles[0].length];
//distance=new float[particles[0].length];
//intraclustDistance=new
float[particles[0].length];
{
System.gc();
newFitness[i]=0;
for(int l=0;l<particles[0].length;l++)
{
clusterSize[l]=0;
distance[l]=(float)(0);
intraclustDistance[l]=(float)(0);
//newFitness[l]=(float)(0);
}
56
{
for(int k=0;k<particles[i].length;k++)
{
distance[k]=eucliDistance(tfIdf[j],particles[i][k]);
}
little=Small(distance);
//clusterPoints[j][little.pos]=true;
intraclustDistance[little.pos]
+=little.distance;
clusterSize[little.pos]++;
}
for(int k=0;k<particles[0].length;k++)
{
intraclustDistance[k]=intraclustDistance[k]/clusterSize[k];
if(Float.isNaN(intraclustDistance[k])==true)
intraclustDistance[k]=(float)
(3.3406782);
System.out.println("The
intracluster
distance in cluster:"+k+" is"+intraclustDistance[k]);
}
System.out.println();
for(int k=0;k<particles[0].length;k++)
newFitness[i]+=intraclustDistance[k];
newFitness[i]=newFitness[i]/particles[0].length;
if(Float.isNaN(newFitness[i])==true)
newFitness[i]=fitness[i];
String l=Float.toString(newFitness[i]);
/*if(l.length()>5)
{
int pos;
57
pos=l.indexOf('.');
//System.out.println(pos);
String s=l.substring(0,pos+4);
//System.out.println(s);
newFitness[i]=Float.parseFloat(s);
}*/
System.out.println("The Fitness of particle

"+i+" is: "+newFitness[i]);
}
}
public void changePartiVelocityLocation()
{
float rand1,rand2;
int i,j,k;
rand1=(float)(Math.random());
System.gc();
while(rand1!=rand2)
for(i=0;i<particles.length;i++)
{
for(j=0;j<particles[i].length;j++)
{
for(k=0;k<particles[i][j].length;k++)
{
partiVelocity[i][j][k]=(float)
((0.72)*partiVelocity[i][j][k]+(1.42)*rand1*(pBest[i][j][k]particles[i][j][k])+(1.42)*rand2*(gBest[j][k]-particles[i][j]
[k]));
//partiVelocity[i][j]
[k]=Math.abs(partiVelocity[i][j][k]);
if(Float.isNaN(partiVelocity[i][j]
[k])==true)
partiVelocity[i][j][k]=0;
particles[i][j][k]+=partiVelocity[i]
[j][k];
}
58
}
}
}
public void findpBest()

{
int i,j;
for(i=0;i<fitness.length;i++)
{
if(fitness[i]>newFitness[i])
{
fitness[i]=newFitness[i];
for(j=0;j<particles[0].length;j++)
{
System.arraycopy(particles[i]
[j],0,pBest[i][j],0,(particles[i][j]).length);
}
}
}
}
public void findgBest(int i)

{
int j;
small a=new small();
int flag=0;
a=Small(fitness);
//System.out.println("gBest:"+a.distance);
if(i==0)
{
gBestFitness=a.distance;
flag=1;
}
else
{
if(i==1 && gBestFitness==0)
else if(a.distance!=0)
59
if(gBestFitness>a.distance)
{
flag=1;
}
}
is:"+gBestFitness);
gBest
Fitness
if(flag==1)
for(j=0;j<particles[0].length;j++)
{
//System.out.println("The
gBest
Fitness is assigned");
System.arraycopy(particles[a.pos]
[j],0,gBest[j],0,(particles[0][0]).length);
}
}
public boolean checkFitness()
{
float a;
byte count=0;
//byte pos;
//String l;
//l=Float.toString(fitness[0]);
//pos=l.indexOf(.);
a=newFitness[0];
for(int i=1;i<newFitness.length;i++)
if(Math.abs(a-newFitness[i])==0)
count++;
if(count==newFitness.length-1)
{
System.out.println("After
checking
Fitness:");
for(int l=0;l<newFitness.length;l++)
System.out.println(newFitness[l]);
return true;
60
}
return false;
}
public void psoalg(int n)
{
int i,j,k;
for(i=0;i<n;i++)
{
System.gc();
System.out.println("iteration: "+i);
calFitness();
if(i==0)
{
System.arraycopy(newFitness,0,fitness,0,newFitness.lengt
h);
//System.out.println("newFitness:"+ne
wFitness[0]);
System.out.println("Fitness:"+fitness[0]);
for(j=0;j<fitness.length;j++)
{
for(k=0;k<particles[0].length;k++)
{
System.arraycopy(particles[j][k],0,pBest[j][k],0,
(particles[j][k]).length);
}
}
System.gc();
findgBest(i);
changePartiVelocityLocation();
}
61
else
{
System.gc();
findpBest();
findgBest(i);
changePartiVelocityLocation();
}
if(checkFitness())
{
System.out.println("Yes");
break;
}
}
public void show() throws IOException

{
PrintWriter out=null;
try
{
out=new PrintWriter(new
FileWriter("c:\\dc\\psoparticles.txt"));
{
out.println("particle:"+(i+1));
for(int j=0;j<(particles[0].length);j++)
{
out.println("Cluster:"+(j+1));
for(int k=0;k<(particles[0][0].length);k+
+)
{
out.print(particles[i][j][k]+"\t");
}
62
out.println();
}
}
}
{
ng(),"pso-show()",JOptionPane.ERROR_MESSAGE);
//return count;
}
finally
{
if(out!=null)
out.close();
}
}
public float centToCentDistance(/*PrintWriter out1*/)
{
float result=0;
{
for(int j=i+1;j<gBest.length;j++)
{
float temp;
temp=eucliDistance(gBest[i],gBest[j]);
System.out.println("The distance
from centroid"+(i+1)+" to centroids"+(j+1)+" is :
"+temp);
//out1.println("The distance from
centroid"+(i+1)+" to centroids"+(j+1)+" is : "+temp);
result+=temp;
}
}
int n=gBest.length;
n=(n*(n-1))/2;
result=result/n;
63
System.out.println("The average distance

is:"+result);
//out1.println("The
average
distance
is:"+result);
return result;
}
public float intDist()

{
float distancei[];
float intraclustDistancei[];
float fitnessi=0;
int clusterSizei[];
small littlei;
littlei=new small();
distancei=new float[gBest.length];
intraclustDistancei=new float[gBest.length];
clusterSizei=new int[gBest.length];
clusterPoints=new
[gBest.length];
boolean[tfIdf.length]
{
clusterSizei[i]=0;
distancei[i]=(float)(0);
intraclustDistancei[i]=(float)(0);
Arrays.fill(clusterPoints[i],false);
//newFitness[i]=(float)(0);
}
{
for(int k=0;k<gBest.length;k++)
{
64
distancei[k]=eucliDistance(tfIdf[j],gBest[k]);
}
littlei=Small(distancei);
clusterPoints[j][littlei.pos]=true;
intraclustDistancei[littlei.pos]
+=littlei.distance;
clusterSizei[littlei.pos]++;
}
intraclustDistancei[k]=intraclustDistancei[k]/clusterSizei[k
];
{
System.out.println("Cluster"+k+":"+intraclustDistancei[k])
;
fitnessi+=intraclustDistancei[k];
}
fitnessi=fitnessi/gBest.length;
//String l=Float.toString(newFitness[0]);
/*if(l.length()>5)
{
int pos;
pos=l.indexOf('.');
//System.out.println(pos);
String s=l.substring(0,pos+4);
//System.out.println(s);
newFitness[i]=Float.parseFloat(s);
}*/
is:"+fitnessi);
65
gBest
fitness
return fitnessi;
//finddocclust(clusterPoints);
}
public String finddocclust()
{
String clust;
clust="";
int flag=0;
for(int i=0;i<clusterPoints[0].length;i++)
{
clust+="The documents under cluster:
"+i+" are:"+"\n";
flag=0;
for(int j=0;j<clusterPoints.length;j++)
{
if(clusterPoints[j][i]==true)
{
flag++;
String s=Integer.toString(j);
clust+=s;
if(flag%5==0)
clust+="\n";
else
clust+="\t";
if(flag==5)
flag=0;
}
}
clust+="\n"+"**************************************"+"\n"
;
}
66
System.out.println("The cluster result is:");

System.out.println(clust);
return clust;
}
}
5. Testing
5.1 Unit Testing
Tests for Input
Test case :
What happens when we press ok with leaving
allfields empty
Expected Output:
When the user clicks on ok without any input for
fields
then it should prompt an error message in an dialog
bax saying Select appropriate fields properly
67
Observed Output:
When ok is pressed error is prompt in the dialog
box
The error show same as that of expected
No errors will be displayed when all fields are entered
correctly
Tests for empty features field

Test case :
features feild empty
Expected Output:
fields
bax saying Select features fields properly
68
Observed Output:
box
No errors will be displayed when all features fields is
entered correctly
Tests for empty vectors field

Test case :
vectors feild empty
Expected Output:
fields
69

bax saying Select vectors fields properly
Observed Output:
box
No errors will be displayed when all vectors fields is
entered correctly
Tests for empty algorithms field

Test case :
algorithms feild unselected
Expected Output:
70

fields
bax saying Select Algorithms fields properly
Observed Output:
box
No errors will be displayed when any Algorithms fields is
selected correctly
5.2 Performance Evaluation

The table 6.1 contains the VRC values of the different
algorithms on different datasets listed below. We compare the VRC
71
values of PSO,TS and TSPSO clustering algorithms . The graph is

plotted for the VRC values of these algorithms.
Table 5.1: VRC values of Three algorithms
Data
PSO
Dataset1
TS
0.489
TSPSO
0.458
0.38
0.502
0.491
0.305
0.561
0.482
0.4
Dataset2
Dataset3
By using the equation 6.1 we calculate the VRC values of each

algorithm in every iteration which are applied on different document
datasets. The above table gives the VRC values of the PSO and TS
and TSPSO clustering algorithms of the corresponding dataset.
0.7
Bisecting Incremental K-Means

Incremental K-Means
K-Means
Sperical K-Means
0.6
F-Score values
0.5
0.4
0.3
0.2
0.1
re0
fbis
Datasets
tr11
tr12
re1
Figure 5.2 : Performance evaluation of the algorithms

The
figure 5.2 explains the VRC values of the three
algorithms. On X-axis we take the datasets and on Y-axis we take

the VRC values. For each dataset the Three cluster algorithm VRC
values are plotted in the figure 7.1 bisecting Incremental K-means
72
values. The cyan colour plot represents the incremental K-way

clustering using MVS and the color yellow represents the K-Means FScore values and the color brown represents the spherical K-Means
values.From the figure, we concluded that the Bisecting F-Score
value of each dataset is high when compared to the other
algorithms.
6. Results
73
This is the home page where the user must enter next in
order to carry out his tasks that are needed to be
performed.
74
The above figure demonstrates the user selectable fields

such as
1: Enter Vector file:
Here the user enters the location of Vector
Files in order to provide the id as input
2: Enter features file:
Here the use enter the location of Features
File In order to provide data that are related to the id
3 AMOC:
Used for generating the clusters from the given
input files.
4 TSPO:
Used for finding the cluster that is needed
5 Practical Swamp Optimization, Tabu search Clustering:
Used for testing the TSPO generated
The above diagram demonstrates the selection of vector

files as input
75
The above figure demonstrates the input of features dile
The above figures demonstrates the id that is selected.
76
The above figure demonstrates the selection of AMOC for

cluster generation
The above figure demonstrates the suceessfull note of

mining
77
The above figure shows the result generated for the given
input files and cluster.
78
The above figure shows the fitness function results so

generated using the input values
79
The above figure shows the final result that is generated

after successful execution
80
7. CONCLUSION
In this thesis the new approach Hybrid algorithm that uses
Tabu search and basic PSO is proposed to solve the
problem of Document Clustering. PSO has been proved as
an effective optimization technique to solve combinatorial
optimization problems. Tabu search, an efficient local
search procedure helps to explore the solutions in different
regions of solutions. This thesis proposes a Hybrid
Algorithm is a blended technique that combines features
of basic PSO and TS. The quality of solutions obtained by
Hybrid Algorithm strongly substantiates the effectiveness
of the algorithm for the document clustering in IR system.
We
also compared the TSPSO with particle swarm
optimization (PSO)
and Tabu search (TS). The results
shows that TSPSO having the largest VRC values among all
the algorithms. It concludes that TSPSO is effective for the
document cluster analysis problem. Future work contains
use more standard data sets to test the performance of
the TSPSO.
The clustering algorithm
And these algorithms are applied to different datasets.
Compared the results of proposed TSPSO algorithm with
the other existing algorithms. Finally the VRC values of
each algorithm compared and concluded that TSPSO
algorithm
gives
the
accurate
reaming algorithms
81
clusters
compared
to
References
1 P Jaganathan, S Jaiganesh,: An improved K-means
algorithm combined with Particle Swarm Optimization
approach for efficient web document clustering
.International Conference on Green Computing,
Communication and Conservation of Energy
(ICGCE)IEEE(2013).
2 M. Yaghini, N.Ghazanfari : Tabu-KM: A Hybrid Clustering
Algorithm Based on Tabu Search Approach. International
Journal of Industrial Engineering & Production Research
Septtember (2010),, Vollume 21.
3 Pritesh Vora, Bhavesh Oza: A Survey on K-mean
Clustering and Particle Swarm Optimization,
International Journal of Science and Modern Engineering
(IJISME) ISSN: 2319-6386, Volume-1, Issue-3,
February( 2013).
82
4 Yudong Zhang, Dayong Li: Cluster Analysis by Variance

Ratio Criterion and Firefly Algorithm, International
Journal of Digital Content Technology and its
Applications(JDCTA) Volume7,Number3,February2013 .
5
karypis,G:CLUTO a clustering Toolkit. technical report,

Dept.of computer science, Univ .of
Minnesota(2013).http:/glaros.dtc.umn.edu/~gkhome/view
s/cluto
K. Premalatha, Dr. A.M. Natarajan: Discrete PSO with GA

Operators for Document Clustering. International Journal
of Recent Trends in Engineering, Vol 1, No. 1, May 2009
Sites Referred:
http://java.sun.com
http://www.sourcefordgde.com
http://www.networkcomputing.com/
http://www.roseindia.com/
http://www.java2s.com/
http://www. javadb.com/
83

MainClustering of Text Documents

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MainClustering of Text Documents

Uploaded by

Copyright:

Available Formats

1 INTRODUCTION

optimization, artificial immune systems, and fuzzy optimization.

or insects search food or migrate and so forth in a searching space,

characteristics of the generator, such as ramp rate limits, prohibited

feasibility of the proposed method is demonstrated for three

Tabu Search (TS), a heuristic method originally proposed by Glover

straightforward, yet challenging and relevant, problems will be used

Tabu search (TS) has its antecedents in methods designed to

optimization problems (especially those arising in real world

predetermined number of iterations no improvements are found, the

iterative procedure consists in constructing from a current solution i

programming [15]. Leyffer and Mahajan (2010) present a survey of

optimization problems have increasingly been subjected to analysis

hybrids are genetic algorithms and differential evolution algorithms

datasets with expression values for thousands even tens of

2004). An efficient way to solve this problem is gene selection and

selection of discriminatory genes is critical to improving the

flexible memory system. But the convergence speed of TS depends

incorporation of TS as a local improvement procedure enables the

1.3 Thesis Overview:

algorithms. Initially we start with

data matrix obtained from the text documents after preprocessing

value and the produced

k value is given to the PSO and TS and

algorithms to form clustering documents. The results

obtained from above process compare with obtained

and also their time complexities.

1.5 Document clustering

in information retrieval area .Example includes search engines. The

Figure 2.Flow diagram for representing basic Steps

Consequently, this vector is very high dimensional but extremely

information. Hence, it is the common model used in most of the

(tfidf). In first case, thefrequency of occurrence for a term in a

frequency of the ithterm in the document.Usually, very common

Figure 2.3: Vector space model

documents(DOC1,DOC2,DOC3) so the three terms are considered as

collection;dfi(documentfrequency) be the number of documents in

Thetfidf weight of term i is computed by:

Dimension reduction techniques

measure must be determined. The measure reflects the degree of

closeness or separation of the target objects and should correspond

rely heavily on the similarity

computation. Density-based clustering finds clusters as dense areas

we can see that a large number of

distance/similarity computations are required for finding dense

to maximize the cosine

similarity between the documents in a cluster and that clusters

, their Jaccard coefficient is:

denotes the average feature value of x overall

1.6 Clustering Applications

search-based approaches that are only able to discover whether the

grouping of related news stories and to reordersearch results

documents).Note that in such applications the description of clusters

recommended articles based on the articles the user has already

1.7 Challenges in Document Clustering

4. Implementing the clustering algorithm in an efficient way that

validation techniques in order to find optimal clusters with

individual object in Xi is represented as[xi1,xi2,xin] where n is

kmax as the upper

bound of the number of clusters. It iteratively integrate with the

with its nearest cluster and

validates the merging result using Rand Index .

centroids and decrease the number of clusters by