Professional Documents
Culture Documents
1.1 Introduction:
Clustering is main process in engineering and in various
fields of scientific research, which tries to group a set of points into
clusters such that points in the same cluster are more homogeneous
to each other when compared to the points in different clusters.
Document clustering is nothing but group the documents by based
on similarity among the documents in an unsupervised manner.
Document clustering used in Quick topic extraction or filtering and
information retrieval. We are facing an ever increasing volume of
text documents. The texts flowing over then termed, vast collections
of documents in repositories, digital libraries and digitized personal
information such as articles and emails .These have brought
challenges for the effective and efficient organization of text
documents
There is no known single optimization method available for
solving all optimization problems. A lot of optimization methods
have been developed for solving different types of optimization
problems in recent years. The modern optimization methods
(sometimes called nontraditional optimization methods) are very
powerful and popular methods for solving complex engineering
problems.
These
methods
algorithm,
neural
networks,
are
particle
genetic
swarm
algorithms,
optimization
ant
colony
(ED)
problem
in
power
systems.
Many
nonlinear
proposed
method
in
practical
generator
operation.
The
of
TS. Throughout
the
paper,
two
relatively
remarkable
successes
of
tabu
search
for
solving
hard
2],
swarm
optimization
[6]
and
non-linear
mathematical
problems
arise
in
practice.
Recently,
non-linear
systems.
Microarray
experiments
generate
large
address:
shiweimin@zzu.edu.cn
(W.-M. Shi).
for
gene
selection
for
tumor
classification.
The
1.2 Motivation:
PSO performs excellently in global search while not so
well in local search, meanwhile, TS performs excellently in local
search while not so well in global search. Therefore in this thesis i to
combine the two algorithms so the new hybrid algorithm conducts
both global search and local search in every iteration , so the
probability of finding the optimal points significantly increases.
However to the best of the authors knowledge, TSPSO has not been
used to cluster text documents. In This study a document clustering
algorithm based on TSPSO is proposed.
VRC values
1.4 Clustering
A general definition of clustering stated by Brian Everitt et al.
[6]Given a number of objects or individuals, each of which is
described by a set of numerical measures, devise a classification
scheme for grouping the objects into a number of classes such that
objects within classes are similar in some respect and unlike those
from other classes. The number of classes and the characteristics of
each class are to be determined. The clustering problem can be
formalized as an optimization problem, i.e. the minimization or
maximization of a function subject to a set of constraints. The goal
of clustering can be defined as follows:
Given
I. a dataset X = {x1, x2, . , xn}
II. the desired number of clusters k
III. a function f that evaluates the quality of clustering
we want to compute a mapping
:{1,2,.....,n}{1,2,.....,k}
that minimizes the function f subject to some constraints. The
function f that evaluates the clustering quality are often defined in
terms of similarity between objects and it is also called distortion
function or divergence. The similarity measure is the key input to a
clustering algorithm.
10
Peprocessing
The text document preprocessing basically consists of a
process
to
capitalization,
strip
all
formatting
punctuation,
and
from
the
extraneous
article,
markup
including
(like
the
dateline,tags). Then the stop words are removed. Stop words term
(i.e., pronouns, prepositions, conjunctions etc) are the words that
don't carry semantic meaning. Stop words can be eliminated using a
list of stop words. Stop words elimination using a list of stop word
11
list will greatly reduce the amount of noise in text collection, as well
as make the computation easier. The benefit of removing stop words
leaves us with condensed version of the documents containing
content words only. The next process is to stem a word. Stemming is
the process for reducing some derived words into their root form.
For English documents, a popularly known algorithm called the
Porter stemmer [7] is used. The performance of text clustering can
be improved by using Porter stemmer.
Document Representation
Preprocessing is done to represent the data in the form that
can be used for clustering. There are many ways of representing
documents, like the vector space model, graphical model etc.[11] .
Vector Space Model
Vector Space Model (VSM) can be the simplest level of
document representation in clusters from [18]. Given a document
collection, any word present in the collection is counted as a
dimension. If there are totally d separate words, each document is
treated as a d-dimensional vector, whose coordinate values are the
frequencies
of appearance of
the words
in that document.
actually belongs to an
individual document.
This representation model treats words as independent
entities, completely ignoring the structural information inside
documents, such as syntax and meaningful relationship between
words or between sentences. Recently, many efforts have been
made to find a better way of representing text document. As
mentioned, scarcity is a problem of VSM. A document vector has so
many unrelated dimensions that may hide its actual meaning.
Researchers have tried to make use of semantic relatedness of
words, or to find some sort of concepts, instead of words, to
represent documents. Its simplicity facilitates fast computation, at
the
same
time
provides
sufficient
12
numerical
and
statistical
(tf)
orterm
frequency-inverse
document
frequency
tfi is the
figure
2.3
explains
terms(TERM1,TERM2,TERM3)
that
common
there
are
in
three
three
Let
be
the
total
number
of
documents
in
the
(2.1)
Similarity Measurement
Accurate clustering requires a precise definition of the
closeness between a pair of objects, in terms of either the pair wise
similarity
or
distance.
Before
clustering,
similarity/distance
14
such as DBSCAN,
clustering
Recalling
that
distance/similarity value,
closeness
is
quantified
as
the
15
(2.4)
Cosine Similarity
When documents are represented as term vectors, the
similarity of two documents corresponds to the correlation between
the vectors. This is quantified as the cosine of the angle between
vectors, that is, the so-called cosine similarity. Cosine similarity is
one of the most popular similarity measure applied to text
documents, such as in numerous information retrieval applications
in [11] and clustering tool kit from [13].An important property of the
cosine similarity is its independence of document length.The
similarity of two document vectors di and dj,Sim (di, dj), is defined as
the cosine of the angle between them. For unit vectors, this equals
to their inner product:
(2.5)
Cosine measure is used in a variant of K-means called
spherical K-Means in [4]. While K-Means aims to minimize Euclidean
distance, spherical K-Means intends
16
(2.7)
Pearson Correlation Coefficient
Correlation Clustering, introduced by Bansal, Blum and Chawla
[9], provides a method for clustering a set of objects into the best
possible number of clusters, without specifying that number in
proceed. Correlation clustering that does not require a bound on the
number of clusters that the data is partitioned into. Rather,
Correlation Clustering in the paper [10] divides the data into the
optimal number of clusters based on the similarity between the data
points. In their paper, [9] Bansal et al. discuss two objectives of
correlation clustering: minimizing disagreements and maximizing
agreements between clusters.
The normalized Pearson correlation is defined as:
(2.8)
Where
dimensions.
In [20] Strehl et al. compared four measures: Euclidean,
Cosine, Pearson correlation and Extended Jaccard, and concluded
that cosine and extended Jaccard are the best ones on the web
documents.
Clustering
is
employed
for
plagiarism
detection,
(to
assure
higher
diversity
among
the
topmost
Recommendation
System
In
this
application
user
is
18
19
} be a set of m objects,every
lower probability
one.
6. Whenever the step 5 has been executed for every cluster, then
go to step7, or else
go to step5.
Criterion Function
The frequently used partitional clustering similarity
strategy is the Variance Ratio Criterion (VRC) . Its definition is as
formulated
Bd
VRC=
n-k
(1)
Wd
k-1
Here B and W denote the between-cluster variations and withincluster, respectively. They are defined as:
20
oij -- oj)T(oij -- o j )
W=
B=
o )T (oj
(2)
o)
(3)
value of
VRC. The
PSO
Particle swarm optimization (PSO) is a computational
method that optimizes a problem by iteratively trying to enhance a
candidate
solution
considering
the
measure
of
quality.
PSO
(4)
(5)
21
by step overview is
given below:
Step1:Initialize the population randomly.
Step2. Perform the following for each particle:
(a) Using the velocity and particle position to Update
equation (4) and (5) and
Tabu search
Fred Glover proposed an approach in 1986, which is called as
Tabu Search, used to allow Local Search (LS) methods to overcome
local optima. The main concept of TS is to pursue LS whenever it
reaches a local optimum by allow non-improving moves. The
difference between meta heuristic approaches and tabu search is,
tabu search based on the notion of tabu list. That is combination of
before visited solutions including disallow moves. we are using short
term memory so it reserves few of the attributes of solutions instead
22
TSPSO:
In this section We introduced the TSPSO algorithm. The
algorithm combines PSO technique with TS. Particle swarm
optimization (PSO) is a computational method used to optimize
the results by iteratively trying to enhance a candidate solution
with view to a given measure of quality. It is a meta heuristic
method, it makes some or no hypothesis about the difficulties
being optimized and can search
solutions. It
a lot
spaces of
applicant
move
the
Steps:
Step1 . the population Initialized randomly;
Step 2. compute the fitness function (1) for each particles
Step 3. randomly divide the population into two halves:
a) one half of population was updated by PSO. i.e Update the
position and
velocity of each particle.
b) The another half of population was updated by TS. it searches
the local
best solution for Each particle.
Step 4. Merge the two halves population, and update the pbest
and gbest particles
and the tabu list (TL).
Step 5 . Iterate Step 2-Step 4 whenever termination condition was
reached.
Step 6 . Output the result
2. Literature Reviews
Tabu-KM: A Hybrid Clustering Algorithm Based on
Tabu Search Approach
Abstract
The clustering problem under the criterion of minimum
sum of squares is a non-convex and non-linear program,
which possesses many locally optimal values, resulting
that its solution often falls into these trap and therefore
cannot converge to global optima solution. In this paper,
an efficient hybrid optimization algorithm is developed for
solving this problem, called Tabu-KM. It gathers the
optimization property of tabu search and the local search
capability of k-means algorithm together. The contribution
of proposed algorithm is to produce tabu space for
escaping from the trap of local optima and finding better
solutions effectively. The Tabu-KM algorithm is tested on
several simulated and standard datasets and its
25
26
27
28
29
Abstract
In order to solve the cluster analysis problem more
efficiently, we presented a new approach based on firefly
algorithm (FA). First, we created the optimization model
using the variance ratio criterion (VRC) as fitness function.
Second, FA was introduced to find the maximal point of
the VRC. The experimental dataset contains 400 data of 4
groups with three different levels of overlapping degrees:
non-overlapping, partial overlapping, and severely
overlapping. We compared the FA with genetic algorithm
(GA) and combinatorial particle swarm optimization
(CPSO). Each algorithm was run 20 times. The results
show that FA can found the largest VRC values among all
three algorithms, while costs the least time. Therefore, FA
is effective and rapid for the cluster analysis problem.
Introduction
Cluster analysis is the assignment of a set of observations
into subsets without any priori knowledge so that
observations in the same cluster are similar to each other
than to those in other clusters [1]. Clustering is a method
of unsupervised learning, and a common technique for
statistical data analysis used in many fields [2], including
machine learning [3], data mining [4], pattern recognition
[5], image analysis [6], medical image classification [7],
and bioinformatics [8]. Cluster analysis can be achieved
by various algorithms that differ significantly. Those
methods can be basically classified into four categories: I.
Hierarchical Methods. They find successive clusters using
previously established clusters. They can be further
divided into the agglomerative methods and the divisive
methods [9]. Agglomerative algorithms start with onepoint clusters and recursively merges two or more most
appropriate clusters [10]. Divisive algorithms begin with
the whole set and proceed to divide it into successively
smaller clusters [11]. II. Partition Methods. They generate
a single partition of data with a specified or estimated
number of non overlapping clusters, in an attempt to
recover natural groups present in the data [12]. III.
Density-based Methods. They are devised to discover
30
31
32
33
34
35
36
37
3. System Design
3.1 Hardware and software specifications
H/W System Configuration:
Processor
Speed
RAM
Hard Disk
Key Board
Keyboard
Mouse
Pentium i5
- 2.3 Ghz
- 4 GB
- 500 GB
- Standard Laptop
-
USB mouse
Operating System
Development tool
Language
Language Version
Technologies
:Windows 10
: Net beans 7.0.1
: JAVA
: jdk 1.7
: AWT, Swings
38
39
Apply AMOC
User
Apply PSO
Apply TSPSO
View Results
Class Diagram:
Here Mining executer is the main class where it utilizes the
methods of OptionSelection when a user invokes the
action function the ResultForm is invoked where the inputs
of the ResultForm is given to the DocClusteringModel.
Finally the output class is executed with the inputs of
DocClustering, here StartUp class is generation of
DocClustering.
40
Sequence diagram
Here User is a main class whenever he wants to view the
datasets he can view them by requesting them using the
vectors and features file. Similarly he can optimize the
number of clusters that are present from the datasets
using AMOC. Whenever he wants to apply the PSO for
41
generating one of the test case for TSPO he can use them,
he can apply TSPO for generating efficient clusters finally
he can view all the result when required.
Read_dataset
Apply AMCO
Apply PSO
Apply TSPSO
: User
1 : Browse Vector and Feature File()
Activity diagram
Behavior of the system in terms of activities are describes below. Here
as user starts initiates the process he browses through the
file for selection of features and vectors files. Then he
applies AOC for generating clusters. Then he can Apply
PSO for generating test sample1 for TSPO, Then he can
apply TS for generating test sampple2 Finally these can be
42
View Results
User
Browse Files
Apply AMCO
Apply PSO
Apply TS
Apply TSPSO
View Results
Browse Files
Apply AMCO
Apply PSO
Apply TS
Apply TSPSO
View Results
Component Diagram:
The figure shows the various interactions of user with
different components. User interacts with Read datasets to
browse the vectors an features file on success he can read
successfully, he can interact with AMOC component to
apply it and get optimized clusters, he then interacts with
PSO component to apply it an generates samples,
44
ALGORITHMS
1. PSO
Let S be the number of particles in the swarm, each
having a position xi Rn in the search-space and a
velocity vi Rn. Let pi be the best known position of
45
the
particle's
position
with
uniformly
46
2. Tabu search:
Steps involved:
Step 1 Create initial solution x.
Step 2 Initialize the Tabu List.
Step 3
complete.
Step 3.1 compute x candidate solution from present
solution x
Step 3.2. Add x to X
iif x
Criterion is satisfied.
Step 4 Select the best x* candidate solution in X.
Step 5 .If fitness(x) < fitness(x*) then x = x*.
Step 6 Then Tabu List is updated.
Step 7 If termination criteria is reached then finish.
Criterion Function
47
o j denotes the n-
3. TSPSO
Steps involved:
Step1 . the population Initialized randomly;
Step 2. Compute the fitness function (1)
for each
particles
Step 3. Randomly divide the population into two halves:
a) one half of population was updated by PSO. i.e Update
the position and velocity of each particle.
b) The another half of population was updated by TS. it
searches the local
best solution for Each particle.
Step 4. Merge the two halves population, and update the
pbest and gbest particles
and the tabu list (TL).
48
4. Implementation
4.1 Introduction to technologies
The feasibility of the project is analyzed in this
phase and business proposal is put forth with a very
general plan for the project and some cost estimates.
During system analysis the feasibility study of the
proposed system is to be carried out. This is to ensure that
the proposed system is not a burden to the company. For
feasibility analysis, some understanding of the major
requirements for the system is essential.
Three
key
considerations
involved
in
the
feasibility
analysis are
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic
impact that the system will have on the organization. The
amount of fund that the company can pour into the
49
most
of
the
technologies
used
are
freely
minimal
or
null
changes
are
required
for
SOCIAL FEASIBILITY
The aspect of study is to check the level of
acceptance of the system by the user. This includes the
process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead
50
java.io.*;
java.util.*;
javax.swing.JOptionPane;
miner.*;
float distance[];
float intraclustDistance[];
boolean clusterPoints[][];
small little=new small();
public void extractData() throws IOException
{
Scanner s=null;
try
{
s=new Scanner(new BufferedReader(new
FileReader("c:\\dc\\tfIdfMatrix.txt")));
String a,b;
int col=-1;
while(s.hasNext())
{
a=s.next();
if(a.indexOf("column")!=-1)
{
col++;
for(int j=0;j<tfIdf.length;j++)
{
a=s.next();
tfIdf[j]
[col]=Float.parseFloat(a);
}/*End of for*/
}/*End of If*/
}/*End of while*/
}/*End of try*/
catch(IOException e)
{
JOptionPane.showMessageDialog(null,e.toStri
ng(),"pso-extractData()",JOptionPane.ERROR_MESSAGE);
//return count;
}
finally
{
if(s!=null)
s.close();/*End of if*/
52
/*if(out!=null)
out.close();*/
}/*End of finally*/
}/*End of Extract data*/
public pso()
{
}
public pso(int Rows,int Columns,int noOfClusters,int
noOfParticles)
{
System.out.println("parameterised Constructor
Executed");
tfIdf=new float[Rows][Columns];
System.out.println("The size of the matrix
is:"+tfIdf.length+"\t"+tfIdf[0].length);
particles=new float[noOfParticles][noOfClusters]
[Columns];
fitness=new float[particles.length];
partiVelocity=new
float[noOfParticles]
[noOfClusters][Columns];
pBest=new
float[noOfParticles][noOfClusters]
[Columns];
gBest=new float[noOfClusters][Columns];
newFitness=new float[particles.length];
//clusterPoints=new
boolean[tfIdf.length]
[particles[0].length];
clusterSize=new int[particles[0].length];
distance=new float[particles[0].length];
intraclustDistance=new
float[particles[0].length];
Arrays.fill(fitness,0);
Arrays.fill(fitness,0);
for(int i=0;i<gBest.length;i++)
Arrays.fill(gBest[i],0);
for(int i=0;i<pBest.length;i++)
for(int j=0;j<pBest[i].length;j++)
{
53
Arrays.fill(partiVelocity[i][j],0);
Arrays.fill(pBest[i][j],0);
}
}
public void assignParticles() throws IOException
{
int Particles[];
try
{
numberGenerator n;
n=new numberGenerator();
Particles=n.extractNumbers((particles.length)*(particles[0]
.length));
System.out.println(Particles.length);
int l=0;
for(int i=0;i<particles.length;i++)
{
for(int j=0;j<(particles[0].length);j++)
{
for(int k=0;k<(particles[0]
[0].length);k++)
{
particles[i][j][k]=tfIdf[Particles[l]1][k];
}
l++;
System.out.println(l);
}
}
}
catch(IOException e)
{
JOptionPane.showMessageDialog(null,e.toStri
ng(),"psoassignParticles()",JOptionPane.ERROR_MESSAGE);
//return count;
}
54
finally
{
}
}
55
j++;
}
}
if(a.distance>distance[i]
&&
distance[i]!
=0)
{
a.distance=distance[i];
a.pos=i;
}
}
return a;
}
for(int i=0;i<particles.length;i++)
{
System.gc();
newFitness[i]=0;
for(int l=0;l<particles[0].length;l++)
{
clusterSize[l]=0;
distance[l]=(float)(0);
intraclustDistance[l]=(float)(0);
//newFitness[l]=(float)(0);
}
56
for(int j=0;j<tfIdf.length;j++)
{
for(int k=0;k<particles[i].length;k++)
{
distance[k]=eucliDistance(tfIdf[j],particles[i][k]);
}
little=Small(distance);
//clusterPoints[j][little.pos]=true;
intraclustDistance[little.pos]
+=little.distance;
clusterSize[little.pos]++;
}
for(int k=0;k<particles[0].length;k++)
{
intraclustDistance[k]=intraclustDistance[k]/clusterSize[k];
if(Float.isNaN(intraclustDistance[k])==true)
intraclustDistance[k]=(float)
(3.3406782);
System.out.println("The
intracluster
distance in cluster:"+k+" is"+intraclustDistance[k]);
}
System.out.println();
for(int k=0;k<particles[0].length;k++)
newFitness[i]+=intraclustDistance[k];
newFitness[i]=newFitness[i]/particles[0].length;
if(Float.isNaN(newFitness[i])==true)
newFitness[i]=fitness[i];
String l=Float.toString(newFitness[i]);
/*if(l.length()>5)
{
int pos;
57
pos=l.indexOf('.');
//System.out.println(pos);
String s=l.substring(0,pos+4);
//System.out.println(s);
newFitness[i]=Float.parseFloat(s);
}*/
58
}
}
}
59
if(gBestFitness>a.distance)
{
gBestFitness=a.distance;
flag=1;
}
}
System.out.println("The
is:"+gBestFitness);
gBest
Fitness
if(flag==1)
for(j=0;j<particles[0].length;j++)
{
//System.out.println("The
gBest
Fitness is assigned");
System.arraycopy(particles[a.pos]
[j],0,gBest[j],0,(particles[0][0]).length);
}
}
public boolean checkFitness()
{
float a;
byte count=0;
//byte pos;
//String l;
//l=Float.toString(fitness[0]);
//pos=l.indexOf(.);
a=newFitness[0];
for(int i=1;i<newFitness.length;i++)
if(Math.abs(a-newFitness[i])==0)
count++;
if(count==newFitness.length-1)
{
System.out.println("After
checking
Fitness:");
for(int l=0;l<newFitness.length;l++)
System.out.println(newFitness[l]);
return true;
60
}
return false;
}
public void psoalg(int n)
{
int i,j,k;
for(i=0;i<n;i++)
{
System.gc();
System.out.println();
System.out.println("iteration: "+i);
System.out.println();
calFitness();
if(i==0)
{
System.arraycopy(newFitness,0,fitness,0,newFitness.lengt
h);
//System.out.println("newFitness:"+ne
wFitness[0]);
System.out.println("Fitness:"+fitness[0]);
System.out.println();
for(j=0;j<fitness.length;j++)
{
for(k=0;k<particles[0].length;k++)
{
System.arraycopy(particles[j][k],0,pBest[j][k],0,
(particles[j][k]).length);
}
}
System.gc();
findgBest(i);
changePartiVelocityLocation();
}
61
else
{
System.gc();
findpBest();
findgBest(i);
changePartiVelocityLocation();
}
if(checkFitness())
{
System.out.println("Yes");
break;
}
}
62
out.println();
}
}
}
catch(IOException e)
{
JOptionPane.showMessageDialog(null,e.toStri
ng(),"pso-show()",JOptionPane.ERROR_MESSAGE);
//return count;
}
finally
{
if(out!=null)
out.close();
}
}
public float centToCentDistance(/*PrintWriter out1*/)
{
float result=0;
for(int i=0;i<gBest.length;i++)
{
for(int j=i+1;j<gBest.length;j++)
{
float temp;
temp=eucliDistance(gBest[i],gBest[j]);
System.out.println("The distance
from centroid"+(i+1)+" to centroids"+(j+1)+" is :
"+temp);
//out1.println("The distance from
centroid"+(i+1)+" to centroids"+(j+1)+" is : "+temp);
result+=temp;
}
}
int n=gBest.length;
n=(n*(n-1))/2;
result=result/n;
63
average
distance
is:"+result);
return result;
}
boolean[tfIdf.length]
for(int i=0;i<gBest.length;i++)
{
clusterSizei[i]=0;
distancei[i]=(float)(0);
intraclustDistancei[i]=(float)(0);
Arrays.fill(clusterPoints[i],false);
//newFitness[i]=(float)(0);
}
for(int j=0;j<tfIdf.length;j++)
{
for(int k=0;k<gBest.length;k++)
{
64
distancei[k]=eucliDistance(tfIdf[j],gBest[k]);
}
littlei=Small(distancei);
clusterPoints[j][littlei.pos]=true;
intraclustDistancei[littlei.pos]
+=littlei.distance;
clusterSizei[littlei.pos]++;
}
for(int k=0;k<gBest.length;k++)
intraclustDistancei[k]=intraclustDistancei[k]/clusterSizei[k
];
for(int k=0;k<gBest.length;k++)
{
System.out.println("Cluster"+k+":"+intraclustDistancei[k])
;
fitnessi+=intraclustDistancei[k];
}
fitnessi=fitnessi/gBest.length;
//String l=Float.toString(newFitness[0]);
/*if(l.length()>5)
{
int pos;
pos=l.indexOf('.');
//System.out.println(pos);
String s=l.substring(0,pos+4);
//System.out.println(s);
newFitness[i]=Float.parseFloat(s);
}*/
System.out.println("The
is:"+fitnessi);
65
gBest
fitness
return fitnessi;
//finddocclust(clusterPoints);
}
public String finddocclust()
{
String clust;
clust="";
int flag=0;
for(int i=0;i<clusterPoints[0].length;i++)
{
clust+="The documents under cluster:
"+i+" are:"+"\n";
flag=0;
for(int j=0;j<clusterPoints.length;j++)
{
if(clusterPoints[j][i]==true)
{
flag++;
String s=Integer.toString(j);
clust+=s;
if(flag%5==0)
clust+="\n";
else
clust+="\t";
if(flag==5)
flag=0;
}
}
clust+="\n"+"**************************************"+"\n"
;
}
66
5. Testing
5.1 Unit Testing
Tests for Input
Test case :
What happens when we press ok with leaving
allfields empty
Expected Output:
When the user clicks on ok without any input for
fields
then it should prompt an error message in an dialog
bax saying Select appropriate fields properly
67
Observed Output:
When ok is pressed error is prompt in the dialog
box
The error show same as that of expected
No errors will be displayed when all fields are entered
correctly
68
Observed Output:
When ok is pressed error is prompt in the dialog
box
The error show same as that of expected
No errors will be displayed when all features fields is
entered correctly
69
Observed Output:
When ok is pressed error is prompt in the dialog
box
The error show same as that of expected
No errors will be displayed when all vectors fields is
entered correctly
Observed Output:
When ok is pressed error is prompt in the dialog
box
The error show same as that of expected
No errors will be displayed when any Algorithms fields is
selected correctly
71
PSO
Dataset1
TS
0.489
TSPSO
0.458
0.38
0.502
0.491
0.305
0.561
0.482
0.4
Dataset2
Dataset3
0.7
0.6
F-Score values
0.5
0.4
0.3
0.2
0.1
re0
fbis
Datasets
tr11
tr12
re1
72
6. Results
73
This is the home page where the user must enter next in
order to carry out his tasks that are needed to be
performed.
74
75
76
77
The above figure shows the result generated for the given
input files and cluster.
78
79
80
7. CONCLUSION
In this thesis the new approach Hybrid algorithm that uses
Tabu search and basic PSO is proposed to solve the
problem of Document Clustering. PSO has been proved as
an effective optimization technique to solve combinatorial
optimization problems. Tabu search, an efficient local
search procedure helps to explore the solutions in different
regions of solutions. This thesis proposes a Hybrid
Algorithm is a blended technique that combines features
of basic PSO and TS. The quality of solutions obtained by
Hybrid Algorithm strongly substantiates the effectiveness
of the algorithm for the document clustering in IR system.
We
optimization (PSO)
shows that TSPSO having the largest VRC values among all
the algorithms. It concludes that TSPSO is effective for the
document cluster analysis problem. Future work contains
use more standard data sets to test the performance of
the TSPSO.
The clustering algorithm
And these algorithms are applied to different datasets.
Compared the results of proposed TSPSO algorithm with
the other existing algorithms. Finally the VRC values of
each algorithm compared and concluded that TSPSO
algorithm
gives
the
accurate
reaming algorithms
81
clusters
compared
to
References
1 P Jaganathan, S Jaiganesh,: An improved K-means
algorithm combined with Particle Swarm Optimization
approach for efficient web document clustering
.International Conference on Green Computing,
Communication and Conservation of Energy
(ICGCE)IEEE(2013).
2 M. Yaghini, N.Ghazanfari : Tabu-KM: A Hybrid Clustering
Algorithm Based on Tabu Search Approach. International
Journal of Industrial Engineering & Production Research
Septtember (2010),, Vollume 21.
3 Pritesh Vora, Bhavesh Oza: A Survey on K-mean
Clustering and Particle Swarm Optimization,
International Journal of Science and Modern Engineering
(IJISME) ISSN: 2319-6386, Volume-1, Issue-3,
February( 2013).
82
Sites Referred:
http://java.sun.com
http://www.sourcefordgde.com
http://www.networkcomputing.com/
http://www.roseindia.com/
http://www.java2s.com/
http://www. javadb.com/
83