You are on page 1of 50

Chapter - 1

Introduction

1.1 OVERVIEW:
The objective of this system is to provide the Clustered components from high
dimensional data space where each dimension represents or denotes an attribute used an
algorithm “Projective clustering based on k – means algorithm”. Each attribute contains a set of
values of each data point corresponding to that given attribute. The algorithm that we propose
does not presume any distribution on each individual dimension for the input data. Furthermore,
there is no restriction imposed on the size of the clusters or the number of relevant dimensions of
each cluster. A projected cluster should have a significant number of selected (i.e., relevant)
dimensions with high relevance in which a large number of points are close to each other.

The projected clustering problem is to identify a set of clusters and their relevant
dimensions such that intra-cluster similarity is maximized while inter-cluster similarity is
minimized. The system takes input the multi dimensional data in excel format(.xls), maximum
no. of components in an individual dimension and no. of nearest elements . It generates the
Scale parameter and shape parameter based on maximizing the likelihood function using EM
algorithm. Initialize the cluster centers, generate membership matrix and then test for
convergence. If the data points in clusters are not relevant (i.e. more sparseness exists among
data points), then split into further group and check. Based on these scale parameters, calculate
code length and generate matrix. Then perform outlier handling to remove irrelevant data points
on data space. In outlier handling, the data points that are lies outside or far away from the data
clusters are identified and then removed from dataset. After removing outliers, Projected clusters
are computed by applying distance function in each individual cluster.

1
1.2 Objectives:

1.2.1 Computation of Sparseness degree:


In this module high dimensional Database table, no.of clusters, no.of dimensions are
given as input to find out the sparseness degree of data points in each dimension. Sparse degree
denotes the density of data points in a particular region. The major advantage of using the
sparseness degree is that it provides a relative measure on which the dense regions are more
easily distinguishable from sparse regions.

1.2.2 Split the data into clusters


In this module fuzzifier value, no.of clusters , threshold ε is given as input to the system.
This process of splitting into clusters is based on Fuzzy means k-algorithm. In this first, we
initialize the cluster centers vi0(i=1,2,….,c). We calculate the relativity of distance function from
the difference of each point to center of cluster to the that point. If the relativity is greater than
threshold, then it is valid and process is terminated otherwise the process is repeated by changing
center of the clusters. The output of the system will be optimal count of clusters, cluster centers
and membership matrix.

1.2.3 Estimating probability density function:


In this module, Probability density function is estimated to identify the characterstics of
each dimension. The output values of previous module optimal count of clusters, cluster centers
and membership matrix are given as input to the system. Here, we can find the scale parameter,
shape parameter and Bayesian Information criterion, to estimate density based on gamma
distribution function.

2
1.3 PROJECT PLAN:
The project Clustering High Dimensional data is created by using Java as front end.
Eclipse Java IDE is used for developing the tool. The backend intermediate files are stored as xls
spread sheet files. Windows 7 operating system was used during the development of the project.

The project gets a excel spread sheet containing different kinds of data values of people
as input. The output decides the optimal number of clusters. Intermediate results contains
sparseness degree values, clustering process by k-means algorithm and finding bayesian criterian
information value to fix optimal number of clusters.

1.4 ORGANIZATION OF THE THESIS


The rest of the thesis is organized as follows

In Chapter 2 focusses on literature summary. It contains various proposed methods to


cluster high dimensional data are given
In Chapter 3 focusses on System architecture. It contains various modules involved in the
architecture digram.
In Chapter 4 modules are identified that are to be implemented. The modules are
identified and the dependencies that exist between them are discussed. It also reports the steps to
implement modules, input and expected outputs.
In Chapter 5 the implementation details are discussed along with clear steps. Results are
shown inscreen shots. Sample test inputs and outputs are presented.
In Chapter 6 concluding remarks about the project are given out and also look upon future
extension for this current project.
References for this project are listed at the end.
In Appendix A work sheet is prepared for each module.
In Appendix B various features are enlisted that are to be supported by the system. It gives
a detailed requirement specification of the system conveying information about the application
requirements, both functional and non-functional.

3
2. Literature Summary

2.1 Taxanomy Representation

Fig 2.1: Taxanomy representation of literature summary

4
Rong et al.,proposed Density-Based Spatial Clustering and Application with Noise
(DBSCAN) was a clustering algorithm based on density in 2004. It did clustering through
growing high density area, and it can find any shape of clustering. DBSCAN requires two
parameters: epsilon and minimum points . It starts with an arbitrary starting point that has not
been visited. It then finds all the neighbor points within distance eps of the starting point. If the
number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point
and its neighbors are added to this cluster and the starting point is marked as visited. The
algorithm then repeats the evaluation process for all the neighbors recursively. If the number of
neighbors is less than minPts, the point is marked as noise. If a cluster is fully expanded (all
points within reach are visited) then the algorithm proceeds to iterate through the remaining
unvisited points in the dataset.

DBSCAN does not require you to know the number of clusters in the data a priori, as
opposed to k-means. It can find arbitrarily shaped clusters and clusters completely surrounded by
(but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link
effect (different clusters being connected by a thin line of points) is reduced. This algorithm has
a notion of noise and it is mostly insensitive to the ordering of the points in the database. Coming
to disadvantages DBSCAN does not respond well to data sets with varying densities (called
hierarchical data sets) .

Teuvo Kohonen et al., proposed A Self-Organizing Map concept. Self-organizing feature


map (SOFM) is a neural network approach that uses competitive unsupervised learning.
Learning is based on the concept that the behavior of a node should impact only those nodes and
arcs near it. Weights are initially assigned randomly and adjusted during the learning process to
produce better results. During this learning process, hidden features or patterns in the data are
uncovered and the weights are adjusted accordingly. The self-organizing map is a single layer
feed forward network where the output syntaxes are arranged in low dimensional (usually 2D or
3D) grid. Each input is connected to all output neurons. There is a weight vector attached to
every neuron with the same dimensionality as the input vectors. The goal of the learning in the
selforganizing map is to associate different parts of the SOM lattice to respond similarly to

5
certain input patterns. Initially, the weights and learning rate are set. The input vectors to be
clustered are presented to the network. Once the input vectors are given, based on the initial
weights, the winner unit is calculated either by Euclidean distance method or sum of products
method. Based on the winner unit selection, the weights are updated for that particular winner
unit. An epoch is said to be completed once all the input vectors are presented to the network. By
updating the learning rate, several epochs of training may be performed.

Yip et al., presented A hierarchical subspace clustering approach with automatic relevant
dimension selection, called HARP. HARP is based on the assumption that two objects are likely
to belong to the same cluster if they are very similar to each other along many dimensions.
Clusters are allowed to merge only if they are similar enough in a number of dimensions, where
the minimum similarity and the minimum number of similar dimensions are controlled by two
internal threshold parameters. Due to the hierarchical nature, the algorithm is intrinsically slow.
Also, if the number of relevant dimensions per cluster is extremely low, the accuracy of HARP
may drop as the basic assumption will become less valid due to the presence of a large amount of
noise values in the data set. A dimension receives an index value close to the maximum value
(one) if the local variance is extremely small, which means the projections form an excellent
signature for identifying the cluster members. Alternatively, if the local variance is only as large
as the global variance, the dimension will receive an index value of zero.

Eric Ka Ka Ng et al., proposed an efficient projective clustering technique by histogram


construction (EPCH)in 2005. The histograms help to generate signatures, where a signature
corresponds to some region in some subspace, and signatures with a large number of data objects
are identified as the regions for subspace clusters. Hence, projected clusters and their
corresponding subspaces can be uncovered. The objective of our proposed method, which we
call EPCH (Efficient Projective Clustering by Histograms), is focused on uncovering projected
clusters with varying dimensionality, without requiring the users to input the average
dimensionality of associated subspaces, and the number of clusters that naturally exist in the data
set. EPCH requires very little prior knowledge about the data. A general user needs to provide
only one input, maximum number of clusters the user is interested to uncover. In case the

6
number of natural clusters is smaller than max no. of clusters, it will return all the discovered
clusters (there may be less than max no. of cluster such clusters). In other cases, it will return the
top max no. of cluster ranked clusters. Therefore, an inaccurate estimation of this parameter will
not affect the accuracy of the clustering output.

Rakesh agrawal et al., proposed an algorithm CLIQUE in 2005, and is clustering


algorithm that satisfies each of these requirements the ability to find clusters embedded in
subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-
presumption of any canonical data distribution, and insensitivity to the order of input records. It
identifies dense clusters in subspaces of maximum dimensionality. It generates cluster
descriptions in the form of DNF expressions that are minimized for ease of comprehension. It
produces identical results irrespective of the order in which input records are presented and does
not presume any specific mathematical form for data distribution. CLIQUE automatically finds
subspaces with high-density clusters.

It produces identical results irrespective of the order in which the input records are
presented, and it does not presume any canonical distribution for the input data. Empirical
evaluation shows that CLIQUE scales linearly with the number of input records, and has good
scalability as the number of dimensions (attributes) in the data or the highest dimension in which
clusters are embedded is increased.

C.M. Procopiuc et al., proposed an algorithm for fast projective clustering “Monte Carlo
Algorithm for Fast Projective Clustering,” in 2002. This Monte carlo algorithm allows us to
compute projective clusters iteratively. During each iteration, we compute an approximation of
an optimal cluster over the current set of points. The termination criterion can be defined in more
than one way, e.g. a certain percentage of the points have been clustered; or a user specified
number of clusters have been computed. by contrast to partitioning methods, the user need not
specify the number of clusters k unless he wants to. This allows more flexibility in tuning the
algorithm to the particular application that uses it. One particularly desirable property of this
method is that it is accurate even when the cluster sizes vary significantly (in terms of number of

7
points). Many partitioning methods rely on random sampling for computing an initial partition.
As a result, their points are either assigned to other clusters, or declared outliers. Greedy method
is employed in this algorithm, that computes each cluster in turn. Its accuracy depends on finding
a good definition for an optimal projective cluster. It proves highly accurate and stable for
various types of data, on which partitioning algorithms are not always successful.

The naive k-means algorithm partitions the dataset into „k‟ subsets such that all records of
data points, and each subset contains a center. Also the points in a given subset are closer to that
center than to any other center. The algorithm keeps track of the centroids of the subsets, and
proceeds in simple iterations. The initial partitioning is randomly generated, that is, we randomly
initialize the centroids to some points in the region of the space. In each iteration step, a new set
of centroids is generated using the existing set of centroids following two very simple steps.
(i) Partition the points based on the centroids C(i), that is, find the centroids to which
each of the points in the dataset belongs. The points are partitioned based on the Euclidean
distance from the centroids.
(ii) Set a new centroid to be the mean of all the points that are closest to all points in that
subset. The algorithm is said to have converged when recomputing the partitions does not result
in a change in the partitioning. For configurations where no point is equidistant to more than one
center, the above convergence condition can always be reached. This convergence property
along with its simplicity adds to the attractiveness of the k means algorithm.

The k-means needs to perform a large number of "nearest-neighbour" queries for the
points in the dataset. If the data is „d‟ dimensional and there are „N‟ points in the dataset, the cost
of a single iteration is O(kdN). Sometimes the convergence of the centroids (i.e. C(i) and C(i+1)
being identical) takes several iterations and also in the last several iterations, the centroids move
very little. As running the expensive iterations so many more times might not be efficient, we
need a measure of convergence of the centroids so that we stop the iterations when the
convergence criteria are met. Disadvantage of this algorithm is that the Distortion is the most
widely accepted measure.

8
CHAPTER 3

SYSTEM ARCHITECTURE

2.1 SYSTEM ARCHITECTURE DESIGN

Applying
k-means
alg

Block diagram for phase-1

9
3.1 Finding nearest neighbours

Program gets the excel spread sheet values as input. The following works are carried out in
finding nearest neighbors.

 Sorting values in each dimension


 Perform comparisions to estimate nearest no.of values.

3.2 Estimating Sparseness degree

The main processing unit of the system is estimating sparseness degree, used to know the
data points graphically. Specifically Estimating sparseness degree includes intializing of centers
to sets of nearest neighbors and measuring distance from center to remaining data point
attributes.

3.3 Clustering process

By using k-means algorithm, cluster the values of data points into maximum number of
clusters that you want to performed. Intialize centers to this process as from the data points.
Repeat the process for finding optimum centers. For this process we are using distance function.

3.4 Optimum no.of clusters by Bayesian information criterion

For each set of clusters, find bayesian criterian values. To which the Bayesian value is
less, make it as optimum count of clusters. For finding Bayesian information value, we are using
Gamma function and logarithmic values.

10
3.5 Data flow diagrams

High 1. PDF
dimensional estimation Optimum
data space for all no.of clusters
datapoints

Figure 3.2 Level 0: Context diagram

Normalize
Sparseness degree values

High 1.1Comput
Program
dimensional sparseness
versions
data degree

Clustered sets,
1.2 split
into data Centers
sets

1.3 1.4 Optimal


Calculate finding no.of
BICvalues minimum
value
Figure
clusters 3.3

Figure 3.3 Level 1 DFD

11
Store all the nearest neighbors
Program
versions 1.1.1find
1.1.2
High nearest
centers
dimensional neighbors
for data
dataset 1.1.3 find
sets
sparsenes
s degree

store centers DB
1.2.1
1.2.1
identify
split the
centers to
data into
dataset
clusters

1.2.3
Generate
clusters

store clustered sets

1.3.1 apply
maximum 1.3.2
likelyhood compute BIC values
function BIC
value

Optimal no.of
clusters

Figure 3.4 Level 2 DFD

12
CHAPTER 4

Clustering high dimensional data

4.1 INTRODUCTION

The system „Clustering High Dimensional data’ having three modules in this phase among
six modules. They are Estimating Sparseness degree, clustering process and Bayesian information criteria
to find optimal no.of clusters.

4.2 Estimating Sparseness degree


The main aim of the java code for estimating sparseness degree is to find degree of sparseness
of data attributes of data points in each dimension. The features of java code are as follows,

 Sorting values in each dimension


 Perform comparisions to estimate nearest no.of values.
 Intialise centers to sets of nearest neighbors
 Measuring distance from center to remaining all other points in a each nearest set
.

The main processing unit of the system is estimating sparseness degree, used to know the data
points graphically. Specifically Estimating sparseness degree includes intializing of centers to sets of
nearest neighbors and measuring distance from center to remaining data point attributes.

Module 1

Fig 4.1: Finding Sparseness degree

13
4.3 clustering Process

The main aim of clustering process is to devide the given data points into required no.of small

data sets. The steps involved of this module are:

 Intialize centers to large data set.


 Find the distance from each data point to each center, and assign that data point to concern
dataset.
 Find new centers to the datasets by taking average value in each dimension.
 Repeat the process till present dataset centres is equal to previous cenbters.
 Repeat the process for no.of clusters 1 to maximum

By using k-means algorithm, cluster the values of data points into maximum number of clusters
that you want to performed. Intialize centers to this process as from the data points. Repeat the process
for finding optimum centers. For this process we are using distance function.

Module 2:

Apply k-
means al

Fig 4.2: Applying k-means algorithm

14
4.4 Bayesian Information criterion

The main aim of this module bayesian criterion information is to compute the optimal no.of

optimal no.of clusters. The steps included are as follows:

 Find logarithmic values to each set of data values.


 Find shape parameter & scale parameter.
 Estimate Gamma function, likely hood function.
 Finding Bayesian information criterion to fix optimal count of clusters.
 Repeat this process for all no.of clusters 1 to maximum.
 To which BIC value is less, that number will be taken as optimal no.of clusters.

For each set of clusters, find bayesian criterian values. To which the Bayesian value is less, make
it as optimum count of clusters. For finding Bayesian information value, we are using Gamma function
and logarithmic values.

Module 3:

Fig 4.3: Finding optimal no.of clusters

15
CHAPTER 5

IMPLEMENTATION AND RESULTS

5.1 Implementation Steps

5.1.1 Sparseness degree computation

System gets input data set values from the excel spread sheet. The following works are carried
out in finding nearest neighbors. The main processing unit of the system is estimating sparseness degree,
used to know the data points graphically. Specifically Estimating sparseness degree includes intializing of
centers to sets of nearest neighbors and measuring distance from center to remaining data point attributes
along with Sorting values in each dimension & Perform comparisions to estimate nearest no.of values.

Computation of Sparseness degree

Input: high dimensional Database table, no.of clusters, dimensions of table

Key Steps

Step1: Read the nearest number of values.

Step2: Sort magnitude values of data points in each dimension.

Step3: Finding no. of nearest neighbors for each magnitude value in each dimension.

Step4: Estimating centers for each nearest neighbors data set.

Step5: Calculate Sparseness degree values for each dimension.

Step6: Normalize the sparseness degree in each dimension.

output: Yij(sparseness degree)

Fig 5.1: Procedure for finding sparsedegree for data points in each dimension

16
5.1.2 Clustering:
In this approach, cluster the data set points into maximum number of clustered sections that you
want to be performed to classify the optimal number of clusters. Intialize centers to this process as from
the data points as like in k-means algorithm and repeat the process for all number of clusters. Each set of
clusters are formed based on k-means algorithm by evaluating distance measures. Finally, clustered data
sets will be result of system.

Clustering process
Input: maximum no.of clusters, dataset

Key Steps

Step1: Maximum number of clusters is given as input to the system.

Step2: Repeat the following process from m=1 to max value given as input

Step3: m is number of clusters.

Step4: Initialize the clusters based on random positions generated by random function.

Step5: Finding Euclidian distance from each other point to each data point of center.

Step6: Allocate the data point to concern group, having min distance to particular point.

Step7: Find centers from groups by averaging the values.

Step8: Repeat the process till present clusters centers is equal to previous set of clusters.

Output: clustered data sets

Fig 5.2: Clustering procedure

17
5.1.3 Finding Optimal no.of clusters
The clustered data sets are given as input to this module. The clustered data sets are subjected
to Expected Maximization algorithm(EM) to find parameters such as scale factor(α), shape factor(β).
Based on these parameters Bayesian information criterion is evaluated along with gamma function(ӷ) and
likelyhood functions. If the Bayesian information criterion value is less for which no.of clusters is to be
considered as optimal.

Bayesian information criterion


Input: clustered data sets

Key Steps

Step1: Estimation of Scale factor and Shape factor for each group of data points.

Step2: Implement code for Gamma function.

Step3: Calculating Gamma function for Scale factor.

Step4: Estimate Maximum likely hood function for all clusters m=1 to max no.of clusters.

Step5: Estimate Bayesian Information criterion for all values m=1 to max. no.of clusters.

Step6: Considering optimal no. of clusters , for which BIC value is minimum among all.

Output: scale factor, shape factor, BIC and optimal count of clusters

Fig 5.3: Procedure for finding optimal no.of clusters

18
5.2 Results

5.2.1Module1 - Estimating Sparseness degree

The main processing unit of the system is estimating sparseness degree, used to know the data
points graphically. Specifically Estimating sparseness degree includes intializing of centers to sets of
nearest neighbors and measuring distance from center to remaining data point attributes.

Input: high dimensional Database table, no.of clusters, dimensions of table

output: Yij(sparseness degree)

Fig 5.4: showing sparseness degree values in each dimension

19
Fig 5.5: Shows the sparseness degree values in each dimension

20
Fig 5.6: Shows the sparseness degree values in each dimension

21
5.2.2 Module 2 - Clustering into datasets

By using k-means algorithm, cluster the values of data points into maximum number of clusters
that you want to performed. Intialize centers to this process as from the data points. Repeat the process
for finding optimum centers. For this process we are using distance function.

Input: maximum no.of clusters, dataset

Output: clustered data sets

Fig 5.7: Shows the centers of clusters in every iteration

22
Fig 5.8: clustered data sets

Fig 5.9: clustered data sets

23
5.2.3 Module 3 - Bayesian Information criterion:

For each set of clusters, find bayesian criterian values. To which the Bayesian value is less, make
it as optimum count of clusters. For finding Bayesian information value, we are using Gamma function
and logarithmic values.

Input: clustered data sets

Output: scale factor, shape factor, BIC and optimal count of clusters

Fig 5.10: scale factor, shape factor & BIC values for clustered set m=1

24
Fig 5.11: scale factor, shape factor & BIC values for clustered set m=2

Fig 5.12: scale factor, shape factor & BIC values for clustered set m=3

25
Fig 5.13: scale factor, shape factor & BIC values for clustered set m=4

Fig 5.14: BIC values for each no.of clusters m = 1 to 4

26
5.3 TEST PLAN
5.3.1 Test case description
The Test Plan is derived from the Functional Specifications, and detailed Design Specifications. The
Test Plan identifies the details of the test approach, identifying the associated test case areas within the
specific product for this release cycle.

The purpose of the Test Plan document is to:

 Break the product down into distinct parts and identify features of the product that
are to be tested.
 To find the expected output of the module

The testing includes testing for several functions like

 compute sparseness degree for all datapoints in each dimension


 Normalize sparseness degree
 Split the data set into clusters
 Compute BIC values for each no.of clusters m=1 to max no.of clusters

USE CASE ID DESCRIPTION TEST CASE

Usecase 1 Finding nearest neighbors Testcase-1

Usecase 2 compute sparseness degree for Testcase-2


all datapoints in each
dimension
Usecase 3 Normalize sparseness degree Testcase-3

Usecase 4 Split the data set into clusters Testcase-4

Usecase 5 Compute BIC values for each Testcase-5


no.of clusters m=1 to max
no.of clusters
TABLE 5.1: USE CASE AND TEST CASE

27
DESCRIPTION AND THE EXPECTED RESULTS OF EACH TEST CASE

TEST CASE ID #1
TEST CASE FIELDS DETAILS

TEST CASE ID:TEST CASE NAME 1: finding nearest neighbors


ACTUAL RESULT The nearest values for the particular data point
attribute came
EXPECTED RESULT The nearest values for the particular data point
attribute came
INFERENCE VALID
TABLE 5.2: TEST CASE 1

TEST CASE ID #2
TEST CASE FIELDS DETAILS

TEST CASE ID:TEST CASE NAME 2: compute sparseness degree for all datapoints
in each dimension
ACTUAL RESULT Sparseness degree values for each datapoint in
each dimension
EXPECTED RESULT Sparseness degree values for each datapoint in
each dimension
INFERENCE VALID
TABLE 5.3: TEST CASE 2

TEST CASE ID #3
TEST CASE FIELDS DETAILS

TEST CASE ID:TEST CASE NAME 3: Normalize sparseness degree


ACTUAL RESULT Normalized values of sparseness degree in the
range 0 to 1
EXPECTED RESULT Normalized values of sparseness degree in the
range 0 to 1
INFERENCE VALID
TABLE 5.4: TEST CASE 3

28
TEST CASE ID #4
TEST CASE FIELDS DETAILS
TEST CASE ID:TEST CASE NAME 4: Split the data set into clusters
ACTUAL RESULT Data is splitted into concern clustered data sets
based
EXPECTED RESULT Data is splitted into concern clustered data sets
based
INFERENCE VALID
TABLE 5.5: TEST CASE 4

TEST CASE ID #5
TEST CASE FIELDS DETAILS
TEST CASE ID:TEST CASE NAME 5: Compute BIC values for each no.of clusters
m=1 to max no.of clusters
ACTUAL RESULT Obtained BIC values for each no.of clusters
EXPECTED RESULT Obtained BIC values for each no.of clusters
INFERENCE VALID
TABLE 5.6: TEST CASE 5

*BIC – Bayesian Criterion Information value

Used to find out optimum no.of clusters

29
5.4 Performance analysis
5.4.1 Estimation of Sparseness degree
Graphical representation of sparseness degree of data points in each dimension

Fig 5.15: Sparseness degree of data points in dimension1

Fig 5.16: Sparseness degree of data points in dimension2

Fig 5.17: Sparseness degree of data points in dimension3


30
Fig 5.18: Sparseness degree of data points in dimension4

Fig 5.19: Sparseness degree of data points in dimension5

Fig 5.20: Sparseness degree of data points in dimension6

31
5.4.2 Clustering Process
Performance evaluation of clustering process is given below
Probability of clusters Accuracy = (Number of obtained clusters) /
(Total count of clusters given)
The Figure 5.21 shows the graph verses test case number and clustering process

Fig 5.21: Clustering algorithm Accuracy


For all types of positive numerical data set 100% accurate Result.

32
5.4.3 Evaluating optimal no.of clusters
The optimal count of clusters can be evaluated from BIC values for each no. of clusters set m= 1
to max through which is having less BIC value.
In the diagram, the optimal no.of clustes are 3 which is having low BIC value of 5 clustered
datasets.
BIC values for each clustered set can be evaluated from parameters scale factor and shape factors.

Fig 5.22: BIC value Vs no. of clusters

33
CHAPTER 6

CONCLUSION AND FUTURE WORK

6.1 Conclusion

For modules estimation of sparseness degree, clustering process & Bayesian information criterion
are coded successfully.

• Able to form clusters in high dimensional space.


• Sparseness degree in each dimension.
• Analysis of data points in each dimension.
• Finding optimal no.of clusters.
• Shape facor, scale factor and BIC value using Gamma function.

6.2 Future work

In the next phase, the work going to carried out is finding outliers (the data points which are
not relevant to current existing data points in a data cluster), removal of outliers from dataset by
measuring using jaccard distance among dataset. i.e minimizing inter component sparseness and
maximizing intra component sparseness to get effective results of projected clustering. Finally,
producing relevant clustered data sets with high degree sparse(maximizing intra cluster sparseness
degree).

34
REFERENCES

1. Mohamed Bouguessa and Shergrui Wang, “Mining Projected Clusters in High – dimensional
Spaces” IEEE Trans. Knowledge and Data Eng., vol.21, no.4, pp.507-522, April 2009.
2. Haojun Sun, shengrui wang, Qingshan Jiang. “FCM – Based model selection algorithms for
determining the number of clusters” Elsevier journal of pattern recognizing society, pattern
recognition 37 (2004) pp.2027-2037.
3. E.K.K. Ng, A.W. Fu, and R.C. Wong, “Projective Clustering by Histograms,” IEEE Trans.
Knowledge and Data Eng., vol. 17, no. 3, pp. 369-383, Mar. 2005.
4. Anne Patrikainen and Marina Meila., “Comparing subspace clusterings” IEEE Trans. Knowledge
and Data Eng., vol. 18, no. 7, pp. 902-916, July. 2006.
5. F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans.
Knowledge and Data Eng., vol. 17, no. 2, pp. 369-383, Feb. 2005.
6. K.Y.L. Yip, D.W. Cheng, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,”
IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.
7. C.C. Aggarwal and P.S. Yu, “Redefining Clustering for High Dimensional Applications,” IEEE
Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002.
8. C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “Monte Carlo Algorithm for Fast
Projective Clustering,” Proc. ACM SIGMOD ‟02, pp. 418-427, 2002.
9. C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park, “Fast Algorithm for Projected
Clustering,” Proc.ACMSIGMOD ‟99, pp. 61-72, 1999.

35
APPENDIX A

Implementation Work Sheet –Model

36
Implementation Work Sheet
Project Name : Clustering High Dimensional data

Module Number :1

Module Name : Finding Sparseness degree of all data points in each dimension

Tools Used : Eclipse, Java 1.6

Algorithm : Bubble sort

Status : Finished

Aim:

To estimate the sparseness degree of all data points in each dimension.

Sparseness degree:
Sparseness degree of the data points denotes the degree of closureness. If sparseness
degree value is more, the data points are assumed to be dispersed in a wide region(low-densed).
If sparseness degree value is less, the data points are said to be more closure(high-densed).

Ideas to implement:

The following points are considered for coding


1. Number of nearest neighbors is given as input.
2. Sort all magnitude values of all data points in each dimension.
3.Use simplified sorting algorithm for sort individual columns.
4. Find the nearest no.of neighbors for each data point .
For example if nearest no.of neighbors are 3, find three values that are
closure to a particular data point.
5. Finding the centers to the nearest neighbor data sets.
6. Estimate sparseness degree to centres.
7. Normalize sparseness degree in the range of [0-1].

37
Key Steps to be followed:

Step1: Read the nearest number of values.

Step2: Sort magnitude values of data points in each dimension.

Step3: Finding no. of nearest neighbors for each magnitude value in each dimension.

Step4: Estimating centers for each nearest neighbors data set.

Step5: Calculate Sparseness degree values for each dimension.

Step6: Normalize the sparseness degree in each dimension.

Work log

DATE Work Done


02.08.2010 Gathering Input data
04.08.2010 Finding the properties of multi dimensional space
06.08.2010 Refreshing Basic concepts of java
10.08.2010 Gathering tools required, Eclipse, java1.6 etc.,
12.08.2010 Preparing Detailed design of the system
13.08.2010 Changing modifications to detailed design
16.08.2010 Preparing Report for 1st review
18.08.2010 Write code for sorting two dimensional array using bubble sort
20.08.2010 Reading nearest neighbors concepts
23.08.2010 Tracing out the implementation of sparseness degree through nearest neighbors
24.08.2010 Review 1
26.08.2010 Writing code for finding nearest neighbors
30.08.2010 Changing modifications to code
31.08.2010 Writing code for calculating sparseness degree
02.09.2010 Normalizing sparseness degree values
Table A.1: workdone for review 1

Reference

1. Book:”Java How To Program” P.J.Deitel and H.M.Deitel

2. Book:”Programming with JAVA” E.Balagurusamy.

38
Implementation Work Sheet
Project Name : Clustering High Dimensional Data

Module Number :2

Module Name : Making clusters based on k-means algorithm

Tools Used : Eclipse, Java 1.6

Algorithm : k-means algorithm

Status : Finished

Aim:

To make clusters based on k-means algorithm. Initialization of clusters is part of


algorithm taken randomly. Clustering process is continued till present clusters centers is equal to
previous cluster centers.

Implementation Details:

1. Maximum no.of clusters is given as input.


2. Using random function, generate random positions of data points to be considered as
centers.
3. Load those data point values of those random positions into two dimensional array.
4. Finding distance from all other data points to each data point of the center and assign
that data point to concern group, having minimum distance to the particular center.
5. Finding new clusters centers by taking average of clustered groups.
6. Again repeat the process till present cluster centers is equal to previous cluster centers.
7. Repeat the above all steps for all values of no.of clusters.

39
Key Steps to follow

Step1: Maximum number of clusters is given as input to the system.

Step2: Repeat the following process from m=1 to max value given as input

Step3: m is number of clusters.

Step4: Initialize the clusters based on random positions generated by random function.

Step5: Finding Euclidian distance from each other point to each data point of center.

Step6: Allocate the data point to concern group, having min distance to particular point.

Step7: Find centers from groups by averaging the values.

Step8: Repeat the process till present clusters centers is equal to previous set of clusters.

Work done:
DATE Work done
06.09.2010 Reading description about module 2
07.09.2010 Writing code for k-means algorithm for very less values
09.09.2010 Generalizing algorithm for large set of values using constraints
13.09.2010 Implementing Random function for initialization of clusters
15.09.2010 Changing conditions for repeating loop in algorithm
17.09.2010 Got proper results for k-means algorithm
Table A.2: workdone for review 2

Reference

1. Book:”Java How To Program” P.J.Deitel and H.M.Deitel

2. Book:”Programming with JAVA” E.Balagurusamy.

3. Book: Wiley - Software Measurement and Estimation - A Practical Approach

40
Implementation Work Sheet
Project Name : Clustering High dimensional data

Module Number :3

Module Name : Bayesian information criterion

Tools Used : Eclipse, Java1.6.,

Algorithm : Gamma function for calculating integration value, EM algorithm for


finding scale factor, shape factor and BIC value to find optimal no.of
clusters

Status : Finished

Aim

To estimate Bayesian Information criterion (BIC) value for each m value 1 to max. no.of
clusters.

Process:
Based on BIC value, we can find optimal no.of clusters, to which BIC value is less. For
calculating BIC value, we have to calculate parameters to each group of values such as scale
factor(α), Shape factor(β).

Through these parameters α and β of each group, Maximum likely hood function(L m) is
calculated for m = 1 to max. no.of clusters.

Ideas to implement:
1. Estimation of parameters for each group.
2. Parameters are Scale factor(α) and shape factor(β).
3. Estimating Maximum Likely hood fuction.
4, Calculate Bayesian Information Criterion.
5. Optimum no.of clusters will come as output, for which BIC value is less.
41
Key Steps to follow

Step1: Estimation of Scale factor and Shape factor for each group of data points.

Step2: Implement code for Gamma function.

Step3: Calculating Gamma function for Scale factor.

Step4: Estimate Maximum likely hood function for all clusters m=1 to max no.of clusters.

Step5: Estimate Bayesian Information criterion for all values m=1 to max. no.of clusters.

Step6: Considering optimal no. of clusters , for which BIC value is minimum among all.

Work done
20.09.2010 Reading detailed description about module 3
22.09.2010 Trace out the results for finding scale factor and shape factor
24.09.2010 Writing code for shape factor and scale factor
27.09.2010 Analyzing Gamma function
29.09.2010 Writing code for Gamma function
30.10.2010 Testing results of Gamma function
01.10.2010 Writing code for finding Maximum likely hood function
04.10.2010 Writing code to estimate Bayesian Information Criterion(BIC) value
05.10.2010 Integrating all these coding parts and testing results
06.10.2010 Documentation for second Review
08.10.2010 Review 2
Table A.3: workdone for review 2

Reference

1. Book:”Java How To Program” P.J.Deitel and H.M.Deitel

2. Book:”Programming with JAVA” E.Balagurusamy.

42
Total Work done
DATE Work Done
02.08.2010 Gathering Input data
04.08.2010 Finding the properties of multi dimensional space
06.08.2010 Refreshing Basic concepts of java
10.08.2010 Gathering tools required, Eclipse, java1.6 etc.,
12.08.2010 Preparing Detailed design of the system
13.08.2010 Changing modifications to detailed design
16.08.2010 Preparing Report for 1st review
18.08.2010 Write code for sorting two dimensional array using bubble sort
20.08.2010 Reading nearest neighbors concepts
23.08.2010 Tracing out the implementation of sparseness degree through nearest neighbors
24.08.2010 Review 1
26.08.2010 Writing code for finding nearest neighbors
30.08.2010 Changing modifications to code
31.08.2010 Writing code for calculating sparseness degree
02.09.2010 Normalizing sparseness degree values
03.09.2010 Generating graphs for sparseness degree values for each dimension
06.09.2010 Reading description about module 2
07.09.2010 Writing code for k-means algorithm for very less values
09.09.2010 Generalizing algorithm for large set of values using constraints
13.09.2010 Implementing Random function for initialization of clusters
15.09.2010 Changing conditions for repeating loop in algorithm
17.09.2010 Got proper results for k-means algorithm
20.09.2010 Reading detailed description about module 3
22.09.2010 Trace out the results for finding scale factor and shape factor
24.09.2010 Writing code for shape factor and scale factor
27.09.2010 Analyzing Gamma function
29.09.2010 Writing code for Gamma function
30.10.2010 Testing results of Gamma function
01.10.2010 Writing code for finding Maximum likely hood function
04.10.2010 Writing code to estimate Bayesian Information Criterion(BIC) value
05.10.2010 Integrating all these coding parts and testing results
06.10.2010 Documentation for second Review
08.10.2010 Review 2
14.10.2010 Searching for real time data of an hospital
19.10.2010 Mapping out data to existing code
25.10.2010 Documentation
01.11.2010 Rectifying errors
04.11.2010 Doing test cases
07.11.2010 Final report for phase -1
22.11.2010 Modifications to the document
24.11.2010 Final rough draft
Table A.4: total workdone for phase1
43
APPENDIX - B
SOFTWARE REQUIREMENT SPECIFICATION

1 INTRODUCTION
The Software Requirements Specification (SRS) for the “Clustering high dimensional
data space”. The purpose of this document is to present a detailed description of the “Clustering
high dimensional data space”.. It will explain the purpose and features of the system, the
interfaces of the system, what the system will do, the constraints under which it must operate and
how the system will react to external stimuli.

1.1 Purpose
The purpose of clustering is to evaluate pattern recognition, trend analysis etc., Clustering
high dimensional data space is a complex task due to presence of multiple dimensions.

1.2 Scope

The project focuses on generating Optimized Cluster groups based on the relevance
analysis and set of outliers which are irrelevant to the others data points in the data space. The
Projected clustering based on k-means algorithm helps the system in grouping the relevant data
points into components.

1.3 Overview
The overall description of the system and the specific requirements of the software
including both the functional and non-functional requirements. The SRS contains the following
in order.
1.3.1 Overall Description of the Product

This section contains the overall description which includes sub-sections depicting the
Computation of Sparseness degree for each dimension, Split the data into clusters as per given
count, and rearrange into sub clusters if data points are irrelevant, PDF estimation,
Detecting dense regions, Outlier handling involves removing irrelevant data points from the data
space, Discovery of projected clusters.

44
1.3.2 Specific Requirements

This section of the SRS carries the external interface requirements followed by
functional requirements and the requirements by features of the product and the performance
requirements and design constraints.

1.3.3 Software System attributes

This section of the SRS explains the “utilities” factors of the product namely
correctness, efficiency and responsiveness.

2 OVERALL DESCRIPTION

2.1 Product Perspective


The product under development of “Clustering high dimensional data space” is a
dependent and self contained system. The system is externally interfaced to the database
management system.

PDF for
sparseness clusters
degree

High
dimensional
data space

Database
Fig B.1 brief struture of the system
45
2.2 Product Functions

 Computation of Sparseness degree

 Split the data into clusters


 PDF estimation

 Detecting dense regions

 Outlier handling

 Discovery of projected clusters

2.3 User Characteristics


The user has the knowledge to identify dimensionalities.

2.4 Constraints

 The product is developed to cluster the data space based on relevance.


 The product is developed using JSP.
 The product requires minimum hardware of 512 MB RAM.

2.5 Assumptions and Dependencies

 The system depends on fuzzy means clustering algorithm to rearrange clusters.


 The system is assumed to have changes only if the data points are not related to center in
a cluster .

46
3 SPECIFIC REQUIREMENTS

3.1 External Interface Requirements


3.1.1 User Interfaces
The screen layout containing data sets, these data sets contains points which are in
multi dimensional.

3.1.2 Hardware Interfaces

Minimum requirements
RAM : 512MB
. HDD: 80 GB
Processor: Intel P4

` 3.1.3 Software Interfaces

The product is built using JSP which will run on Windows Operating system. Net
beans 6.7.1 software is used for building JSP pages. The coverage matrix details are
stored in the Oracle database. The interactions with the JSP are made using JDBC
connectivity.

47
3.2 Software Product Features
3.2.1 Computation of Sparseness degree
Introduction/Purpose of feature
To find the sparseness degree of data points in each dimension of high
dimensional data space.
Stimulus/Response
Stimulus: The user provides high dimensional Database table, no.of clusters, no.of
dimensions.
Response: Yij(sparseness degree) of each dimension in the data space.

3.2.2 Split the data into clusters


Introduction/Purpose of feature
To rearrange the clusters based on relevance of data points in each cluster.
Stimulus/Response
Stimulus: fuzzifier value, no.of clusters, threshold ε.
Response: The no.of clusters and data centers are generated.

3.2.3 PDF estimation


Introduction/Purpose of feature
To find the dense regions based on scale and shape parameters
Stimulus/Response
Stimulus: For finding the dense regions inputs are cluster centers and no.of clusters.
Response: Shape factor and scale factor are produced to analyse the dense areas.

48
3.3 Functional Requirements

3.3.1 Use Cases

3.3.1.1 Use Case 1


Use-Case name - Computation of Sparseness degree
Actor – Developer
Pre-condition – data points are grouped based on no.of nearest points k, in one
dimensional space.
Post-condition – Sparseness degree of each dimension.

3.3.3.1.2 Use-Case 2
Use-Case name - Split the data into clusters
Actor - Developer.
Pre-condition – Initiate the cluster groups and centers.
Post-condition – The data points in each cluster should relevant.

3.3.3.1.3 Use-Case 3
Use-Case name - PDF estimation

. Actor - Developer.
Pre-condition – no.of clusters, and clusters centers will given.
Post-condition – Scale factor , Shape factor are calculated to analyse the dense regions
and sparse regions in the high dimensional data space.

49
3.3.4 Non Functional Requirements
3.3.4.1 Security Requirements

Specific requirements include


 Keep specific log of all activities.
 Restrict communications between some areas of the program.

3.3.4.2 Reliability
This system has high precision yielding smaller value for the given input high
dimensional data space.

3.3.4.3 Efficiency
The system provides higher efficiency in terms of finding the projected clusters from
the data points of high dimensional data space.

3.3.4.4 Correctness
The system prioritizes the clusters and evaluates the centers optimally and provides the
Optimized projected clusters intended by the user

50