30 views

Uploaded by karthik0484

- 5_TOMCCAP_Browse by chunks.pdf
- Lab 4
- Chapter 1
- A Review on K means clustering.docx
- 3
- Survey Som
- Cloud
- Martinet z 1993
- Centroid With K-MEAN
- Accelerating Unique Strategy for Centroid Priming in K-Means Clustering
- u06a1 CLUSTER ANALYSIS 2 Hal Hagood (4).docx
- Marketing Research Assignment(D012,D024,D032,D043)
- Clustering
- 201071164112_F029
- PsoPSO based fuzzy clustering for replicated gene expression data
- E-Journal GJCST Vol 11 Issue 12 July
- 10.1.1.4.4410
- Efficiency of Data Mining Techniques in Edifying Sector
- image segmentation Report
- Identifying High Potential Entrepreneurs (JDE)

You are on page 1of 50

Introduction

1.1 OVERVIEW:

The objective of this system is to provide the Clustered components from high

dimensional data space where each dimension represents or denotes an attribute used an

algorithm “Projective clustering based on k – means algorithm”. Each attribute contains a set of

values of each data point corresponding to that given attribute. The algorithm that we propose

does not presume any distribution on each individual dimension for the input data. Furthermore,

there is no restriction imposed on the size of the clusters or the number of relevant dimensions of

each cluster. A projected cluster should have a significant number of selected (i.e., relevant)

dimensions with high relevance in which a large number of points are close to each other.

The projected clustering problem is to identify a set of clusters and their relevant

dimensions such that intra-cluster similarity is maximized while inter-cluster similarity is

minimized. The system takes input the multi dimensional data in excel format(.xls), maximum

no. of components in an individual dimension and no. of nearest elements . It generates the

Scale parameter and shape parameter based on maximizing the likelihood function using EM

algorithm. Initialize the cluster centers, generate membership matrix and then test for

convergence. If the data points in clusters are not relevant (i.e. more sparseness exists among

data points), then split into further group and check. Based on these scale parameters, calculate

code length and generate matrix. Then perform outlier handling to remove irrelevant data points

on data space. In outlier handling, the data points that are lies outside or far away from the data

clusters are identified and then removed from dataset. After removing outliers, Projected clusters

are computed by applying distance function in each individual cluster.

1

1.2 Objectives:

In this module high dimensional Database table, no.of clusters, no.of dimensions are

given as input to find out the sparseness degree of data points in each dimension. Sparse degree

denotes the density of data points in a particular region. The major advantage of using the

sparseness degree is that it provides a relative measure on which the dense regions are more

easily distinguishable from sparse regions.

In this module fuzzifier value, no.of clusters , threshold ε is given as input to the system.

This process of splitting into clusters is based on Fuzzy means k-algorithm. In this first, we

initialize the cluster centers vi0(i=1,2,….,c). We calculate the relativity of distance function from

the difference of each point to center of cluster to the that point. If the relativity is greater than

threshold, then it is valid and process is terminated otherwise the process is repeated by changing

center of the clusters. The output of the system will be optimal count of clusters, cluster centers

and membership matrix.

In this module, Probability density function is estimated to identify the characterstics of

each dimension. The output values of previous module optimal count of clusters, cluster centers

and membership matrix are given as input to the system. Here, we can find the scale parameter,

shape parameter and Bayesian Information criterion, to estimate density based on gamma

distribution function.

2

1.3 PROJECT PLAN:

The project Clustering High Dimensional data is created by using Java as front end.

Eclipse Java IDE is used for developing the tool. The backend intermediate files are stored as xls

spread sheet files. Windows 7 operating system was used during the development of the project.

The project gets a excel spread sheet containing different kinds of data values of people

as input. The output decides the optimal number of clusters. Intermediate results contains

sparseness degree values, clustering process by k-means algorithm and finding bayesian criterian

information value to fix optimal number of clusters.

The rest of the thesis is organized as follows

cluster high dimensional data are given

In Chapter 3 focusses on System architecture. It contains various modules involved in the

architecture digram.

In Chapter 4 modules are identified that are to be implemented. The modules are

identified and the dependencies that exist between them are discussed. It also reports the steps to

implement modules, input and expected outputs.

In Chapter 5 the implementation details are discussed along with clear steps. Results are

shown inscreen shots. Sample test inputs and outputs are presented.

In Chapter 6 concluding remarks about the project are given out and also look upon future

extension for this current project.

References for this project are listed at the end.

In Appendix A work sheet is prepared for each module.

In Appendix B various features are enlisted that are to be supported by the system. It gives

a detailed requirement specification of the system conveying information about the application

requirements, both functional and non-functional.

3

2. Literature Summary

4

Rong et al.,proposed Density-Based Spatial Clustering and Application with Noise

(DBSCAN) was a clustering algorithm based on density in 2004. It did clustering through

growing high density area, and it can find any shape of clustering. DBSCAN requires two

parameters: epsilon and minimum points . It starts with an arbitrary starting point that has not

been visited. It then finds all the neighbor points within distance eps of the starting point. If the

number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point

and its neighbors are added to this cluster and the starting point is marked as visited. The

algorithm then repeats the evaluation process for all the neighbors recursively. If the number of

neighbors is less than minPts, the point is marked as noise. If a cluster is fully expanded (all

points within reach are visited) then the algorithm proceeds to iterate through the remaining

unvisited points in the dataset.

DBSCAN does not require you to know the number of clusters in the data a priori, as

opposed to k-means. It can find arbitrarily shaped clusters and clusters completely surrounded by

(but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link

effect (different clusters being connected by a thin line of points) is reduced. This algorithm has

a notion of noise and it is mostly insensitive to the ordering of the points in the database. Coming

to disadvantages DBSCAN does not respond well to data sets with varying densities (called

hierarchical data sets) .

map (SOFM) is a neural network approach that uses competitive unsupervised learning.

Learning is based on the concept that the behavior of a node should impact only those nodes and

arcs near it. Weights are initially assigned randomly and adjusted during the learning process to

produce better results. During this learning process, hidden features or patterns in the data are

uncovered and the weights are adjusted accordingly. The self-organizing map is a single layer

feed forward network where the output syntaxes are arranged in low dimensional (usually 2D or

3D) grid. Each input is connected to all output neurons. There is a weight vector attached to

every neuron with the same dimensionality as the input vectors. The goal of the learning in the

selforganizing map is to associate different parts of the SOM lattice to respond similarly to

5

certain input patterns. Initially, the weights and learning rate are set. The input vectors to be

clustered are presented to the network. Once the input vectors are given, based on the initial

weights, the winner unit is calculated either by Euclidean distance method or sum of products

method. Based on the winner unit selection, the weights are updated for that particular winner

unit. An epoch is said to be completed once all the input vectors are presented to the network. By

updating the learning rate, several epochs of training may be performed.

Yip et al., presented A hierarchical subspace clustering approach with automatic relevant

dimension selection, called HARP. HARP is based on the assumption that two objects are likely

to belong to the same cluster if they are very similar to each other along many dimensions.

Clusters are allowed to merge only if they are similar enough in a number of dimensions, where

the minimum similarity and the minimum number of similar dimensions are controlled by two

internal threshold parameters. Due to the hierarchical nature, the algorithm is intrinsically slow.

Also, if the number of relevant dimensions per cluster is extremely low, the accuracy of HARP

may drop as the basic assumption will become less valid due to the presence of a large amount of

noise values in the data set. A dimension receives an index value close to the maximum value

(one) if the local variance is extremely small, which means the projections form an excellent

signature for identifying the cluster members. Alternatively, if the local variance is only as large

as the global variance, the dimension will receive an index value of zero.

construction (EPCH)in 2005. The histograms help to generate signatures, where a signature

corresponds to some region in some subspace, and signatures with a large number of data objects

are identified as the regions for subspace clusters. Hence, projected clusters and their

corresponding subspaces can be uncovered. The objective of our proposed method, which we

call EPCH (Efficient Projective Clustering by Histograms), is focused on uncovering projected

clusters with varying dimensionality, without requiring the users to input the average

dimensionality of associated subspaces, and the number of clusters that naturally exist in the data

set. EPCH requires very little prior knowledge about the data. A general user needs to provide

only one input, maximum number of clusters the user is interested to uncover. In case the

6

number of natural clusters is smaller than max no. of clusters, it will return all the discovered

clusters (there may be less than max no. of cluster such clusters). In other cases, it will return the

top max no. of cluster ranked clusters. Therefore, an inaccurate estimation of this parameter will

not affect the accuracy of the clustering output.

algorithm that satisfies each of these requirements the ability to find clusters embedded in

subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-

presumption of any canonical data distribution, and insensitivity to the order of input records. It

identifies dense clusters in subspaces of maximum dimensionality. It generates cluster

descriptions in the form of DNF expressions that are minimized for ease of comprehension. It

produces identical results irrespective of the order in which input records are presented and does

not presume any specific mathematical form for data distribution. CLIQUE automatically finds

subspaces with high-density clusters.

It produces identical results irrespective of the order in which the input records are

presented, and it does not presume any canonical distribution for the input data. Empirical

evaluation shows that CLIQUE scales linearly with the number of input records, and has good

scalability as the number of dimensions (attributes) in the data or the highest dimension in which

clusters are embedded is increased.

C.M. Procopiuc et al., proposed an algorithm for fast projective clustering “Monte Carlo

Algorithm for Fast Projective Clustering,” in 2002. This Monte carlo algorithm allows us to

compute projective clusters iteratively. During each iteration, we compute an approximation of

an optimal cluster over the current set of points. The termination criterion can be defined in more

than one way, e.g. a certain percentage of the points have been clustered; or a user specified

number of clusters have been computed. by contrast to partitioning methods, the user need not

specify the number of clusters k unless he wants to. This allows more flexibility in tuning the

algorithm to the particular application that uses it. One particularly desirable property of this

method is that it is accurate even when the cluster sizes vary significantly (in terms of number of

7

points). Many partitioning methods rely on random sampling for computing an initial partition.

As a result, their points are either assigned to other clusters, or declared outliers. Greedy method

is employed in this algorithm, that computes each cluster in turn. Its accuracy depends on finding

a good definition for an optimal projective cluster. It proves highly accurate and stable for

various types of data, on which partitioning algorithms are not always successful.

The naive k-means algorithm partitions the dataset into „k‟ subsets such that all records of

data points, and each subset contains a center. Also the points in a given subset are closer to that

center than to any other center. The algorithm keeps track of the centroids of the subsets, and

proceeds in simple iterations. The initial partitioning is randomly generated, that is, we randomly

initialize the centroids to some points in the region of the space. In each iteration step, a new set

of centroids is generated using the existing set of centroids following two very simple steps.

(i) Partition the points based on the centroids C(i), that is, find the centroids to which

each of the points in the dataset belongs. The points are partitioned based on the Euclidean

distance from the centroids.

(ii) Set a new centroid to be the mean of all the points that are closest to all points in that

subset. The algorithm is said to have converged when recomputing the partitions does not result

in a change in the partitioning. For configurations where no point is equidistant to more than one

center, the above convergence condition can always be reached. This convergence property

along with its simplicity adds to the attractiveness of the k means algorithm.

The k-means needs to perform a large number of "nearest-neighbour" queries for the

points in the dataset. If the data is „d‟ dimensional and there are „N‟ points in the dataset, the cost

of a single iteration is O(kdN). Sometimes the convergence of the centroids (i.e. C(i) and C(i+1)

being identical) takes several iterations and also in the last several iterations, the centroids move

very little. As running the expensive iterations so many more times might not be efficient, we

need a measure of convergence of the centroids so that we stop the iterations when the

convergence criteria are met. Disadvantage of this algorithm is that the Distortion is the most

widely accepted measure.

8

CHAPTER 3

SYSTEM ARCHITECTURE

Applying

k-means

alg

9

3.1 Finding nearest neighbours

Program gets the excel spread sheet values as input. The following works are carried out in

finding nearest neighbors.

Perform comparisions to estimate nearest no.of values.

The main processing unit of the system is estimating sparseness degree, used to know the

data points graphically. Specifically Estimating sparseness degree includes intializing of centers

to sets of nearest neighbors and measuring distance from center to remaining data point

attributes.

By using k-means algorithm, cluster the values of data points into maximum number of

clusters that you want to performed. Intialize centers to this process as from the data points.

Repeat the process for finding optimum centers. For this process we are using distance function.

For each set of clusters, find bayesian criterian values. To which the Bayesian value is

less, make it as optimum count of clusters. For finding Bayesian information value, we are using

Gamma function and logarithmic values.

10

3.5 Data flow diagrams

High 1. PDF

dimensional estimation Optimum

data space for all no.of clusters

datapoints

Normalize

Sparseness degree values

High 1.1Comput

Program

dimensional sparseness

versions

data degree

Clustered sets,

1.2 split

into data Centers

sets

Calculate finding no.of

BICvalues minimum

value

Figure

clusters 3.3

11

Store all the nearest neighbors

Program

versions 1.1.1find

1.1.2

High nearest

centers

dimensional neighbors

for data

dataset 1.1.3 find

sets

sparsenes

s degree

store centers DB

1.2.1

1.2.1

identify

split the

centers to

data into

dataset

clusters

1.2.3

Generate

clusters

1.3.1 apply

maximum 1.3.2

likelyhood compute BIC values

function BIC

value

Optimal no.of

clusters

12

CHAPTER 4

4.1 INTRODUCTION

The system „Clustering High Dimensional data’ having three modules in this phase among

six modules. They are Estimating Sparseness degree, clustering process and Bayesian information criteria

to find optimal no.of clusters.

The main aim of the java code for estimating sparseness degree is to find degree of sparseness

of data attributes of data points in each dimension. The features of java code are as follows,

Perform comparisions to estimate nearest no.of values.

Intialise centers to sets of nearest neighbors

Measuring distance from center to remaining all other points in a each nearest set

.

The main processing unit of the system is estimating sparseness degree, used to know the data

points graphically. Specifically Estimating sparseness degree includes intializing of centers to sets of

nearest neighbors and measuring distance from center to remaining data point attributes.

Module 1

13

4.3 clustering Process

The main aim of clustering process is to devide the given data points into required no.of small

Find the distance from each data point to each center, and assign that data point to concern

dataset.

Find new centers to the datasets by taking average value in each dimension.

Repeat the process till present dataset centres is equal to previous cenbters.

Repeat the process for no.of clusters 1 to maximum

By using k-means algorithm, cluster the values of data points into maximum number of clusters

that you want to performed. Intialize centers to this process as from the data points. Repeat the process

for finding optimum centers. For this process we are using distance function.

Module 2:

Apply k-

means al

14

4.4 Bayesian Information criterion

The main aim of this module bayesian criterion information is to compute the optimal no.of

Find shape parameter & scale parameter.

Estimate Gamma function, likely hood function.

Finding Bayesian information criterion to fix optimal count of clusters.

Repeat this process for all no.of clusters 1 to maximum.

To which BIC value is less, that number will be taken as optimal no.of clusters.

For each set of clusters, find bayesian criterian values. To which the Bayesian value is less, make

it as optimum count of clusters. For finding Bayesian information value, we are using Gamma function

and logarithmic values.

Module 3:

15

CHAPTER 5

System gets input data set values from the excel spread sheet. The following works are carried

out in finding nearest neighbors. The main processing unit of the system is estimating sparseness degree,

used to know the data points graphically. Specifically Estimating sparseness degree includes intializing of

centers to sets of nearest neighbors and measuring distance from center to remaining data point attributes

along with Sorting values in each dimension & Perform comparisions to estimate nearest no.of values.

Key Steps

Step3: Finding no. of nearest neighbors for each magnitude value in each dimension.

Fig 5.1: Procedure for finding sparsedegree for data points in each dimension

16

5.1.2 Clustering:

In this approach, cluster the data set points into maximum number of clustered sections that you

want to be performed to classify the optimal number of clusters. Intialize centers to this process as from

the data points as like in k-means algorithm and repeat the process for all number of clusters. Each set of

clusters are formed based on k-means algorithm by evaluating distance measures. Finally, clustered data

sets will be result of system.

Clustering process

Input: maximum no.of clusters, dataset

Key Steps

Step2: Repeat the following process from m=1 to max value given as input

Step4: Initialize the clusters based on random positions generated by random function.

Step5: Finding Euclidian distance from each other point to each data point of center.

Step6: Allocate the data point to concern group, having min distance to particular point.

Step8: Repeat the process till present clusters centers is equal to previous set of clusters.

17

5.1.3 Finding Optimal no.of clusters

The clustered data sets are given as input to this module. The clustered data sets are subjected

to Expected Maximization algorithm(EM) to find parameters such as scale factor(α), shape factor(β).

Based on these parameters Bayesian information criterion is evaluated along with gamma function(ӷ) and

likelyhood functions. If the Bayesian information criterion value is less for which no.of clusters is to be

considered as optimal.

Input: clustered data sets

Key Steps

Step1: Estimation of Scale factor and Shape factor for each group of data points.

Step4: Estimate Maximum likely hood function for all clusters m=1 to max no.of clusters.

Step5: Estimate Bayesian Information criterion for all values m=1 to max. no.of clusters.

Step6: Considering optimal no. of clusters , for which BIC value is minimum among all.

Output: scale factor, shape factor, BIC and optimal count of clusters

18

5.2 Results

The main processing unit of the system is estimating sparseness degree, used to know the data

points graphically. Specifically Estimating sparseness degree includes intializing of centers to sets of

nearest neighbors and measuring distance from center to remaining data point attributes.

19

Fig 5.5: Shows the sparseness degree values in each dimension

20

Fig 5.6: Shows the sparseness degree values in each dimension

21

5.2.2 Module 2 - Clustering into datasets

By using k-means algorithm, cluster the values of data points into maximum number of clusters

that you want to performed. Intialize centers to this process as from the data points. Repeat the process

for finding optimum centers. For this process we are using distance function.

22

Fig 5.8: clustered data sets

23

5.2.3 Module 3 - Bayesian Information criterion:

For each set of clusters, find bayesian criterian values. To which the Bayesian value is less, make

it as optimum count of clusters. For finding Bayesian information value, we are using Gamma function

and logarithmic values.

Output: scale factor, shape factor, BIC and optimal count of clusters

Fig 5.10: scale factor, shape factor & BIC values for clustered set m=1

24

Fig 5.11: scale factor, shape factor & BIC values for clustered set m=2

Fig 5.12: scale factor, shape factor & BIC values for clustered set m=3

25

Fig 5.13: scale factor, shape factor & BIC values for clustered set m=4

26

5.3 TEST PLAN

5.3.1 Test case description

The Test Plan is derived from the Functional Specifications, and detailed Design Specifications. The

Test Plan identifies the details of the test approach, identifying the associated test case areas within the

specific product for this release cycle.

Break the product down into distinct parts and identify features of the product that

are to be tested.

To find the expected output of the module

Normalize sparseness degree

Split the data set into clusters

Compute BIC values for each no.of clusters m=1 to max no.of clusters

all datapoints in each

dimension

Usecase 3 Normalize sparseness degree Testcase-3

no.of clusters m=1 to max

no.of clusters

TABLE 5.1: USE CASE AND TEST CASE

27

DESCRIPTION AND THE EXPECTED RESULTS OF EACH TEST CASE

TEST CASE ID #1

TEST CASE FIELDS DETAILS

ACTUAL RESULT The nearest values for the particular data point

attribute came

EXPECTED RESULT The nearest values for the particular data point

attribute came

INFERENCE VALID

TABLE 5.2: TEST CASE 1

TEST CASE ID #2

TEST CASE FIELDS DETAILS

TEST CASE ID:TEST CASE NAME 2: compute sparseness degree for all datapoints

in each dimension

ACTUAL RESULT Sparseness degree values for each datapoint in

each dimension

EXPECTED RESULT Sparseness degree values for each datapoint in

each dimension

INFERENCE VALID

TABLE 5.3: TEST CASE 2

TEST CASE ID #3

TEST CASE FIELDS DETAILS

ACTUAL RESULT Normalized values of sparseness degree in the

range 0 to 1

EXPECTED RESULT Normalized values of sparseness degree in the

range 0 to 1

INFERENCE VALID

TABLE 5.4: TEST CASE 3

28

TEST CASE ID #4

TEST CASE FIELDS DETAILS

TEST CASE ID:TEST CASE NAME 4: Split the data set into clusters

ACTUAL RESULT Data is splitted into concern clustered data sets

based

EXPECTED RESULT Data is splitted into concern clustered data sets

based

INFERENCE VALID

TABLE 5.5: TEST CASE 4

TEST CASE ID #5

TEST CASE FIELDS DETAILS

TEST CASE ID:TEST CASE NAME 5: Compute BIC values for each no.of clusters

m=1 to max no.of clusters

ACTUAL RESULT Obtained BIC values for each no.of clusters

EXPECTED RESULT Obtained BIC values for each no.of clusters

INFERENCE VALID

TABLE 5.6: TEST CASE 5

29

5.4 Performance analysis

5.4.1 Estimation of Sparseness degree

Graphical representation of sparseness degree of data points in each dimension

30

Fig 5.18: Sparseness degree of data points in dimension4

31

5.4.2 Clustering Process

Performance evaluation of clustering process is given below

Probability of clusters Accuracy = (Number of obtained clusters) /

(Total count of clusters given)

The Figure 5.21 shows the graph verses test case number and clustering process

For all types of positive numerical data set 100% accurate Result.

32

5.4.3 Evaluating optimal no.of clusters

The optimal count of clusters can be evaluated from BIC values for each no. of clusters set m= 1

to max through which is having less BIC value.

In the diagram, the optimal no.of clustes are 3 which is having low BIC value of 5 clustered

datasets.

BIC values for each clustered set can be evaluated from parameters scale factor and shape factors.

33

CHAPTER 6

6.1 Conclusion

For modules estimation of sparseness degree, clustering process & Bayesian information criterion

are coded successfully.

• Sparseness degree in each dimension.

• Analysis of data points in each dimension.

• Finding optimal no.of clusters.

• Shape facor, scale factor and BIC value using Gamma function.

In the next phase, the work going to carried out is finding outliers (the data points which are

not relevant to current existing data points in a data cluster), removal of outliers from dataset by

measuring using jaccard distance among dataset. i.e minimizing inter component sparseness and

maximizing intra component sparseness to get effective results of projected clustering. Finally,

producing relevant clustered data sets with high degree sparse(maximizing intra cluster sparseness

degree).

34

REFERENCES

1. Mohamed Bouguessa and Shergrui Wang, “Mining Projected Clusters in High – dimensional

Spaces” IEEE Trans. Knowledge and Data Eng., vol.21, no.4, pp.507-522, April 2009.

2. Haojun Sun, shengrui wang, Qingshan Jiang. “FCM – Based model selection algorithms for

determining the number of clusters” Elsevier journal of pattern recognizing society, pattern

recognition 37 (2004) pp.2027-2037.

3. E.K.K. Ng, A.W. Fu, and R.C. Wong, “Projective Clustering by Histograms,” IEEE Trans.

Knowledge and Data Eng., vol. 17, no. 3, pp. 369-383, Mar. 2005.

4. Anne Patrikainen and Marina Meila., “Comparing subspace clusterings” IEEE Trans. Knowledge

and Data Eng., vol. 18, no. 7, pp. 902-916, July. 2006.

5. F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans.

Knowledge and Data Eng., vol. 17, no. 2, pp. 369-383, Feb. 2005.

6. K.Y.L. Yip, D.W. Cheng, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,”

IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.

7. C.C. Aggarwal and P.S. Yu, “Redefining Clustering for High Dimensional Applications,” IEEE

Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002.

8. C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “Monte Carlo Algorithm for Fast

Projective Clustering,” Proc. ACM SIGMOD ‟02, pp. 418-427, 2002.

9. C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park, “Fast Algorithm for Projected

Clustering,” Proc.ACMSIGMOD ‟99, pp. 61-72, 1999.

35

APPENDIX A

36

Implementation Work Sheet

Project Name : Clustering High Dimensional data

Module Number :1

Module Name : Finding Sparseness degree of all data points in each dimension

Status : Finished

Aim:

Sparseness degree:

Sparseness degree of the data points denotes the degree of closureness. If sparseness

degree value is more, the data points are assumed to be dispersed in a wide region(low-densed).

If sparseness degree value is less, the data points are said to be more closure(high-densed).

Ideas to implement:

1. Number of nearest neighbors is given as input.

2. Sort all magnitude values of all data points in each dimension.

3.Use simplified sorting algorithm for sort individual columns.

4. Find the nearest no.of neighbors for each data point .

For example if nearest no.of neighbors are 3, find three values that are

closure to a particular data point.

5. Finding the centers to the nearest neighbor data sets.

6. Estimate sparseness degree to centres.

7. Normalize sparseness degree in the range of [0-1].

37

Key Steps to be followed:

Step3: Finding no. of nearest neighbors for each magnitude value in each dimension.

Work log

02.08.2010 Gathering Input data

04.08.2010 Finding the properties of multi dimensional space

06.08.2010 Refreshing Basic concepts of java

10.08.2010 Gathering tools required, Eclipse, java1.6 etc.,

12.08.2010 Preparing Detailed design of the system

13.08.2010 Changing modifications to detailed design

16.08.2010 Preparing Report for 1st review

18.08.2010 Write code for sorting two dimensional array using bubble sort

20.08.2010 Reading nearest neighbors concepts

23.08.2010 Tracing out the implementation of sparseness degree through nearest neighbors

24.08.2010 Review 1

26.08.2010 Writing code for finding nearest neighbors

30.08.2010 Changing modifications to code

31.08.2010 Writing code for calculating sparseness degree

02.09.2010 Normalizing sparseness degree values

Table A.1: workdone for review 1

Reference

38

Implementation Work Sheet

Project Name : Clustering High Dimensional Data

Module Number :2

Status : Finished

Aim:

algorithm taken randomly. Clustering process is continued till present clusters centers is equal to

previous cluster centers.

Implementation Details:

2. Using random function, generate random positions of data points to be considered as

centers.

3. Load those data point values of those random positions into two dimensional array.

4. Finding distance from all other data points to each data point of the center and assign

that data point to concern group, having minimum distance to the particular center.

5. Finding new clusters centers by taking average of clustered groups.

6. Again repeat the process till present cluster centers is equal to previous cluster centers.

7. Repeat the above all steps for all values of no.of clusters.

39

Key Steps to follow

Step2: Repeat the following process from m=1 to max value given as input

Step4: Initialize the clusters based on random positions generated by random function.

Step5: Finding Euclidian distance from each other point to each data point of center.

Step6: Allocate the data point to concern group, having min distance to particular point.

Step8: Repeat the process till present clusters centers is equal to previous set of clusters.

Work done:

DATE Work done

06.09.2010 Reading description about module 2

07.09.2010 Writing code for k-means algorithm for very less values

09.09.2010 Generalizing algorithm for large set of values using constraints

13.09.2010 Implementing Random function for initialization of clusters

15.09.2010 Changing conditions for repeating loop in algorithm

17.09.2010 Got proper results for k-means algorithm

Table A.2: workdone for review 2

Reference

40

Implementation Work Sheet

Project Name : Clustering High dimensional data

Module Number :3

finding scale factor, shape factor and BIC value to find optimal no.of

clusters

Status : Finished

Aim

To estimate Bayesian Information criterion (BIC) value for each m value 1 to max. no.of

clusters.

Process:

Based on BIC value, we can find optimal no.of clusters, to which BIC value is less. For

calculating BIC value, we have to calculate parameters to each group of values such as scale

factor(α), Shape factor(β).

Through these parameters α and β of each group, Maximum likely hood function(L m) is

calculated for m = 1 to max. no.of clusters.

Ideas to implement:

1. Estimation of parameters for each group.

2. Parameters are Scale factor(α) and shape factor(β).

3. Estimating Maximum Likely hood fuction.

4, Calculate Bayesian Information Criterion.

5. Optimum no.of clusters will come as output, for which BIC value is less.

41

Key Steps to follow

Step1: Estimation of Scale factor and Shape factor for each group of data points.

Step4: Estimate Maximum likely hood function for all clusters m=1 to max no.of clusters.

Step5: Estimate Bayesian Information criterion for all values m=1 to max. no.of clusters.

Step6: Considering optimal no. of clusters , for which BIC value is minimum among all.

Work done

20.09.2010 Reading detailed description about module 3

22.09.2010 Trace out the results for finding scale factor and shape factor

24.09.2010 Writing code for shape factor and scale factor

27.09.2010 Analyzing Gamma function

29.09.2010 Writing code for Gamma function

30.10.2010 Testing results of Gamma function

01.10.2010 Writing code for finding Maximum likely hood function

04.10.2010 Writing code to estimate Bayesian Information Criterion(BIC) value

05.10.2010 Integrating all these coding parts and testing results

06.10.2010 Documentation for second Review

08.10.2010 Review 2

Table A.3: workdone for review 2

Reference

42

Total Work done

DATE Work Done

02.08.2010 Gathering Input data

04.08.2010 Finding the properties of multi dimensional space

06.08.2010 Refreshing Basic concepts of java

10.08.2010 Gathering tools required, Eclipse, java1.6 etc.,

12.08.2010 Preparing Detailed design of the system

13.08.2010 Changing modifications to detailed design

16.08.2010 Preparing Report for 1st review

18.08.2010 Write code for sorting two dimensional array using bubble sort

20.08.2010 Reading nearest neighbors concepts

23.08.2010 Tracing out the implementation of sparseness degree through nearest neighbors

24.08.2010 Review 1

26.08.2010 Writing code for finding nearest neighbors

30.08.2010 Changing modifications to code

31.08.2010 Writing code for calculating sparseness degree

02.09.2010 Normalizing sparseness degree values

03.09.2010 Generating graphs for sparseness degree values for each dimension

06.09.2010 Reading description about module 2

07.09.2010 Writing code for k-means algorithm for very less values

09.09.2010 Generalizing algorithm for large set of values using constraints

13.09.2010 Implementing Random function for initialization of clusters

15.09.2010 Changing conditions for repeating loop in algorithm

17.09.2010 Got proper results for k-means algorithm

20.09.2010 Reading detailed description about module 3

22.09.2010 Trace out the results for finding scale factor and shape factor

24.09.2010 Writing code for shape factor and scale factor

27.09.2010 Analyzing Gamma function

29.09.2010 Writing code for Gamma function

30.10.2010 Testing results of Gamma function

01.10.2010 Writing code for finding Maximum likely hood function

04.10.2010 Writing code to estimate Bayesian Information Criterion(BIC) value

05.10.2010 Integrating all these coding parts and testing results

06.10.2010 Documentation for second Review

08.10.2010 Review 2

14.10.2010 Searching for real time data of an hospital

19.10.2010 Mapping out data to existing code

25.10.2010 Documentation

01.11.2010 Rectifying errors

04.11.2010 Doing test cases

07.11.2010 Final report for phase -1

22.11.2010 Modifications to the document

24.11.2010 Final rough draft

Table A.4: total workdone for phase1

43

APPENDIX - B

SOFTWARE REQUIREMENT SPECIFICATION

1 INTRODUCTION

The Software Requirements Specification (SRS) for the “Clustering high dimensional

data space”. The purpose of this document is to present a detailed description of the “Clustering

high dimensional data space”.. It will explain the purpose and features of the system, the

interfaces of the system, what the system will do, the constraints under which it must operate and

how the system will react to external stimuli.

1.1 Purpose

The purpose of clustering is to evaluate pattern recognition, trend analysis etc., Clustering

high dimensional data space is a complex task due to presence of multiple dimensions.

1.2 Scope

The project focuses on generating Optimized Cluster groups based on the relevance

analysis and set of outliers which are irrelevant to the others data points in the data space. The

Projected clustering based on k-means algorithm helps the system in grouping the relevant data

points into components.

1.3 Overview

The overall description of the system and the specific requirements of the software

including both the functional and non-functional requirements. The SRS contains the following

in order.

1.3.1 Overall Description of the Product

This section contains the overall description which includes sub-sections depicting the

Computation of Sparseness degree for each dimension, Split the data into clusters as per given

count, and rearrange into sub clusters if data points are irrelevant, PDF estimation,

Detecting dense regions, Outlier handling involves removing irrelevant data points from the data

space, Discovery of projected clusters.

44

1.3.2 Specific Requirements

This section of the SRS carries the external interface requirements followed by

functional requirements and the requirements by features of the product and the performance

requirements and design constraints.

This section of the SRS explains the “utilities” factors of the product namely

correctness, efficiency and responsiveness.

2 OVERALL DESCRIPTION

The product under development of “Clustering high dimensional data space” is a

dependent and self contained system. The system is externally interfaced to the database

management system.

PDF for

sparseness clusters

degree

High

dimensional

data space

Database

Fig B.1 brief struture of the system

45

2.2 Product Functions

PDF estimation

Outlier handling

The user has the knowledge to identify dimensionalities.

2.4 Constraints

The product is developed using JSP.

The product requires minimum hardware of 512 MB RAM.

The system is assumed to have changes only if the data points are not related to center in

a cluster .

46

3 SPECIFIC REQUIREMENTS

3.1.1 User Interfaces

The screen layout containing data sets, these data sets contains points which are in

multi dimensional.

Minimum requirements

RAM : 512MB

. HDD: 80 GB

Processor: Intel P4

The product is built using JSP which will run on Windows Operating system. Net

beans 6.7.1 software is used for building JSP pages. The coverage matrix details are

stored in the Oracle database. The interactions with the JSP are made using JDBC

connectivity.

47

3.2 Software Product Features

3.2.1 Computation of Sparseness degree

Introduction/Purpose of feature

To find the sparseness degree of data points in each dimension of high

dimensional data space.

Stimulus/Response

Stimulus: The user provides high dimensional Database table, no.of clusters, no.of

dimensions.

Response: Yij(sparseness degree) of each dimension in the data space.

Introduction/Purpose of feature

To rearrange the clusters based on relevance of data points in each cluster.

Stimulus/Response

Stimulus: fuzzifier value, no.of clusters, threshold ε.

Response: The no.of clusters and data centers are generated.

Introduction/Purpose of feature

To find the dense regions based on scale and shape parameters

Stimulus/Response

Stimulus: For finding the dense regions inputs are cluster centers and no.of clusters.

Response: Shape factor and scale factor are produced to analyse the dense areas.

48

3.3 Functional Requirements

Use-Case name - Computation of Sparseness degree

Actor – Developer

Pre-condition – data points are grouped based on no.of nearest points k, in one

dimensional space.

Post-condition – Sparseness degree of each dimension.

3.3.3.1.2 Use-Case 2

Use-Case name - Split the data into clusters

Actor - Developer.

Pre-condition – Initiate the cluster groups and centers.

Post-condition – The data points in each cluster should relevant.

3.3.3.1.3 Use-Case 3

Use-Case name - PDF estimation

. Actor - Developer.

Pre-condition – no.of clusters, and clusters centers will given.

Post-condition – Scale factor , Shape factor are calculated to analyse the dense regions

and sparse regions in the high dimensional data space.

49

3.3.4 Non Functional Requirements

3.3.4.1 Security Requirements

Keep specific log of all activities.

Restrict communications between some areas of the program.

3.3.4.2 Reliability

This system has high precision yielding smaller value for the given input high

dimensional data space.

3.3.4.3 Efficiency

The system provides higher efficiency in terms of finding the projected clusters from

the data points of high dimensional data space.

3.3.4.4 Correctness

The system prioritizes the clusters and evaluates the centers optimally and provides the

Optimized projected clusters intended by the user

50

- 5_TOMCCAP_Browse by chunks.pdfUploaded byfaruqwsur
- Lab 4Uploaded byBhavik Patel
- Chapter 1Uploaded byMamta Arora
- A Review on K means clustering.docxUploaded byFaizan Shaikh
- 3Uploaded byNitin Kumbhar
- Survey SomUploaded byErnesto Gutierrez Corona
- CloudUploaded byvishalatdwork573
- Martinet z 1993Uploaded byPrayag Gowgi
- Centroid With K-MEANUploaded byBudi Purnomo
- Accelerating Unique Strategy for Centroid Priming in K-Means ClusteringUploaded byIJIRST
- u06a1 CLUSTER ANALYSIS 2 Hal Hagood (4).docxUploaded byHalHagood
- Marketing Research Assignment(D012,D024,D032,D043)Uploaded bykdatta86
- ClusteringUploaded byRahulRoy
- 201071164112_F029Uploaded bybella_ds
- PsoPSO based fuzzy clustering for replicated gene expression dataUploaded byVinoth Chitra
- E-Journal GJCST Vol 11 Issue 12 JulyUploaded bykrishnagdeshpande
- 10.1.1.4.4410Uploaded byjimakosjp
- Efficiency of Data Mining Techniques in Edifying SectorUploaded byIJAFRC
- image segmentation ReportUploaded byramandeep kaur
- Identifying High Potential Entrepreneurs (JDE)Uploaded bymkshrighy
- A K-Means Based Multi-level Text Clustering Algorithm for Retrieval of Research InformationUploaded byATS
- 885 Fall 2009Uploaded byfrabby_1
- ijsrcsamsv1i1p1.docxUploaded byrocking rakesh
- cal Clustering for OLAP the Cube File ApproachUploaded byApurva Meshram
- CLUMPP and Distruct ProtocolUploaded byManikantan K
- A Survey on Target Tracking Techniques in Wireless Sensor NetworksUploaded byijcses
- 10.1137@1.9780898718348.ch10Uploaded byMauro Avila
- Market Research toolsUploaded bykumarankit007
- Sivapragasam Et Al-2007-Hydrological Processes (2)Uploaded byMaheswaran Rathinasamy
- text summUploaded bysfaritha

- 241176335-Unit-03-Differential-Equations.docUploaded byAshok Pradhan
- Stochastic Gradient DescentUploaded byTDLemonNh
- Wagner 03Uploaded byamessbee
- 1st 9 weeks calendar 14-15Uploaded byapi-262296248
- Solving Quadratic EquationsUploaded bymishi19
- ModuleUploaded byYogix Thetroubleshooter
- Iia-5. Binomial TheoremUploaded bypjaindak
- G5Uploaded byapi-3744646
- 6et_reviewofalgebraUploaded byWaleed Liaqat
- KPI Dashboard - revisited IIUploaded byRupee Rudolf Lucy Ha
- Ch1Uploaded byAlan Vu
- FDSP in pdfUploaded byArun Jose
- Prolog Lists AdvancedUploaded byCristina Larisa
- Test ScribdUploaded byBrenda Castillo
- dsUploaded byapi-340772707
- Taylor SeriesUploaded bydjalilkadi
- Fizionometria umanaUploaded bySescioreanu Mihai
- Algorithmic ComplexityUploaded byMuhammad Kaife Uddin
- Chapter 2_2nd order differential equations.pdfUploaded byAsad Hafudh
- Circular ConvolutionUploaded byseeksudhanshu1
- Dinamica trad.Uploaded byCesar Chacon
- Partial Fractions Cheat Sheet 2016Uploaded byRyan Mills
- PreCal Sem1 Absent-studentVersionUploaded byratliffj
- Dyer Sarin or 1979Uploaded bypsychoachilles11314
- l07Uploaded byAmbrose Omonigho Onueseke
- Linear Programming on TI-89Uploaded byorlando007
- mamta (1)Uploaded byApne Dipu
- Mooculus PrintUploaded byDesvaro
- WK7--MunkresHW4Uploaded byFabrício Mello
- Billions Pi DigitsUploaded byPater Frosch