You are on page 1of 21

CLUSTERING HIGH DIMENSIONAL DATA USING

SIGNATURES WITH HASHING TECHNIQUE

A PROJECT REPORT

Submitted by

ANUSHA S 1305066
KIRUTHIKA M 1305084
PRATHEEBA R 1305097
SWETHA M 1305116

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING
in

COMPUTER SCIENCE AND ENGINEERING

COIMBATORE INSTITUTE OF TECHNOLOGY


(Government Aided Autonomous Institution Affiliated to Anna University)
COIMBATORE-641014

ANNA UNIVERSITY-CHENNAI 600 025

APRIL 2017
COIMBATORE INSTITUTE OF TECHNOLOGY
(A Govt. Aided Autonomous Institution Affiliated to Anna University)
COIMBATORE - 641014

BONAFIDE CERTIFICATE

Certified that this project CLUSTERING HIGH DIMENSIONAL DATA USING SIGNATURES
WITH HASHING TECHNIQUE is the bonafide work of ANUSHA S (1305066), KIRUTHIKA M
(1305084), PRATHEEBA R (1305097), SWETHA M (1305116) under my supervision during the
academic year 2016-2017.

Prof. K.S.PALANISAMY, M.E Mrs. S.PRIYA,M.E


HEAD OF THE DEPARTMENT, SUPERVISOR,
Department of CSE & IT, Department of CSE & IT,
Coimbatore Institute of Technology, Coimbatore Institute of Technology,
Coimbatore 641014 Coimbatore - 641014

Certified that the candidates were examined by us in the project work viva-voce
examination held on

Internal Examiner External Examiner

Place:

Date:
ACKNOWLEDGEMENT

We express our sincere thanks to our Secretary Dr.R.Prabhakar and our Principal

Dr.V.Selladurai for providing us a greater opportunity to carry out our work. The following words are

rather very meagre to express our gratitude to them. This work is the outcome of their inspiration and

product of plethora of their knowledge and rich experience.

We record the deep sense of gratefulness to Prof.K.S.Palanisamy, Head of the Department of

Computer Science and Engineering & Information Technology, for his encouragement during this tenure.

We equally tender our sincere thankfulness to our project guide Mrs.S.Priya, Department of

Computer Science and Engineering& Information Technology, for her valuable suggestions and guidance

during this course.

During the period of study, the entire staff members of the Department of Computer Science and

Engineering & Information Technology have offered ungrudging help. It is also a great pleasure to

acknowledge the unfailing help we have received from our friends.

It is matter of great pleasure to thank our parents and family members for their constant support

and co-operation in the pursuit of this endeavor.


ABSTRACT

Rapid growth of high dimensional datasets have created an indispensible need of


analysing the underlying patterns. Clustering is an approach to find these underlying
patterns of interest. Clustering high dimensional data is a challenging task as data group
together differently under different subsets of dimensions, called subspaces. Subspace
clustering algorithms tries to extract the clusters but they require excessive database scans
and are subdued by redundant clusters thus making it computationally large. The proposed
system involves assigning a unique random number to each data point and incrementally
finding dense units in each dimension. The sum of the random numbers of data points in
each dense unit is calculated and stored in a hash table. Redundant clusters are implied by
same sum value in the hash table and are thus eliminated. Dbscan algorithm is run over
the non redundant dense units to find the final clusters.
LIST OF ABBREVIATIONS

DB Database of points
D Set of attributes / Dimensions
S Subspace
N Neighbourhood of a point in a subspace S
C Cluster
CS Core set
U Dense unit
H Hash Signature of a dense unit
L Large Integer
HTable Hash table
PCA Principal Component Analysis
1. INTRODUCTION

1.1 Data Mining

Data mining is the computational process of discovering patterns in large relational databases and
summarizing it into useful information. The overall goal of the data mining process is to extract
information from a data set and transform it into an understandable structure for further use.

Data mining consists of five major elements:

Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

1.2 Data Mining Techniques

1.2.1 Association

Association enables the discovery of interesting relations between different variables in large
databases. Association rule learning uncovers hidden patterns in the data that can be used to identify
variables within the data and the co-occurences of different variables that appear with the greatest
frequencies.

1.2.2 Clustering

Clustering is a data mining technique that makes a meaningful or useful cluster of objects which
have similar characteristics using the automatic technique.

1.2.3 Classification

Classification is used to classify each item in a set of data into one of a predefined set of classes
or groups. Classification method makes use of mathematical techniques such as decision trees, linear
programming, neural network and statistics
1.2.4 Prediction

The prediction is one of a data mining techniques that discovers the relationship between
independent variables and relationship between dependent and independent variables.

1.2.5 Sequential Patterns

Sequential patterns analysis seeks to discover or identify similar patterns, regular events or trens
in transaction data over a business period.

1.3 Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group are
more similar to each other than to those in other groups. Cluster is a group of objects that belongs to the
same class.

Clustering methods can be classified into the following categories

Partitioning Method

Hierarchical Method

Density-based Method

Grid-Based Method

Constraint-based Method
1.3.1 Partitioning Method
For a database of n objects, the partitioning method constructs k partition of data. Each
partition will represent a cluster and k will be less than or equal to n. It means that it will classify the
data into k groups, which satisfy the following requirements

Each group contains at least one object.

Each object must belong to exactly one group.

1.3.2 Hierarchical Methods


This method creates a hierarchical decomposition of the given set of data objects. There are two
approaches,

Agglomerative Approach

Divisive Approach
1.3.2.1 Agglomerative Approach
This method is started with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keeps on doing so until all of the groups are
merged into one or until the termination condition holds.

1.3.2.2 Divisive Approach


This approach is started with all of the objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the
termination condition holds.

1.3.3 Density-based Method


This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum number
of points.

1.3.4 Grid-based Method


The objects together form a grid. The object space is quantized into finite number of cells that
form a grid structure

1.3.5 Constraint-based Method


In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering results.
Constraints can be specified by the user or the application requirement.

1.4 Applications of Clustering


Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.

Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.

Clustering also helps in classifying documents on the web for information discovery.

Clustering is also used in outlier detection applications such as detection of credit card fraud.

1.5 Hashing

The hash function is used to index the original value or key and then used later each time the
data associated with the value or key is to be retrieved. A hash table uses a hash function to compute an
index into an array of buckets or slots, from which the desired value can be found.

1.5.1 Advantage

The main advantage of hash tables over other table data structures is speed especially when the
number of entries is large. If the set of key-value pairs is fixed and known ahead of time the average
lookup cost can be reduced by a careful choice of the hash function, bucket table size, and internal data
structures.
2. SYSTEM SPECIFICATION

The hardware and software for the system is selected by considering the factors such as CPU processing
speed, peripheral channel speed, printer speed, seek time, relational delay of hard disk and
communication speed etc. The hardware and software specifications are as follows.

2.1 Hardware Specification

Processor : Intel Pentium III or more


Speed : 2.10 GHZ
RAM : 2 GB
Monitor : 15TFT
Keyboard : 104 keys window keyboard
Mouse : Optical mouse

Table: 2.1 Hardware Specifications

2.2 Software Specification

Operating system : Windows 7 or more


Language : JAVA
IDE : NETBEANS 8.0

Table: 2.2 Software Specifications


3. SYSTEM ANALYSIS

3.1 LITERATURE REVIEW

3.1.1 SPARSE SUBSPACE CLUSTERING: ALGORITHM, THEORY, AND APPLICATIONS.

DESCRIPTION:

In Ordered Subspace clustering, the number of clusters to be formed are not known apriori.Each
data point in a union of subspaces can always be written as a linear combination of all other points. A
block diagonal matrix is formed from the data. The columns that are within a segment will be the zero
vector or very close to it because columns from the same subspace share similarity. Columns that greatly
deviate away from the zero vector indicate the boundary of a segment as the similarity is low.

3.1.2 AUTOMATIC SUBSPACE CLUSTERING OF HIGH DIMENSIONAL DATA FOR DATA


MINING APPLICATIONS

DESCRIPTION:

CLIQUE, a clustering algorithm identifies dense clusters in subspaces of maximum


dimensionality. The subspaces that contain clusters are identified using bottom - up algorithm which
finds dense units. The dense subspaces are sorted by coverage and the subspace with greatest coverage
are kept and the rest are pruned. For the disjoint sets of connected k-dimensional units in the same
subspace minimal cluster descriptions are generated. By taking as input a cover for each cluster, a
minimal cover is found.

LIMITATIONS:

CLIQUE algorithm does not perform well when the number of dimensions increases.

3.1.3 HIGH DIMENSIONAL DATA: SUBSPACE CLUSTERING - A REVIEW

DESCRIPTION:

Data set is projected in all dimensions. Histograms are constructed over the dimensions. Sparse
histogram do not contribute to the cluster. Dimensions having dense histogram are combined recursively.
DFS is applied to find maximal region that represents a cluster.
3.2 PROPOSED SYSTEM

The proposed system uses the commonality of data points across dimensions as the key step in
Cluster contribution thereby by-passing the generation of clusters across increasing combinations of
dimensions. Clusters generated in 1-D having same signatures imply a maximal subspace cluster where
the subspace is a set of those particular dimensions. However same signatures for 1-D do not always
imply a maximal subspace cluster because there are chances that a cluster in 1-D can have interleaved
dense units of clusters in other dimensions. So a dense unit in 1-D is split into combinations of dense
units of comparatively smaller size TAU+1. These combinations of dense units form the Core Set. Dense
units from the Core Set can now be tested for same signatures to find maximal subspace cluster.
DBSCAN can be run across all the maximal subspace clusters so formed to find the final maximal
clusters.

3.2.1 Commonality of data points

For the dense sets of points in each of the 1-dimensional projections of the attribute-set of a
given data, the sufficiently common points among these 1-dimensional sets will lead to the dense points in
the higher dimensional subspaces.

3.2.2 Assigning Point ID and Labels

Each row of database is assigned a point ID. A random number is generated for each point
ID and maintained in a table so that it can be used during assignment of signatures. Every point is added
to the point vector. All the point vectors are added to a data matrix.

3.2.3 Finding Dense Units

A point is said to be dense if it has at least tau points within epsilon neighborhood. Epsilon
is a distance measure between two points. These dense units can be connected together to form a cluster.

3.2.4 Finding Core Set

A dense unit in 1-D is split into combinations of dense units of comparatively smaller size
TAU+1. These combinations of dense units form the Core Set. Dense units from the Core Set can now be
tested for same signatures to find maximal subspace cluster.
3.2.5 Hashing

The method of assigning signatures to each of these 1-D dense units is to avoid comparing
the individual points among all dense units in order to decide whether they contain exactly same points or
not. We can hash the signatures of these 1-D dense units from all k-dimensions.

3.2.6 Maximal Subspace Clusters and DBSCAN

The hash table is checked for collision now. The resulting collisions will lead to the
maximal subspace dense units. These dense units are given as input to DBSCAN algorithm to obtain
clusters.

Block Diagram:

3.3 Features

It involves bottom up approach since the number of clusters and number of subspaces
are not known priori.
This algorithm gives only non redundant information, (i.e) it assigns signatures to the
data points thereby generating non redundant clusters.
4. DESIGN

4.1 Input Module

Input is obtained as database containing information about patients with diabetes disease. The
database is normalized using WEKA and stored as .csv format.

4.2 Processing Modules

4.2.1 Find dense units

Each row of database is assigned a point ID. A random number is generated for each point
ID and maintained in a table. A point is said to be dense in a data matrix if it has at least tau points within
epsilon neighborhood within a single dimension. Epsilon is a distance measure between two points.
These dense units can be connected together to form a cluster.

4.2.2 Find core set

A dense unit in 1-D is split into combinations of dense units of comparatively smaller size
TAU+1. These combinations of dense units form the Core Set. Dense units from the Core Set can now be
tested for same signatures to find maximal subspace cluster.

4.2.3 Signature assignment

The method of assigning signatures to each of these 1-D dense units is to avoid comparing
the individual points among all dense units in order to decide whether they contain exactly same points or
not. We can hash the signatures of these 1-D dense units from all k-dimensions.

4.2.4 Maximal subspace cluster generation

The resulting collisions in the hash table will lead to the maximal subspace dense units.
These dense units are given as input to DBSCAN algorithm to obtain clusters.

4.3 Output Module

The output is a set of maximal subspace clusters from all the dimensions.
5. SOFTWARE DESCRIPTION

5.1 Net Beans IDE

The Smarter and Faster Way to Code

Net Beans is a software development platform written in Java. The Net


Beans Platform allows applications to be developed from a set of modular software
components called modules. Applications based on the Net Beans Platform, including the Net
Beans integrated development environment (IDE), can be extended by third party developers. The
Net Beans IDE is primarily intended for development in Java, but also supports other languages, in
particular PHP, C/C++ and HTML5. Net Beans is cross-platform and runs on Microsoft
Windows, Mac OS X, Linux, Solaris and other platforms supporting a compatible JVM. The Net
Beans Team actively support the product and seek feature suggestions from the wider community.
Every release is preceded by a time for Community testing and feedback.

Best Support for Latest Java Technologies

Net Beans IDE is the official IDE for Java 8. With its editors, code analyzers, and
converters, you can quickly and smoothly upgrade your applications to use new Java 8 language
constructs, such as lambdas, functional operations, and method references. Batch analyzers and
converters are provided to search through multiple applications at the same time, matching patterns
for conversion to new Java 8 language constructs. With its constantly improving Java Editor, many
rich features and an extensive range of tools, templates and samples, Net Beans IDE sets the
standard for developing with cutting edge technologies out of the box.

5.2 Creating, Editing, and Refactoring

The IDE provides wizards and templates to let you create Java EE, Java SE, and Java ME
applications. A variety of technologies and frameworks are supported out of the box. For example, you
can use wizard and templates to create applications that use the OSGi framework or the NetBeans module
system as the basis of modular applications. The language-aware NetBeans editor detects errors while
you type and assists you with documentation popups and smart code completionall with the speed and
simplicity of your favorite lightweight text editor.
5.3 Building

Out of the box, the IDE provides support for the Maven and Ant build systems. In the New
Project wizard, when you choose to create a new application, you can choose to create Maven-based or
Ant-based applications. You can open Maven-based applications into the IDE without an import process
because the IDE reads project settings from the Maven POM file. In addition, tools are provided for
importing Ant-based projects that were not created in the IDE. The IDE includes a Maven Repository
Browser, as well as graphs for analying Maven dependencies.

5.4 Debugging and Profiling

To identify and solve problems in your applications, such as deadlocks and memory leaks, the
IDE provides a feature rich debugger and profiler.

5.5 Testing and Code Analysis

When you are testing your applications, the IDE provides tools for using JUnit and TestNG, as
well as code analyzers and, in particular, integration with the popular open source FindBugs tool.

Rapid User Interface Development

Design GUIs for Java SE, HTML5, Java EE, PHP, C/C++, and Java ME applications quickly and
smoothly by using editors and drag-and-drop tools in the IDE. For Java SE applications, the NetBeans
GUI Builder automatically takes care of correct spacing and alignment, while supporting in-place editing,
as well. The GUI builder is so easy to use and intuitive that it has been used to prototype GUIs live at
customer presentations.
6. IMPLEMENTATION

6.1 Function

Assign a set of random integers to each data point in the database. Sort the points in the database
so that there is no need to scan the database each time when the first two values itself not in the range of
epsilon distance. Find the core set by using epsilon as a distance measure between the points in each
dimension such that there are tau number of points in the neighborhood.

A set K of n large integers are randomly generated and used as a one-to-one mapping M : DB
K to assign a unique label to each point in the database. The signature H of a dense unit U is given by the
sum of the labels of the points in it. These 1-D signatures across different dimensions can be matched
without checking for the individual points contained in these dense units. The signature-sums are hashed
to a hash table. If all the sums collide then these dense units are same (with very high probability) and
exist in the subspace {d1 , d2 , . . . , dm }. Thus, the final collisions after hashing all dense units in all
dimensions generate dense units in the relevant maximal subspaces. These dense units are combined to
get final clusters in their respective subspaces.

PROCEDURE:
The maximal subspace cluster is generated using this procedure.

STEP 1:

Consider a set, K of very large, unique and random positive integers {K1 , K2 , . . . , Kn }. We
define M as a one-to-one mapping function, M : DB K.Each point Pi DB is assigned a unique
random integer Ki from the set K.

STEP 2:

In each dimension j, we have projections of n -points, P1 , P2 , . . . , Pn . We create all possible


dense units containing + 1 points that are within an distance. Instead of the actual points, a dense unit U
will now contain the mapped keys i.e. {K1 , K2 , . . . , K +1 }.

STEP 3:

Then, we create a hash table hTable, as follows. In each dimension j, for everydense unit Ua we
calculate the sum of its elements called signature, Ha and hash this signature in hTable. If Ha collides
with anothersignature Hb then the dense unit Ua exists in subspace {j, k} with extremely high k
probability. After repeating this process in all single dimensions, each entry of this hash table will contain
a dense unit in the maximal subspace as we can store the colliding dimensions against each signature Hi
hTable.

STEP 4:

The dense units in all possible maximal subspaces are processed to create density-reachable sets
and hence, maximal clusters. We use DBSCAN in each found subspace for the clustering process and the
value of and can be adapted differently as per the dimensionality of the subspace to deal with the curse
of dimensionality.

6.1.2 Flow Chart


7. ADVANTAGES

7.1 Advantages

All the clustering algorithms find the common points in each dimensions but this approach finds
the interleaved dense units.

Unlike other algorithms, it uses a method of assigning signature so that repeated clusters are not
formed.

We need to scan the database only once unlike other algorithm which requires n database scans
for n dimensions.

It runs efficiently on high dimensional databases.

Since the data points are sorted before finding dense units there is no need to scan all the data
points in a dimension when the first two points itself not contribute to the cluster.

The computation of signatures makes this approach effective.

This approach finds dense units only in single dimension which can be combined together to form
multiple dimensions.

The number of database scans and the number of clusters to be generated are not estimated before
runtime.
8. CONCLUSION

The generation of large and high dimensional data in the recent few years has over-whelmed the
data mining community. This approach efficiently finds the quality subspace clusters without expensive
database scans or generating trivial clusters in between. We have compared SUBSCALE algorithm
against recent subspace clustering algorithms and our proposed algorithm has performed far better when
it comes to handling high-dimensional datasets. However, the main cost in our SUBSCALE algorithm is
the computation of the candidate 1-dimensional dense units. In addition to splitting the hash table
computation, SUBSCALE has a high degree of parallelism as there is no dependency in computing dense
units across multiple dimensions. We plan to implement a parallel version of our algorithm on General
Purpose Graphics Processing Units (GPGPU) in the near future.
9. REFERENCES

1. Elhamifar E, Vidal R (2013) Sparse subspace clustering: algorithm, theory, and


applications. IEEE Trans Pattern Anal Mach Intell 35(11):27652781.

2. Tierney S, Gao J, Guo Y (2014) Subspace clustering for sequential data. In:
Computer vision and pattern recognition(CVPR), 2014 IEEE conference On. IEEE.
pp 10191026.

3. ParsonsL,HaqueE,LiuH(2004)Subspace clustering for high dimensional data :


ACMSIGKDDExplorNewsl 6(1):90105.

4. VidalR(2011)Subspaceclustering.IEEESignalProcMag28(2):5268.

5. AgrawalR,GehrkeJ,GunopulosD Automatic subspace clustering of high


dimensional data for datamining applications.

You might also like