You are on page 1of 8

Project Synopsis

BE (IT) - I

Walchand Institute of Technology, Solapur.


Department of Information Technology

Project Synopsis

Project Title: A Fast Clustering-Based Feature Subset Selection Algorithm for

High-Dimensional Data
Project Group:

Name

Roll No

Snehal S Soni
(Team Leader)

BE-IT 66

Hina.S.Mujawar

BE-IT 45

Shweta.A.Patil

BE-IT 52

Sayali.S.Chougule

BE-IT 15

Prajakta.S.Thobade BE-IT 72

E-mail

Phone No

snehalsonicom@gmail.com

8275159853

Hinamujawar1@gmail.com

8408807761

Shweta.patil2621@gmail.com 9370801639
Sayalichougule7@gmail.com

9404420141

thobadeprajakta@gmail.com 9423605068

Project Guide:

Year 2014-15

Sem-I

Project Synopsis

BE (IT) - I

(Prof . Anil S Naik)

Abstract:
Feature selection involves identifying a subset of the most useful features that produces compatible
results as the original entire set of features. A feature selection algorithm may be evaluated from
both the efficiency and effectiveness points of view. While the efficiency concerns the time
required to find a subset of features, the effectiveness is related to the quality of the subset of
features. Based on these criteria, a Fast clustering-bAsed feature Selection algoriThm (FAST) is
proposed in this project. The FAST algorithm works in two steps. In the first step, features are
divided into clusters by using graph-theoretic clustering methods. In the second step, the most
representative feature that is strongly related to target classes is selected from each cluster to form a
subset of features. Features in different clusters are relatively independent, the clustering-based
strategy of FAST has a high probability of producing a subset of useful and independent features.
To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering
method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical
study. Extensive experiments are carried out to compare FAST and several representative feature
selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS- SF, with respect to four
types of well-known classifiers, namely, the probability based Naive Bayes, the tree-based C4.5, the
instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on
35 publicly available real-world high-dimensional image, microarray, and text data, demonstrate
that the FAST not only produces smaller subsets of features but also improves the performances of
the four types of classifiers.

Year 2014-15

Sem-I

Project Synopsis

BE (IT) - I

Background Study (Literature Survey):


Statistical Comparisons of Classifiers over Multiple Data Sets:
In this method introduce some new pre- or post processing step has been proposed, and the
implicit hypothesis is made that such an enhancement yields an improved performance over the
existing classification algorithm. Alternatively, various solutions to a problem are proposed and the
goal is to tell the successful from the failed. A number of test data sets is selected for testing, the
algorithms are run and the quality of the resulting models is evaluated using an appropriate
measure, most commonly classification accuracy.

A Features Set Measure Based On Relief:


It used six real world dataset from the UCI repository have been used. Three of them have
classification Problem with discrete features, the next two classifications with discrete and
continuous features, and the last one is approximation problem. The learning algorithm is used to
check the quality of feature selected are a classification and regression tree layer with pruning. This
process and algorithms is implemented by the orange data mining System. Overall, the nonparametric tests, namely the Wilcox on and Friedman test are suitable for our problems. They are
appropriate since they assume some, but limited commensurability.

Feature Clustering and Mutual Information for the Selection of Variables In


Spectral Data:
It face many problems in spectrometry require predicting a quantitative value from
measured spectra. The major issue with spectrometric data is their functional nature; they are
functions discredited with a high resolution. This leads to a large number of highly-correlated
features; many of which are irrelevant for the prediction. The approach for the features is to
describe the spectra in a functional basis whose basis functions are local in the sense that they
correspond to well-defined portions of the spectra.

On Feature Selection through Clustering:


This paper introduce an algorithm for feature selection that clusters attributes using a
special metric and, then uses a hierarchical clustering for feature selection. Hierarchical algorithms
generate clusters that are placed in a cluster tree, which is commonly known as a dendrogram.
Clusterings are obtained by extracting those clusters that are situated at a given height in this tree.
It use several data sets from the UCI dataset repository and, due to space limitations we discuss
only the results obtained with the votes and zoo datasets, Bayes algorithms of the WEKA package
were used for constructing classifiers on data sets obtained by projecting the initial data sets on the
sets of representative attributes.

Year 2014-15

Sem-I

Project Synopsis

BE (IT) - I

Existing System
The past approach there are several algorithm which illustrates how to maintain the data into the
database and how to retrieve it faster, but the problem here is no one cares about the database
maintenance with ease manner and safe methodology.
The embedded methods incorporate feature selection as a part of the training process and are
usually specific to given learning algorithms, and therefore may be more efficient than the other
three categories. Traditional machine learning algorithms like decision
trees or artificial neural networks are examples of embedded approaches.
The wrapper methods use the predictive accuracy of a predetermined learning algorithm to
determine the goodness of the selected subsets, the accuracy of the learning algorithms is usually
high. However, the generality of the selected features is limited and the computational complexity
is large. The filter methods are independent of learning algorithms, with good generality.
Their computational complexity is low, but the accuracy of the learning algorithms is not
guaranteed. The hybrid methods are a combination of filter and wrapper methods by using a filter
method to reduce search space that will be considered by the
subsequent wrapper. They mainly focus on combining filter and wrapper methods to achieve the
best possible performance with a particular learning algorithm with similar time complexity of the
filter methods.

Disadvantages:
1. The generality of the selected features is limited and the computational complexity is large.
2. Their computational complexity is low, but the accuracy of the learning algorithms is not
guaranteed.
3. The hybrid methods are a combination of filter and wrapper methods by using a filter
method to reduce search space that will be considered by the subsequent wrapper.

Year 2014-15

Sem-I

Project Synopsis

BE (IT) - I

Proposed System
Feature subset selection can be viewed as the process of identifying and removing as many
irrelevant and redundant features as possible. This is because irrelevant features do not contribute to
the predictive accuracy and redundant features do not redound to getting a better predictor for that
they provide mostly information which is already present in other feature(s). Of the many feature
subset selection algorithms, some can effectively eliminate irrelevant features but fail to handle
redundant features yet some of others can eliminate the irrelevant while taking care of the redundant
features.
Our proposed FAST algorithm falls into the second group. Traditionally, feature subset
selection research has focused on searching for relevant features. A well-known example is Relief
which weighs each feature according to its ability to discriminate instances under different targets
based on distance-based criteria function. However, Relief is ineffective at removing redundant
features as two predictive but highly correlated features are likely both to be highly weighted.
Relief-F extends Relief, enabling this method to work with noisy and incomplete data sets and to
deal with multiclass problems, but still cannot identify redundant features.

Advantages:

Good feature subsets contain features highly correlated with (predictive of) the class, yet
uncorrelated with (not predictive of) each other.

The efficiently and effectively deal with both irrelevant and redundant features, and obtain a good
feature subset.

Generally all the six algorithms achieve significant reduction of dimensionality by selecting only a
small portion of the original features.

The null hypothesis of the Friedman test is that all the feature selection algorithms are equivalent in
terms of runtime.

Year 2014-15

Sem-I

Project Synopsis

BE (IT) - I

System Flow:

Data set

Irrelevant feature removal

Minimum Spinning tree


constriction

Tree partition & representation


feature selection

Selected Feature

Year 2014-15

Sem-I

Project Synopsis

BE (IT) - I

SOFTWARE REQUIREMENTS:

Operating System

Windows XP

Front End

Java JDK 1.7

Scripts

JavaScript.

Tools

Netbeans

Database

SQL Server or MS-Access

Database Connectivity

JDBC.

Year 2014-15

Sem-I

Project Synopsis

BE (IT) - I

References:
[1] Qinbao Song, Jingjie Ni and Guangtao Wang ,A Fast Clustering-Based Feature Subset
Selection Algorithm for High Dimensional Data, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR 2013.
[2] T. Jaga Priya Vathana, C. Saravanabhavan , A Survey On Feature Selection Algorithm
For High Dimensional Data Using Fuzzy Logic, Proc. Research Scholar & Asst. Professor,
Department of CSE, Kongunadu College of Engineering and Technology,
[3] K.Revathi, T.Kalai Selvi, Effective Feature Subset Selection Methods and Algorithms for
High Dimensional Data, vol. 2, nos. 1/2, pp. 279-305, 2013.
[4] T.Jaga Priya Vathana, A Survey On Feature Selection Algorithm For High Dimensional
Data Using Fuzzy Logic, Proc. Fifth Intl Conf. Recent Advances in Soft Computing, pp. 104109, 2004.
[5]. D.A. Bell and H. Wang, A Formalism for Relevance and Its Application in Feature Subset
Selection, Machine Learning, vol. 41, no. 2, pp. 175-195, 2000.

Year 2014-15

Sem-I

You might also like