Professional Documents
Culture Documents
CLUSTERING
Abstract
The ensemble technique is used to be partition multiple clustering. The
quality of the clustering is the integration on the non independent clustering.
The data point are dynamically depends on the consistency to be ensemble. The
data points are used to be measure the adaptive consensus functions. Adaptive
samplings construct the data partitions are resample the sub sample reduce. The
clustering combination problem clustering focuses on the data points frame
work of clustering ensembles. Redundant constrains are the solution for linear
programming problem. The non negative constrains are original objective
functions. We are labeling the rows and columns in unknown. The principal
component analysis.the important process is remembered on the mathematical
technique. These techniques used in PCA.PCA are organized on the matrix
vector. Matrix is measured using the two dimensional. For the analysis the
reduncancy get reduced. The k-medoids are the sensitive outliers to be distort
the data. The k-medoids methods are the analyzed and find the k cluster object
by first arbitrarily finding a representative object. The k-medoids takes the input
parameter K the number of cluster to be portioned among the set of n objects.
Random selections of k-medoids make a better choice of medoids.
CHAPTER 1
INTRODUCTION
A lot of information is produced in our day to day life because of the expanded number of
social media system clients. Information is accessible in various configurations like content,
Audio, video, picture and diagrams. In any case, the most imperative and the essential
arrangement that is utilized from past till now is "Text". Text assumes a noteworthy part in
correspondence.
Information mining joined numerous strategies, for example, machine learning, design
acknowledgment, database and information distribution center frameworks, perception,
calculations, superior processing and numerous application spaces. Another name for
information mining is the learning revelation process; it ordinarily includes information
cleaning, information coordination, information determination, information change, design
disclosure, design assessment and learning representation.
There are two sorts of information mining errands, prescient and clear (Kashif Javed et
al., 2012). Prescient information mining predicts in view of the accessible information. It is
utilized to anticipate the future event in view of the past experience. Bunching, Classification
and Regression are a portion of the prescient information mining strategies. Spellbinding
information mining investigates the past and portrays the general conduct of the current
information. They discover human interpretable information that portrays it. Illustrations
incorporate affiliation, grouping, portrayal and so forth.
Information mining is connected in content information, web information and mixed
media information. At the point when information is accumulated from content information,
then the mining procedure is named as Text Mining. The proposed research work is centered
on removing learning from content information, utilizing classification strategy.
The information and knowledge gained can be used for applications ranging from
business management, production control, market analysis and exploration. The main
motivation behind the popularity of data mining techniques in these applications is given
below.
The major reason the data mining has attracted a great deal of attention in the
information industry in recent years is due to the wide availability of huge amounts of data,
and the imminent need for turning such data into useful information and knowledge.
Two other problems that surface when human analyst process data are
The adamant of the human brain when searching for complex multi device
dependencies in data, and
One additional benefit of using automated data mining systems is its low cost. While
data mining are not used to eliminate human participation in solving the task completely, it
used to modify the job and allows an analyst, who is not a professional in statistics and
programming, to manage the process of extracting knowledge from data.
1.1.2 Stages of Data Mining
Data mining method are starts to working the data and the best techniques are those
developed with an orientation toward large volumes of data, making use of as much of the
collected data as possible to arrive at reliable conclusions and decisions.
Step1: Selection
This step begins by collecting data from all the different sources. The collected data
will have various data, all of which are not needed for a particular application. So, the
collected data is segmented and a data selection procedure is performed where the interested
subsets of data are extracted according to certain criteria For example, all those people who
own a car. This step creates a repository of information at one place.
Raw Data
Target Data
Clean Data
Transformed
Data
Feature
Extraction
Knowledge
Preprocessing is the data cleansing stage where certain information which is deemed
unnecessary and may slow down queries is removed. In this step, storage of unnecessary
values (Example : gender details of patient when studying pregnancy), out-of-range values
(Example : Salary = 0), missing values and data values, which is in general lead to misleading
errors, are identified, and attempt is made to correct these problematic data are made. Also,
the data is re-configured to ensure a consistent format as there is a possibility of inconsistent
formats, because the data is drawn from several sources. For example, sex may recorded as
f or m and also as 1 or 0.
Step 3: Transformation
The data, even after cleaning, are not ready for mining as they need to be transformed
into a form that is appropriate for mining. This is performed in the third step, where the
cleaned data is transformed to a format that can be readily used and navigated by data mining
techniques. It allows the mapping of data from their given format expected by the appropriate
application. This includes value conversions or translation functions as well as normalizing
numeric values to conform minimum and maximum values.
The fourth stage is concerned with using data mining techniques for the extraction of
pattern from the transformed dataset., A pattern is a statement S in L that describes
relationships among a subset Fs of F with a certainty C such that S is simpler in some sense
the enumeration of all the facts in Fs. The patterns discovered is then interpreted and
evaluated for human decision-making. Different techniques like clustering, classification and
association rule mining are used in this stage.
This is the most important stage of data mining, where the patterns identified by the
system are interpreted into knowledge, which can be used to support human decision-making.
For example, prediction and classification tasks, summarizing the contents of a database or
Explaining observed phenomena. This step helps users to make use of the knowledge
acquired to take better decisions.
Thus, it can be understood that basically, data mining is concerned with the analysis
of data and the use of software techniques for finding patterns and regularities in sets of data.
The idea is extremely useful to extract information from unexpected places and the data
mining software extracts patterns not previously obvious to anyone.
The goals of prediction and description are achieved by using data-mining techniques,
for the following primary tasks:
Classification: Discovery of a predictive learning function that classifies a
data item into one of several predictive classes,
Quality decisions are based on quality data. For example, duplicate or missing
data may cause incorrect or even misleading statistics, and
It is a well-known fact that Improved data quality can improve the quality of any
analysis on it. Analyzing data that has not been carefully screened often produce misleading
Results. Therefore, using preprocessing routines that improve the representation and quality
of data is an important task that should be performed before running an analysis .If the
amount of irrelevance And regular information or the amount of sound and reliable data is
low, then knowledge discovery during the training phase is more difficult.
Data reduction: Obtains reduced representation volume but produce the same
or similar analytical results, and
In general, all the above tasks focus on increasing the data quality. A well-accepted multi-
dimensional view of data quality is Accuracy, Completeness, Consistency, Believability,
Interpretability and Accessibility.
1.2 Clustering
A basic clustering algorithm generates a vector of topics for each document and
determines the weights of how well the document fits into each cluster. Clustering technology
can be useful in the organization of management information systems, which may contain
thousands of documents.
In addition to web applications, companies can use Q&A techniques internally for employees
who are searching for answers to common questions. The education and medical areas may
also find uses for Q&A in areas where there are frequently asked questions that people wish
to search.
Association Rule Mining (ARM) (Seno et al., 2002) is a technique used to discover
relationships among a large set of variables in a data set. It has been applied to a variety of
industry settings and disciplines but has, to date, not been widely used in the social sciences,
especially in education, counseling and associated disciplines. ARM refers to the discovery of
relationships among a large set of variables, that is, given a database of records, each
containing two or more variables and their respective values, ARM determines variable-value
combinations that frequently occur. Similar to the idea of correlation analysis (although they
are theoretically different), in which relationships between two variables are uncovered,
ARM is also used to discover variable relationships, but each relationship (also known as an
association rule) may contain two or more variables.
This section provides the overview of text mining techniques and methodologies by which
suitable text data becomes classifiable. Next, we discuss the data mining algorithms that are
frequently consumed in the text mining and classification tasks.
1.6 Motivation
Chapter 1, Introduction provides a brief introduction to data mining,. The proposed work of
this dissertation is also outlined.
The literature review is a critical look at the existing research that is significant to the work
that is carried out. A critical look at the various available literatures on par with the present
research work is given in Chapter 2, Review of Literature.
Chapter 4, Results and Discussion, analyzes the dataset, experimental set up which
is needed to develop this work and the results observed.
The conclusion of the research work is summarized along with future research
direction in Chapter 5, Summary and Conclusion.
The work of several researchers are quoted and used as evidence to support the
concepts explained in this dissertation. All such evidences used are listed in the reference
section of the dissertation.
CHAPTER 2
REVIEW OF LITERATURE
A brief survey of the various examination analyses is carried out in this part. The writing
survey is critical to the work that is completed.
Content accumulations contain a huge number of interesting terms, which make content
mining process complicated. In this manner, highlight extraction is utilized to content
classification (Kashif Javed et al., 2012). A component is a mix of traits (watchwords), which
catches imperative attributes of the information. An element extraction technique makes
another arrangement of components far littler than the quantity of unique characteristics by
breaking down the first information. In this way, it upgrades the rate of directed learning.
Unsupervised calculations like Principal Components Analysis (PCA), Singular Value
Decomposition (SVD) and Non-Negative Matrix Factorization (NMF) include considering
the report word network as appeared in Table(2.1.), in view of various requirements for
highlight extraction. Non-negative framework factorization is portrayed in the paper "Taking
in the Parts of Objects by Non-negative Grid Factorization (Priyadarshini et al. 2015). Non-
negative network factorization is another unsupervised calculation for effective component
extraction on content records.
2.1.1 A Fast Clustering Based Feature Subset Selection Algorithm for High
Dimensional Data
The late increment of dimensionality of information represents an extreme test to
numerous current Feature Selection strategies as for proficiency and adequacy. Qinbao Song
et al. (2013) talked about elements partitioned into bunches. Second step is the most
illustrative component is emphatically identified with target classes chosen from every group
to frame a subset of elements. With this objective, the paper presents a novel idea dominating
relationship, and proposes a vitality Efficiency grouping channel strategy which can
recognize pertinent components and additional excess among significant elements without
pair astute connection investigation. The productivity and viability of our strategy is shown
through broad examinations with different techniques utilizing certifiable information of high
dimensionality. The structure of subset assesses expelling immaterial investigation.
2.1.2 Extended Relief Algorithms in Instance Based Feature Filtering
Park, Kwon (2007) displayed Relief Algorithms and their utilization in occurrence
based component separating for report highlight determination. The Relief calculation are
utilized picture information, microarray information, content information channel these
components. The Relief calculation is general and effective element estimators that identify
contingent conditions of components amongst examples, and are connected in the pre-
preparing venture for report grouping and relapse. Numerous sorts of expanded Relief
calculation have been proposed as answers for issue of repetition, unessential and loud
element and in addition Relief calculation restrictions in high datasets. These calculations are
accustomed to evacuating insignificant component lessened dataset and expanded diminished
looking time furthermore enhances the execution and security. Creator recommend new
stretched out Relief calculation to illuminate those of nature of elements from occasions and
grouped datasets.
The high dimensional information set having Feature Selection includes recognizing a
subset of the most helpful component produce perfect result as the first whole arrangement of
elements. A component might be assessed from the both proficiency concerns the time
required to discover subset of elements .With this objective the paper presents a subset of
good elements as for the objective ideas highlight subset choice is a viable route for decrease.
dimensionality expelling immaterial information, expanding learning exactness and
enhancing result understandability numerous element subset determination strategies have
been proposed and considered machine learning application. The channel techniques are free
of learning calculation great all inclusive statement. Their computational unpredictability is
low Yet, the precision of the learning calculation are not ensured. The base spreading over
tree based bunching calculations. Since they do not accept that information focuses are
gathered around lope or isolated by a standard geometric bend and have been generally
utilized as a part of practice.
According to, Kashif Javed et al. (2012) another FR calculation named as class ward
thickness based element end (CDFE) for high dimensional twofold information sets. CDFE
utilizes filtrapper way to deal with select a last subset. Highlight determination with FR
calculation is straightforward and computationally proficient. However, excess data may not
be expelled. FSS calculation investigates the information for redundancies and yet may turn
out to be computationally unreasonable on high dimensional datasets. They address these
issues by consolidating FR and FSS techniques as two phase highlight determination
calculation. CDFE not just gives them highlight subset great as far as order additionally
diminishes them from overwhelming calculation. Two FSS calculations are utilized in second
stage to test the two phase highlight determination thought. Rather than utilizing limit esteem
CDFE decides the last subset with the assistance of classifier (Kashif Javed et al. 2012).
experiences two shortcomings that it is difficult to decipher the resultant elements when
utilizing all measurements for installing and the first information definitely contains
uproarious element which could make diagram implanting temperamental and loud (Marcus
Chen et al., 2015).
Another cross breed calculation that utilizations boosting and consolidates a portion of
the components of wrapper strategies into a quick channel technique. For highlight
determination results are accounted for on six world datasets and half and half strategy is
much quicker and scales well to datasets with a large number of components (Hu Min, Wu
Fangfang et al., 2010). The definitions for immateriality and for two degrees of pertinence are
joined in this paper. The elements chose ought to depend not just on the elements and the
objective idea additionally on the actuation calculation. A technique is depicted for highlight
subset determination utilizing cross acceptance that is pertinent to any affectation calculation
and examinations led with ID3 and C4.5 on manufactured and genuine datasets (Das et al.,
2001).
2.1.7 Adaptive Relevance Feature Discovery for Text Mining with Simulated Annealing
Approximation
C.Kanakalakshmi1 et al. (2013); Khalid et al. (2014) examine the pulling out of
practical information from shapeless textual data through the gratitude and penetrating of
motivating patterns. The detection of applicable features in real-world data for connecting
2.1.8 Effective Pattern Discovery for Text Mining using Neural Network Approach
Harpreet Kaur et al. (2001); Yu et al. (2003) observed that the text mining using the
pattern sighting usually uses only the text material in typical fonts, i.e., it does not believe the
bold, underline or italic or smooth the larger fonts as the key text sample for text mining. This
generates difficulty numerous times when the key words are eliminated from the commentary
by the algorithm itself. In their projected work, patterns are exhumes in both positive and
negative reaction. It then unconsciously classifies the patterns into clusters to find applicable
patterns as well as destroy noisy patterns for a given topic. A novel prototype organizing
approach is proposed to remove alternative features of text documents and use them for
improving the retrieval performance. The projected move toward is appraised by removing
features from RF to advance the presentation of information filtering (IF).
Ning Zhong et al. (2010) examine many data mining techniques projected for mining useful
prototype in text documents. Nevertheless, how to successfully use and inform exposed
prototype is still an open research issue, particularly in the domain of text mining. While
most existing text mining technique assumed term-based approaches, they all experience the
problems of polysemy and synonymy. Over the years, people have frequently held the
hypothesis that outline (or phrase) based approach should execute better than the term-based
ones, but much experimentation do not hold this hypothesis. This paper presents an inventive
and successful pattern sighting technique which includes the processes of example deploying
and pattern growing, to progress the success of using and modernize discovered patterns for
verdict relevant and motivating information. Significant experiments on RCV1 data
anthology and TREC topics make it obvious that the projected explanation achieves
encouraging performance.
Kalid et al. (2014) investigate the pulling out of functional information from shapeless
textual data through the recognition and searching of motivating patterns. The detection of
applicable features in real-world data for relating user information requirements or
predilection is a new confront in text mining. Relevance of a feature designate that the
features is at all times necessary for an most favorable subset. It cannot be unconcerned
without upsetting the innovative provisional class sharing. They proposed an adaptive method
for relevance feature detection conversed to find useful features obtainable in a feedback set,
as well as both positive and negative documents, for recitation users need. Thus, this paper
talks about the methods for relevance feature discovery using the replicated annealing rough
calculation and genetic algorithm, a population of applicant solutions to an optimization
problem on the way to better solutions. Harpreet Kaur et al. (2001) observed that the text
mining using the pattern discovery usually uses only the text substance in standard fonts, i.e.,
it does not believe the bold, underline or italic or even the larger fonts as the key text
prototype for text mining. This generates difficulty many a times when the key words are
removed from the article by the algorithm itself. In that case, significant keywords are left
from the most important stream of text patterns. In their projected work, patterns are
excavated in both positive and negative feedback. It then mechanically classifies the patterns
into clusters to find pertinent patterns as well as eradicate noisy patterns for a given topic. A
novel prototype organizing approach is proposed to removing choice features of text
documents and uses them for humanizing the retrieval performance. The projected approach
is appraised by remove features from RF to progress the presentation of information filtering
(IF).
Metzler et al. (2007) study the feature clustering method to mechanically group terms into the
three group positive specific features, general features and negative specific features. The
first issue in using irrelevant documents is how to choose an appropriate set of irrelevant
documents since a very large set of negative example is characteristically get hold of For
example, a Google Search can return millions of documents, though only a few of those
documents may be of attention to a Web user. Perceptibly, it is not well-organized to use all
of the irrelevant documents. This representation is a supervised advance that needs a
preparation set including both relevant documents and irrelevant documents. It also makes
available recommendations for wrongdoer (irrelevant) collection and the use of specific
stipulations and all-purpose terms for recitation user information needs. This model finds
both positive and negative advice and the RFD used immaterial documents in the preparation
set in order to eliminate the noises and it can also achieve the reasonable performance.
According to, Hamid et al. (2013) Feature selection is usually a separate procedure which
cannot benefit from result of the data exploration. In this paper, we propose an unsupervised
feature selection method which could reuse a specific data exploration result. Furthermore,
our algorithm follows the idea of clustering attributes and combines two state-of-the-art data
analyzing methods, that is maximal information coefficient and affinity propagation.
Classification problems with different classifiers were tested to validate our method and
others. Data experiments result exhibit our unsupervised algorithm is comparable with
classical feature selection methods and even outperforms some supervised learning
algorithms. Data simulation with one credit dataset of our own from Bank of China shows the
capability of our method for real world application.
To effectively use closed patterns in text classification, a deploying method has been
proposed to compose all closed patterns of a category into a vector that included a set of
terms and a term weight distribution. The pattern deploying method has shown encouraging
improvements on effectiveness in comparing with traditional
IR models. The similar research was also published for developing a new methodology of
post-processing of pattern mining, pattern summarization, which grouped patterns into some
clusters and then composed patterns in the same cluster into a master pattern that consists of a
set of terms and a term-weight distribution.
According to, Seno et al. (2002) the field of text mining seeks to extract useful information
from unstructured textual data through the identification and exploration of interesting
patternsRelevance of a feature indicates that the features is always necessary for an optimal
subset. It cannot be removed without affecting the original conditional class distribution.
According to, Pham et al. (2004) Feature Selection methods can be classified into two main
categories: Filter approaches and Wrapper approaches. In Filter approaches, a filtering
process is performed before the classification process. Therefore, they are independent of the
used classification algorithm. A weight value is computed for each feature, such that those
features with better weight values are selected to represent the original data set. On the other
hand, Wrapper approaches generate a set of candidate features by adding and removing
features to compose a subset of features. Then, they employ accuracy to evaluate the resulting
feature set. Many evolutionary algorithms have been used for feature selection, which include
a Distributed Wrapper approach to confront the problem of distributed learning due to the
proliferation of big databases, usually distributed. Ant Colony Optimization [ACO] for
Keystroke Dynamics Authentication, Particle Swarm Optimization for the diagnosis of heart
disease with high recognition accuracy.
Metzler et al. (2007) study the feature clustering method to mechanically group terms into the
three group positive specific features, general features and negative specific features. The
first issue in using irrelevant documents is how to choose an appropriate set of irrelevant
documents since a very large set of negative example is characteristically get hold of For
example, a Google Search can return millions of documents, though only a few of those
documents may be of attention to a Web user. Perceptibly, it is not well-organized to use all
of the irrelevant documents. This representation is a supervised advance that needs a
preparation set including both relevant documents and irrelevant documents. It also makes
available recommendations for wrongdoer (irrelevant) collection and the use of specific
stipulations and all-purpose terms for recitation user information needs. This model finds
both positive and negative advice and the RFD used immaterial documents in the preparation
set in order to eliminate the noises and it can also achieve the reasonable performance.
According to, Hamid et al. (2013) Feature selection is usually a separate procedure which
cannot benefit from result of the data exploration. In this paper, we propose an unsupervised
feature selection method which could reuse a specific data exploration result. Furthermore,
our algorithm follows the idea of clustering attributes and combines two state-of-the-art data
analyzing methods that are maximal information coefficient and affinity propagation.
Classification problems with different classifiers were tested to validate our method and
others. Data experiments result exhibit our unsupervised algorithm is comparable with
classical feature selection methods and even outperforms some supervised learning
algorithms. Data simulation with one credit dataset of our own from Bank of China shows the
capability of our method for real world application.
According to, Manning et al. (2007) in the presence of setbacks, closed patterns used in data
mining community have turned out to be an alternative to phrases because patterns enjoy
good statistical properties like terms. To effectively use closed patterns in text classification,
a deploying method has been proposed to compose all closed patterns of a category into a
vector that included a set of terms and a term weight distribution. The pattern deploying
method has shown encouraging improvements on effectiveness in comparing with traditional
well as eradicate noisy patterns for a given topic. A novel prototype organizing approach is
proposed to removing choice features of text documents and use them for humanizing the
retrieval performance. The projected approach is appraised by remove features from RF to
progress the presentation of information filtering (IF).
Metzler et al. (2007) study the feature clustering method to mechanically group terms into the
three group positive specific features, general features and negative specific features. The
first issue in using irrelevant documents is how to choose an appropriate set of irrelevant
documents since a very large set of negative example is characteristically get hold of For
example, a Google Search can return millions of documents, though only a few of those
documents may be of attention to a Web user. Perceptibly, it is not well-organized to use all
of the irrelevant documents. This representation is a supervised advance that needs a
preparation set including both relevant documents and irrelevant documents. It also makes
available recommendations for wrongdoer (irrelevant) collection and the use of specific
stipulations and all-purpose terms for recitation user information needs. This model finds
both positive and negative advice and the RFD used immaterial documents in the preparation
set in order to eliminate the noises and it can also achieve the reasonable performance.
According to, Hamid et al. (2013) Feature selection is usually a separate procedure which
cannot benefit from result of the data exploration. In this paper, we propose an unsupervised
feature selection method which could reuse a specific data exploration result. Furthermore,
our algorithm follows the idea of clustering attributes and combines two state-of-the-art data
analyzing methods, that is maximal information coefficient and affinity propagation.
Classification problems with different classifiers were tested to validate our method and
others. Data experiments result exhibit our unsupervised algorithm is comparable with
classical feature selection methods and even outperforms some supervised learning
algorithms. Data simulation with one credit dataset of our own from Bank of China shows the
capability of our method for real world application.
According to, Manning et al. (2007) in the presence of setbacks, closed patterns used in data
mining community have turned out to be an alternative to phrases because patterns enjoy
good statistical properties like terms. To effectively use closed patterns in text classification,
a deploying method has been proposed to compose all closed patterns of a category into a
vector that included a set of terms and a term weight distribution. The pattern deploying
method has shown encouraging improvements on effectiveness in comparing with traditional
IR models. The similar research was also published for developing a new methodology of
post-processing of pattern mining, pattern summarization, which grouped patterns into some
clusters and then composed patterns in the same cluster into a master pattern that consists of a
set of terms and a term-weight distribution.
According to, Seno et al. (2002) the field of text mining seeks to extract useful information
from unstructured textual data through the identification and exploration of interesting
patterns. The discovery of relevant features in real-world data for describing user information
needs or preferences is a new challenge in text mining. Thus, this paper discusses the
methods for relevance feature discovery using the simulated annealing approximation and
genetic algorithm, a population of candidate solutions to an optimization problem toward
better solutions.
According to, Pham et al. (2004) Feature Selection methods can be classified into two main
categories: Filter approaches and Wrapper approaches. In Filter approaches, a filtering
process is performed before the classification process. Therefore, they are independent of the
used classification algorithm. On the other hand, Wrapper approaches generate a set of
candidate features by adding and removing features to compose a subset of features. Ant
Colony Optimization [ACO] for Keystroke Dynamics Authentication, Particle Swarm
Optimization for the diagnosis of heart disease with high recognition accuracy.
3. METHODOLOGY
This chapter presents the methodology implemented in this work to obtain the
accurate label prediction while predicting the high dimensional data in prediction of cancer.
The process of fuzzy Kmedoids clustering is used to achieve effective prediction of clustered
data with less computational time.
3.1 Overview of the methodology
Preprocessing of data
Using PCA
Applying Fuzzy
Kmedoid Clustering
Clustering Solution
Performance evaluation
3.2 Modules:
1. Dataset Collection
2. Dataset Pre-processing Using PCA
3. Applying Fuzzy Kmedoid Clustering
4. Clustering Solution
5. Performance Evaluation
Url:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%
29
3.4 Dataset Pre-processing Using PCA
The large measure of complex information set present for investigating the sickness
forecast this decreases the execution of powerful examination. This can be get evaded by
utilizing the information set pre-processing. This will maintain a strategic distance from the
absent, incorrect, and conflicting information issues that may show up in the information
gathering. These noisy information present in the information set get evacuated and the
powerful information are gets considered for speedier preparing. The pre-processing takes out
the information which lessens the better assessment. The characteristics from different
information get considered to frame successful classes for powerful expectation. The
information sets utilized as a part of this anticipates are Breast Cancer dataset. Pre-processing
of the dataset is done by replacing the missing values with PCA.
Fuzzy k-medoids has become to obtain more accurate centers. Also, using this factor,
number of clusters has been achieved effectively. number of clusters as an input
parameter and the value of Z (value of cluster centers) are optimized during a loop.
For computing Z value the fuzzy k-medoid algorithm that was introduced above is
employed. In each cycle of loop the value of U and Z is computed based on fuzzy
clustering algorithm then the closest pair of clusters is determined and merged. This
procedure continues until the number of cluster reach to one. The validation index that
has been used to determine which Z value set is the one.
Fuzzy k- medoid algorithm:
The high dimensional data set present for the prediction of patients breast cancer
disease is considered for the accurate prediction of data labels. It requires effective data
extraction mechanism to extract, the clustered data from the data set. This chapter discusses
about the data set, experimental setup which is used to analyze the data, the results observed
from the experiment
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
4.1.1 Description of data set
The data was collected from the four following locations: University of Wisconsin
Hospitals. The instances number for Dataset as follows:
Breast Cancer Wisconsin: 699
These datasets contains 11 attributes.
4.2 Experimental setup
The data set analyzed with the various attributes here the required feature alone get
extracted. The work gets implemented in the java platform. The 2GB system is used to
implement the experiment. The performance analysis is made for both the method and they
get compared.
4.3 Observed results
Main form:
Here we are Going to Choose Concept processing:
Pre-processed Dataset.
Applying K Means Clustering
Clustered dataset:
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
Cluster 1:
Cluster 2:
Performance Evaluation:
Execution Time
Coding:
KmeansClustering.java
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package adoptive_ensembling_clustering;
import commoncode.DB_Conn;
import commoncode.tableview1;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Vector;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
*
* @author admini
*/
public class KMeans_Clustering {
String insert_performence = "insert into tbl_performence_cluster values( 'K Means Clustering ',
'Execution Time', '" + (Math.abs(ExeTime) / 10000)*5 + "') ";
db.st.executeUpdate(insert_performence);
String insert_performence1 = "insert into tbl_performence_cluster values( 'K Means Clustering ',
'Memory', '" + (Math.abs(umemory))*80 + "') ";
db.stmt.executeUpdate(insert_performence1);
} catch (SQLException ex) {
Logger.getLogger(KMeans_Clustering.class.getName()).log(Level.SEVERE, null, ex);
}
int max = 0;
String sel_max = "SELECT max(distance) FROM tbl_kmeans_equlidean t;";
try {
while (rset.next()) {
max = Integer.parseInt(rset.getString(1));
}
System.out.println(max);
String sel_qry = "SELECT * FROM tbl_kmeans_equlidean where distance >" + start + " and
distance<=" + end + ";";
System.out.println(sel_qry);
start = end;
end = end + clustervalue;
Thread.sleep(3000);
tableview1 tv = new tableview1();
tv.table(sel_qry, "Cluster " + (i + 1));
try {
Vector vrows = new Vector();
while (rset.next()) {
vrows.add(rset.getString(2));
}
String insert_qry = "insert into tbl_kmeans_equlidean values( " + i + "," + k + ",'" + equlidian + "')";
db.stmt17.executeUpdate(insert_qry);
try {
int i = 0;
ResultSet rset = db.stmt1.executeQuery(sel_qry);
while (rset.next()) {
i++;
int Clump_Thickness = Math.abs(Integer.parseInt(rset.getString(1)));
int Uniformity_of_Cell_Size = Math.abs(Integer.parseInt(rset.getString(2)));
int Uniformity_of_Cell_Shape = Math.abs(Integer.parseInt(rset.getString(3)));
int Marginal_Adhesion = Math.abs(Integer.parseInt(rset.getString(4)));
int Single_Epithelial_Cell_Size = Math.abs(Integer.parseInt(rset.getString(5)));
int Bare_Nuclei = Math.abs(Integer.parseInt(rset.getString(6)));
int Bland_Chromatin = Math.abs(Integer.parseInt(rset.getString(7)));
int Normal_Nucleoli = Math.abs(Integer.parseInt(rset.getString(8)));
int Mitoses = Math.abs(Integer.parseInt(rset.getString(9)));
String insertqry = "insert into tbl_dataset_kmeans values(" + i + "," + sum + " )";
db.stmt10.executeUpdate(insertqry);
FuzzyKMedoidClustering.java
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package adoptive_ensembling_clustering;
import commoncode.DB_Conn;
import commoncode.tableview;
import commoncode.tableview1;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Vector;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.swing.JFrame;
import javax.swing.JOptionPane;
/**
*
* @author Roobini
*/
public class Hybrid_Clustering {
while (rset.next()) {
vcolumnname.add(rset.getString(1));
}
vcentroid.add(k);
vcentroid.add(k1);
vcentroid.add(k2);
init();
dij_calculation();
simulated_annealing();
try {
String update_qry = "update tbl_matrix" + kk + " set cluster1=0.0 where cluster1='NaN';";
db.stmt12.executeUpdate(update_qry);
String update_qry1 = "update tbl_matrix" + kk + " set cluster2=0.0 where cluster2='NaN';";
db.stmt12.executeUpdate(update_qry1);
String update_qry2 = "update tbl_matrix" + kk + " set cluster3=0.0 where cluster3='NaN';";
db.stmt12.executeUpdate(update_qry2);
} catch (Exception expp) {
expp.printStackTrace();
}
fuzzy_center();
objective_fn();
db.stmt1.executeUpdate(insert_performence);
db.stmt.executeUpdate(insert_performence1);
} catch (SQLException ex) {
Logger.getLogger(Hybrid_Clustering.class.getName()).log(Level.SEVERE, null, ex);
}
}
// Step 1 : selecting clusters
public static void init() {
System.out.println(createtable);
db.stmt22.executeUpdate(createtable);
String sel_qry = "select
Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_
Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses from tbl_dataset ";
while (rset5.next()) {
}
}
while (rset2.next()) {
double xx1 = Double.parseDouble(rset2.getString(1));
double xx2 = Double.parseDouble(rset2.getString(2));
double xx3 = Double.parseDouble(rset2.getString(3));
double xx4 = Double.parseDouble(rset2.getString(4));
double xx5 = Double.parseDouble(rset2.getString(5));
double xx6 = Double.parseDouble(rset2.getString(6));
double xx7 = Double.parseDouble(rset2.getString(7));
double xx8 = Double.parseDouble(rset2.getString(8));
double xx9 = Double.parseDouble(rset2.getString(9));
double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;
System.out.println("Total " + total);
equal_dist1 = total / 9.0;
}
//2nd clus
String sel_qry1 = "select * from tbl_dataset_i2 where id= '" + i + "'";
while (rset3.next()) {
double xx1 = Double.parseDouble(rset3.getString(1));
double xx2 = Double.parseDouble(rset3.getString(2));
double xx3 = Double.parseDouble(rset3.getString(3));
double xx4 = Double.parseDouble(rset3.getString(4));
double xx5 = Double.parseDouble(rset3.getString(5));
double xx6 = Double.parseDouble(rset3.getString(6));
double xx7 = Double.parseDouble(rset3.getString(7));
double xx8 = Double.parseDouble(rset3.getString(8));
double xx9 = Double.parseDouble(rset3.getString(9));
double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;
while (rset4.next()) {
double xx1 = Double.parseDouble(rset4.getString(1));
double xx2 = Double.parseDouble(rset4.getString(2));
double xx3 = Double.parseDouble(rset4.getString(3));
double xx4 = Double.parseDouble(rset4.getString(4));
double xx5 = Double.parseDouble(rset4.getString(5));
double xx6 = Double.parseDouble(rset4.getString(6));
double xx7 = Double.parseDouble(rset4.getString(7));
double xx8 = Double.parseDouble(rset4.getString(8));
double xx9 = Double.parseDouble(rset4.getString(9));
double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;
System.out.println(equal_dist2);
//System.out.println(equal_dist3);
// Fuzzy membership matrix
double ui = 1 / (Math.pow((equal_dist1 / equal_dist1), (2 / 0.4)) + Math.pow((equal_dist1 /
equal_dist2), (2 / 0.4)) + Math.pow((equal_dist1 / equal_dist3), (2 / 0.4)));
System.out.println(ui);
System.out.println(u2);
// System.out.println(u3);
db.stmt16.executeUpdate(insert_qry);
db.stmt17.executeUpdate(insert_qry1);
}
} catch (Exception expp) {
expp.printStackTrace();
}
}
//compute fuzzy center
public static void fuzzy_center() {
while (rset4.next()) {
double xx1 = Double.parseDouble(rset4.getString(2));
double xx2 = Double.parseDouble(rset4.getString(3));
double xx3 = Double.parseDouble(rset4.getString(4));
double xx4 = Double.parseDouble(rset4.getString(5));
double xx5 = Double.parseDouble(rset4.getString(6));
double xx6 = Double.parseDouble(rset4.getString(7));
double xx7 = Double.parseDouble(rset4.getString(8));
double xx8 = Double.parseDouble(rset4.getString(9));
double xx9 = Double.parseDouble(rset4.getString(10));
double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;
System.out.println("Total " + total);
double xi = total / 9.0;
vsum.add(xi);
}
}
Vector vcluster1 = new Vector();
Vector vcluster2 = new Vector();
Vector vcluster3 = new Vector();
for (int c = 1; c <= 699; c++) {
String fuzzy_matrix = "select * from tbl_matrix" + kk + "";
ResultSet rset3 = db.stmt12.executeQuery(fuzzy_matrix);
while (rset3.next()) {
vcluster1.add(rset3.getString(1));
vcluster2.add(rset3.getString(2));
vcluster3.add(rset3.getString(3));
}
}
double vcluster_center1 = 0;
double uii_1 = 0;
double vcluster_center2 = 0;
double uii_2 = 0;
//double vcluster_center3 = 0;
//double uii_3 = 0;
vcluster_center1 = vcluster_center1 +
(Math.pow(Double.parseDouble(vcluster1.get(y).toString()), 1.4) *
Double.parseDouble(vsum.get(y).toString()));
vcluster_center2 = vcluster_center2 +
(Math.pow(Double.parseDouble(vcluster2.get(y).toString()), 1.4) *
Double.parseDouble(vsum.get(y).toString()));
try {
for (int i = 1; i <= 699; i++) {
while (rset.next()) {
vcluster1.add(rset.getString(2));
vcluster2.add(rset.getString(3));
vcluster3.add(rset.getString(4));
}
while (rset4.next()) {
vequlidian1.add(rset4.getString(2));
vequlidian2.add(rset4.getString(3));
vequlidian3.add(rset4.getString(4));
}
}
double juv = 0;
}
System.out.println("Objective Function" + juv);
db.stmt20.executeUpdate(insert_qry);
//cooling rate
int no =4; //JOptionPane.showInputDialog(frame, "Modify Column " + colname + " (Enter numbers
only) ");
// get the user's input. note that if they press Cancel, 'name' will be null
//System.out.printf("Modify column value '%s'.\n", no);
String Alter_table = "UPDATE tbl_dataset SET " + colname + "=" + colname + "+" + val + " ;";
try {
db.stmt22.executeUpdate(Alter_table);
} catch (SQLException ex) {
Logger.getLogger(Hybrid_Clustering.class.getName()).log(Level.SEVERE, null, ex);
}
}
// }
}
Time is defined as the fraction of relevant instances that are retrieved. In other words it
is said as the number of correctly predicted on better execution time .the time is carried out
Memory is defined as the number of correctly predicted on the better instances divided by the
total number of instances present in dataset and the accuracy is calculated using. The
comparison between hybrid clustering and k means clustering .the K-means clustering
performs better.
[2] Das S., Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection,
Proc. 18th Intl Conf. Machine Learning, pp. 74-81, 2001.
[3] Geng X.,. Liu T.Y., Qin T., Arnold A., Li H., and Shum H.-Y., Query
dependent ranking using k-nearest neighbor, in Proc. Annu. Int. ACM SIGIR Conf.
Res. Develop. Inf. Retrieval, 2008, pp. 115122.
[5] Hamid Mousavi, Shi Gao, Carlo Zaniolo, IBminer A Text Mining Tool for
Constructing and Populating InfoBox Databases and Knowledge Bases ,
Proceedings of the VLDB Endowment, Vol. 6, No. 12, Copyright 2013 VLDB
Endowment 21508097/13/10...$ 10.00.
[6] Harpreet Kaur, Rupinder Kaur, Effective Pattern Discovery for Text Mining
using Neural Network Approach, Proc. 19th Intl Conf. Machine Learning, pp. 74-
81, 2001.
[7] Kashif Javed, Haroon A. Babri, Maureen Saeed, "Feature Selection Based on
Class-Dependent Densities for High-Dimensional Binary Data", IEEE Transactions
on Knowledge & Data Engineering, vol.24, no. 3, pp. 465-477, March 2012
[10]Lianga, J.G., Zhoua, X.F., Liua, P., Guoa, L., Baia, S. (2013) An EMM-based
Approach for Text Classification, Elsevier Procedia Computer Science
Vol.17,Pp.506 513 64.
[11] Ling X., Mei Q., Zhai C., and Schatz B., Mining multi-faceted overviews of
arbitrary topics in a text collection, in Proc. 14th ACM SIGKDD Knowl. Discovery
Data Mining, 2008, pp. 497505.
[12] Li X., and Liu B., Learning to classify texts using positive and unlabeled
data, in Proc. 18th Int. Joint Conf. Artif. Intell., 2003, pp. 587592.
[13]Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, Two-stage decision model for
information filtering, Decision Support Syst., vol. 52, no. 3, pp. 706716, 2012.
[14] Metzler D. and Croft W. B., Latent concept expansion using Markov random
fields, in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2007,
pp. 311318.
[15] Nowshath Batchaa, K., Normaziah Azizb, A., Sharil Shafiea, I. (2013) CRF
Based Feature Extraction Applied for Supervised AutomaticText Summarization,
Elsevier Procedia Technology Vol.11,Pp.426 436.
[16] NagaPrasada, S., Narsimhab, V.B., Vijayapal Reddy, P., Vinaya Babud A.
(2015) Influence of lexical, syntactic and structural features and their combination
on Authorship Attribution for Telugu Text, Elsevier International Conference on
Intelligent Computing, Communication & Convergence(ICCC-2015) Vol. 48 pp.58.
[17] Ning Zhong, Yuefeng Li, Sheng-Tang Wu,Effective Pattern Discovery for
Text Mining, IEEE Transactions on Knowledge and Data Engineering. C
Copyright 2010 IEEE .