Professional Documents
Culture Documents
CLUSTERING TECHNIQUES
Abstract
Currently, due to the availability of massive biomedical data on each individual, both
healthcare and life Science is becoming data-driven. The input-attributes are structured/un-
structured data with many challenges, including sparse-binary attributes with imbalanced
outcomes, non-unique distributed structure and high- dimensional data, which hamper efforts to
make a clinical decision in clinical practice. In recent decades, considerable effort has been made
toward overcoming most of these challenges, but still there is an essential need for significant
improvements in this field, especially after integrating both omics and phenotype data for future
personalized medicine. These challenges motivate us to use the state-of-the-art of big data
analytics and large-scale machine learning frameworks to confront most of the challenges and
provide proper clinical solutions to assess physicians in clinical practice at the bedside and
This research proposes a new recursive screening incremental ranking machine learning
paradigm to empower the desired classifiers, especially for imbalanced training data, to create
suitable data-driven clusters without prior information and later reduce the dimensionality of
large biomedical data sets. The new framework combines many binary-attributes based on two
criteria: (i) the minimum power value for each combination and (ii) the classification power of
such a combination. Next, these sets of combined attributes are investigated by physicians to
select the proper set of rules that make clinical sense and subsequently to use the result to
empower the desired healthcare event (binary or multinomial target) at the bedside. After
empowering the target class categories, we select the k-significant risk drivers with a suitable
volume of data and high correlation to the desire outcome, and next, we establish the proper
segmentation using AND-OR associative relationships. Finally, we use the propensity score to
handle the imbalanced data, and next, we build break-through machine learning/data mining
performance computing platforms, such as scalable MapReduce HDFS, Spark MLlib, and
Google Sibyl.
Comparative studies with both simulated and real-life biomedical databases are carried out for
identifying specific biomedical and healthcare outcomes, such as asthma, breast cancer, gene
mutations selection and genomic association studies for specific complex diseases. Results have
shown that the proposed incremental learning scheme empower the new classifier with reliable
and stable performance. The new classifier outperforms the current existing predictive models in
both high quality outcome and less expensive in execution time, especially, with imbalanced and
sparse with high-dimensional big biomedical data. We recommend future work to be conducted
using real-life integrated clinic-genomic big data with genome-wide association studies for
The clusters are formed according to the distance between data points and cluster centers
which are formed for each cluster. To compare the two algorithms using normal distribution data
points. This investigation can be used for two unsupervised clustering methods, namely K-
Means, Entropy Based Mean Clustering, which are examined to analyze based on the distance
between the input data points. For implementation plan, we take the datasets from UCI Machine
Learning Repository. The implementation work was used in Advanced Java, MS-Excel
Software. The execution time is calculated in milliseconds. This Chapter deals with a method for
Improving efficiency of the said algorithm and analyze the elapsed time is taken, predicting
Proposed System
The proposed approach Entropy Based Mean clustering, consisting of different enhanced
features of basic K-mean clustering. The partitioning algorithms work well for decision
spherical-shaped clusters in different type of data points. The advantage of the K-Means
algorithm is its favorable execution time. Its drawback is that the user has to know in advance
how many clusters are searched for. From the experimental results, it is practical that K-Means
algorithm is efficient for smaller data sets and Entropy Based Mean Clustering is good for larger
databases. For the both algorithms, size and shape of the cluster is depending upon the type of
attribute, we selected or clustering and number of unique values for attribute consider for the
clustering. EBM Clustering was more preferable because it eliminates the empty clusters during
The generation itself and no additional phases are required. The time complexity can be
calculated by CPU elapsed time for different two algorithms. As a rule the time complexity
varies from one processor to another processor, which depends on the speed and the type of the
system.
clustering. Through the results produced from the proposed algorithm EHAC, one can easily
understand different levels of observations on data analysis. One of the major advantages of this
approach is, representation of Global clusters, which is not possible in the Non Hierarchical
clustering approaches. The proposed algorithm can be optimized for feature selection, which
iii. The proposed mechanism called duo-mining, is designed and developed for diabetic patent
data organization, it summarizes the both features of Data Mining and Text Mining. This
process combination has proven with better results because, instead of only being able to analyze
the structured data they collect from transactions, they can add different patterns from the text
mining side. These new developments in text mining technology that go beyond simple
searching methods are the key to information discovery and have a promising outlook for
iv. Verification and Validation are key points for any development, the last chapter of this
thesis is meant for checking the correctness of the proposed algorithms. To check the validation
of the approach, we used some metric called Accuracy, Return on Investment, Complexity and
Ability to handle the null values. To evaluate these metrics, we found and data set called Breast
Cancer from UCI. Finally we found Mean Clustering shows the better results with respect to the