You are on page 1of 5

BIO THERAPEUTIC INFORMATION INVESTIGATION USING NARRATIVE

CLUSTERING TECHNIQUES

Abstract

Currently, due to the availability of massive biomedical data on each individual, both

healthcare and life Science is becoming data-driven. The input-attributes are structured/un-

structured data with many challenges, including sparse-binary attributes with imbalanced

outcomes, non-unique distributed structure and high- dimensional data, which hamper efforts to

make a clinical decision in clinical practice. In recent decades, considerable effort has been made

toward overcoming most of these challenges, but still there is an essential need for significant

improvements in this field, especially after integrating both omics and phenotype data for future

personalized medicine. These challenges motivate us to use the state-of-the-art of big data

analytics and large-scale machine learning frameworks to confront most of the challenges and

provide proper clinical solutions to assess physicians in clinical practice at the bedside and

subsequently provide high quality care while reducing its cost.

This research proposes a new recursive screening incremental ranking machine learning

paradigm to empower the desired classifiers, especially for imbalanced training data, to create

suitable data-driven clusters without prior information and later reduce the dimensionality of

large biomedical data sets. The new framework combines many binary-attributes based on two

criteria: (i) the minimum power value for each combination and (ii) the classification power of

such a combination. Next, these sets of combined attributes are investigated by physicians to

select the proper set of rules that make clinical sense and subsequently to use the result to
empower the desired healthcare event (binary or multinomial target) at the bedside. After

empowering the target class categories, we select the k-significant risk drivers with a suitable

volume of data and high correlation to the desire outcome, and next, we establish the proper

segmentation using AND-OR associative relationships. Finally, we use the propensity score to

handle the imbalanced data, and next, we build break-through machine learning/data mining

predictive models based on functional networks maximum-likelihood and Newton-Raphson

iterative matrix computation mechanism to expedite the implementations with in high

performance computing platforms, such as scalable MapReduce HDFS, Spark MLlib, and

Google Sibyl.

Comparative studies with both simulated and real-life biomedical databases are carried out for

identifying specific biomedical and healthcare outcomes, such as asthma, breast cancer, gene

mutations selection and genomic association studies for specific complex diseases. Results have

shown that the proposed incremental learning scheme empower the new classifier with reliable

and stable performance. The new classifier outperforms the current existing predictive models in

both high quality outcome and less expensive in execution time, especially, with imbalanced and

sparse with high-dimensional big biomedical data. We recommend future work to be conducted

using real-life integrated clinic-genomic big data with genome-wide association studies for

future personalized medicine.


Existing System

The clusters are formed according to the distance between data points and cluster centers

which are formed for each cluster. To compare the two algorithms using normal distribution data

points. This investigation can be used for two unsupervised clustering methods, namely K-

Means, Entropy Based Mean Clustering, which are examined to analyze based on the distance

between the input data points. For implementation plan, we take the datasets from UCI Machine

Learning Repository. The implementation work was used in Advanced Java, MS-Excel

Software. The execution time is calculated in milliseconds. This Chapter deals with a method for

Improving efficiency of the said algorithm and analyze the elapsed time is taken, predicting

best seed points and removing the empty clusters.

Proposed System

The proposed approach Entropy Based Mean clustering, consisting of different enhanced

features of basic K-mean clustering. The partitioning algorithms work well for decision

spherical-shaped clusters in different type of data points. The advantage of the K-Means

algorithm is its favorable execution time. Its drawback is that the user has to know in advance

how many clusters are searched for. From the experimental results, it is practical that K-Means

algorithm is efficient for smaller data sets and Entropy Based Mean Clustering is good for larger

databases. For the both algorithms, size and shape of the cluster is depending upon the type of

attribute, we selected or clustering and number of unique values for attribute consider for the

clustering. EBM Clustering was more preferable because it eliminates the empty clusters during
The generation itself and no additional phases are required. The time complexity can be

calculated by CPU elapsed time for different two algorithms. As a rule the time complexity

varies from one processor to another processor, which depends on the speed and the type of the

system.

ii. The proposed approach called Enhanced Agglomerative Hierarchical Clustering is

designed and developed is overcomes the complexity in traditional single-link Hierarchical

clustering. Through the results produced from the proposed algorithm EHAC, one can easily

understand different levels of observations on data analysis. One of the major advantages of this

approach is, representation of Global clusters, which is not possible in the Non Hierarchical

clustering approaches. The proposed algorithm can be optimized for feature selection, which

reduces the dendrogram.

iii. The proposed mechanism called duo-mining, is designed and developed for diabetic patent

data organization, it summarizes the both features of Data Mining and Text Mining. This

process combination has proven with better results because, instead of only being able to analyze

the structured data they collect from transactions, they can add different patterns from the text

mining side. These new developments in text mining technology that go beyond simple

searching methods are the key to information discovery and have a promising outlook for

application in all areas of work.

iv. Verification and Validation are key points for any development, the last chapter of this

thesis is meant for checking the correctness of the proposed algorithms. To check the validation

of the approach, we used some metric called Accuracy, Return on Investment, Complexity and

Ability to handle the null values. To evaluate these metrics, we found and data set called Breast
Cancer from UCI. Finally we found Mean Clustering shows the better results with respect to the

defined metrics like time complexity, accuracy and Return on Investment.

You might also like