Data Analysis On Challenges of Healthcare Centers: A Major Project Report On

A MAJOR PROJECT REPORT ON
“DATA ANALYSIS ON CHALLENGES OF

HEALTHCARE CENTERS”
Bachelor of Business Administration

(Batch: 2015-2018)
Supervised By: Submitted by:

Ms. Bhajneet Kaur Manas Dua
Assistant Professor Roll No.: 02180301715
BBA VI SEM (Evening, Sec A)
RUKMINI DEVI INSTITUTE OF ADVANCED STUDIES

(Aff. to Guru Gobind Singh Indraprastha University)
NAAC ACCREDITED ‘A’ Grade
Category ‘A++’ Institute
High grading 83% by joint assessment
An ISO Certified Institute
(Approved by AICTE, HRD Ministry, Govt. of India)
Affiliated to Guru Gobind Singh Indraprastha University, Delhi
2A & 2B, Madhuban Chowk, Outer Ring Road, Phase –I, Delhi – 110085
Page 1
TABLE OF CONTENTS
PARTICULARS PAGE NO
1. TITE PAGE 1
2. STUDENT DECLARATION 3
3. CERTIFICATE 4
4. ACKNOWLEDGEMENT 5
5. INTRODUCTION 6-8
6. PHASES IN THE BDA PROCESS 9-10
7. BENEFITS OF USING BIG DATA ANALYTICS IN 11-13
HEALTHCARE
8. Literature Review 13-21
9. RESEARCH METHODOLOGY 22-26
10. DATA ANALYSIS AND INTERPRETATION 27-31
11. CONCLUSION 32-34
12. REFERENCES 35
13. BIBLIOGRAPHY 35
Page 2
STUDENT DECLARATION
I hereby declare that the Project report titled “DATA ANALYSIS ON CHALLENGES OF
HEALTHCARE CENTERS” is my original work and has not been published or submitted
for any degree, diploma or other similar titles elsewhere. This has been undertaken for the
purpose of partial fulfillment of GRADUATION DIGREE IN MANAGEMENT at
RUKMINI DEVI INSTITUTE OF ADVANCED STUDIES.
MANAS DUA
BBA 6TH SEM
SECTION A
EVENING
Page 3
CERTIFICATE
This is to certify that MR.MANAS DUA a student of RUKMINI DEVI INSTITUTE OF

ADVANCED STUDIES, IP UNIVERSITY” has worked under my supervision and guidance
for his Project on “DATA ANALYSIS ON CHALLENGES OF HEALTHCARE
CENTERS”. The matter embodied in this report is original and authentic and same
recommended for evaluation.
_________________
Signature of Mentor
MS. MANAS DUA
Page 4
ACKNOWLEDGEMENT
I would like to express my special thanks of gratitude to my teacher Ms. Bhajneet kaur as
well as our Director Prof. Dr. Raman Garg who gave me the golden opportunity to do this
wonderful project on the topic ‘DATA ANALYSIS ON CHALLENGES OF HEALTH
CARE CENTERS’ which also helped me in doing a lot of Research and I came to know
about so many new things. I am really thankful to them.
Secondly I would also like to thank my parents and friends who helped me a lot in finalizing
this project within the limited time frame.
MANAS DUA
BBA 6TH SEM
SECTION A
EVENING
Page 5
CHAPTER 1
INTRODUCTION
Page 6
INTRODUCTION
Big data is data sets that are so voluminous and complex that traditional data
processing application software are inadequate to deal with them. Big data challenges
include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, updating and information privacy. There are three
dimensions to big data known as Volume, Variety and Velocity.
Healthcare is a prime example of how the three Vs of data, velocity speed of generation of
data), variety, and volume, are an innate aspect of the data it produces. This data is spread
among multiple healthcare systems, health insurers, researchers, government entities, and so
forth. Furthermore, each of these data repositories is siloed and inherently incapable of
providing a platform for global data transparency. To add to the three Vs, the veracity of
healthcare data is also critical for its meaningful use towards developing translational research.
Big data in healthcare is overwhelming not only because of its volume but also because of the
diversity of data types and the speed at which it must be managed. The totality of data related
to patient healthcare and wellbeing make up “big data” in the healthcare industry. It includes
clinical data from CPOE and clinical decision support systems (physician’s written notes and
prescriptions, medical imaging, laboratory, pharmacy, insurance, and other administrative
data); patient data in electronic patient records (EPRs); machine generated/sensor data, such as
monitoring vital signs ; social medias posts, including Twitter feeds, blogs, status updates on
Facebook and other platforms, and web pages; and less patient-specific information, including
emergency care data, news feed, and articles in medical journals.
For the big data scientist, there is, amongst this vast amount and array of data, opportunity. By
discovering associations and understanding patterns and trends within the data, big data
analytics has the potential to improve care, save lives and lower costs. Thus, big data analytics
applications in healthcare take advantage of the explosion in data to extracts insights for
making better informed decisions, and as a research category are referred to as, no surprise
here, big data analytics in healthcare.
First and most significantly, the volume of data is growing exponentially in the biomedical
informatics fields. For example, the ProteomicsDB covers 92% (18,097 of 19,629) of known
human genes that are annotated in the Swiss-Prot database. ProteomicsDB has a data volume
of 5.17 TB. In the clinical realm, the promotion of the HITECH Act has nearly tripled the
adoption rate of electronic health records (EHRs) in hospitals to 44% from 2009 to 2012. Data
Page 7
from millions of patients have already been collected and stored in an electronic format, and
these accumulated data could potentially enhance health-care services and increase research
opportunities. In addition, medical imaging (eg, MRI, CT scans) produces vast amounts of data
with even more complex features and broader dimensions. One such example is the Visible
Human Project, which has archived 39 GB of female datasets. These and other datasets will
provide future opportunities for large aggregate collection and analysis.
The second feature of big data is the variety of data types and structures. The ecosystem of
biomedical big data comprises many different levels of data sources to create a rich array of
data for researchers. For example, sequencing technologies produce “omics” data
systematically at almost all levels of cellular components, from genomics, proteomics, and
metabolomics to protein interaction and phenomics. Much of the data that are unstructured (eg,
notes from EHRs, clinical trial results, medical images, and medical sensors) provide many
opportunities and a unique challenge to formulate new investigations.
The third characteristic of big data, velocity, refers to producing and processing data. The new
generation of sequencing technologies enables the production of billions of DNA sequence
data each day at a relatively low cost. Because faster speeds are required for gene sequencing,
big data technologies will be tailored to match the speed of producing data, as is required to
process them. Similarly, in the public health field, big data technologies will provide
biomedical researchers with time-saving tools for discovering new patterns among population
groups using social media data.
Important physiological and pathophysiological phenomena are con currently manifest as

changes across multiple clinical streams. This results from strong coupling among different
systems within the body (e.g., interactions between heart rate, respiration, and blood pressure)
there by producing potential markers for clinical assessment. Thus, understanding and
predicting diseases require an aggregated approach where structured and unstructured data
stemming from a myriad of clinical and nonclinical modalities are utilized for a more
comprehensive perspective of the disease states. An aspect of health care research that has
recently gained traction is in addressing some of the growing pains in introducing concepts of
big data analytics to medicine.
Despite the enormous expenditure consumed by the current healthcare systems, clinical
outcomes remain suboptimal, particularly in the USA, where 96 people per 100,00 die annually
from conditions considered treatable [26]. A key factor attributed to such inefficiencies is the
Page 8
inability to effectively gather, share, and use information in a more comprehensive manner
within the healthcare systems [27]. This is an opportunity for big data analytics to play a more
significant role in aiding the exploration and discovery process, improving the delivery of care,
helping to design and plan healthcare policy, providing a means for comprehensively
measuring, and evaluating the complicated and convoluted healthcare data. More importantly,
adoption of insights gained from big data analytics has the potential to save lives, improve care
delivery, expand access to healthcare, align payment with performance, and help curb the
vexing growth of health care costs.
Page 9
PHASES IN THE BDA PROCESS
We can map steps taken up while performing BDA Process to the data mining knowledge
discovery steps as follows:
1.Data acquisition and storage:
As already mentioned the data is fed to the system through many external sources like clinical
data from Clinical Decision Support systems (CDSS), EMR, EHR, machine generated sensor
data, data from wearable devices, national health register data, drug related data from
Pharmaceutical companies, social media data like twitter feeds, Facebook status, web pages,
blogs, articles and many more. This data is either stored in databases or data warehouse. With
advent of cloud computing, it is convenient to store such voluminous data on the cloud rather
than on physical disks. This is more cost effective and manageable way to store data.
2. Data cleaning:
The data which has been acquired should be complete and should be in a structured format, for
performing effective analysis. Generally it is seen in that healthcare data from flaws like, many
patients don’t share their data completely like data about their dietary habits, weight and
lifestyle. In such cases the empty fields need to be handled appropriately. Another example can
be for e.g.: for field like Gender of person, there can be at most one of two values i.e. male or
female. In case any other value or no value is present then such entries need to marked and
handled accordingly. The data from sensors, prescriptions, medical image data and social
media data need to be expressed in a structured form suitable for analysis.
3. Data integration:
The BDA process uses data accumulated across various platforms. This data can vary in
metadata (the number of fields, type, and format). The entire data has to be aggregated correctly
and consistently into a dataset which can be effectively used for data analysis purpose. This is
a very challenging task, considering the big volume and variety of big data.
4. Data querying, analysis and interpretation:
Once the data is cleaned and integrated, the next step is to query the data.A query can be simple
query like for eg: What is mortality rate in a particular region ? or complex query as how many
patients with diabetes are likely to develop heart related problems in next 5 years? Depending
Page 10
on the complexity of the query, the data analyst has to choose appropriate platform and analysis
tools.
A large no. of open source and proprietary platforms and tools are available in market. Some
of them are Hadoop, MapReduce, Storm, GridGrain. Big data databases like Cassanadra,
HBase, MongoDB, CouchDB, OrientDB, Terrastore, Hive etc.Data Mining tools like
RapidMiner,Mahout, Orange, Weka, Rattle, KEEL etc. File systems like HDFS and Gluster.
Programming languages like Pig/PigLatin, R, ECL. Big data search tools like Lucene, Solr
etc.Data Aggregation and transfer tools like Sqoop,Flume,Chukwa. Other tools like Oozie,
Zookeeper, Avro, Terracotta. Some open source platforms are also available like Lumify,
IKANOW. The criteria for platform evaluation can vary for different organizations. Generally
the ease of use, availability, the capability to handle voluminous data, support for visualization,
high quality assurance, cost, security can be some of the variables to decide upon the platform
and tool to be used.
BENEFITS OF USING BIG DATA ANALYTICS IN HEALTHCARE
The tremendous amount of varied data gives an opportunity to researchers in field Health
informatics professional to use tools and techniques for unlocking the hidden answers. BDA
tools and techniques when applied effectively to this volume data can be beneficial in following
ways:
1.For individual’s/patients:
Generally while deciding any line of treatment for a patient, historical data(of a set of similar
patients) about the symptoms, drugs used, outcome/response of different patients is taken into
account. With help of BDA, the move is towards formulating a personalised line of treatment
for a patient based on his genomic data, location, weather, lifestyle, medical history, response
to certain medicines, allergies, family history etc. When the genome data is fully explored some
kind of relation can be established between the DNA and a particular disease. Then the specific
line of treatment can be formulated for this subset of individuals. The patients will benefit in
following ways:
 Correct and effective line of treatment.
 Better informed health related decisions.
 Preventive steps can be taken in time.
Page 11
 Continuous health monitoring at patients place using wearable wireless devices.
 Designing personalised line of treatment for patient.
 Increase in life expectancy and quality.
2. For Hospitals:
By using effective BDA techniques on the data available the hospitals can reap following
benefits:
 Predict the patients which are likely to stay for longer time or going to be readmitted after the
treatment.
 Identifying patients that are at risk for hospitalization. Thus, health care providers could
develop new healthcare plans to prevent hospitalization.
 Various questions can be answered by analysing the data using BDA tools and techniques
like will a patient respond positively to a particular treatment?, Is a surgery required to be done
on patient and will he/she respond to it?, Is the patient prone to catching disease after
treatment?, what is his likelihood of getting affected from the same disease in near future?
 The hospital authorities can take better informed and managed administrative decisions. Like
if no. of patients are not getting cured early and no. of readmissions are increasing because
patients become ill again after treatment, then diagnose the root cause of problem, hire more
competitive and experienced staff, invest in better drugs/instruments which can aid in effective
treatment, increase the cleanliness of hospital, make the treatment more timely, engage more
staff on floor, plan for more frequent post treatment follow ups etc
3. For Insurance Companies:
A large amount of expenditure is done by governments for giving medical claims to patients.
By using BDA we can analyse, identify, predict and minimize the possible frauds related to
medical claims.
4. For Pharmaceutical company:
By using BDA techniques effectively, the R&D can help to produce in a shorter time ,drugs/
instruments/tools that are most effective for treating a specific disease.
5. For government:
Page 12
The government can use demographic data, historical data of disease outbreak, weather data,
data from social media over disease keywords like cholera, flu etc. They can analyse this
massive data to predict epidemics, by finding correlation between the weather and likely
occurrence of disease. Therefore preventive measures can be used for avoiding the same. The
BDA can help in improving the public health surveillance and speed up the response to disease
outbreaks.
Page 13
CHAPTER 2
LITERATURE REVIW
Page 14
Literature Review
METHODS
A literature review was conducted to identify recent articles about the use of big data in health
care. The following search terms were used: “big data in healthcare,” “big data in health care,”
“big data medicine,” and “big data clinical.” The search terms were used with PubMed,
Google Scholar, Science Direct, and Web of Knowledge as well as Google to identify both
peer-reviewed literature and professional journal articles addressing the use and
application of big data in health care settings. The primary search included only articles
published within the last 5 years. Secondary references from older articles were used to allow
full discussion of issues identified in primary, more recent articles.
Page 15
TABLE:1 DATA STORAGE AND RETRIEVAL
S.NO. TECHNIQUE/ DESCRIPTION

AUTHORS
1. Cloud Burst Cloud Burst is a parallel computing model that

facilitates the genome mapping process. Cloud Burst
parallelizes the short-read mapping process to
improve the scalability of reading large sequencing
data. The Cloud Burst model was evaluated using a
25-core cluster, and the results indicate that the speed
to process seven million short-reads was almost 24
times faster than a single-core machine.
2. Contrail for assembling large genomes
3. Crossbow for identifying single nucleotide polymorphisms

(SNPs) from sequencing data.
4. DistMap is a toolkit for distributed short-read mapping on a

Hadoop cluster. DistMap aims to increase the
support of different types of mappers to cover a
wider range of sequencing applications. The nine
supported mapper types include BWA, Bowtie,
Bowtie2, GSNAP, SOAP, STAR, Bismark,
BSMAP, and TopHat. A mapping workflow is
integrated into DistMap, which can be operated with
simple commands. For example, an evaluation test
was done using a 13-node cluster, making it an
effective application for mapping short-read data.
The BWA mapper can perform 500 million read
pairs (247 GB) in about six hours using DistMap,
which is 13 times faster than a single-node mapper.
Page 16
5. The Read Annotation by the DNA Data Bank of Japan (DDBJ) is a cloud-
Pipeline® based pipeline for high-throughput analysis of next-
generation sequencing data. DDBJ initiated this
cloud-computing system to support sequencing
analysis. It offers a user-friendly interface to process
sequencing datasets, which supports two levels of
analysis: (1) the basic-level tools accept FASTQ
format data and pre process them to trim low-quality
bases and (2) during the second analysis, the data are
mapped to genome references or assembled on
supercomputers. This pipeline uses the Galaxy
interface for advanced analysis, such as SNP
detection, RNA-sequencing (RNA-seq) analysis,
and ChIP-seq analysis. In a benchmark testing,
DDBJ finished mapping 34.7 million sequencing
reads to a 383-MB reference genome in 6.5 hours.
6. Hydra is a scalable proteomic search engine that uses the

Hadoop-distributed computing framework. Hydra is
a software package for processing large peptide and
spectra databases, implementing a distributed
computing environment that supports the scalable
searching of massive amounts of spectrometry data.
The proteomic search in Hydra is divided into two
steps: (1) generating a peptide database and (2)
scoring the spectra and retrieving the data. The
system is capable of performing 27 billion peptide
scorings in about 40 minutes on a 43-node Hadoop
cluster.
8. SeqWare is a query engine built on the Apache HBase

database to help bioinformatics researchers access
large-scale whole-genome datasets. The SeqWare
Page 17
team created an interactive interface to integrate
genome browsers and tools. In a prototyping
analysis, the U87MG and 1102GBM tumor
databases were loaded, and the team used this engine
to compare the Berkeley DB and HBase back end for
loading and exporting variant data capabilities. The
results show that the Berkeley DB solution is faster
when reading 6M variants, while the HBase solution
is faster when reading more than 6M variants.
TABLE:2 Error identification
S.NO. TECHNIQUES/AUTHOURS DESCRIPTION
1. SAMQA SAMQA identifies such errors and ensures that

large-scale genomic data meet the minimum
quality standards. Originally built for the
National Institutes of Health Cancer Genome
Atlas to automatically identify and report errors,
SAMQA includes a set of technical tests to find
data abnormalities (eg, sequence alignment/map
[SAM] format error, invalid CIGAR value) that
contain empty reads. For biological tests,
researchers can set a threshold to filter reads that
could be erroneous (empty reads) and report
them to experts for manual evaluation. A
comparison of Hadoop, which was tested on a
cluster, with SAMQA, which was tested on a
single-core server, shows that the Hadoop cluster
processed a 23-GB sample nearly 80 times faster
(18.25 hours).
Page 18
2. ART ART provides simulation data for sequencing
analysis for three major sequencing platforms:
454 Sequencing×, Illumina, and SOLiD. ART
has built-in profiles of read error and read length
and can identify three types of sequencing errors:
base substitutions, insertions, and deletions.
3. Cloud RS CloudRS is an error-correction algorithm of

high-throughput sequencing data based on a
parallel, scalable framework. This method is
developed based on the RS algorithm. The
CloudRS team evaluated the system on six
different datasets using the GAGE benchmarks,
and the results show that CloudRS has a higher
precision rate compared with the Reptile method.
TABLE: 3 DATA ANALYSIS
S.NO. TECHNIQUES/AUTHORS DESCRIPTION
1. the Genome Analysis Toolkit Is a MapReduce-based programing framework

(GATK) designed to support large-scale DNA sequence
analysis. GATK supports many data formats,
including SAM files, binary alignment/map
(BAM), HapMap, and dbSNP. With GATK,
“traversal” modules prepare and read sequencing
data into the system and thus provide associated
references to the data, such as ordering data by
loci. The “walker” module consumes the data and
provides analytics outcomes. GATK has been
used in the Cancer Genome Atlas and 1000
Genomes Projects.
Page 19
2. The Array Express Archive of The Array Express Archive of Functional
Functional Genomics data Genomics data repository is an international
repository collaboration for integrating high-throughput
genomics data. The repository contains 30,000
experiments and more than one million assays.
About 80% of the data were extracted from the
GEO data repository, and the rest 20% were
directly submitted to Array Express by its users.
Each day, the platform is visited by more than
1,000 different users, and more than 50 GB of data
are downloaded. The platform also connects with
R and Genome Space to support data transition
and analysis.
3. BlueSNP BlueSNP is an R package for genome-wide

association studies (GWAS) analysis, focusing on
statistical tests (eg, P-value) to find intensive
associations between large genotype–phenotype
datasets. BlueSNP operates on the Hadoop
platform, which reduces barriers and improves the
efficiency of running GWAS analyses on clustered
machines. On a 40-node cluster, BlueSNP
analyzed 1,000 phenotypes on 106 SNPs in 104
individuals within 34 minutes.
4. Myrna Myrna is a cloud-based computing pipeline that

calculates the differences of gene expression in
large RNA-seq datasets. RNA-seq data are m-
sequencing reads derived from mRNA molecules.
Myrna supports several functions for RNA-seq
analysis, including reads alignment,
normalization, and statistical modelling in an
integrated pipeline. Myrna returns differential
Page 20
expression of genes into the form of P-value and
q-value. This system was tested on the Amazon
Elastic Compute Cloud (Amazon EC2) using 1.1
billion RNA-seq reads, and the results show that
Myrna can process data in less than two hours; the
cost of the test task was around $66.
5. SparkSeq SparkSeq is a fast, scalable, cloud-ready software

package for interactive genomic data analysis with
nucleotide precision. SparkSeq provides
interactive queries for RNA/DNA studies, and the
project is implemented on Apache Spark using the
Hadoop-BAM library for processing
bioinformatics files.
6. Ng et al. proposed PARAMO PARAMO as a predictive modelling platform for

analyzing electronic health data. PARAMO
supports the generation and reuse of a clinical data
analysis pipeline for different modelling purposes.
To efficiently process parallel tasks, PARAMO
supports MapReduce, which analyzes data for an
immense amount of medical data that can be
processed in a reasonable time. Medical
terminology ontologies (eg, ICD, UMLS) were
integrated into the PARAMO system, and the
analysis was tested on a set of EHR data from
5,000 to 300,000 patients using a Hadoop cluster;
the concurrent task varies from 10 to 160. Results
show that on this large dataset, 160 concurrent
tasks are 72 times faster than running on 10
concurrent tasks.
7. Zolfaghar et al. In addition, Zolfaghar et al. used big data

techniques to study the 30-day risk of readmission
Page 21
for congestive heart failure patients. The patient
data were extracted from the National Inpatient
Dataset and the Multicare Health System. Several
algorithms (eg, logistic regression, random forest)
were used to build a predictive model to analyze
the possibility of patient readmission. The
investigators performed several tests on more than
three million patient records. The results showed
that the use of big data significantly increased the
performance of building a predictive model: the
models achieved the highest accuracy at 77% and
recall at 61%.
8. Deligiannis et al. Deligiannis et al. presented a data-driven

prototype using MapReduce to diagnose
hypertrophic cardiomyopathy (HCM), an
inherited heart disease that causes cardiac death in
young athletes. Successful diagnosis of HCM is
challenging due to the large number of potential
variables. Deligiannis et al believed that the
diagnosis rate could be improved by using a data-
driven analysis. In addition to improved predictive
accuracy, the experimental results showed that the
overall runtime of predictive analysis decreased
from nine hours to only a few minutes when
accessing a dataset of 10,000 real medical records
– this is a remarkable improvement over previous
analyses and could lead to possible future
applications for early systematic diagnoses.
Page 22
CHAPTER 3
RESEARCH
METHODOLOGY
Page 23
RESEARCH METHODOLOGY
Objective:
To describe the promise and potential of big data analytics in healthcare
Methods:
The paper describes the nascent field of big data analytics in healthcare, discusses the benefits,
outlines an architectural framework and methodology, describes examples reported in the
literature, briefly discusses the challenges, and offers conclusions.
We searched four bibliographic databases to find related research articles:
(1) PubMed,
(2) Science Direct,
(3) Springer, and
(4) Scopus.
In searching these databases, we used the main keywords “big data,” “health care,” and
“biomedical.” Then, we selected papers based on the following inclusion criteria:
1. The paper was written in English.

2. The paper discussed the design and use of a big data application in the biomedical
and health-care domains.
3. The paper reported a new pipeline or method for processing big data and discussed
the performance of the method.
4. The paper evaluated the performance of new or existing big data applications.
Results:
Page 24
The paper provides a broad overview of big data analytics for healthcare researchers and
practitioners.
DATA ANALYSED
The second hand data has been used in the research. The research is done on the literature given
by different researchers and practitioners.
CHALLENGES
The promises of big data are huge for healthcare, but there are quite a number of challenges
which need to be taken up.
1. Unstructured data and provenance of data :
As has been discussed before, the BDA process uses data from varied sources. Most of the data
is unstructured data like medical prescriptions, blogs, tweets, status updates, and comments.
We need to generate right metadata for this data and transform it into a structured format. The
image and video data should be structured for semantic content and search. Provenance of data
along with its metadata should be carried through data analysis process so that it is easier to
track the processing steps in case of error.[3] Some intelligent processing techniques should be
devised to handle the data input from sensors and wearables in memory. This will help to
filter/derive the meaningful data, which can then be stored on permanent storage. Therefore it
will save space.
2. Missing or incomplete data:
It is seen through practices that the patient tends to hide some of the personal facts or choices
about his/her lifestyle while filling up forms or oral interview by physicians. When this data is
stored in digital format then many fields remain empty. Sometimes it happens that some of the
fields carry wrong values. When analysis is done on the entire data such empty or wrong fields
may or may not get processed. In both the cases they produce wrong results. If we are leaving
some records as they are empty then our analysis is not on cumulative data. If we consider
wrong value fields then the again the analysis is incorrect and unreliable. Such issues need to
be addressed.
3. Quality of data:
Page 25
We need processes to ensure that data from sources is a valid data or not and is of good quality.
Determining the validation and quality of data in case of social media data is another big
challenge.
4. Technical challenges:
 Data aggregation from different database systems is also a challenge in BDA. It can be made
easier by devising certain standard database design practices meant for a specific domain like
healthcare, financial sector etc. We need to come up with a lot more technological standards
and protocols for different data systems to integrate seamlessly.
 The traditional algorithms for data mining processes or analysis have to be scaled up to handle
the big volume of data. Another aspect is parallelisation of algorithms, the processors speed
has come to a point beyond which it’s hard to increase. Therefore the trend is moving towards
multi-core processors. In such a scenario we need statistical algorithms which can be
parallelised, else there computing performance will decrease when they handle complex big
volume data. Apart from this scaling complex query processing techniques to terabytes, while
enabling interactive response times is another big problem.
 An analysis is useful only when a non-technical person is able to understand and interpret it.
Considering the volume and variety of data used in BDA, it is very hard to depict it visually in
a more understandable and easier way. Going one step further than this, a user should be able
to perform the repeated analysis with different set of assumptions, data sets and parameters. It
will help the user to better understand the analysis process and verify whether the system works
in a desired way or not.
 With market flooded with so many open source and proprietary platforms and tolls for BDA,
we need careful evaluation to use the best platform and tool for our problem.
5. Data Security:
As more and more data is digitized, the role of securing data is gaining importance. Large no.
of people are not willing to share their personal data with a fear of security breach. In case they
are assured of full security then this problem can be tackled. We need strict Government
policies and norms which should be adhered to regarding what data can be shared and what
Page 26
not. In addition to this, strong technological hardware and software level security measures
should be implemented to discourage the hackers and malicious people.
6. Lack of Experts:
There is a dearth of qualified and experienced data scientists in the world. We need to create
an expertise in the field of data science, so that in future we have data scientists who can help
to turn the promises of big data into reality.
Page 27
CHAPTER 4
DATA ANALYSIS
Page 28
DATA ANALYSIS AND INTERPRETATION
1. EXPENDITURE
Page 29
2. TYPES OF DISEASES
3. PUBLIC AND PRIVATE SECTOR
Page 30
4. AGE GROUP
Page 31
5. TECHNOLOGY
6. WAITING LINE
Page 32
Page 33
CHAPTER 5
CONCLUSION
CONCLUSION
We are currently in the era of “big data,” in which big data technology is being rapidly
applied to biomedical and healthcare fields. In this review, we demonstrated various
examples in which big data technology has played an important role in modern-day health-
care revolution, as it has completely changed people's view of health-care activity. The first
three sections of this review revealed that big data applications facilitate three important
clinical activities, while the last section (especially the chronic disease management section)
draws an integrated picture of how separate clinical activities are completed in a pipeline to
manage individual patients from multiple perspectives. We summarized recent progress in
the most relevant areas in each field, including big data storage and retrieval, error
identification, data security, data sharing and data analysis for electronic patient records,
social media data, and integrated health databases.
Furthermore, in this review, we learned that bioinformatics is the primary field in which big
data analytics are currently being applied, largely due to the massive volume and complexity
of bioinformatics data. Big data application in bioinformatics is relatively mature, with
Page 34
sophisticated platforms and tools already in use to help analyze biological data, such as gene
sequencing mapping tools. However, in other biomedical research fields, such as clinical
informatics, medical imaging informatics, and public health informatics, there is enormous,
untapped potential for big data applications.
This literature review also showed that: (1) integrating different sources of information
enables clinicians to depict a new view of patient care processes that consider a patient's
holistic health status, from genome to behaviour; (2) the availability of novel mobile health
technologies facilitates real-time data gathering with more accuracy; (3) the implementation
of distributed platforms enables data archiving and analysis, which will further be developed
for decision support; and (4) the inclusion of geographical and environmental information
may further increase the ability to interpret gathered data and extract new knowledge.
While big data holds significant promise for improving health care, there are several
common challenges facing all the four fields in using big data technology; the most
significant problem is the integration of various databases. For example, the VHA's
database, VISTA, is not a single system; it is a set of 128 interlinked systems. This becomes
even more complicated when databases contain different data types (eg, integrating an
imaging database or a laboratory test results database into existing systems), thereby
limiting a system's ability to make queries against all databases to acquire all patient data.
The lack of standardization for laboratory protocols and values also creates challenges for
data integration. For example, image data can suffer from technological batch effects when
they come from different laboratories under different protocols. Efforts are made to
normalize data when there is a batch effect; this may be easier for image data, but it is
intrinsically more difficult to normalize laboratory test data. Security and privacy concerns
also remain as hurdles to big data integration and usage in all the four fields, and thus, secure
platforms with better communication standards and protocols are greatly needed.
In its latest industry analysis report, McKinsey & Company predicted that big data analytics
for the medical field will potentially save more than $300 billion per year in US health-care
costs. Future development of big data applications in the biomedical fields holds foreseeable
promise because it is dependent on the advancement of new data standards, relevant
research and technology, cooperation in research institutions and companies, and strong
government incentives.
Page 35
REFERENCES
1. AMFI Advisory Module, 3rd Edition,2006,Pg 8

2. Ibid www.amfiindia.com/mfconcept
3. Ibid AMFI-Types of Mutual Fund Scheme
4. Ibid AMFI-Legal Structure of MF
5. www.valuesearchonline.com/ Fund charges
6. Ibid AMFI Module-Pg
7. Ibid-AMFI Module-Pg
8. seekingalpha.com/article/ article/what is the best allocation strategy
9. www.moneycontrol.com
10. www.omfiindia.com
11. www.mutualfundindia.com
12. www.indiamart.com
13. www.myiris.com
14. www.hindubusinessline.com
Page 36
BIBLIOGRAPHY
1. Indian Mutual Funds Handbook by Sankaran S.

2. Investment policy and performance of Mutual Funds by Barua 2003.
3. Mutual funds in India by Sadhak, 2005.
4. Business Research Methods by Schindler & cooper, 2003.
Page 37

Data Analysis On Challenges of Healthcare Centers: A Major Project Report On

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis On Challenges of Healthcare Centers: A Major Project Report On

Uploaded by

Copyright:

Available Formats

A MAJOR PROJECT REPORT ON

“DATA ANALYSIS ON CHALLENGES OF

Bachelor of Business Administration

Supervised By: Submitted by:

RUKMINI DEVI INSTITUTE OF ADVANCED STUDIES

Category ‘A++’ Institute

High grading 83% by joint assessment

An ISO Certified Institute

(Approved by AICTE, HRD Ministry, Govt. of India)

Affiliated to Guru Gobind Singh Indraprastha University, Delhi

This is to certify that MR.MANAS DUA a student of RUKMINI DEVI INSTITUTE OF

MS. MANAS DUA

BBA 6TH SEM

Important physiological and pathophysiological phenomena are con currently manifest as

1.Data acquisition and storage:

4. Data querying, analysis and interpretation:

BENEFITS OF USING BIG DATA ANALYTICS IN HEALTHCARE

 Correct and effective line of treatment.

 Better informed health related decisions.

 Preventive steps can be taken in time.

 Designing personalised line of treatment for patient.

 Increase in life expectancy and quality.

3. For Insurance Companies:

4. For Pharmaceutical company:

S.NO. TECHNIQUE/ DESCRIPTION

1. Cloud Burst Cloud Burst is a parallel computing model that

2. Contrail for assembling large genomes

3. Crossbow for identifying single nucleotide polymorphisms

4. DistMap is a toolkit for distributed short-read mapping on a

6. Hydra is a scalable proteomic search engine that uses the

8. SeqWare is a query engine built on the Apache HBase

TABLE:2 Error identification

S.NO. TECHNIQUES/AUTHOURS DESCRIPTION

1. SAMQA SAMQA identifies such errors and ensures that

3. Cloud RS CloudRS is an error-correction algorithm of

TABLE: 3 DATA ANALYSIS

S.NO. TECHNIQUES/AUTHORS DESCRIPTION

1. the Genome Analysis Toolkit Is a MapReduce-based programing framework

3. BlueSNP BlueSNP is an R package for genome-wide

4. Myrna Myrna is a cloud-based computing pipeline that

5. SparkSeq SparkSeq is a fast, scalable, cloud-ready software

6. Ng et al. proposed PARAMO PARAMO as a predictive modelling platform for

7. Zolfaghar et al. In addition, Zolfaghar et al. used big data

8. Deligiannis et al. Deligiannis et al. presented a data-driven

To describe the promise and potential of big data analytics in healthcare

We searched four bibliographic databases to find related research articles:

(2) Science Direct,

(3) Springer, and

1. The paper was written in English.

1. Unstructured data and provenance of data :

2. Missing or incomplete data:

3. PUBLIC AND PRIVATE SECTOR

1. AMFI Advisory Module, 3rd Edition,2006,Pg 8

1. Indian Mutual Funds Handbook by Sankaran S.

You might also like