Professional Documents
Culture Documents
2A & 2B, Madhuban Chowk, Outer Ring Road, Phase –I, Delhi – 110085
Page 1
TABLE OF CONTENTS
PARTICULARS PAGE NO
1. TITE PAGE 1
2. STUDENT DECLARATION 3
3. CERTIFICATE 4
4. ACKNOWLEDGEMENT 5
5. INTRODUCTION 6-8
6. PHASES IN THE BDA PROCESS 9-10
7. BENEFITS OF USING BIG DATA ANALYTICS IN 11-13
HEALTHCARE
8. Literature Review 13-21
9. RESEARCH METHODOLOGY 22-26
10. DATA ANALYSIS AND INTERPRETATION 27-31
11. CONCLUSION 32-34
12. REFERENCES 35
13. BIBLIOGRAPHY 35
Page 2
STUDENT DECLARATION
I hereby declare that the Project report titled “DATA ANALYSIS ON CHALLENGES OF
HEALTHCARE CENTERS” is my original work and has not been published or submitted
for any degree, diploma or other similar titles elsewhere. This has been undertaken for the
purpose of partial fulfillment of GRADUATION DIGREE IN MANAGEMENT at
RUKMINI DEVI INSTITUTE OF ADVANCED STUDIES.
MANAS DUA
BBA 6TH SEM
SECTION A
EVENING
Page 3
CERTIFICATE
_________________
Signature of Mentor
Page 4
ACKNOWLEDGEMENT
I would like to express my special thanks of gratitude to my teacher Ms. Bhajneet kaur as
well as our Director Prof. Dr. Raman Garg who gave me the golden opportunity to do this
wonderful project on the topic ‘DATA ANALYSIS ON CHALLENGES OF HEALTH
CARE CENTERS’ which also helped me in doing a lot of Research and I came to know
about so many new things. I am really thankful to them.
Secondly I would also like to thank my parents and friends who helped me a lot in finalizing
this project within the limited time frame.
MANAS DUA
SECTION A
EVENING
Page 5
CHAPTER 1
INTRODUCTION
Page 6
INTRODUCTION
Big data is data sets that are so voluminous and complex that traditional data
processing application software are inadequate to deal with them. Big data challenges
include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, updating and information privacy. There are three
dimensions to big data known as Volume, Variety and Velocity.
Healthcare is a prime example of how the three Vs of data, velocity speed of generation of
data), variety, and volume, are an innate aspect of the data it produces. This data is spread
among multiple healthcare systems, health insurers, researchers, government entities, and so
forth. Furthermore, each of these data repositories is siloed and inherently incapable of
providing a platform for global data transparency. To add to the three Vs, the veracity of
healthcare data is also critical for its meaningful use towards developing translational research.
Big data in healthcare is overwhelming not only because of its volume but also because of the
diversity of data types and the speed at which it must be managed. The totality of data related
to patient healthcare and wellbeing make up “big data” in the healthcare industry. It includes
clinical data from CPOE and clinical decision support systems (physician’s written notes and
prescriptions, medical imaging, laboratory, pharmacy, insurance, and other administrative
data); patient data in electronic patient records (EPRs); machine generated/sensor data, such as
monitoring vital signs ; social medias posts, including Twitter feeds, blogs, status updates on
Facebook and other platforms, and web pages; and less patient-specific information, including
emergency care data, news feed, and articles in medical journals.
For the big data scientist, there is, amongst this vast amount and array of data, opportunity. By
discovering associations and understanding patterns and trends within the data, big data
analytics has the potential to improve care, save lives and lower costs. Thus, big data analytics
applications in healthcare take advantage of the explosion in data to extracts insights for
making better informed decisions, and as a research category are referred to as, no surprise
here, big data analytics in healthcare.
First and most significantly, the volume of data is growing exponentially in the biomedical
informatics fields. For example, the ProteomicsDB covers 92% (18,097 of 19,629) of known
human genes that are annotated in the Swiss-Prot database. ProteomicsDB has a data volume
of 5.17 TB. In the clinical realm, the promotion of the HITECH Act has nearly tripled the
adoption rate of electronic health records (EHRs) in hospitals to 44% from 2009 to 2012. Data
Page 7
from millions of patients have already been collected and stored in an electronic format, and
these accumulated data could potentially enhance health-care services and increase research
opportunities. In addition, medical imaging (eg, MRI, CT scans) produces vast amounts of data
with even more complex features and broader dimensions. One such example is the Visible
Human Project, which has archived 39 GB of female datasets. These and other datasets will
provide future opportunities for large aggregate collection and analysis.
The second feature of big data is the variety of data types and structures. The ecosystem of
biomedical big data comprises many different levels of data sources to create a rich array of
data for researchers. For example, sequencing technologies produce “omics” data
systematically at almost all levels of cellular components, from genomics, proteomics, and
metabolomics to protein interaction and phenomics. Much of the data that are unstructured (eg,
notes from EHRs, clinical trial results, medical images, and medical sensors) provide many
opportunities and a unique challenge to formulate new investigations.
The third characteristic of big data, velocity, refers to producing and processing data. The new
generation of sequencing technologies enables the production of billions of DNA sequence
data each day at a relatively low cost. Because faster speeds are required for gene sequencing,
big data technologies will be tailored to match the speed of producing data, as is required to
process them. Similarly, in the public health field, big data technologies will provide
biomedical researchers with time-saving tools for discovering new patterns among population
groups using social media data.
Despite the enormous expenditure consumed by the current healthcare systems, clinical
outcomes remain suboptimal, particularly in the USA, where 96 people per 100,00 die annually
from conditions considered treatable [26]. A key factor attributed to such inefficiencies is the
Page 8
inability to effectively gather, share, and use information in a more comprehensive manner
within the healthcare systems [27]. This is an opportunity for big data analytics to play a more
significant role in aiding the exploration and discovery process, improving the delivery of care,
helping to design and plan healthcare policy, providing a means for comprehensively
measuring, and evaluating the complicated and convoluted healthcare data. More importantly,
adoption of insights gained from big data analytics has the potential to save lives, improve care
delivery, expand access to healthcare, align payment with performance, and help curb the
vexing growth of health care costs.
Page 9
PHASES IN THE BDA PROCESS
We can map steps taken up while performing BDA Process to the data mining knowledge
discovery steps as follows:
As already mentioned the data is fed to the system through many external sources like clinical
data from Clinical Decision Support systems (CDSS), EMR, EHR, machine generated sensor
data, data from wearable devices, national health register data, drug related data from
Pharmaceutical companies, social media data like twitter feeds, Facebook status, web pages,
blogs, articles and many more. This data is either stored in databases or data warehouse. With
advent of cloud computing, it is convenient to store such voluminous data on the cloud rather
than on physical disks. This is more cost effective and manageable way to store data.
2. Data cleaning:
The data which has been acquired should be complete and should be in a structured format, for
performing effective analysis. Generally it is seen in that healthcare data from flaws like, many
patients don’t share their data completely like data about their dietary habits, weight and
lifestyle. In such cases the empty fields need to be handled appropriately. Another example can
be for e.g.: for field like Gender of person, there can be at most one of two values i.e. male or
female. In case any other value or no value is present then such entries need to marked and
handled accordingly. The data from sensors, prescriptions, medical image data and social
media data need to be expressed in a structured form suitable for analysis.
3. Data integration:
The BDA process uses data accumulated across various platforms. This data can vary in
metadata (the number of fields, type, and format). The entire data has to be aggregated correctly
and consistently into a dataset which can be effectively used for data analysis purpose. This is
a very challenging task, considering the big volume and variety of big data.
Once the data is cleaned and integrated, the next step is to query the data.A query can be simple
query like for eg: What is mortality rate in a particular region ? or complex query as how many
patients with diabetes are likely to develop heart related problems in next 5 years? Depending
Page 10
on the complexity of the query, the data analyst has to choose appropriate platform and analysis
tools.
A large no. of open source and proprietary platforms and tools are available in market. Some
of them are Hadoop, MapReduce, Storm, GridGrain. Big data databases like Cassanadra,
HBase, MongoDB, CouchDB, OrientDB, Terrastore, Hive etc.Data Mining tools like
RapidMiner,Mahout, Orange, Weka, Rattle, KEEL etc. File systems like HDFS and Gluster.
Programming languages like Pig/PigLatin, R, ECL. Big data search tools like Lucene, Solr
etc.Data Aggregation and transfer tools like Sqoop,Flume,Chukwa. Other tools like Oozie,
Zookeeper, Avro, Terracotta. Some open source platforms are also available like Lumify,
IKANOW. The criteria for platform evaluation can vary for different organizations. Generally
the ease of use, availability, the capability to handle voluminous data, support for visualization,
high quality assurance, cost, security can be some of the variables to decide upon the platform
and tool to be used.
The tremendous amount of varied data gives an opportunity to researchers in field Health
informatics professional to use tools and techniques for unlocking the hidden answers. BDA
tools and techniques when applied effectively to this volume data can be beneficial in following
ways:
1.For individual’s/patients:
Generally while deciding any line of treatment for a patient, historical data(of a set of similar
patients) about the symptoms, drugs used, outcome/response of different patients is taken into
account. With help of BDA, the move is towards formulating a personalised line of treatment
for a patient based on his genomic data, location, weather, lifestyle, medical history, response
to certain medicines, allergies, family history etc. When the genome data is fully explored some
kind of relation can be established between the DNA and a particular disease. Then the specific
line of treatment can be formulated for this subset of individuals. The patients will benefit in
following ways:
Page 11
Continuous health monitoring at patients place using wearable wireless devices.
2. For Hospitals:
By using effective BDA techniques on the data available the hospitals can reap following
benefits:
Predict the patients which are likely to stay for longer time or going to be readmitted after the
treatment.
Identifying patients that are at risk for hospitalization. Thus, health care providers could
develop new healthcare plans to prevent hospitalization.
Various questions can be answered by analysing the data using BDA tools and techniques
like will a patient respond positively to a particular treatment?, Is a surgery required to be done
on patient and will he/she respond to it?, Is the patient prone to catching disease after
treatment?, what is his likelihood of getting affected from the same disease in near future?
The hospital authorities can take better informed and managed administrative decisions. Like
if no. of patients are not getting cured early and no. of readmissions are increasing because
patients become ill again after treatment, then diagnose the root cause of problem, hire more
competitive and experienced staff, invest in better drugs/instruments which can aid in effective
treatment, increase the cleanliness of hospital, make the treatment more timely, engage more
staff on floor, plan for more frequent post treatment follow ups etc
A large amount of expenditure is done by governments for giving medical claims to patients.
By using BDA we can analyse, identify, predict and minimize the possible frauds related to
medical claims.
By using BDA techniques effectively, the R&D can help to produce in a shorter time ,drugs/
instruments/tools that are most effective for treating a specific disease.
5. For government:
Page 12
The government can use demographic data, historical data of disease outbreak, weather data,
data from social media over disease keywords like cholera, flu etc. They can analyse this
massive data to predict epidemics, by finding correlation between the weather and likely
occurrence of disease. Therefore preventive measures can be used for avoiding the same. The
BDA can help in improving the public health surveillance and speed up the response to disease
outbreaks.
Page 13
CHAPTER 2
LITERATURE REVIW
Page 14
Literature Review
METHODS
A literature review was conducted to identify recent articles about the use of big data in health
care. The following search terms were used: “big data in healthcare,” “big data in health care,”
“big data medicine,” and “big data clinical.” The search terms were used with PubMed,
Google Scholar, Science Direct, and Web of Knowledge as well as Google to identify both
peer-reviewed literature and professional journal articles addressing the use and
application of big data in health care settings. The primary search included only articles
published within the last 5 years. Secondary references from older articles were used to allow
full discussion of issues identified in primary, more recent articles.
Page 15
TABLE:1 DATA STORAGE AND RETRIEVAL
Page 16
5. The Read Annotation by the DNA Data Bank of Japan (DDBJ) is a cloud-
Pipeline® based pipeline for high-throughput analysis of next-
generation sequencing data. DDBJ initiated this
cloud-computing system to support sequencing
analysis. It offers a user-friendly interface to process
sequencing datasets, which supports two levels of
analysis: (1) the basic-level tools accept FASTQ
format data and pre process them to trim low-quality
bases and (2) during the second analysis, the data are
mapped to genome references or assembled on
supercomputers. This pipeline uses the Galaxy
interface for advanced analysis, such as SNP
detection, RNA-sequencing (RNA-seq) analysis,
and ChIP-seq analysis. In a benchmark testing,
DDBJ finished mapping 34.7 million sequencing
reads to a 383-MB reference genome in 6.5 hours.
Page 17
team created an interactive interface to integrate
genome browsers and tools. In a prototyping
analysis, the U87MG and 1102GBM tumor
databases were loaded, and the team used this engine
to compare the Berkeley DB and HBase back end for
loading and exporting variant data capabilities. The
results show that the Berkeley DB solution is faster
when reading 6M variants, while the HBase solution
is faster when reading more than 6M variants.
Page 18
2. ART ART provides simulation data for sequencing
analysis for three major sequencing platforms:
454 Sequencing×, Illumina, and SOLiD. ART
has built-in profiles of read error and read length
and can identify three types of sequencing errors:
base substitutions, insertions, and deletions.
Page 19
2. The Array Express Archive of The Array Express Archive of Functional
Functional Genomics data Genomics data repository is an international
repository collaboration for integrating high-throughput
genomics data. The repository contains 30,000
experiments and more than one million assays.
About 80% of the data were extracted from the
GEO data repository, and the rest 20% were
directly submitted to Array Express by its users.
Each day, the platform is visited by more than
1,000 different users, and more than 50 GB of data
are downloaded. The platform also connects with
R and Genome Space to support data transition
and analysis.
Page 20
expression of genes into the form of P-value and
q-value. This system was tested on the Amazon
Elastic Compute Cloud (Amazon EC2) using 1.1
billion RNA-seq reads, and the results show that
Myrna can process data in less than two hours; the
cost of the test task was around $66.
Page 21
for congestive heart failure patients. The patient
data were extracted from the National Inpatient
Dataset and the Multicare Health System. Several
algorithms (eg, logistic regression, random forest)
were used to build a predictive model to analyze
the possibility of patient readmission. The
investigators performed several tests on more than
three million patient records. The results showed
that the use of big data significantly increased the
performance of building a predictive model: the
models achieved the highest accuracy at 77% and
recall at 61%.
Page 22
CHAPTER 3
RESEARCH
METHODOLOGY
Page 23
RESEARCH METHODOLOGY
Objective:
Methods:
The paper describes the nascent field of big data analytics in healthcare, discusses the benefits,
outlines an architectural framework and methodology, describes examples reported in the
literature, briefly discusses the challenges, and offers conclusions.
(1) PubMed,
(4) Scopus.
In searching these databases, we used the main keywords “big data,” “health care,” and
“biomedical.” Then, we selected papers based on the following inclusion criteria:
Results:
Page 24
The paper provides a broad overview of big data analytics for healthcare researchers and
practitioners.
DATA ANALYSED
The second hand data has been used in the research. The research is done on the literature given
by different researchers and practitioners.
CHALLENGES
The promises of big data are huge for healthcare, but there are quite a number of challenges
which need to be taken up.
As has been discussed before, the BDA process uses data from varied sources. Most of the data
is unstructured data like medical prescriptions, blogs, tweets, status updates, and comments.
We need to generate right metadata for this data and transform it into a structured format. The
image and video data should be structured for semantic content and search. Provenance of data
along with its metadata should be carried through data analysis process so that it is easier to
track the processing steps in case of error.[3] Some intelligent processing techniques should be
devised to handle the data input from sensors and wearables in memory. This will help to
filter/derive the meaningful data, which can then be stored on permanent storage. Therefore it
will save space.
It is seen through practices that the patient tends to hide some of the personal facts or choices
about his/her lifestyle while filling up forms or oral interview by physicians. When this data is
stored in digital format then many fields remain empty. Sometimes it happens that some of the
fields carry wrong values. When analysis is done on the entire data such empty or wrong fields
may or may not get processed. In both the cases they produce wrong results. If we are leaving
some records as they are empty then our analysis is not on cumulative data. If we consider
wrong value fields then the again the analysis is incorrect and unreliable. Such issues need to
be addressed.
3. Quality of data:
Page 25
We need processes to ensure that data from sources is a valid data or not and is of good quality.
Determining the validation and quality of data in case of social media data is another big
challenge.
4. Technical challenges:
Data aggregation from different database systems is also a challenge in BDA. It can be made
easier by devising certain standard database design practices meant for a specific domain like
healthcare, financial sector etc. We need to come up with a lot more technological standards
and protocols for different data systems to integrate seamlessly.
The traditional algorithms for data mining processes or analysis have to be scaled up to handle
the big volume of data. Another aspect is parallelisation of algorithms, the processors speed
has come to a point beyond which it’s hard to increase. Therefore the trend is moving towards
multi-core processors. In such a scenario we need statistical algorithms which can be
parallelised, else there computing performance will decrease when they handle complex big
volume data. Apart from this scaling complex query processing techniques to terabytes, while
enabling interactive response times is another big problem.
An analysis is useful only when a non-technical person is able to understand and interpret it.
Considering the volume and variety of data used in BDA, it is very hard to depict it visually in
a more understandable and easier way. Going one step further than this, a user should be able
to perform the repeated analysis with different set of assumptions, data sets and parameters. It
will help the user to better understand the analysis process and verify whether the system works
in a desired way or not.
With market flooded with so many open source and proprietary platforms and tolls for BDA,
we need careful evaluation to use the best platform and tool for our problem.
5. Data Security:
As more and more data is digitized, the role of securing data is gaining importance. Large no.
of people are not willing to share their personal data with a fear of security breach. In case they
are assured of full security then this problem can be tackled. We need strict Government
policies and norms which should be adhered to regarding what data can be shared and what
Page 26
not. In addition to this, strong technological hardware and software level security measures
should be implemented to discourage the hackers and malicious people.
6. Lack of Experts:
There is a dearth of qualified and experienced data scientists in the world. We need to create
an expertise in the field of data science, so that in future we have data scientists who can help
to turn the promises of big data into reality.
Page 27
CHAPTER 4
DATA ANALYSIS
Page 28
DATA ANALYSIS AND INTERPRETATION
1. EXPENDITURE
Page 29
2. TYPES OF DISEASES
Page 30
4. AGE GROUP
Page 31
5. TECHNOLOGY
6. WAITING LINE
Page 32
Page 33
CHAPTER 5
CONCLUSION
CONCLUSION
We are currently in the era of “big data,” in which big data technology is being rapidly
applied to biomedical and healthcare fields. In this review, we demonstrated various
examples in which big data technology has played an important role in modern-day health-
care revolution, as it has completely changed people's view of health-care activity. The first
three sections of this review revealed that big data applications facilitate three important
clinical activities, while the last section (especially the chronic disease management section)
draws an integrated picture of how separate clinical activities are completed in a pipeline to
manage individual patients from multiple perspectives. We summarized recent progress in
the most relevant areas in each field, including big data storage and retrieval, error
identification, data security, data sharing and data analysis for electronic patient records,
social media data, and integrated health databases.
Furthermore, in this review, we learned that bioinformatics is the primary field in which big
data analytics are currently being applied, largely due to the massive volume and complexity
of bioinformatics data. Big data application in bioinformatics is relatively mature, with
Page 34
sophisticated platforms and tools already in use to help analyze biological data, such as gene
sequencing mapping tools. However, in other biomedical research fields, such as clinical
informatics, medical imaging informatics, and public health informatics, there is enormous,
untapped potential for big data applications.
This literature review also showed that: (1) integrating different sources of information
enables clinicians to depict a new view of patient care processes that consider a patient's
holistic health status, from genome to behaviour; (2) the availability of novel mobile health
technologies facilitates real-time data gathering with more accuracy; (3) the implementation
of distributed platforms enables data archiving and analysis, which will further be developed
for decision support; and (4) the inclusion of geographical and environmental information
may further increase the ability to interpret gathered data and extract new knowledge.
While big data holds significant promise for improving health care, there are several
common challenges facing all the four fields in using big data technology; the most
significant problem is the integration of various databases. For example, the VHA's
database, VISTA, is not a single system; it is a set of 128 interlinked systems. This becomes
even more complicated when databases contain different data types (eg, integrating an
imaging database or a laboratory test results database into existing systems), thereby
limiting a system's ability to make queries against all databases to acquire all patient data.
The lack of standardization for laboratory protocols and values also creates challenges for
data integration. For example, image data can suffer from technological batch effects when
they come from different laboratories under different protocols. Efforts are made to
normalize data when there is a batch effect; this may be easier for image data, but it is
intrinsically more difficult to normalize laboratory test data. Security and privacy concerns
also remain as hurdles to big data integration and usage in all the four fields, and thus, secure
platforms with better communication standards and protocols are greatly needed.
In its latest industry analysis report, McKinsey & Company predicted that big data analytics
for the medical field will potentially save more than $300 billion per year in US health-care
costs. Future development of big data applications in the biomedical fields holds foreseeable
promise because it is dependent on the advancement of new data standards, relevant
research and technology, cooperation in research institutions and companies, and strong
government incentives.
Page 35
REFERENCES
Page 36
BIBLIOGRAPHY
Page 37