You are on page 1of 11

The Impacts of Advanced Categorical Analysis on the Healthcare Industry:

The Potential of GALILEO in an Increasingly Digital World


Sean Jordan
Intern/Mentor I G/T
SY 2017-2018
Overview of Research

The main goal of the Galileo program is to analyze extremely large datasets containing

categorical attributes and values and extrapolate hidden patterns and relationships that lie

between the data values. The program is still in the development stage, and needs to be modified

and fine-tuned before it is officially put on the public market for companies. The goal of this

research project is to outline ways to potentially improve upon the current version of the code

and make it more viable for a wider range of companies.

Background and History

Around 5 years ago, the term “big data” started seeing a huge spike in its usage - a

popularity of near nothing to worldwide awareness in just a few months. This trend hasn’t really

let off steam since, due to the fact that it is used more and more in today’s digital age. A lot of

companies like insurance providers, government agencies, and pharmaceutical businesses collect

a lot of data for each person that they serve. Oftentimes, they serve millions upon millions of

customers. The amount of data that they need to have under control is astronomical in size. Due

to the large amount of data values that just one dataset has, this type of information is dubbed

“big data” (Dean & Ghemawat, 2004). In biostatistics especially, big data and the methods used

to manage and analyze it are expanding rapidly.

In recent years, machine learning in particular has taken center stage when it comes to

analyzing health data. Health data is always being made available to researchers. There are

always new patients being admitted to hospitals that can contribute to a dataset, and this

information can be used to develop better machine learning techniques. Machine learning is split

into two parts - supervised and unsupervised learning. Unsupervised machine learning occurs

when an algorithm is given a plain set of data and is asked to draw patterns or conclusions from
it (like mean, standard deviation, etc.). Supervised machine learning includes class headings, and

allows predictions to be made about those class headings given the data points. The whole point

of machine learning is to determine what outcome a series of inputs represents - for medical data,

machine learning is used to determine whether a patient is likely to have a disease based on

certain inputs or symptoms.

Problem Statement and Rationale

The problem of big data and machine learning is important because as hospital databases

get larger and larger, it gets harder and harder for humans to analyze them and draw conclusions

from the data points (Dinov, 2016). It is much faster and more efficient for a computer to do all

of the pattern matching instead. While computer management of patient data took over human

management a while ago, machine control has recently spread over to the diagnosis realm of

medicine. The problem is that as doctors get more and more patients, they need to spend more

and more time and work trying to determine whether or not a patient has a specific disease based

on risk factors. A recent solution has been proposed to allow machine algorithms to learn about

the patient’s symptoms, compare it against a set of risk factors, and then develop a probability

which describes the chance that the patient has or will contract the disease. However, these

machine learning algorithms are in their infancy, and still inherently unreliable - many people

would not trust a computer to diagnosis them over a well-trained physician. The complex

algorithms that determine these patients’ risks need to be improved in order to more reliably

predict whether or not a patient has a disease or not so that physicians are not overloaded with

unnecessary patient visits in hospitals and clinics.

Research Methodology

Research Question and Hypothesis


Can unsupervised and supervised machine learning be applied to pre-existing datasets in

order to learn relationships between health indicators that might be predictive of medical

conditions?

With further development work, the Galileo machine learning program can be used to

revolutionize the way data analysis companies interpret large, categorical datasets and provide

them with a new set of tools that can help them better understand their customers.

Basis of Hypothesis

Currently, Galileo has only been tested with a few select datasets provided with the

program. Its accuracy and efficiency have not been thoroughly tested with other datasets,

however, and this limits its applicability to other subject areas. Thus, the main goal of this

research project is to run the algorithm through multiple different datasets (mainly categorical

data from hospitals) and obtain feedback on how well Galileo performs in clustering and

classification where ground truth is known.

Research Design

The data collection will be primarily causal-comparative in nature. The independent

variable is the input dataset, and the dependent variable is the performance results of the Galileo

program. The data will be collected by analyzing the results of the Galileo program, which gives

a quantitative variable depicting the optimal number of clusters to use when modeling the

dataset. The cluster assignments can be compared to the known ground truth. This data can then

be used to improve the analytic capabilities of the algorithm in order to reach the highest possible

true positive and true negative rates. The primary method of collecting research data will be

through experiments on raw datasets using the Galileo program. Created models will be saved

with unique and readable identifies as to keep track of all experimental trials. When running the
program, input data and the settings of Galileo will be recorded in a digital log. The new models

that Galileo creates to represent the datasets will be original.

Operational Definitions

There are no operational definitions or subjective measures in this experiment. All measures used

in the experiment have standard definitions.

Product Overview

At the end of the year, a presentation about Galileo will be given at the annual JHU/APL

project conference in the spring in order to showcase the progress made with Galileo.

Alternatively, a smaller “brown bag” presentation will be held within the QAS/Large Scale

Analytics Group so that the same purpose can be accomplished. Additionally, a “white paper”

may be created in order to update interested computer scientists on the progress of Galileo at the

JHU/APL. The intended audience for this research project will be other scientists and researchers

at APL that have a special interest in machine learning and/or large-scale data analysis. This

audience will have most of the background knowledge on the topic beforehand, and will be more

able to understand the updates on Galileo and comprehend what is going on during the

presentation. After the presentation, the audience will be able to ask questions and give feedback

on anything that concerns them about Galileo.

Logistical Considerations

The main material that is needed for this research project and the final product is a

computer that has access to Python, a code editor and a poster creator. Permission must be
granted by JHU/APL in order to present or share this project outside the APL. However, no

permission should be needed when sharing within the boundaries of the APL.
References

Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large

clusters. Retrieved from http://web.mit.edu/6.033/www/papers/mapreduce-osdi04.pdf

Dinov, I. D. (2016, March 17). Volume and value of big healthcare data. Retrieved from

HHS Public Access website:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481/pdf/nihms-766954.pdf
Timeline of Product Development

In order to collect data for this research project, the Galileo program will be used to determine
clusters of data values in multiple datasets relating to the healthcare industry. The final product
for this project will be a 4 foot by 4 foot poster made at the APL that talks about the problems of
big data in the healthcare industry and how Galileo can be used to fix these troubles. This poster
board will be presented at the end-of-the-year ASPIRE Showcase held at the Johns Hopkins
University Applied Physics Laboratory. The intended audience of this project will be any data
scientist and/or healthcare provider personnel that is interested in mitigating the problem of big
data in their field of work. This poster board will also hopefully help obtain contractors that will
purchase Galileo to use in their business. Up to this point, all the tests on healthcare datasets that
have been done have unfortunately been inconclusive in terms of its data-crunching power of
mixed datasets.

August 25th: Received notification of acceptance into the JHU/APL ASPIRE Internship
Program

September 18th: Attended the ASPIRE Internship Orientation; Had first meeting with Dr. Philip
Graff and other intern from Severna Park HS, Jake Larson

September 25th: Started introduction to Socrates, a major program that the Large Scale Analytics
Group is currently working on. Installed/cloned everything onto intern computer (Python, all the
libraries, Anaconda, etc.)

September 29th: Started introduction to prerequisite libraries utilized extensively in data analysis
and Socrates (mainly Pandas and Scikit-learn)

October 5th: Started introduction to the Jupyter Online Notebook

October 9th: Started introduction to machine learning in general

October 12th: Started focusing my studies on Galileo, a subset algorithm in the Socrates
Program that Dr. Graff is working extensively on

October 16th: Started researching about the inner workings of Galileo (entropy, quantum
mechanics, and density metrics)

October 17th: Cloned the Galileo python files onto the intern computer from the QAS Gitlab and
started looking through the code with Dr. Graff
October 24th: Started annotating the code in Galileo in order to help both myself and
other people better understand it

October 30th: Went over visualizations that helped describe how Galileo acted on various
datasets

November 6th: Started using scikit-learn to practice performing supervised machine learning
techniques on example datasets, such as the Iris dataset

November 10th: Interviewed Dr. Graff about his history and how he got to working in the
Large Scale Analytics Group (QAS) at the JHU/APL, as well as how Galileo and Socrates is
intended to change the world

November 13th: Continued annotating the code in the main python file

November 21st: Started working on December Presentation

November 27th: Finalized drafts of hypothesis assignments and preliminary research proposal in
order to get an idea for the scope of the project

December 4th: Finalized the December Presentation and got it approved by Dr. Graff and Mrs.
Dungey

December 6th: Attended the annual holiday party for the Asymmetric Operations Sector
for the JHU/APL, and used the opportunity to network with people working on different
projects.

December 11th: Delivered December Presentation at Centennial High School

December 13th: Attended a QAS Group Meeting and met some more people within the Large
Scale Analytics Group

December 19th: Finished annotating the python code in the main Galileo file, and gained
knowledge of how it does its work in the process (which was the overall goal all along)

January 3rd: Started receiving instruction on how to run the actual Galileo algorithm on datasets.
Also met Dr. Graff’s former intern, Sam Rosen, and talked to him shortly about what he does
and likes about the Applied Physics Lab.

January 11th: Finalized the Annotated Source List and sent it to Dr. Graff for approval.
January 18th: Ran the Galileo dataset on different healthcare-provided datasets such as heart
disease and breast cancer without prior modifications. The test results were unsatisfactory.

January 25th: Started talking about the synthesis paper and the required literature review for the
project synthesis paper with Dr. Graff.

February 6th: Ran the Galileo dataset with slight binning applied to the heart disease dataset and
produced a final, normalized confusion matrix. The results were unsatisfactory, especially with
the algorithm finding the optimal number of clusters as the lowest possible value. When this
issue was brought up with Dr. Graff, we decided that we had to find a larger dataset (the Heart
Disease one was only ~900 rows)

February 16th: Started search for new dataset - found a dataset detailing CASP Protein Analysis
that was optimized for machine learning (it contained ~40,000 data values). Upon running the
vanilla Galileo test on the CASP dataset, the memory in the Jupyter Notebook overflowed,
causing the web browser to crash. Upon reducing the size of the data used for testing to ~10,000
data values, the test was successful. However, the resulting confusion matrix was not
satisfactory.

February 20th: Started working to construct the structure of the final synthesis paper. Went
through a quick runthrough of what the paper should include with Dr. Graff.

February 26th: Received official offer of employment from Dr. David Silberberg, another
employee in the Large Scale Analytics Group (QAS).

February 27th: Worked on restructuring synthesis paper with Dr. Graff in order to move all
things with Galileo to the Data Collection/Results (as per Mrs. Dungey’s suggestion) and go over
the different libraries and algorithms in the literature review.

March 6th: Made the decision to add a section on statistics (Bayesian and Gaussian probability
models) to the literature review, as these models are used extensively in the Galileo algorithm.
Accepted offer of employment from JHU/APL - will start on June 4th, 2018.

March 13th: Started binning the CASP Protein Analysis Dataset so that Galileo might have a
better chance with identifying clusters. Results were somewhat satisfactory, with the final AIC
value coming out most often as 22 (the true number of clusters should be 21). The confusion
matrix, however, was still not what we wanted.
March 23rd: Started running the supervised machine learning subset of the Galileo algorithm
(Galileo Classifier) in hopes that a better result would be created.
March 27th: Talked with Dr. Graff about different methods of data collection that we will use for
Galileo, and how we will showcase the results via the poster board as the final product.

You might also like