Professional Documents
Culture Documents
The main goal of the Galileo program is to analyze extremely large datasets containing
categorical attributes and values and extrapolate hidden patterns and relationships that lie
between the data values. The program is still in the development stage, and needs to be modified
and fine-tuned before it is officially put on the public market for companies. The goal of this
research project is to outline ways to potentially improve upon the current version of the code
Around 5 years ago, the term “big data” started seeing a huge spike in its usage - a
popularity of near nothing to worldwide awareness in just a few months. This trend hasn’t really
let off steam since, due to the fact that it is used more and more in today’s digital age. A lot of
companies like insurance providers, government agencies, and pharmaceutical businesses collect
a lot of data for each person that they serve. Oftentimes, they serve millions upon millions of
customers. The amount of data that they need to have under control is astronomical in size. Due
to the large amount of data values that just one dataset has, this type of information is dubbed
“big data” (Dean & Ghemawat, 2004). In biostatistics especially, big data and the methods used
In recent years, machine learning in particular has taken center stage when it comes to
analyzing health data. Health data is always being made available to researchers. There are
always new patients being admitted to hospitals that can contribute to a dataset, and this
information can be used to develop better machine learning techniques. Machine learning is split
into two parts - supervised and unsupervised learning. Unsupervised machine learning occurs
when an algorithm is given a plain set of data and is asked to draw patterns or conclusions from
it (like mean, standard deviation, etc.). Supervised machine learning includes class headings, and
allows predictions to be made about those class headings given the data points. The whole point
of machine learning is to determine what outcome a series of inputs represents - for medical data,
machine learning is used to determine whether a patient is likely to have a disease based on
The problem of big data and machine learning is important because as hospital databases
get larger and larger, it gets harder and harder for humans to analyze them and draw conclusions
from the data points (Dinov, 2016). It is much faster and more efficient for a computer to do all
of the pattern matching instead. While computer management of patient data took over human
management a while ago, machine control has recently spread over to the diagnosis realm of
medicine. The problem is that as doctors get more and more patients, they need to spend more
and more time and work trying to determine whether or not a patient has a specific disease based
on risk factors. A recent solution has been proposed to allow machine algorithms to learn about
the patient’s symptoms, compare it against a set of risk factors, and then develop a probability
which describes the chance that the patient has or will contract the disease. However, these
machine learning algorithms are in their infancy, and still inherently unreliable - many people
would not trust a computer to diagnosis them over a well-trained physician. The complex
algorithms that determine these patients’ risks need to be improved in order to more reliably
predict whether or not a patient has a disease or not so that physicians are not overloaded with
Research Methodology
order to learn relationships between health indicators that might be predictive of medical
conditions?
With further development work, the Galileo machine learning program can be used to
revolutionize the way data analysis companies interpret large, categorical datasets and provide
them with a new set of tools that can help them better understand their customers.
Basis of Hypothesis
Currently, Galileo has only been tested with a few select datasets provided with the
program. Its accuracy and efficiency have not been thoroughly tested with other datasets,
however, and this limits its applicability to other subject areas. Thus, the main goal of this
research project is to run the algorithm through multiple different datasets (mainly categorical
data from hospitals) and obtain feedback on how well Galileo performs in clustering and
Research Design
variable is the input dataset, and the dependent variable is the performance results of the Galileo
program. The data will be collected by analyzing the results of the Galileo program, which gives
a quantitative variable depicting the optimal number of clusters to use when modeling the
dataset. The cluster assignments can be compared to the known ground truth. This data can then
be used to improve the analytic capabilities of the algorithm in order to reach the highest possible
true positive and true negative rates. The primary method of collecting research data will be
through experiments on raw datasets using the Galileo program. Created models will be saved
with unique and readable identifies as to keep track of all experimental trials. When running the
program, input data and the settings of Galileo will be recorded in a digital log. The new models
Operational Definitions
There are no operational definitions or subjective measures in this experiment. All measures used
Product Overview
At the end of the year, a presentation about Galileo will be given at the annual JHU/APL
project conference in the spring in order to showcase the progress made with Galileo.
Alternatively, a smaller “brown bag” presentation will be held within the QAS/Large Scale
Analytics Group so that the same purpose can be accomplished. Additionally, a “white paper”
may be created in order to update interested computer scientists on the progress of Galileo at the
JHU/APL. The intended audience for this research project will be other scientists and researchers
at APL that have a special interest in machine learning and/or large-scale data analysis. This
audience will have most of the background knowledge on the topic beforehand, and will be more
able to understand the updates on Galileo and comprehend what is going on during the
presentation. After the presentation, the audience will be able to ask questions and give feedback
Logistical Considerations
The main material that is needed for this research project and the final product is a
computer that has access to Python, a code editor and a poster creator. Permission must be
granted by JHU/APL in order to present or share this project outside the APL. However, no
permission should be needed when sharing within the boundaries of the APL.
References
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large
Dinov, I. D. (2016, March 17). Volume and value of big healthcare data. Retrieved from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481/pdf/nihms-766954.pdf
Timeline of Product Development
In order to collect data for this research project, the Galileo program will be used to determine
clusters of data values in multiple datasets relating to the healthcare industry. The final product
for this project will be a 4 foot by 4 foot poster made at the APL that talks about the problems of
big data in the healthcare industry and how Galileo can be used to fix these troubles. This poster
board will be presented at the end-of-the-year ASPIRE Showcase held at the Johns Hopkins
University Applied Physics Laboratory. The intended audience of this project will be any data
scientist and/or healthcare provider personnel that is interested in mitigating the problem of big
data in their field of work. This poster board will also hopefully help obtain contractors that will
purchase Galileo to use in their business. Up to this point, all the tests on healthcare datasets that
have been done have unfortunately been inconclusive in terms of its data-crunching power of
mixed datasets.
August 25th: Received notification of acceptance into the JHU/APL ASPIRE Internship
Program
September 18th: Attended the ASPIRE Internship Orientation; Had first meeting with Dr. Philip
Graff and other intern from Severna Park HS, Jake Larson
September 25th: Started introduction to Socrates, a major program that the Large Scale Analytics
Group is currently working on. Installed/cloned everything onto intern computer (Python, all the
libraries, Anaconda, etc.)
September 29th: Started introduction to prerequisite libraries utilized extensively in data analysis
and Socrates (mainly Pandas and Scikit-learn)
October 12th: Started focusing my studies on Galileo, a subset algorithm in the Socrates
Program that Dr. Graff is working extensively on
October 16th: Started researching about the inner workings of Galileo (entropy, quantum
mechanics, and density metrics)
October 17th: Cloned the Galileo python files onto the intern computer from the QAS Gitlab and
started looking through the code with Dr. Graff
October 24th: Started annotating the code in Galileo in order to help both myself and
other people better understand it
October 30th: Went over visualizations that helped describe how Galileo acted on various
datasets
November 6th: Started using scikit-learn to practice performing supervised machine learning
techniques on example datasets, such as the Iris dataset
November 10th: Interviewed Dr. Graff about his history and how he got to working in the
Large Scale Analytics Group (QAS) at the JHU/APL, as well as how Galileo and Socrates is
intended to change the world
November 13th: Continued annotating the code in the main python file
November 27th: Finalized drafts of hypothesis assignments and preliminary research proposal in
order to get an idea for the scope of the project
December 4th: Finalized the December Presentation and got it approved by Dr. Graff and Mrs.
Dungey
December 6th: Attended the annual holiday party for the Asymmetric Operations Sector
for the JHU/APL, and used the opportunity to network with people working on different
projects.
December 13th: Attended a QAS Group Meeting and met some more people within the Large
Scale Analytics Group
December 19th: Finished annotating the python code in the main Galileo file, and gained
knowledge of how it does its work in the process (which was the overall goal all along)
January 3rd: Started receiving instruction on how to run the actual Galileo algorithm on datasets.
Also met Dr. Graff’s former intern, Sam Rosen, and talked to him shortly about what he does
and likes about the Applied Physics Lab.
January 11th: Finalized the Annotated Source List and sent it to Dr. Graff for approval.
January 18th: Ran the Galileo dataset on different healthcare-provided datasets such as heart
disease and breast cancer without prior modifications. The test results were unsatisfactory.
January 25th: Started talking about the synthesis paper and the required literature review for the
project synthesis paper with Dr. Graff.
February 6th: Ran the Galileo dataset with slight binning applied to the heart disease dataset and
produced a final, normalized confusion matrix. The results were unsatisfactory, especially with
the algorithm finding the optimal number of clusters as the lowest possible value. When this
issue was brought up with Dr. Graff, we decided that we had to find a larger dataset (the Heart
Disease one was only ~900 rows)
February 16th: Started search for new dataset - found a dataset detailing CASP Protein Analysis
that was optimized for machine learning (it contained ~40,000 data values). Upon running the
vanilla Galileo test on the CASP dataset, the memory in the Jupyter Notebook overflowed,
causing the web browser to crash. Upon reducing the size of the data used for testing to ~10,000
data values, the test was successful. However, the resulting confusion matrix was not
satisfactory.
February 20th: Started working to construct the structure of the final synthesis paper. Went
through a quick runthrough of what the paper should include with Dr. Graff.
February 26th: Received official offer of employment from Dr. David Silberberg, another
employee in the Large Scale Analytics Group (QAS).
February 27th: Worked on restructuring synthesis paper with Dr. Graff in order to move all
things with Galileo to the Data Collection/Results (as per Mrs. Dungey’s suggestion) and go over
the different libraries and algorithms in the literature review.
March 6th: Made the decision to add a section on statistics (Bayesian and Gaussian probability
models) to the literature review, as these models are used extensively in the Galileo algorithm.
Accepted offer of employment from JHU/APL - will start on June 4th, 2018.
March 13th: Started binning the CASP Protein Analysis Dataset so that Galileo might have a
better chance with identifying clusters. Results were somewhat satisfactory, with the final AIC
value coming out most often as 22 (the true number of clusters should be 21). The confusion
matrix, however, was still not what we wanted.
March 23rd: Started running the supervised machine learning subset of the Galileo algorithm
(Galileo Classifier) in hopes that a better result would be created.
March 27th: Talked with Dr. Graff about different methods of data collection that we will use for
Galileo, and how we will showcase the results via the poster board as the final product.