You are on page 1of 25

The Impacts of Advanced Categorical

Data Analysis on the Healthcare Industry


The Potential of GALILEO in an Increasingly Digital World

Sean Jordan
Long Reach High School
Dr. Philip Graff
Large Scale Analytics Group (AOS/QAS)
May 15th, 2018
If you are reading this, then Sean just
failed his AP Calculus BC Exam!
Slide Preview

• Abstract in Brief
• Big Data and Medicine
• Machine Learning
• GALILEO
• Data Results
• Data Collection
• Conclusions

15 May 2018 2
Abstract (in Brief.. Thankfully)

• The healthcare industry has had a big influx of


data in recent years – too much to handle
• GALILEO can help mediate this problem
• GALILEO was tested on 3 different datasets
with varying results
• It has potential, but needs future work and
development to be operationally useful
• Can have a lasting impact on the medical field
15 May 2018 3
What in the World is Big Data?

• Big Data – very large data sets that can be


analyzed to reveal patterns and trends
• Appears everywhere you look
• Data sets are getting bigger and bigger
• Big Data = Big Problem

15 May 2018 4
How Does it Relate to Medicine?

• Healthcare institutions need to keep lots of data


• The world population keeps growing
• More people = More patients
• Healthcare datasets are growing bigger
• Harder and harder to maintain

15 May 2018 5
Why is Big Data such a Big Topic in Medicine?

• There is a ton of patient data in hospitals


• It’s just sitting there… doing nothing
• There is enormous potential with big data
analysis when it comes to medical data sets
• Healthcare is our future
• Precision Medicine!

15 May 2018 6
The Answer: Machine Learning!

• Uses statistics to progressively improve


performance on a task without being
explicitly programmed
• Unsupervised ML
• Supervised ML
• Draws relationships within data

15 May 2018 7
Unsupervised vs. Supervised Machine Learning

• Finding hidden patterns and trends in a data


set without any labels or accuracy measures
• VS.
• Using labels as an accuracy
measure to determine how
well an algorithm did at
classifying data values

15 May 2018 8
What is the point of all this information?

• Machine learning can be used in hospitals to


better understand gigantic patient data sets
• It can identify potential subpopulations within
patients, which helps identify symptoms
• It can transform the way humans treat
healthcare
• We need a tool to do it…

15 May 2018 9
What is GALILEO, you ask?

Not this…
Or this…

15 May 2018 10
The Real GALILEO

• A cluster-based data analysis program


• Uses entropy-based density metrics to group
values into different clusters
• Entropy = Randomness  New results each time
• Density = Mass/Volume  English analogy!

15 May 2018 11
The Ranking System

• Three different ranking criteria


• Akaike (AIC), Bayesian (BIC), and Density (DIC)
Information Criterions
• Lower AIC/BIC is better
• Higher DIC is better
• BIC penalizes for more k

15 May 2018 12
The Mushrooms Dataset

• Common benchmark dataset for testing


categorical data analysis programs
• 23 attributes
• 8124 instances of data
• Contains different letters representing different
characteristics of mushrooms (age, edibility, etc.)

15 May 2018 13
Results!

• GALILEO found 23 to be the optimal number of


clusters to represent the dataset
• Very intense and precise clusters
• Little error or spread

15 May 2018 14
But wait.. There’s more!

• There were 23 different species of mushrooms


present in the dataset
• No indication of this in the actual dataset
• GALILEO found this on its own
• Showed that GALILEO has amazing potential

15 May 2018 15
Test #1: The Cleveland Heart Disease Data Set

• Contained anonymous information from different


patients’ statuses on some heart disease
symptoms (cholesterol, resting ECG, etc.)
• 14 attributes
• 303 instances of data
• Contains quantitative and categorical data

15 May 2018 16
Results 2.0

• GALILEO found 2 to be the optimal number of


clusters to represent the dataset (minimum)
• Intense but badly defined clusters
• Lots of error and spread

15 May 2018 17
What went wrong?

• Not nearly enough instances of data (300


compared to 8000)
• Too many unique values in an attribute (291.4
and 291.5 are completely different values)

15 May 2018 18
Test #2: The CASP Protein Data Set

• Contained data on a variety of physiochemical


properties of tertiary protein structure
• 9 attributes
• 46,000 instances of data
• Contains quantitative data only
• Data points were binned

15 May 2018 19
Results 3.0

• GALILEO found 23 to be the optimal number of


clusters to represent the dataset
• Pretty well defined clusters
• Noticeable error and spread

15 May 2018 20
What else, Sherlock?

• The CASP Dataset had 21 different types of


proteins
• This wasn’t in the dataset itself
• GALILEO came very close to accurately
representing the data set

15 May 2018 21
Data Collection Methods

• Notepad++: Edited Python files from GALILEO


• Jupyter Notebook: Ran GALILEO on data sets
• Scikit-learn: Made confusion matrices for data
• MatPlotLib: Plotted graphs showing AIC/BIC/DIC

15 May 2018 22
Conclusions (Finally!)

• Too many unique values or too few data


instances causes under-fitting problems
• Needs data sets that have enough samples for
their attribute space
• Provides new ways to view a dataset
• Needs more work and testing!
• Can have a huge impact on the digital world of
data
15 May 2018 23
Acknowledgements

• Mrs. Beth Dungey and the other members of the


Long Reach High School teachers and staff
• Dr. Philip Graff and the other members of the
Large Scale Analytics Group
• All of you for listening to my presentation today!

15 May 2018 24
11 December 2017 25

You might also like