Qaslunchpresentation

The Impacts of Advanced Categorical
Data Analysis on the Healthcare Industry

The Potential of GALILEO in an Increasingly Digital World
Sean Jordan
Long Reach High School
Dr. Philip Graff
Large Scale Analytics Group (AOS/QAS)
May 15th, 2018
If you are reading this, then Sean just
failed his AP Calculus BC Exam!
Slide Preview
• Abstract in Brief
• Big Data and Medicine
• Machine Learning
• GALILEO
• Data Results
• Data Collection
• Conclusions
15 May 2018 2
Abstract (in Brief.. Thankfully)
• The healthcare industry has had a big influx of

data in recent years – too much to handle
• GALILEO can help mediate this problem
• GALILEO was tested on 3 different datasets
with varying results
• It has potential, but needs future work and
development to be operationally useful
• Can have a lasting impact on the medical field
15 May 2018 3
What in the World is Big Data?
• Big Data – very large data sets that can be

analyzed to reveal patterns and trends
• Appears everywhere you look
• Data sets are getting bigger and bigger
• Big Data = Big Problem
15 May 2018 4
How Does it Relate to Medicine?
• Healthcare institutions need to keep lots of data

• The world population keeps growing
• More people = More patients
• Healthcare datasets are growing bigger
• Harder and harder to maintain
15 May 2018 5
Why is Big Data such a Big Topic in Medicine?
• There is a ton of patient data in hospitals

• It’s just sitting there… doing nothing
• There is enormous potential with big data
analysis when it comes to medical data sets
• Healthcare is our future
• Precision Medicine!
15 May 2018 6
The Answer: Machine Learning!
• Uses statistics to progressively improve

performance on a task without being
explicitly programmed
• Unsupervised ML
• Supervised ML
• Draws relationships within data
15 May 2018 7
Unsupervised vs. Supervised Machine Learning
• Finding hidden patterns and trends in a data

set without any labels or accuracy measures
• VS.
• Using labels as an accuracy
measure to determine how
well an algorithm did at
classifying data values
15 May 2018 8
What is the point of all this information?
• Machine learning can be used in hospitals to

better understand gigantic patient data sets
• It can identify potential subpopulations within
patients, which helps identify symptoms
• It can transform the way humans treat
healthcare
• We need a tool to do it…
15 May 2018 9
What is GALILEO, you ask?
Not this…
Or this…
15 May 2018 10
The Real GALILEO
• A cluster-based data analysis program

• Uses entropy-based density metrics to group
values into different clusters
• Entropy = Randomness  New results each time
• Density = Mass/Volume  English analogy!
15 May 2018 11
The Ranking System
• Three different ranking criteria

• Akaike (AIC), Bayesian (BIC), and Density (DIC)
Information Criterions
• Lower AIC/BIC is better
• Higher DIC is better
• BIC penalizes for more k
15 May 2018 12
The Mushrooms Dataset
• Common benchmark dataset for testing

categorical data analysis programs
• 23 attributes
• 8124 instances of data
• Contains different letters representing different
characteristics of mushrooms (age, edibility, etc.)
15 May 2018 13
Results!
• GALILEO found 23 to be the optimal number of

clusters to represent the dataset
• Very intense and precise clusters
• Little error or spread
15 May 2018 14
But wait.. There’s more!
• There were 23 different species of mushrooms

present in the dataset
• No indication of this in the actual dataset
• GALILEO found this on its own
• Showed that GALILEO has amazing potential
15 May 2018 15
Test #1: The Cleveland Heart Disease Data Set
• Contained anonymous information from different

patients’ statuses on some heart disease
symptoms (cholesterol, resting ECG, etc.)
• 14 attributes
• 303 instances of data
• Contains quantitative and categorical data
15 May 2018 16
Results 2.0

clusters to represent the dataset (minimum)
• Intense but badly defined clusters
• Lots of error and spread
15 May 2018 17
What went wrong?
• Not nearly enough instances of data (300

compared to 8000)
• Too many unique values in an attribute (291.4
and 291.5 are completely different values)
15 May 2018 18
Test #2: The CASP Protein Data Set
• Contained data on a variety of physiochemical

properties of tertiary protein structure
• 9 attributes
• 46,000 instances of data
• Contains quantitative data only
• Data points were binned
15 May 2018 19
Results 3.0

clusters to represent the dataset
• Pretty well defined clusters
• Noticeable error and spread
15 May 2018 20
What else, Sherlock?
• The CASP Dataset had 21 different types of

proteins
• This wasn’t in the dataset itself
• GALILEO came very close to accurately
representing the data set
15 May 2018 21
Data Collection Methods
• Notepad++: Edited Python files from GALILEO

• Jupyter Notebook: Ran GALILEO on data sets
• Scikit-learn: Made confusion matrices for data
• MatPlotLib: Plotted graphs showing AIC/BIC/DIC
15 May 2018 22
Conclusions (Finally!)
• Too many unique values or too few data

instances causes under-fitting problems
• Needs data sets that have enough samples for
their attribute space
• Provides new ways to view a dataset
• Needs more work and testing!
• Can have a huge impact on the digital world of
data
15 May 2018 23
Acknowledgements
• Mrs. Beth Dungey and the other members of the

Long Reach High School teachers and staff
• Dr. Philip Graff and the other members of the
Large Scale Analytics Group
• All of you for listening to my presentation today!
15 May 2018 24
11 December 2017 25

Qaslunchpresentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Qaslunchpresentation

Uploaded by

Copyright:

Available Formats

The Impacts of Advanced Categorical

Data Analysis on the Healthcare Industry

• The healthcare industry has had a big influx of

• Big Data – very large data sets that can be

• Healthcare institutions need to keep lots of data

• There is a ton of patient data in hospitals

• Uses statistics to progressively improve

• Finding hidden patterns and trends in a data

• Machine learning can be used in hospitals to

• A cluster-based data analysis program

• Three different ranking criteria

• Common benchmark dataset for testing

• GALILEO found 23 to be the optimal number of

• There were 23 different species of mushrooms

• Contained anonymous information from different

• GALILEO found 2 to be the optimal number of

• Not nearly enough instances of data (300

• Contained data on a variety of physiochemical

• GALILEO found 23 to be the optimal number of

• The CASP Dataset had 21 different types of

• Notepad++: Edited Python files from GALILEO

• Too many unique values or too few data

• Mrs. Beth Dungey and the other members of the

You might also like