Professional Documents
Culture Documents
Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: ramanathana@ornl.gov
Class Logistics
Where: Min Kao Engineering 406
When: 5.05 PM to 6.20 PM Tu/Th
Office Hours: 4.00 PM to 5.00 PM Thu
Where: Min Kao 619
Teaching Assistant
Who: Yang Song
Research Interests:
Email: ysong18@utk.edu
Office hours: TBD
Where: TBD
Introductions
Objectives
Design and develop algorithms to analyze
large amounts of data
Evaluate HPC and distributed computing
paradigms for analyzing large datasets
Develop end-to-end solutions that can
select, manipulate, analyze and view largescale datasets
Collaborate with domain experts on interdisciplinary areas such as business analytics,
social sciences, biomedical and health
6
Date
1/8/201 2
5
HW0 out
1/13/20 2
15
HW1 in/
HW 1 out
1/22/20 1
15
1/27/20 4
15
2/10/20 2
15
HW 1 in
Classification
2/17/20 2
15
HW 2 out
Clustering
2/24/20 3
15
3/5/201 2
5
Class
es
Assignmen
ts
Date
Graph Mining
3/31/20 2
15
4/7/201 1
5
4/9/201 2
5
Project Presentations
No.
classe
s
Assignmen
ts
HW 3 in
4/16/20 2
15
Except for orange and blue highlights, things
are not set in
Poster
Presentations
4/23/20 1
stone
15
Topics can change based on class participation
Requirements
Components
% Grade
Total
Homework
45
Project
50
In-class quizzes/
participation
10
Due-date
% Grade
10
20
Initial Report
10
Project Demonstration
10
25
25
12
Algorithm(s) implemented
Assignments
Assignments take time; start early
You can work in pairs; submit only one
write up per group
Mention who you worked with
Latex your assignments
[lastnames]HW<num>submit.tgz
[lastnames]HW<num>submit.zip
Ex:RamanathanSongHW0submit.tgz
Pre-requisites
Basic Database course:
SQL queries, data retrieval, etc.
Algorithms:
Dynamic programming, data structures
Statistics:
Moments, Distributions, Regression
Programming Languages:
Java, Python + any object oriented language
14
Course Books
http://www.mmds.org/
Readings and Papers will be available as part
of the course website
Useful references:
Machine Learning, Tom Mitchell
Building Machine Learning Systems with Python, Richert, PedroCoelho
http://scikit-learn.org/stable/
http://guidetodatamining.com/
http://www.cs.cornell.edu/home/kleinber/networks-book/
15
Computer Accounts
Get Piazza accounts you should have
gotten emails!
Get AWS accounts information
forthcoming
16
Data Mining
17
18
19
Today
20
Yesterday
Data Mining
Computational process of discovering
patterns in large data sets
Data sets: from databases, structured and
unstructured text, images.
Patterns: five primary forms
Cluster analysis
Anomaly detection/ change point detection
Dependency modeling
Classification and Regression
Summarization
21
Machine
Learning &
Databases
Data Statistics
Mining
Theory: Algorithms, in
particular Randomized
methods
Theory
Examples:
Word Documents
Databases (relational)
Email messages
Medical records
Data Warehouses
Enterprise systems
23
Unstructured Data
Data does not have a
specific organization,
but can be accessed
Photos / Images
Audio
24
Some humor
26
Lets calculate
Probability that two persons p and q will be
at the same hotel on the same given day
d:
1/100 x 1/100 x 1/105 = 10-9
p at some hotel
q at some hotel
p and q at same hotel
More calculations
Probability that p and q were at the same
hotel on some two days:
5 x 105 x 10-18 = 5 x 10-13
Pairs of people:
5 x 1017
29
30
31
Types of
High
Datasets
dimensional
Sparse
Graph
Infinite/streami
ng
Labeled
Compute
Map Reduce
Models
Streams/
Online
Single machine
in-memory
Big Storage
Storage is cheap:
Data farms such as DropBox,
Nitro, Mozy, Netflix, Amazon!
More opportunities to
archive and analyze
33
Big Compute
Cost of computing is
also going down
Commodity clusters are
cheap
34
35
36
Observing the
human body
Protein
structures
Material
Sciences
Sparse Datasets
Most experimental observations are
sparse:
Structural biology: nuclear
magnetic resonance (NMR), small
angle scattering (X-ray, neutron)
Materials science: Atomic force
microscopy, spectroscopic
techniques
Astronomy:
Geology:
Climate Science:
37
Graph Datasets
Friendship network on Facebook
The Internet!
38
2010
s
2000
s
1990
s
1980
s
39
Realizations &
Reflections
Small
datasets +
commonly used
algorithms were slow
Nothing better than
having lots of data
Growth of Machine
Learning as a discipline
Data as fourth paradigm
Scaling machine learning
How to query,
Analyze
mine and
and
Model
model data?
How to ensure
models are
valid
representatio
40 ns of
Acquire
(data)
Extract
and
Clean
Aggrega
te and
Integrat
e
Represent
How to handle
data that is not
ready for
analysis?
How to deal with
missing/incomple
te/noisy data?
How to bring
together
heterogeneous
datasets
together?
41
42
Summary
Data mining as a discipline:
Inter-disciplinary with statistics, machine
learning and engineering
Next class:
Diving into Map Reduce
43
Class Logistics
Register for Piazza
https://piazza.com/utk/spring2015/cosc526/home
Complete HW0!
Expectations
Thou shalt:
Participate in class and on piazza
Collaborate constructively
Respect your data
Have FUN