COSC 526 Big Data Mining Class Syllabus

COSC 526 Class 1
Big Data Mining
Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: ramanathana@ornl.gov
Acknowledgement: Content borrowed from William

Cohens (CMU) class 10-605 and Stanford Mining
Massive datasets
Class Logistics
Where: Min Kao Engineering 406
When: 5.05 PM to 6.20 PM Tu/Th
Office Hours: 4.00 PM to 5.00 PM Thu
Where: Min Kao 619
Who: Arvind Ramanathan

Research interests: Computational
Biology; Health Informatics; Data Analytics
with heterogeneous compute architectures
Email: ramanathana@ornl.gov
2
Teaching Assistant
Who: Yang Song
Research Interests:
Email: ysong18@utk.edu
Office hours: TBD
Where: TBD
What I know about the class
CE: Computer Eng.; CN: Computer Networks; CS: Computer

Science; CT: Communication Theory; MA: Mathematics; BA:
Business Admin.; NU: Nuclear Eng.; ES: Energy Sciences; PH:
Physics; IE: Industrial Eng.; PS: Power Systems; ME: Mechanical Eng.
DO: Doctoral; MS: Masters; UG: Undergraduate
Introductions
Tell us a bit about yourselves

What motivated you to take the class?
What do you expect from the class?
Where would you take these skills?
Objectives
Design and develop algorithms to analyze
large amounts of data
Evaluate HPC and distributed computing
paradigms for analyzing large datasets
Develop end-to-end solutions that can
select, manipulate, analyze and view largescale datasets
Collaborate with domain experts on interdisciplinary areas such as business analytics,
social sciences, biomedical and health
6
Class Website and Course Materials

http://web.utk.edu/~ramana01/index.html
I will usually make the lecture notes
available prior to the class (not a promise!)
Use the website all the materials are
available there
Piazza for class discussion
https://piazza.com/utk/spring2015/cosc526/
home
Blackboard for grades

7
Overview of Class Schedule

Topic
Date
Map Reduce / Hadoop and Logistics of

handling large data sets (Python: memmap)
1/8/201 2
5
HW0 out
Similarity Search with high dimensional data
1/13/20 2
15
HW1 in/
HW 1 out
Recommendation Systems + Collaborative

Filtering
1/22/20 1
15
Dimensionality Reduction Techniques (Part

1)
1/27/20 4
15
Using UriKa Graph Appliance for Semantic

Reasoning (Guest Lecture: Dr. Rangan
Sukumar)
2/10/20 2
15
HW 1 in
Classification
2/17/20 2
15
HW 2 out
Clustering
2/24/20 3
15
Data Stream Processing/ Analytics
3/5/201 2
5
Class
es
Assignmen
ts
Overview of Class Schedule (2)

Topic
Date
Graph Mining
3/31/20 2
15
Digital Pathology + Molecular Biophysics

(Guest Lecture: Dr. Chakra Chennubhotla)
4/7/201 1
5
Advanced Programming Models for data

mining
4/9/201 2
5
Project Presentations
No.
classe
s
Assignmen
ts
HW 3 in
4/16/20 2
15
Except for orange and blue highlights, things
are not set in
Poster
Presentations
4/23/20 1
stone
15
Topics can change based on class participation
Dates I am not available:

Feb 10-12 : Biophysical society meetings
Mar 18-27 : Arvind in India
9
Requirements
Components
% Grade
Total
Homework
45
Project
50
In-class quizzes/
participation
10
3 late-days in total (for the whole semester)

Assignments take time; start early!!
Project:
Significant implementation effort
Poster session at end of semester
Peer-evaluation and judges from UTK and ORNL
Project Deliverables and Deadlines

Deliverable
Due-date
% Grade
Initial selection of topics1, 2
Jan 27, 2015
10
Project Description and Approach
Feb 20, 2015
20
Initial Report
Mar 20, 2015
10
Project Demonstration
Apr 16-19, 2015
10
Final Project Report (10 pages)
Apr 21, 2015
25
Poster (12-16 slides)
Apr 23, 2015
25
1Projects can come with their own data (e.g., from

your project) or can be provided
2Datasets need to be open! Please dont use
datasets that have proprietary limitations
All reports will be in NIPS format:
http://nips.cc/Conferences/2013/PaperInformation/
StyleFiles
11
Project Deliverables and Deadlines

1. Initial selection (1 page)
Describe what is the dataset (how large); what problem are you
trying to solve (e.g., classification, clustering);
2. Project Description and Approach (1

page)
Current state of the art and background
What approach(es) are you planning to use?
What are the relevant metrics?
3. Initial Report (3 pages)
12
Algorithm(s) implemented
Should have a flavor of big data technologies needed
Assignments
Assignments take time; start early
You can work in pairs; submit only one
write up per group
Mention who you worked with
Latex your assignments
Electronic hand-over of assignments:
[lastnames]HW<num>submit.tgz
[lastnames]HW<num>submit.zip
Ex:RamanathanSongHW0submit.tgz
Post questions via Piazza

13
Pre-requisites
Basic Database course:
SQL queries, data retrieval, etc.
Algorithms:
Dynamic programming, data structures
Statistics:
Moments, Distributions, Regression
Programming Languages:
Java, Python + any object oriented language
14
Course Books
http://www.mmds.org/
Readings and Papers will be available as part
of the course website
Useful references:
Machine Learning, Tom Mitchell
Building Machine Learning Systems with Python, Richert, PedroCoelho
http://scikit-learn.org/stable/
http://guidetodatamining.com/
http://www.cs.cornell.edu/home/kleinber/networks-book/
15
Computer Accounts
Get Piazza accounts you should have
gotten emails!
Get AWS accounts information
forthcoming
16
Data Mining
17
Data Explosion is Fueling Innovation in

Science, Engineering and Business
Estimated worlds data in 2010
~ 1.2 zettabytes (1021 bytes)
Total data ~ 35 zettabytes in
2020!!
Data needs to be:
Stored
Managed
This class
Analyzed
18
19
We are swimming in data and

We are in demand
Today
20
Yesterday
Data Mining
Computational process of discovering
patterns in large data sets
Data sets: from databases, structured and
unstructured text, images.
Patterns: five primary forms
Cluster analysis
Anomaly detection/ change point detection
Dependency modeling
Classification and Regression
Summarization
21
Data Mining as a Discipline

Interdisciplinary with diverse
interactions
Databases: managing
large datasets
Machine Learning &
Statistics: data and
models
Machine
Learning &
Databases
Data Statistics
Mining
Theory: Algorithms, in
particular Randomized
methods
Theory
Our class focuses on:

Scalability: what to do
when we have big data?
Algorithms: how to do
what we do with big data?
Architecture: what
infrastructure is suitable?
22
How are datasets represented?

Structured Data
Data organized in
terms of records with
fields corresponding to
specific entries
Examples:
Examples:
Word Documents
Databases (relational)
Email messages
XML (and other

structured layouts)
Medical records
Data Warehouses
Enterprise systems
23
Unstructured Data
Data does not have a
specific organization,
but can be accessed
Photos / Images
Audio
Defining Value from Data Analysis

Discovery of patterns and models that are:
Valid: general across new data with some
certainty
Useful: make sure that the domain expert
benefits from the insights gained
Understandable: the insights should be
interpretable for the end-user
Patterns and models must be statistically

sound
24
Some humor
Data Mining: Extract value from large

data sets
Inherent Risk: how to quantify value?
If we start looking for interesting
patterns in more places than the
amount of data we have access to, we
25
will find crap
Bonferoni Principle: Making sure

that the patterns discovered are not
crap
Suppose terrorists are meeting in a hotel to
plot an event
A feature to exploit: can we find (unrelated)
people that stayed at least twice at the same
hotel on the same day?
26
How do we go about doing this?

109 people
1,000 days
Each person stays 1% of the time in a hotel
(10 days in 1,000 days)
Each hotel holds about 100 people
105 hotels
If everyone behaves randomly (i.e., no

collusion), will data mining find any
suspicious behavior?
27
Lets calculate
Probability that two persons p and q will be
at the same hotel on the same given day
d:
1/100 x 1/100 x 1/105 = 10-9
p at some hotel
q at some hotel
p and q at same hotel
Probability that p and q will stay at the

same hotel on the same given dates d1
and d2:
28
10-9 x 10-9 = 10-18
More calculations
Probability that p and q were at the same
hotel on some two days:
5 x 105 x 10-18 = 5 x 10-13
Pairs of people:
5 x 1017
Expected number of suspected behaviors:

5 x 1017 x 5 x 10-13 = 250,000!
Analysts paradox: if there are 10 real terrorists, then one has to
sift through O(250K) records. Almost impossible!!!
29
How not to design a data

experiment?
Hypothesis: Some people have
extrasensory perception (ESP)
Experimental design:
Have people identify 10 hidden red or blue
cards
Observation: 1 in 1,000 people had ESP

meaning they were able to get all 10
correct!
Lets now inform these people that they
have ESP and then try to test for ESP
(again)
30
Tackling Big Data
31
Challenges we face with increasing

size of datasets
Big Storage
Big Compute
Attributes
Volume
Variety
Velocity
Variability
Value
32
Types of
High
Datasets
dimensional
Sparse
Graph
Infinite/streami
ng
Labeled
Compute
Map Reduce
Models
Streams/
Online
Single machine
in-memory
Big Storage
Storage is cheap:
Data farms such as DropBox,
Nitro, Mozy, Netflix, Amazon!
More opportunities to
archive and analyze
33
Big Compute
Cost of computing is
also going down
Commodity clusters are
cheap
34
Why not scale our algorithms to

work with big data?
35
High Dimensional Datasets
Modeling physics of combustion
36
Observing the
human body
Protein
structures
Material
Sciences
Sparse Datasets
Most experimental observations are
sparse:
Structural biology: nuclear
magnetic resonance (NMR), small
angle scattering (X-ray, neutron)
Materials science: Atomic force
microscopy, spectroscopic
techniques
Astronomy:
Geology:
Climate Science:

37
Graph Datasets
Friendship network on Facebook
The Internet!
38
Genome wide interaction mapping
2010
s
2000
s
1990
s
1980
s
History of Big Data Mining
39
Realizations &
Reflections
Small
datasets +
Cohen, W., et al (IJCAI 1993)
commonly used
algorithms were slow
Nothing better than
having lots of data
Y2K problem + Human

Genome Project
Growth of Machine
Learning as a discipline
Data as fourth paradigm
Scaling machine learning
Banko & Brill, (ACL 2001)
The Big Data Mining Pipeline

How to effectively
portray
information to the
end user?
How to ensure
that end users
understand the Interpre
data?
t
How to query,
Analyze
mine and
and
Model
model data?
How to ensure
models are
valid
representatio
40 ns of
Acquire
(data)
Where is the data coming

from?
Extract
and
Clean
Aggrega
te and
Integrat
e
Represent
How to handle
data that is not
ready for
analysis?
How to deal with
missing/incomple
te/noisy data?
How to bring
together
heterogeneous
datasets
together?
How to facilitate and host

datasets for analysis?
Other concerns of Big Data Analytics

Heterogeneity
Scalability
Timeliness
Privacy
Human element!
41
To Dos and Summary
42
Summary
Data mining as a discipline:
Inter-disciplinary with statistics, machine
learning and engineering
Big Data Mining:

Why we are interested in big data?
Process of big data mining
Caveats of analytic techniques (Bonferoni
principle, designing good experiments)
Next class:
Diving into Map Reduce
43
Class Logistics
Register for Piazza
https://piazza.com/utk/spring2015/cosc526/home
Post your questions through Piazza*

Familiarize with use of Python and Hadoop
We need Java and Python for the class
Complete HW0!
*Piazza posts have to be tagged. Please dont forget to do this.

44
Expectations
Thou shalt:
Participate in class and on piazza
Collaborate constructively
Respect your data
Have FUN
Thou shalt not:

Copy code, homework, etc.
I expect to see projects that can be potentially submitted to
conferences/ workshops. Lots of opportunities to work both at
application level and engineering level.
45

COSC 526 Big Data Mining Class Syllabus

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COSC 526 Big Data Mining Class Syllabus

Uploaded by

Copyright:

Available Formats

COSC 526 Class 1

Big Data Mining

Acknowledgement: Content borrowed from William

Who: Arvind Ramanathan

What I know about the class

CE: Computer Eng.; CN: Computer Networks; CS: Computer

Tell us a bit about yourselves

Class Website and Course Materials

Blackboard for grades

Overview of Class Schedule

Map Reduce / Hadoop and Logistics of

Similarity Search with high dimensional data

Recommendation Systems + Collaborative

Dimensionality Reduction Techniques (Part

Using UriKa Graph Appliance for Semantic

Data Stream Processing/ Analytics

Overview of Class Schedule (2)

Digital Pathology + Molecular Biophysics

Advanced Programming Models for data

Dates I am not available:

3 late-days in total (for the whole semester)

Project Deliverables and Deadlines

Initial selection of topics1, 2

Jan 27, 2015

Project Description and Approach

Feb 20, 2015

Mar 20, 2015

Apr 16-19, 2015

Final Project Report (10 pages)

Apr 21, 2015

Poster (12-16 slides)

Apr 23, 2015

1Projects can come with their own data (e.g., from

Project Deliverables and Deadlines

2. Project Description and Approach (1

3. Initial Report (3 pages)

Should have a flavor of big data technologies needed

Electronic hand-over of assignments:

Post questions via Piazza

Data Explosion is Fueling Innovation in

We are swimming in data and

Data Mining as a Discipline

Our class focuses on:

How are datasets represented?

XML (and other

Defining Value from Data Analysis

Patterns and models must be statistically

Data Mining: Extract value from large

Bonferoni Principle: Making sure

How do we go about doing this?

If everyone behaves randomly (i.e., no

Probability that p and q will stay at the

10-9 x 10-9 = 10-18

Expected number of suspected behaviors:

How not to design a data

Observation: 1 in 1,000 people had ESP

Tackling Big Data

Challenges we face with increasing

Why not scale our algorithms to

High Dimensional Datasets

Modeling physics of combustion

Genome wide interaction mapping