You are on page 1of 46

COSC 526 Class 1

Big Data Mining

Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: ramanathana@ornl.gov

Acknowledgement: Content borrowed from William


Cohens (CMU) class 10-605 and Stanford Mining
Massive datasets

Class Logistics
Where: Min Kao Engineering 406
When: 5.05 PM to 6.20 PM Tu/Th
Office Hours: 4.00 PM to 5.00 PM Thu
Where: Min Kao 619

Who: Arvind Ramanathan


Research interests: Computational
Biology; Health Informatics; Data Analytics
with heterogeneous compute architectures
Email: ramanathana@ornl.gov
2

Teaching Assistant
Who: Yang Song
Research Interests:
Email: ysong18@utk.edu
Office hours: TBD
Where: TBD

What I know about the class

CE: Computer Eng.; CN: Computer Networks; CS: Computer


Science; CT: Communication Theory; MA: Mathematics; BA:
Business Admin.; NU: Nuclear Eng.; ES: Energy Sciences; PH:
Physics; IE: Industrial Eng.; PS: Power Systems; ME: Mechanical Eng.
DO: Doctoral; MS: Masters; UG: Undergraduate

Introductions

Tell us a bit about yourselves


What motivated you to take the class?
What do you expect from the class?
Where would you take these skills?

Objectives
Design and develop algorithms to analyze
large amounts of data
Evaluate HPC and distributed computing
paradigms for analyzing large datasets
Develop end-to-end solutions that can
select, manipulate, analyze and view largescale datasets
Collaborate with domain experts on interdisciplinary areas such as business analytics,
social sciences, biomedical and health
6

Class Website and Course Materials


http://web.utk.edu/~ramana01/index.html
I will usually make the lecture notes
available prior to the class (not a promise!)
Use the website all the materials are
available there
Piazza for class discussion
https://piazza.com/utk/spring2015/cosc526/
home

Blackboard for grades


7

Overview of Class Schedule


Topic

Date

Map Reduce / Hadoop and Logistics of


handling large data sets (Python: memmap)

1/8/201 2
5

HW0 out

Similarity Search with high dimensional data

1/13/20 2
15

HW1 in/
HW 1 out

Recommendation Systems + Collaborative


Filtering

1/22/20 1
15

Dimensionality Reduction Techniques (Part


1)

1/27/20 4
15

Using UriKa Graph Appliance for Semantic


Reasoning (Guest Lecture: Dr. Rangan
Sukumar)

2/10/20 2
15

HW 1 in

Classification

2/17/20 2
15

HW 2 out

Clustering

2/24/20 3
15

Data Stream Processing/ Analytics

3/5/201 2
5

Class
es

Assignmen
ts

Overview of Class Schedule (2)


Topic

Date

Graph Mining

3/31/20 2
15

Digital Pathology + Molecular Biophysics


(Guest Lecture: Dr. Chakra Chennubhotla)

4/7/201 1
5

Advanced Programming Models for data


mining

4/9/201 2
5

Project Presentations

No.
classe
s

Assignmen
ts

HW 3 in

4/16/20 2
15
Except for orange and blue highlights, things
are not set in
Poster
Presentations
4/23/20 1
stone
15
Topics can change based on class participation

Dates I am not available:


Feb 10-12 : Biophysical society meetings
Mar 18-27 : Arvind in India
9

Requirements
Components

% Grade

Total

Homework

45

Project

50

In-class quizzes/
participation

10

3 late-days in total (for the whole semester)


Assignments take time; start early!!
Project:
Significant implementation effort
Poster session at end of semester
Peer-evaluation and judges from UTK and ORNL

Project Deliverables and Deadlines


Deliverable

Due-date

% Grade

Initial selection of topics1, 2

Jan 27, 2015

10

Project Description and Approach

Feb 20, 2015

20

Initial Report

Mar 20, 2015

10

Project Demonstration

Apr 16-19, 2015

10

Final Project Report (10 pages)

Apr 21, 2015

25

Poster (12-16 slides)

Apr 23, 2015

25

1Projects can come with their own data (e.g., from


your project) or can be provided
2Datasets need to be open! Please dont use
datasets that have proprietary limitations
All reports will be in NIPS format:
http://nips.cc/Conferences/2013/PaperInformation/
StyleFiles
11

Project Deliverables and Deadlines


1. Initial selection (1 page)
Describe what is the dataset (how large); what problem are you
trying to solve (e.g., classification, clustering);

2. Project Description and Approach (1


page)
Current state of the art and background
What approach(es) are you planning to use?
What are the relevant metrics?

3. Initial Report (3 pages)

12

Algorithm(s) implemented

Should have a flavor of big data technologies needed

Assignments
Assignments take time; start early
You can work in pairs; submit only one
write up per group
Mention who you worked with
Latex your assignments

Electronic hand-over of assignments:

[lastnames]HW<num>submit.tgz

[lastnames]HW<num>submit.zip

Ex:RamanathanSongHW0submit.tgz

Post questions via Piazza


13

Pre-requisites
Basic Database course:
SQL queries, data retrieval, etc.

Algorithms:
Dynamic programming, data structures

Statistics:
Moments, Distributions, Regression

Programming Languages:
Java, Python + any object oriented language

14

Course Books
http://www.mmds.org/
Readings and Papers will be available as part
of the course website
Useful references:
Machine Learning, Tom Mitchell
Building Machine Learning Systems with Python, Richert, PedroCoelho
http://scikit-learn.org/stable/
http://guidetodatamining.com/
http://www.cs.cornell.edu/home/kleinber/networks-book/

15

Computer Accounts
Get Piazza accounts you should have
gotten emails!
Get AWS accounts information
forthcoming

16

Data Mining

17

Data Explosion is Fueling Innovation in


Science, Engineering and Business
Estimated worlds data in 2010
~ 1.2 zettabytes (1021 bytes)
Total data ~ 35 zettabytes in
2020!!
Data needs to be:
Stored
Managed
This class
Analyzed

18

19

We are swimming in data and


We are in demand

Today

20

Yesterday

Data Mining
Computational process of discovering
patterns in large data sets
Data sets: from databases, structured and
unstructured text, images.
Patterns: five primary forms
Cluster analysis
Anomaly detection/ change point detection
Dependency modeling
Classification and Regression
Summarization

21

Data Mining as a Discipline


Interdisciplinary with diverse
interactions
Databases: managing
large datasets
Machine Learning &
Statistics: data and
models

Machine
Learning &
Databases
Data Statistics
Mining

Theory: Algorithms, in
particular Randomized
methods

Theory

Our class focuses on:


Scalability: what to do
when we have big data?
Algorithms: how to do
what we do with big data?
Architecture: what
infrastructure is suitable?
22

How are datasets represented?


Structured Data
Data organized in
terms of records with
fields corresponding to
specific entries
Examples:

Examples:
Word Documents

Databases (relational)

Email messages

XML (and other


structured layouts)

Medical records

Data Warehouses
Enterprise systems

23

Unstructured Data
Data does not have a
specific organization,
but can be accessed

Photos / Images
Audio

Defining Value from Data Analysis


Discovery of patterns and models that are:
Valid: general across new data with some
certainty
Useful: make sure that the domain expert
benefits from the insights gained
Understandable: the insights should be
interpretable for the end-user

Patterns and models must be statistically


sound

24

Some humor

Data Mining: Extract value from large


data sets
Inherent Risk: how to quantify value?
If we start looking for interesting
patterns in more places than the
amount of data we have access to, we
25
will find crap

Bonferoni Principle: Making sure


that the patterns discovered are not
crap
Suppose terrorists are meeting in a hotel to
plot an event
A feature to exploit: can we find (unrelated)
people that stayed at least twice at the same
hotel on the same day?

26

How do we go about doing this?


109 people
1,000 days
Each person stays 1% of the time in a hotel
(10 days in 1,000 days)
Each hotel holds about 100 people
105 hotels

If everyone behaves randomly (i.e., no


collusion), will data mining find any
suspicious behavior?
27

Lets calculate
Probability that two persons p and q will be
at the same hotel on the same given day
d:
1/100 x 1/100 x 1/105 = 10-9

p at some hotel
q at some hotel
p and q at same hotel

Probability that p and q will stay at the


same hotel on the same given dates d1
and d2:
28

10-9 x 10-9 = 10-18

More calculations
Probability that p and q were at the same
hotel on some two days:
5 x 105 x 10-18 = 5 x 10-13

Pairs of people:
5 x 1017

Expected number of suspected behaviors:


5 x 1017 x 5 x 10-13 = 250,000!
Analysts paradox: if there are 10 real terrorists, then one has to
sift through O(250K) records. Almost impossible!!!

29

How not to design a data


experiment?
Hypothesis: Some people have
extrasensory perception (ESP)
Experimental design:
Have people identify 10 hidden red or blue
cards

Observation: 1 in 1,000 people had ESP


meaning they were able to get all 10
correct!
Lets now inform these people that they
have ESP and then try to test for ESP
(again)

30

Tackling Big Data

31

Challenges we face with increasing


size of datasets
Big Storage
Big Compute
Attributes
Volume
Variety
Velocity
Variability
Value
32

Types of
High
Datasets
dimensional
Sparse
Graph
Infinite/streami
ng
Labeled

Compute
Map Reduce
Models
Streams/
Online
Single machine
in-memory

Big Storage

Storage is cheap:
Data farms such as DropBox,
Nitro, Mozy, Netflix, Amazon!

More opportunities to
archive and analyze
33

Big Compute

Cost of computing is
also going down
Commodity clusters are
cheap
34

Why not scale our algorithms to


work with big data?

35

High Dimensional Datasets

Modeling physics of combustion

36

Observing the
human body

Protein
structures

Material
Sciences

Sparse Datasets
Most experimental observations are
sparse:
Structural biology: nuclear
magnetic resonance (NMR), small
angle scattering (X-ray, neutron)
Materials science: Atomic force
microscopy, spectroscopic
techniques
Astronomy:
Geology:
Climate Science:

37

Graph Datasets
Friendship network on Facebook

The Internet!

38

Genome wide interaction mapping

2010
s

2000
s

1990
s

1980
s

History of Big Data Mining

39

Realizations &
Reflections
Small
datasets +

Cohen, W., et al (IJCAI 1993)

commonly used
algorithms were slow
Nothing better than
having lots of data

Y2K problem + Human


Genome Project

Growth of Machine
Learning as a discipline
Data as fourth paradigm
Scaling machine learning

Banko & Brill, (ACL 2001)

The Big Data Mining Pipeline


How to effectively
portray
information to the
end user?
How to ensure
that end users
understand the Interpre
data?
t

How to query,
Analyze
mine and
and
Model
model data?
How to ensure
models are
valid
representatio
40 ns of

Acquire
(data)

Where is the data coming


from?

Extract
and
Clean

Aggrega
te and
Integrat
e
Represent

How to handle
data that is not
ready for
analysis?
How to deal with
missing/incomple
te/noisy data?
How to bring
together
heterogeneous
datasets
together?

How to facilitate and host


datasets for analysis?

Other concerns of Big Data Analytics


Heterogeneity
Scalability
Timeliness
Privacy
Human element!

41

To Dos and Summary

42

Summary
Data mining as a discipline:
Inter-disciplinary with statistics, machine
learning and engineering

Big Data Mining:


Why we are interested in big data?
Process of big data mining
Caveats of analytic techniques (Bonferoni
principle, designing good experiments)

Next class:
Diving into Map Reduce
43

Class Logistics
Register for Piazza
https://piazza.com/utk/spring2015/cosc526/home

Post your questions through Piazza*


Familiarize with use of Python and Hadoop
We need Java and Python for the class

Complete HW0!

*Piazza posts have to be tagged. Please dont forget to do this.


44

Expectations
Thou shalt:
Participate in class and on piazza
Collaborate constructively
Respect your data
Have FUN

Thou shalt not:


Copy code, homework, etc.
I expect to see projects that can be potentially submitted to
conferences/ workshops. Lots of opportunities to work both at
application level and engineering level.
45

You might also like