You are on page 1of 54

CS685: Data Mining

Introduction

Arnab Bhattacharya
arnabb@cse.iitk.ac.in

Computer Science and Engineering,


Indian Institute of Technology, Kanpur
http://web.cse.iitk.ac.in/~cs685/

1st semester, 2018-19


Mon, Thu 1030-1145 at RM101

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 1 / 16


Rules

No pre-requisites except general aptitude


Linear algebra, probability and statistics expected
Email arnabb@cse.iitk.ac.in to set up appointment
Put “CS685” in the subject so that automatic filter catches it

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 2 / 16


Rules

No pre-requisites except general aptitude


Linear algebra, probability and statistics expected
Email arnabb@cse.iitk.ac.in to set up appointment
Put “CS685” in the subject so that automatic filter catches it
Participate
Attend classes
Clear doubts
Answer questions
Do homeworks (i.e., assignments) individually

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 2 / 16


Rules

No pre-requisites except general aptitude


Linear algebra, probability and statistics expected
Email arnabb@cse.iitk.ac.in to set up appointment
Put “CS685” in the subject so that automatic filter catches it
Participate
Attend classes
Clear doubts
Answer questions
Do homeworks (i.e., assignments) individually
No extension of deadlines for degradation of health of
Your computer
Your family members and (special) friend(s)
If you are unwell, follow standard IITK procedure
Produce a sick certificate, etc.

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 2 / 16


Grading policy

Exams: 20-30%
Project: 40-45%
Results: 20%
Presentation and/or Demonstration: 10%
Report: 10%
Assignments and Quizzes: 20-30%

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 3 / 16


Grading policy

Exams: 20-30%
Project: 40-45%
Results: 20%
Presentation and/or Demonstration: 10%
Report: 10%
Assignments and Quizzes: 20-30%
Paper presentation and discussion: 10%
Depends on class strength

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 3 / 16


Grading policy

Exams: 20-30%
Project: 40-45%
Results: 20%
Presentation and/or Demonstration: 10%
Report: 10%
Assignments and Quizzes: 20-30%
Paper presentation and discussion: 10%
Depends on class strength
Things may be changed by mutual consent after discussion in class

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 3 / 16


Project details

Form your own idea


Just implementation or survey will not be enough
Back it up with analysis
Deadlines
1 Groups for project: Aug 13
2 Initial write-up: Sep 3
3 Mid-term report: Sep 30
4 Demonstration: Nov 12–15
5 Final report: Nov 16

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 4 / 16


Course material

Slides
Classwork
Book: no text book
Reference books
Many
References mentioned in slides
Conference proceedings and journal articles
KDD, ICDM, SDM, PKDD, PAKDD, etc.
TKDE, KDD, DMKD, etc.

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 5 / 16


Course contents
1 What is data mining?
Connection to machine learning, statistics, databases
What is not data mining?

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 6 / 16


Course contents
1 What is data mining?
Connection to machine learning, statistics, databases
What is not data mining?
2 Data pre-processing
Data extraction
Data cleaning
Data transformation

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 6 / 16


Course contents
1 What is data mining?
Connection to machine learning, statistics, databases
What is not data mining?
2 Data pre-processing
Data extraction
Data cleaning
Data transformation
3 Data warehousing and data cube
Multi-dimensional data model
OLAP: on-line analytical processing

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 6 / 16


Course contents
1 What is data mining?
Connection to machine learning, statistics, databases
What is not data mining?
2 Data pre-processing
Data extraction
Data cleaning
Data transformation
3 Data warehousing and data cube
Multi-dimensional data model
OLAP: on-line analytical processing
4 Itemset mining
Frequent itemsets
Association rule mining

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 6 / 16


Course contents
1 What is data mining?
Connection to machine learning, statistics, databases
What is not data mining?
2 Data pre-processing
Data extraction
Data cleaning
Data transformation
3 Data warehousing and data cube
Multi-dimensional data model
OLAP: on-line analytical processing
4 Itemset mining
Frequent itemsets
Association rule mining
5 Classification
Tree-based classification
Bayesian classification
Rule-based classification
Support vector machines
Artificial neural networks
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 6 / 16
Course contents (contd.)
6 Prediction
Regression

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 7 / 16


Course contents (contd.)
6 Prediction
Regression
7 Clustering
Partition-based methods
Hierarchical methods
Model-based methods

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 7 / 16


Course contents (contd.)
6 Prediction
Regression
7 Clustering
Partition-based methods
Hierarchical methods
Model-based methods
8 Anomaly detection
Rule-based methods
Statistical methods

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 7 / 16


Course contents (contd.)
6 Prediction
Regression
7 Clustering
Partition-based methods
Hierarchical methods
Model-based methods
8 Anomaly detection
Rule-based methods
Statistical methods
9 Mining special kinds of data (if time and interests permit)
Graph mining
Text mining
Image analysis
Biological data

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 7 / 16


What is data mining?

Extracting or mining knowledge from large amounts of data


Knowledge discovery from data (KDD)
We are in a data rich but information poor scenario

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 8 / 16


What is data mining?

Extracting or mining knowledge from large amounts of data


Knowledge discovery from data (KDD)
We are in a data rich but information poor scenario
Data mining is supported by three major technologies
1 Massive data collection
2 Data mining algorithms
3 Powerful multiprocessor/distributed computers

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 8 / 16


What is data mining?

Extracting or mining knowledge from large amounts of data


Knowledge discovery from data (KDD)
We are in a data rich but information poor scenario
Data mining is supported by three major technologies
1 Massive data collection
2 Data mining algorithms
3 Powerful multiprocessor/distributed computers
It is in the confluence of
Machine learning
Statistics
Databases
Information retrieval
Visualization techniques

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 8 / 16


Data analysis challenges

Scalability
High dimensionality
Heterogeneous and complex data
Web
Unstructured text
Graph
Distributed data
Data ownership and privacy
How to access knowledge without violating privacy

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 9 / 16


Data analysis challenges

Classification
Predicting the class of a data object
Clustering
Finding groups in data
Association
Finding co-occurring and related itemsets
Visualization
Facilitating human discovery of patterns
Summarization
Succinctly describing a group
Anomaly detection
Identifying abnormal behavior
Estimation
Predicting values of a data object
Link analysis
Finding relationships among data objects
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 10 / 16
Extra-sensory perception (ESP)

Rhine, a para-psychologist, proceeded to show that people experience


extra-sensory perception (ESP)

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 11 / 16


Extra-sensory perception (ESP)

Rhine, a para-psychologist, proceeded to show that people experience


extra-sensory perception (ESP)
Asked many people to correctly guess a sequence of 10 red or blue
cards
About 1 in every 1000 was right
Rhine declared that they had ESP
Called them for further investigation

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 11 / 16


Extra-sensory perception (ESP)

Rhine, a para-psychologist, proceeded to show that people experience


extra-sensory perception (ESP)
Asked many people to correctly guess a sequence of 10 red or blue
cards
About 1 in every 1000 was right
Rhine declared that they had ESP
Called them for further investigation
They lost ESP

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 11 / 16


Extra-sensory perception (ESP)

Rhine, a para-psychologist, proceeded to show that people experience


extra-sensory perception (ESP)
Asked many people to correctly guess a sequence of 10 red or blue
cards
About 1 in every 1000 was right
Rhine declared that they had ESP
Called them for further investigation
They lost ESP
Conclusion was one should not inform people that they have ESP

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 11 / 16


Tea taster

A lady claimed that she can sense if tea or milk was mixed later

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 12 / 16


Tea taster

A lady claimed that she can sense if tea or milk was mixed later
Fisher tested with 8 cups, with 4 having tea mixed later
Only 1 chance of being correct out of 84 = 70 possibilities


Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 12 / 16


Tea taster

A lady claimed that she can sense if tea or milk was mixed later
Fisher tested with 8 cups, with 4 having tea mixed later
Only 1 chance of being correct out of 84 = 70 possibilities


Lady was wrong

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 12 / 16


Tea taster

A lady claimed that she can sense if tea or milk was mixed later
Fisher tested with 8 cups, with 4 having tea mixed later
Only 1 chance of being correct out of 84 = 70 possibilities


Lady was wrong


She claimed that she is mostly correct

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 12 / 16


Tea taster

A lady claimed that she can sense if tea or milk was mixed later
Fisher tested with 8 cups, with 4 having tea mixed later
Only 1 chance of being correct out of 84 = 70 possibilities


Lady was wrong


She claimed that she is mostly correct
Multiple tests

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 12 / 16


Terrorism example

Is it sensible to try and detect possible terror links among people?


Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: they will scan hotel logs to identify such
occurrences

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 13 / 16


Terrorism example

Is it sensible to try and detect possible terror links among people?


Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: they will scan hotel logs to identify such
occurrences
Data assumptions
Number of people: 109
Tracked over 103 days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102 people at a time
Total number of hotels is 105

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 13 / 16


Terrorism example

Is it sensible to try and detect possible terror links among people?


Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: they will scan hotel logs to identify such
occurrences
Data assumptions
Number of people: 109
Tracked over 103 days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102 people at a time
Total number of hotels is 105
Deductions
A person stays in hotel for 10 days
Each day, 107 people stay in a hotel

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 13 / 16


Terrorism example (contd.)

In a day, probability that person A and B stays in the same hotel is


10−9

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 14 / 16


Terrorism example (contd.)

In a day, probability that person A and B stays in the same hotel is


10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 14 / 16


Terrorism example (contd.)

In a day, probability that person A and B stays in the same hotel is


10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9 × 10−9

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 14 / 16


Terrorism example (contd.)

In a day, probability that person A and B stays in the same hotel is


10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9 × 10−9
Total pairs of days is (roughly) 5 × 105
103

Any 2 out of 103 : 2

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 14 / 16


Terrorism example (contd.)

In a day, probability that person A and B stays in the same hotel is


10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9 × 10−9
Total pairs of days is (roughly) 5 × 105
103

Any 2 out of 103 : 2
Probability that A and B meet twice in some pair of days is (roughly)
5 × 10−13
10−18 × 5 × 105

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 14 / 16


Terrorism example (contd.)

In a day, probability that person A and B stays in the same hotel is


10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9 × 10−9
Total pairs of days is (roughly) 5 × 105
103

Any 2 out of 103 : 2
Probability that A and B meet twice in some pair of days is (roughly)
5 × 10−13
10−18 × 5 × 105
Total pairs of people is (roughly) 5 × 1017
109

Any 2 out of 109 : 2

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 14 / 16


Terrorism example (contd.)

In a day, probability that person A and B stays in the same hotel is


10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9 × 10−9
Total pairs of days is (roughly) 5 × 105
103

Any 2 out of 103 : 2
Probability that A and B meet twice in some pair of days is (roughly)
5 × 10−13
10−18 × 5 × 105
Total pairs of people is (roughly) 5 × 1017
109

Any 2 out of 109 : 2
Expected number of suspicions, i.e., probability that any pair of
people meet twice on any pair of days is 2.5 × 105
5 × 10−13 × 5 × 1017
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 14 / 16
Ice-cream

A man goes to an ice-cream parlor every night after dinner


He observes that only on days he orders vanilla flavor, his car stalls
When any other flavor is ordered, the car does not stall

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 15 / 16


Ice-cream

A man goes to an ice-cream parlor every night after dinner


He observes that only on days he orders vanilla flavor, his car stalls
When any other flavor is ordered, the car does not stall
He observes it over an extended period of time
He tries changing other attributes such as shirt color, boot type,
person accompanying him, etc.
No other attribute has any consistent effect

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 15 / 16


Ice-cream

A man goes to an ice-cream parlor every night after dinner


He observes that only on days he orders vanilla flavor, his car stalls
When any other flavor is ordered, the car does not stall
He observes it over an extended period of time
He tries changing other attributes such as shirt color, boot type,
person accompanying him, etc.
No other attribute has any consistent effect
A data mining researcher comes

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 15 / 16


Ice-cream

A man goes to an ice-cream parlor every night after dinner


He observes that only on days he orders vanilla flavor, his car stalls
When any other flavor is ordered, the car does not stall
He observes it over an extended period of time
He tries changing other attributes such as shirt color, boot type,
person accompanying him, etc.
No other attribute has any consistent effect
A data mining researcher comes
She finds out that since vanilla is the most popular favor, ordering
vanilla induces a significantly longer waiting time
Car stalls when the man waits longer and not otherwise

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 15 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)
Moral: Knowing what data mining is and is not will help you look
smarter (than others not taking this course)

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)
Moral: Knowing what data mining is and is not will help you look
smarter (than others not taking this course)
Just doing it once may not prove or disprove anything
Tea taster story

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)
Moral: Knowing what data mining is and is not will help you look
smarter (than others not taking this course)
Just doing it once may not prove or disprove anything
Tea taster story
Moral: Multiple random tests are needed

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)
Moral: Knowing what data mining is and is not will help you look
smarter (than others not taking this course)
Just doing it once may not prove or disprove anything
Tea taster story
Moral: Multiple random tests are needed
Bonferroni’s principle: if you look in more places for interesting
patterns than your amount of data supports, you are bound to “find”
something “interesting” (most likely spurious)
Terrorism story

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)
Moral: Knowing what data mining is and is not will help you look
smarter (than others not taking this course)
Just doing it once may not prove or disprove anything
Tea taster story
Moral: Multiple random tests are needed
Bonferroni’s principle: if you look in more places for interesting
patterns than your amount of data supports, you are bound to “find”
something “interesting” (most likely spurious)
Terrorism story
Moral: When checking a particular rule or property, if there are many
possibilities, then it will happen

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)
Moral: Knowing what data mining is and is not will help you look
smarter (than others not taking this course)
Just doing it once may not prove or disprove anything
Tea taster story
Moral: Multiple random tests are needed
Bonferroni’s principle: if you look in more places for interesting
patterns than your amount of data supports, you are bound to “find”
something “interesting” (most likely spurious)
Terrorism story
Moral: When checking a particular rule or property, if there are many
possibilities, then it will happen
Obvious rules may not always make sense
Ice-cream story

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16


Morals

Rhine paradox
ESP story (extra-sensory perception)
Moral: Knowing what data mining is and is not will help you look
smarter (than others not taking this course)
Just doing it once may not prove or disprove anything
Tea taster story
Moral: Multiple random tests are needed
Bonferroni’s principle: if you look in more places for interesting
patterns than your amount of data supports, you are bound to “find”
something “interesting” (most likely spurious)
Terrorism story
Moral: When checking a particular rule or property, if there are many
possibilities, then it will happen
Obvious rules may not always make sense
Ice-cream story
Moral: When deducting rules, look at correct attributes, i.e., those
that explain the phenomenon
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Introduction 2018-19 16 / 16

You might also like