You are on page 1of 44

# Python for Data Science

## Machine Learning in Python:

Classification
Dr. Ilkay Altintas and Dr. Leo Porter
By the end of this video, you should be able to:
Python for Data Science

## § Describe how binomial classification differs from

multinomial classification
Classification
Target variable
is categorical
Python for Data Science

Sunny
Input
Variables
Model Windy

Rainy

Goal: Cloudy
Given input variables,
predict category
Data for Classification
Input Variables Target Variable
Python for Data Science

## Temperature Humidity Wind Speed Weather

79 48 2.7 Sunny
60 80 3.8 Rainy
68 45 17.9 Windy
57 77 4.2 Cloudy
Classification is Supervised

## Target Label Output

Python for Data Science

## Temperature Humidity Wind Speed Weather

79 48 2.7 Sunny
60 80 3.8 Rainy
68 45 17.9 Windy
57 77 4.2 Cloudy
Binary Types of Classification
Multi-class
Classification Classification
Python for Data Science

## Target has two Target has > 2

values values
Classification Examples
Binary Multi-Class
Python for Data Science

Classification Classification
• Will it rain • What type of
tomorrow or not? product will this
• Is this transaction customer buy?
legitimate or • Is this tweet
fraudulent positive, negative,
or neutral
Classification Main Points
• Predict category from input variables
Python for Data Science

## • Classification is a supervised task

• Target variable is categorical

## Input Model Target

Variables
Machine Learning in Python:
Python for Data Science

## Building and Applying a

Classification Model
Dr. Ilkay Altintas and Dr. Leo Porter
By the end of this video, you should be able to:
Python for Data Science

## § Name some common algorithms for classification

What is a Machine Learning Model?
Python for Data Science

Input
Model Output
Variables
y=f(ax+b) (y)
(x)

## function mapping input parameters

to output
Example of Model
Python for Data Science

y = mx + b
output

input
Adjusting Model slope m = 2
y-intercept b = +1
Parameters x=1 => y=2*1+1= 3
Python for Data Science

slope m = 2
y-intercept b = -1
x=1 => y=2*1-1=1
Building Machine Learning Model
Model parameters are adjusted during model training to
Python for Data Science

Input
Model Output
Variables
y=f(ax+b) (y)
(x)

## function mapping input parameters

to output
Building Classification Model
Decision
Python for Data Science

Boundaries
Building vs. Applying Model
Python for Data Science

• Training Phase
• Use training data

• Testing Phase
• Apply learned model
• Use new data
Building vs. Applying Model
Training
Python for Data Science

Data Build
Model
Model
Learning
Algorithm Training Phase

Test
Data Apply
Results
Model
Model
Testing Phase
Building a Classification Model
Use learning algorithm to model
Python for Data Science

Algorithm

Input
Model Output
Variables
y=f(ax+b) (y)
(x)

## function mapping input parameters

to output
Classification
Python for Data Science

## • Task: Predict category from input variables

• Goal: Match model outputs to targets (desired
outputs)

## Input Model Target

Variables
Learning Algorithm
Python for Data Science

Training
Data Build
Model
Model
Learning
Algorithm Training Phase

## • Learning algorithm used to adjust model’s parameters

Common Classification Algorithms
Python for Data Science

• kNN
• Decision Tree
• Naïve Bayes
kNN Overview
• Classify sample by looking at its
Python for Data Science

closest neighbors
Decision Tree Overview
Warm- • Tree captures
Python for Data Science

Blooded
multiple
Live Non-
classification
Birth Mammal decision paths

Vertebrate Non-
Mammal

Non-
Mammal
Mammal
Naïve Bayes Overview
Python for Data Science

• Probabilistic approach to
classification
Common Classification Algorithms
Python for Data Science

• kNN
• Decision Tree
• Naïve Bayes
Python for Data Science

## Machine Learning in Python:

Decision Trees
Dr. Ilkay Altintas and Dr. Leo Porter
By the end of this video, you should be able to:
Python for Data Science

classification

## § Interpret how a decision tree comes up with a

classification decision
Decision Tree Overview
Python for Data Science

## • Idea: Split data into “pure” regions Decision

Boundaries
Classification Using Decision Tree
Root Node
Python for Data Science

Internal Nodes

Leaf Nodes
Classification Using Decision Tree
Root Node
Python for Data Science

## Internal Nodes Tree Depth = 3

Tree Size = 6

Leaf Nodes
Example Decision Tree
Warm- Live Verte- Target
Blooded Birth brate Label
Warm-
Python for Data Science

Blooded
Yes Yes Yes Mammal

Live Non-
Birth Mammal

Vertebrate Non-
Mammal

Non-
Mammal
Mammal
Constructing
Decision Tree Tree
Induction
Python for Data Science

• Partition samples based on input to create purest
subsets.
• Repeat to partition data into successively purer
subsets.
Greedy Approach
Python for Data Science

## What’s the best way to split the current node?

How to Determine Best Split?
Want subsets to be as homogeneous as possible
Python for Data Science

## Less homogeneous = More pure More homogeneous = More pure

Impurity Measure
• To compare different ways to split data in a node
Python for Data Science

## Higher = Less pure

Gini
Index

Lower = More
pure
What Variable to Split On?
Python for Data Science

## Split on var1, var2, … varN

When to Stop Splitting a Node?
Python for Data Science

## • All (or X% of) samples have same class label

• Number of samples in node reaches minimum
• Change in impurity measure is smaller than threshold
• Max tree depth is reached
• Others…
Tree Induction Example: Split 1
Python for Data Science

Defaults Repays
Tree Induction Example: Split 2
Python for Data Science

Defaults Repays
Tree Induction Example: Split 3
Defaults Repays
Python for Data Science
Tree Induction Example
• Resulting model
Python for Data Science
Decision Boundaries
• Rectilinear = Parallel to axes
Python for Data Science
Decision Tree for Classification
Python for Data Science

## • Resulting tree is often simple and easy to interpret

• Induction is computationally inexpensive
• Greedy approach does not guarantee best solution
• Rectilinear decision boundaries
Python for Data Science

Decision Tree