1.0 Modeling: 1.1 Classification

1.
0 Modeling
Predictive modeling is the process by which a model is created to predict an outcome. If the outcome is
categorical, it is called classification and if the outcome is numerical, it is called regression. Descriptive
modeling or clustering is the assignment of observations into clusters so that observations in the same
cluster are similar. Finally, association rules can find interesting associations amongst observations.
1.1 Classification
Classification is a data mining task of predicting the value of a categorical variable (target or class) by
building a model based on one or more numerical and/or categorical variables (predictors or attributes).
Four main groups of classification algorithms are:
1. Frequency Table
o ZeroR
o OneR
o Naive Bayesian
o Decision Tree
2. Covariance Matrix
o Linear Discriminant Analysis
o Logistic Regression
3. Similarity Functions
o K Nearest Neighbors
4. Others
o Artificial Neural Network
o Support Vector Machine
1.2 Regression
Regression is a data mining task of predicting the value of target (numerical variable) by building a model
based on one or more predictors (numerical and categorical variables).
1. Frequency Table
o Decision Tree
2. Covariance Matrix
o Multiple Linear Regression
3. Similarity Function
o K Nearest Neighbors
4. Others
o Artificial Neural Network
o Support Vector Machine
1.3 Clustering
A cluster is a subset of data which are similar. Clustering (also called unsupervised learning) is the
process of dividing a dataset into groups such that the members of each group are as similar (close) as
possible to one another, and different groups are as dissimilar (far) as possible from one another.
Clustering can uncover previously undetected relationships in a dataset. There are many applications for
cluster analysis. For example, in business, cluster analysis can be used to discover and characterize
customer segments for marketing purposes and in biology, it can be used for classification of plants and
animals given their features.
Two main groups of clustering algorithms are:
1. Hierarchical
o Agglomerative
o Divisive
2. Partitive
o K Means
o Self-Organizing Map
Good clustering method requirements are:
The ability to discover some or all of the hidden clusters.

Within-cluster similarity and between-cluster dissimilarity.
Ability to deal with various types of attributes.
Can deal with noise and outliers.
Can handle high dimensionality.
Scalable, Interpretable and usable.
OneR
OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that generates one rule for
each predictor in the data, then selects the rule with the smallest total error as its "one rule". To create a
rule for a predictor, we construct a frequency table for each predictor against the target. It has been shown
that OneR produces rules only slightly less accurate than state-of-the-art classification algorithms while
producing rules that are simple for humans to interpret.
OneR Algorithm
For each predictor,

For each value of that predictor, make a rule as follows;
Count how often each value of target (class) appears
Find the most frequent class
Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.
Naive Bayesian
The Naive Bayesian classifier is based on Bayes theorem with independence assumptions between
predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation
which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian
classifier often does surprisingly well and is widely used because it often outperforms more sophisticated
classification methods.
Algorithm
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c).
Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is
independent of the values of other predictors. This assumption is called class conditional independence.
P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
The zero-frequency problem

Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute
value (Outlook=Overcast) doesnt occur with every class value (Play Golf=no).
Numerical Predictors
Numerical variables need to be transformed to their categorical counterparts (binning) before
constructing their frequency tables. The other option we have is using the distribution of the numerical
variable to have a good guess of the frequency. For example, one common practice is to assume normal
distributions for numerical variables.
The probability density function for the normal distribution is defined by two parameters (mean and
standard deviation).
Decision Tree - Regression
Decision tree builds regression or classification models in the form of a tree structure. It brakes down a
dataset into smaller and smaller subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node
(e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for
the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The
topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees
can handle both categorical and numerical data.
Normalize Data
You can use the Normalize Data module to transform a dataset so that the columns in the dataset are
on a common scale.
For example, assume your input dataset contains one column with values ranging from 0 to 1, and
another column with values ranging from 10,000 to 100,000. The great difference in the scale of the
numbers could cause problems when you attempt to combine the values as features during modeling.
Normalization helps you avoid these problem, by transforming the values so that they maintain their
general distribution and ratios, yet conform to a common scale. For example, you might change all
values to a 0-1 scale, or transform the values by representing them as percentile ranks rather than
absolute values.
You can apply normalization to a single column, or to multiple columns in the same dataset. However,
you can apply only one normalization method at a time using this module. Therefore, all columns that
you select will have the same normalization method applied.
If you need to repeat the experiment, or apply the same normalization steps to other data, you can save
the steps as a normalization transform, and apply it to other datasets that have the same schema.

1.0 Modeling: 1.1 Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.0 Modeling: 1.1 Classification

Uploaded by

Copyright:

Available Formats

1.

Four main groups of classification algorithms are:

Two main groups of clustering algorithms are:

Good clustering method requirements are:

The ability to discover some or all of the hidden clusters.

For each predictor,

The zero-frequency problem

Decision Tree - Regression

You might also like