Professional Documents
Culture Documents
DATA MINING
ALGORITHMS
Vikram Singh Sankhala
What is Data Mining
■ Data mining is the process of discovering or extracting new patterns from large data
sets involving methods from
– Statistics
– Artificial intelligence.
Major Techniques Used
■ Classification
■ Regression
■ Clustering
■ Association Rules
■ Principal Components Analysis
Supervised Learning
Clustering (K-means,
Hierarchical clustering)
Association
Rule(Apriori, Eclast)
Support Vector Machine
• “Support Vector Machine” (SVM) is a supervised machine learning algorithm
which can be used for both classification or regression challengesThe Data Mining
Process include collecting, exploring and selecting the right data.
• Support Vector Machines are based on the concept of decision planes that define
decision boundaries.
Support Vector Machine
Support Vector Machine
Advantage of SVM
• Performance is good when linear problems
Disadvantage of SVM
• It doesn't work on nonlinear problems (you need Kernel SVM) and
you cannot get the probabilities of the classes.
Support Vector Machine Application
• SVM has been used successfully in many real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Naïve Bayes Algorithm
• It is a classification technique based on Bayes’ Theorem with an
assumption of independence among predictors.
• Naive Bayes model is easy to build and particularly useful for very
large data sets (not feature wise).
Advantage of Naïve Bayes
• You can get the probabilities of the classes and that it works on non
linear problems
• Medical Diagnosis
• Given a list of symptoms, predict whether a patient has disease X
or not
• Weather
• Based on temperature, humidity, etc… predict if it will rain
tomorrow
• Advantage –
• Easy to Understand
• Useful in Data exploration
• Less data cleaning required
• Handle both numerical and categorical variables
• Disadvantage –
• Over fitting
• Not fit for continuous variables
Application of DT and Random Forest
• Astronomy:
• star-galaxy classification, determining galaxy counts.
• Biomedical Engineering:
• Use of decision trees for identifying features to be used in
implantable devices can be found
• Pharmacology:
• Use of tree based classification for drug analysis
• Manufacturing:
• Chemical material evaluation for manufacturing and production
• Medicine:
• Analysis of the Sudden Infant Death Syndrome
Which Model….?
• DT - when you want to have clear interpretation of your model results
• Random Forest - when you are just looking for high performance
with less need for interpretation
• SVM - when your business problem is a linear problem (with a linearly
separable dataset)
• Naive Bayes - when you want your business problem to be based on
a probabilistic approach. For example if you want to rank your
customers from the highest probability to buy a certain product, to
the lowest.
Cluster Analysis (Unsupervised Learning)
• Clustering analysis is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar
(in some sense or another) to each other than to those in other
groups (clusters).
Advantage of Clustering (K-means)
• If variables are huge, then K-Means most of the times
computationally faster than hierarchical clustering
• K-Means produce tighter clusters than hierarchical clustering
• Insurance:
• Identifying groups of crop insurance policy holders with a high
average claim rate. Farmers crash crops, when it is “profitable”.
• Land use:
• Identification of areas of similar land use in a GIS database.
• Seismic studies:
• Identifying probable areas for oil/gas exploration based on
seismic data.
Classification
1. Decision trees
2. CART: Classification and Regression Trees
3. Ruleset classifiers
4. Ensemble Classifiers
5. Support vector machines
6. Naive Bayes
Decision trees
■ The splitting of nodes is decided by algorithms like information gain, chi square, Gini
index.
■ ID3, or Iterative Dichotomizer, was the first of three Decision Tree implementations
developed by Ross Quinlan
■ The ID3 algorithm uses a greedy search. It selects a test using the information gain
criterion (Minimizing Shannon Entropy), and then never explores the possibility of
alternate choices.
'Greedy Algorithm'?
■ Makes a locally-optimal choice in the hope that this choice will lead to a globally-
optimal solution.
■ A code is a mapping from a “string” (a finite sequence of letters) to a finite sequence
of binary numbers.
■ The goal of compression algorithms is to encode strings with the smallest sequence
of binary numbers.
■ Shannon entropy gives the optimal compression rate, that can be approached but
not improved.
■ Information Gain is inversely Proportion to entropy.
■ The Greedy Algorithm is used at each node to arrive at the next node.
Information Gain and Shannon Entropy
■ Classification Trees
■ Regression Trees
Classification Trees
■ These are considered as the default kind of decision trees used to separate a
dataset into different classes, based on the response variable. These are generally
used when the response variable is categorical in nature.
Regression Trees
■ The Data Science libraries in Python language to implement Decision Tree Machine
Learning Algorithm are – SciPy and Sci-Kit Learn.
■ The Data Science libraries in R language to implement Decision Tree Machine
Learning Algorithm is caret.
Random Forest
■ Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of
the mean of the sample
■ Create many (e.g. 1000) random sub-samples of our dataset with replacement
(meaning we can select the same value multiple times).
■ Calculate the mean of each sub-sample.
■ Calculate the average of all of our collected means and use that as our estimated
mean for the data.
Bootstrap Aggregation (Bagging)
■ Even with Bagging, the decision trees (CART) can have a lot of structural similarities
and in turn have high correlation in their predictions.
■ To reduce Correlation between features, the random forest algorithm changes the
procedure so that the learning algorithm is limited to a random sample of features
of which to search.
■ The number of features that can be searched at each split point (m) must be
specified as a parameter to the algorithm.
Libraries
■ A Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
■ For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter.
■ Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability that this
fruit is an apple
■ That is why it is known as ‘Naive’.
■ This algorithm is mostly used in text classification and with problems having multiple
classes.
How Naive Bayes algorithm works
■ In a two-class learning task, the aim of SVM is to find the best classification function
to distinguish between members of the two classes in the training data.
■ For a linearly separable dataset, a linear classification function corresponds to a
separating hyperplane f (x ) that passes through the middle of the two classes,
separating the two.
Margin Maximization
■ In case of multiple classes, SVM works by classifying the data into different classes
by finding a line (hyperplane) which separates the training data set into classes.
■ As there are many such linear hyperplanes, SVM algorithm tries to maximize the
distance between the various classes that are involved and this is referred as
margin maximization.
■ If the line that maximizes the distance between the classes is identified, the
probability to generalize well to unseen data is increased.
SVM can also be used for
■ Linear SVM’s – In linear SVM’s the training data i.e. classifiers are separated by a
hyperplane.
■ Non-Linear SVM’s- In non-linear SVM’s it is not possible to separate the training data
using a hyperplane.
Applications
■ Risk assessment
■ Stock Market forecasting
■ Most commonly, SVM is used to compare the performance of a stock with other
stocks in the same sector. This helps companies make decisions about where they
want to invest.
Association Analysis
■ Association rule implies that if an item A occurs, then item B also occurs with a
certain probability.
The Apriori algorithm
■ The approach is to find frequent item sets from a transaction dataset and derive
association rules
■ A ratio is derived like out of the 100 people who purchased an apple, 85 people also
purchased an orange.
Libraries - The Apriori algorithm
1. The EM algorithm
2. The k-means algorithm
3. k-nearest neighbor classification
The Expectation–Maximization algorithm
■ Search engines like Yahoo and Bing (to identify relevant results)
■ Data libraries
■ Google image search
k-nearest neighbor classification
■ Binary Logistic Regression – The most commonly used logistic regression when the
categorical response has 2 possible outcomes i.e. either yes or not. Example –
Predicting whether a student will pass or fail an exam, predicting whether a student
will have low or high blood pressure, predicting whether a tumor is cancerous or not.
■ Multi-nominal Logistic Regression - Categorical response has 3 or more possible
outcomes with no ordering. Example- Predicting what kind of search engine (Yahoo,
Bing, Google, and MSN) is used by majority of US citizens.
■ Ordinal Logistic Regression - Categorical response has 3 or more possible outcomes
with natural ordering. Example- How a customer rates the service and quality of food
at a restaurant based on a scale of 1 to 10.
Logistic Regression
■ It measures the relationship between the categorical dependent variable and one or
more independent variables by estimating probabilities using a logistic function,
which is the cumulative logistic distribution.
■ regressions can be used in real-world applications such as:
– Credit Scoring
– Measuring the success rates of marketing campaigns
– Predicting the revenues of a certain product
Boosting
■ In 1988, Kearns and Valiant posed an interesting question, i.e., whether a weak
learning algorithm that performs just slightly better than random guess could be
“boosted” into an arbitrarily accurate strong learning algorithm.
■ AdaBoost was born with in response to this question. AdaBoost has given rise to
abundant research on theoretical aspects of ensemble methods, which can be
easily found in machine learning and statistics literature.
■ It is worth mentioning that for their AdaBoost paper, Schapire and Freund won the
Godel Prize, which is one of the most prestigious awards in theoretical computer
science, in the year of 2003.
How Adaboost works
■ First, it assigns equal weights to all the training examples (xi , yi )(i ∈ {1,..., m}). Denote
the distribution of the weights at the t -th learning round as Dt
■ From the training set and Dt the algorithm generates a weak or base learner ht : X → Y
by calling the base learning algorithm.
■ Then, it uses the training examples to test ht , and the weights of the incorrectly
classified examples will be increased. Thus, an updated weight distribution Dt +1 is
obtained.
■ From the training set and Dt +1 AdaBoost generates another weak learner by calling the
base learning algorithm again.
■ Such a process is repeated for T rounds, and the final model is derived by weighted
majority voting of the T weak learners, where the weights of the learners are determined
during the training process.
Artificial Neural Networks
■ Artificial Neural Networks are named so because they’re based on the structure and
functions of real biological neural networks.
■ Information flows through the network and in response, the neural network changes
based on the input and output.
■ Applications
– Character recognition (understanding human handwriting and converting it to
text)
– Image compression
– Stock market prediction
– Loan applications
Linear Discriminant Analysis
■ Linear discriminant analysis (LDA) and the related Fisher’s linear discriminant are
methods used in statistics, pattern recognition and machine learning to find a linear
combination of features which characterizes or separates two or more classes of
objects or events.
■ The resulting combination may be used as a linear classifier, or, more commonly, for
dimensionality reduction before later classification.
■ QDA is a general discriminant function with a quadratic decision boundaries which
can be used to classify datasets with two or more classes.
Method
■ LDA is based upon the concept of searching for a linear combination of variables
(predictors) that best separates two classes (targets)
■ To capture the notion of separability, Fisher defined the following score function.
■ Given the score function, the problem is to estimate the linear coefficients that
maximize the score function.
■ One way of assessing the effectiveness of the discrimination is to calculate
the Mahalanobis distance between two groups. A distance greater than 3 means
that in two averages differ by more than 3 standard deviations. It means that the
overlap (probability of misclassification) is quite small.
Predictors Contribution
■ A simple linear correlation between the model scores and predictors can be used to
test which predictors contribute significantly to the discriminant function. Correlation
varies from -1 to 1, with -1 and 1 meaning the highest contribution but in different
directions and 0 means no contribution at all.
Applications of LDA
■ The procedure starts off with initial values for the coefficient or coefficients for the
function. These could be 0.0 or a small random value.
■ The cost of the coefficients is evaluated by plugging them into the function and
calculating the cost
■ The derivative of the cost is calculated. The derivative is a concept from calculus
and refers to the slope of the function at a given point. We need to know the slope
so that we know the direction (sign) to move the coefficient values in order to get a
lower cost on the next iteration
■ Now that we know from the derivative which direction is downhill, we can now
update the coefficient values.
Cont.
■ A learning rate parameter (alpha) must be specified that controls how much the
coefficients can change on each update.
■ delta = derivative(cost)
■ coefficient = coefficient – (alpha * delta)
■ This process is repeated until the cost of the coefficients (cost) is 0.0 or close
enough to zero to be good enough.
Applications
■ Traditionally, feature learning methods have largely sought to learn models that
provide good approximations of the true data distribution
■ Sparse Filtering is a form of unsupervised feature learning that learns a sparse
representation of the input data without directly modelling it.
■ It has only has only one hyperparameter, the number of features to learn.
■ Sparse filtering scales gracefully to handle high-dimensional inputs,
t-SNE to visualize multidimensional
datasets
■ t-SNE stands for t-Distributed Stochastic Neighbour Embedding and its main aim is
that of dimensionality reduction.
■ The dimensionality of a set of images is the number of pixels in any image, which
ranges from thousands to millions. We need to reduce the dimensionality of a
dataset from an arbitrary number to two or three.
■ Stochastic neighbour embedding techniques compute an N ×N similarity matrix in
both the original data space and in the low-dimensional embedding space called
Similarity Matrices.
Contd.
■ The distribution over pairs of objects is defined such that pairs of similar objects
have a high probability under the distribution, whilst pairs of dissimilar points have a
low probability.
■ The probabilities are generally given by a normalized Gaussian or Student-t kernel
computed from the data space or from the embedding space.
■ The low-dimensional embedding is learned by minimizing the Kullback-Leibler
divergence between the two probability distributions (computed in the original data
space and the embedding space) with respect to the locations of the points in the
embedding space.
■ This is the topic of manifold learning, also called nonlinear dimensionality reduction,
a branch of machine learning (more specifically, unsupervised learning).
■ It is still an active area of research today and tries to develop algorithms that can
automatically recover a hidden structure in a high-dimensional dataset.
LSTMs for Time Series and Sequences
■ A usual RNN (Recurrent Neural Network) has a short-term memory. In combination
with a LSTM they also have a long-term memory
■ An LSTM unit is composed of a cell, an input gate, an output gate and a forget gate.
■ The cell remembers values over arbitrary time intervals and the three gates regulate
the flow of information into and out of the cell.
■ LSTM’s enable Recurrent Neural Networks to remember their inputs over a long
period of time.
MCMC and Metropolis Algorithm
■ The Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method
for obtaining a sequence of random samples from multi-dimensional distributions,
especially when the number of dimensions is high.
■ The algorithm proceeds by generating random numbers over a unform distribution
and uses an accept or reject criteria.
■ If the criteria is accepted, the a transition is made over a Stochastic Transition
Matrix.
■ It uses the property of an Ergodicity of a Markov Process to ensure that the
probability of reaching any point in the space is greater than Zero.
■ A stochastic process is said to be ergodic if its statistical properties can be deduced
from a single, sufficiently long, random sample of the process.
■ The reasoning is that any collection of random samples from a process must
represent the average statistical properties of the entire process.
■ The End