You are on page 1of 86

PRIMER ON MAJOR

DATA MINING
ALGORITHMS
Vikram Singh Sankhala
What is Data Mining

■ Data mining is the process of discovering or extracting new patterns from large data
sets involving methods from
– Statistics
– Artificial intelligence.
Major Techniques Used

■ Classification
■ Regression
■ Clustering
■ Association Rules
■ Principal Components Analysis
Supervised Learning

Regression (Multiple Linear regression, Polynomial


regression, SVR,DT regression, Random forest
regression etc.)

Classification(Logistic regression, K-NN, SVM, Kernel


SVM, Naïve Bayes, DT, Random forest etc.)
Unsupervised Learning

Clustering (K-means,
Hierarchical clustering)

Association
Rule(Apriori, Eclast)
Support Vector Machine
• “Support Vector Machine” (SVM) is a supervised machine learning algorithm
which can be used for both classification or regression challengesThe Data Mining
Process include collecting, exploring and selecting the right data.
• Support Vector Machines are based on the concept of decision planes that define
decision boundaries.
Support Vector Machine
Support Vector Machine
Advantage of SVM
• Performance is good when linear problems

Disadvantage of SVM
• It doesn't work on nonlinear problems (you need Kernel SVM) and
you cannot get the probabilities of the classes.
Support Vector Machine Application
• SVM has been used successfully in many real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Naïve Bayes Algorithm
• It is a classification technique based on Bayes’ Theorem with an
assumption of independence among predictors.
• Naive Bayes model is easy to build and particularly useful for very
large data sets (not feature wise).
Advantage of Naïve Bayes
• You can get the probabilities of the classes and that it works on non
linear problems

Disadvantage of Naïve Bayes


• It doesn't work properly on datasets with many features
Naïve Bayes Application
• Spam Classification
• Given an email, predict whether it is spam or not

• Medical Diagnosis
• Given a list of symptoms, predict whether a patient has disease X
or not

• Weather
• Based on temperature, humidity, etc… predict if it will rain
tomorrow

• Text classification/ Spam Filtering/ Sentiment Analysis


Decision Tree and Random Forest
• Decision Tree - Decision tree is a type of supervised learning
algorithm (having a pre-defined target variable) that is mostly used
in classification problems.
• It works for both categorical and continuous input and output
variables.
Random Forest
• Random forest is an ensemble classifier made using many decision
tree
• Ensemble Model – Combines the results from different models &
produce better results.
Advantage and Disadvantage of DT and Random Forest

• Advantage –
• Easy to Understand
• Useful in Data exploration
• Less data cleaning required
• Handle both numerical and categorical variables

• Disadvantage –
• Over fitting
• Not fit for continuous variables
Application of DT and Random Forest
• Astronomy:
• star-galaxy classification, determining galaxy counts.

• Biomedical Engineering:
• Use of decision trees for identifying features to be used in
implantable devices can be found

• Pharmacology:
• Use of tree based classification for drug analysis

• Manufacturing:
• Chemical material evaluation for manufacturing and production

• Medicine:
• Analysis of the Sudden Infant Death Syndrome
Which Model….?
• DT - when you want to have clear interpretation of your model results
• Random Forest - when you are just looking for high performance
with less need for interpretation
• SVM - when your business problem is a linear problem (with a linearly
separable dataset)
• Naive Bayes - when you want your business problem to be based on
a probabilistic approach. For example if you want to rank your
customers from the highest probability to buy a certain product, to
the lowest.
Cluster Analysis (Unsupervised Learning)
• Clustering analysis is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar
(in some sense or another) to each other than to those in other
groups (clusters).
Advantage of Clustering (K-means)
• If variables are huge, then K-Means most of the times
computationally faster than hierarchical clustering
• K-Means produce tighter clusters than hierarchical clustering

Disadvantage of Clustering (K-means)


• Difficult to predict K-Value
Clustering Application
• Marketing:
• Discovering distinct groups in customer databases.

• Insurance:
• Identifying groups of crop insurance policy holders with a high
average claim rate. Farmers crash crops, when it is “profitable”.

• Land use:
• Identification of areas of similar land use in a GIS database.

• Seismic studies:
• Identifying probable areas for oil/gas exploration based on
seismic data.
Classification

1. Decision trees
2. CART: Classification and Regression Trees
3. Ruleset classifiers
4. Ensemble Classifiers
5. Support vector machines
6. Naive Bayes
Decision trees

■ Decision tree builds classification or regression models in the form of a tree


structure.
■ Decision nodes and leaf nodes
■ Decision node has two or more branches
■ Leaf node represents a classification or decision
The Algorithms used in the decision trees are ID3 ,
C4.5, CART, C5.0, CHAID, QUEST, CRUISE, etc.

■ The splitting of nodes is decided by algorithms like information gain, chi square, Gini
index.
■ ID3, or Iterative Dichotomizer, was the first of three Decision Tree implementations
developed by Ross Quinlan
■ The ID3 algorithm uses a greedy search. It selects a test using the information gain
criterion (Minimizing Shannon Entropy), and then never explores the possibility of
alternate choices.
'Greedy Algorithm'?

■ Makes a locally-optimal choice in the hope that this choice will lead to a globally-
optimal solution.
■ A code is a mapping from a “string” (a finite sequence of letters) to a finite sequence
of binary numbers.
■ The goal of compression algorithms is to encode strings with the smallest sequence
of binary numbers.
■ Shannon entropy gives the optimal compression rate, that can be approached but
not improved.
■ Information Gain is inversely Proportion to entropy.
■ The Greedy Algorithm is used at each node to arrive at the next node.
Information Gain and Shannon Entropy

■ Suppose you need to uncover a certain English word of five letters.


■ You manage to obtain one letter, namely an e. This is useful information, but the
letter e is common in English, so this provides little information.
■ If, on the other hand, the letter that you discover is j (the least common in English),
the search has been more narrowed and you have obtained more information.
■ The unit for the information gain is the bit.
CART

■ Classification Trees
■ Regression Trees
Classification Trees

■ These are considered as the default kind of decision trees used to separate a
dataset into different classes, based on the response variable. These are generally
used when the response variable is categorical in nature.
Regression Trees

■ When the response or target variable is continuous or numerical, regression trees


are used. These are generally used in predictive type of problems when compared to
classification
C5.0 model

■ A C5.0 algorithm is used to build either a decision tree or a rule set


■ A C5.0 model works by splitting the sample based on the field that provides the
maximum information gain.
Applications of Decision Tree Machine
Learning Algorithm
■ Decision trees are among the popular machine learning algorithms that find great
use in finance for option pricing.
■ Remote sensing is an application area for pattern recognition based on decision
trees.
■ Decision tree algorithms are used by banks to classify loan applicants by their
probability of defaulting payments.
Libraries

■ The Data Science libraries in Python language to implement Decision Tree Machine
Learning Algorithm are – SciPy and Sci-Kit Learn.
■ The Data Science libraries in R language to implement Decision Tree Machine
Learning Algorithm is caret.
Random Forest

■ It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or


bagging.
Bootstrap Method

■ Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of
the mean of the sample
■ Create many (e.g. 1000) random sub-samples of our dataset with replacement
(meaning we can select the same value multiple times).
■ Calculate the mean of each sub-sample.
■ Calculate the average of all of our collected means and use that as our estimated
mean for the data.
Bootstrap Aggregation (Bagging)

■ Bagging of the CART algorithm would work as follows.


– Create many (e.g. 100) random sub-samples of our dataset with replacement.
– Train a CART model on each sample.
– Given a new dataset, calculate the average prediction from each model.
Applications of Random Forest
Algorithms
■ Random Forest algorithms are used by banks to predict if a loan applicant is a likely
high risk.
■ They are used in the automobile industry to predict the failure or breakdown of a
mechanical part.
■ These algorithms are used in the healthcare industry to predict if a patient is likely
to develop a chronic disease or not.
■ They can also be used for regression tasks like predicting the average number of
social media shares and performance scores.
■ Recently, the algorithm has also made way into predicting patterns in speech
recognition software and classifying images and texts.
Random Forest and CART

■ Even with Bagging, the decision trees (CART) can have a lot of structural similarities
and in turn have high correlation in their predictions.
■ To reduce Correlation between features, the random forest algorithm changes the
procedure so that the learning algorithm is limited to a random sample of features
of which to search.
■ The number of features that can be searched at each split point (m) must be
specified as a parameter to the algorithm.
Libraries

■ Data Science libraries in Python language to implement Random Forest Machine


Learning Algorithm is Sci-Kit Learn.
■ Data Science libraries in R language to implement Random Forest Machine
Learning Algorithm is randomForest.
Naïve Bayes

■ A Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
■ For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter.
■ Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability that this
fruit is an apple
■ That is why it is known as ‘Naive’.
■ This algorithm is mostly used in text classification and with problems having multiple
classes.
How Naive Bayes algorithm works

■ Step 1: Convert the data set into a frequency table


■ Step 2: Create Likelihood table by finding the probabilities
■ Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for
each class.
■ Step 4: The class with the highest posterior probability is the outcome of prediction.
Applications of Naive Bayes Algorithms

■ Real time Prediction


■ Multi class Prediction
■ Text classification/ Spam Filtering/ Sentiment Analysis Recommendation
System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques
to filter unseen information and predict whether a user would like a given resource
or not
■ Disease prediction
■ Document classification
Support Vector Machines

■ In a two-class learning task, the aim of SVM is to find the best classification function
to distinguish between members of the two classes in the training data.
■ For a linearly separable dataset, a linear classification function corresponds to a
separating hyperplane f (x ) that passes through the middle of the two classes,
separating the two.
Margin Maximization

■ In case of multiple classes, SVM works by classifying the data into different classes
by finding a line (hyperplane) which separates the training data set into classes.
■ As there are many such linear hyperplanes, SVM algorithm tries to maximize the
distance between the various classes that are involved and this is referred as
margin maximization.
■ If the line that maximizes the distance between the classes is identified, the
probability to generalize well to unseen data is increased.
SVM can also be used for

■ Regression – By minimizing the error between actual and Predicted value to be


within a margin Epsilon
■ Ranking
SVM’s are classified into two categories:

■ Linear SVM’s – In linear SVM’s the training data i.e. classifiers are separated by a
hyperplane.
■ Non-Linear SVM’s- In non-linear SVM’s it is not possible to separate the training data
using a hyperplane.
Applications

■ Risk assessment
■ Stock Market forecasting
■ Most commonly, SVM is used to compare the performance of a stock with other
stocks in the same sector. This helps companies make decisions about where they
want to invest.
Association Analysis

■ Association rule implies that if an item A occurs, then item B also occurs with a
certain probability.
The Apriori algorithm

■ The approach is to find frequent item sets from a transaction dataset and derive
association rules
■ A ratio is derived like out of the 100 people who purchased an apple, 85 people also
purchased an orange.
Libraries - The Apriori algorithm

■ Data Science Libraries in Python to implement Apriori Machine Learning Algorithm –


There is a python implementation for Apriori in PyPi
■ Data Science Libraries in R to implement Apriori Machine Learning Algorithm –
arules
Applications of Apriori Algorithm

■ Detecting Adverse Drug Reactions


– Apriori algorithm is used for association analysis on healthcare data like-the drugs taken by
patients, characteristics of each patient, adverse ill-effects patients experience, initial
diagnosis, etc. This analysis produces association rules that help identify the combination of
patient characteristics and medications that lead to adverse side effects of the drugs.
■ Market Basket Analysis
– Many e-commerce giants like Amazon use Apriori to draw data insights on which products are
likely to be purchased together and which are most responsive to promotion. For example, a
retailer might use Apriori to predict that people who buy sugar and flour are likely to buy eggs
to bake a cake.
■ Auto-Complete Applications
– Google auto-complete is another popular application of Apriori wherein - when the user types
a word, the search engine looks for other associated words that people usually type after a
specific word.
clustering

1. The EM algorithm
2. The k-means algorithm
3. k-nearest neighbor classification
The Expectation–Maximization algorithm

■ The EM algorithm attempts to approximate the observed distributions of values


based on mixtures of different distributions in different clusters.
■ The EM clustering algorithm then computes probabilities of cluster memberships
based on one or more of the mixture of probability distributions.
The k-means algorithm

1. Randomly select ‘c’ cluster centers.


2. Calculate the distance between each data point and cluster centers.
3. Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers.
4. Recalculate the new cluster center using the algorithm (aims at minimizing an
objective function know as squared error function).
5. Recalculate the distance between each data point and new obtained cluster centers.
6. If no data point was reassigned then stop, otherwise repeat from step (3).
7. This learning algorithm requires prior specification of the number of cluster centers.
Applications

■ Search engines like Yahoo and Bing (to identify relevant results)
■ Data libraries
■ Google image search
k-nearest neighbor classification

■ Used for classification and regression


■ The number k will have to be specified
■ The kNN algorithm will search through the training dataset for the k-most similar
instances.
■ This is a process of calculating the distance for all instances and selecting a subset
with the smallest distance values..
Applications

■ Pattern recognition (like to predict how cancer may spread)


■ Statistical estimation (like to predict if someone may default on a loan)
Linear Regression

■ “Ordinary least squares” strategy


■ Draw a line, and then for each of the data points, measure the vertical distance
between the point and the line, and add these up;
■ The fitted line would be the one where this sum of distances is as small as possible
Logistic Regression

■ Binary Logistic Regression – The most commonly used logistic regression when the
categorical response has 2 possible outcomes i.e. either yes or not. Example –
Predicting whether a student will pass or fail an exam, predicting whether a student
will have low or high blood pressure, predicting whether a tumor is cancerous or not.
■ Multi-nominal Logistic Regression - Categorical response has 3 or more possible
outcomes with no ordering. Example- Predicting what kind of search engine (Yahoo,
Bing, Google, and MSN) is used by majority of US citizens.
■ Ordinal Logistic Regression - Categorical response has 3 or more possible outcomes
with natural ordering. Example- How a customer rates the service and quality of food
at a restaurant based on a scale of 1 to 10.
Logistic Regression

■ It measures the relationship between the categorical dependent variable and one or
more independent variables by estimating probabilities using a logistic function,
which is the cumulative logistic distribution.
■ regressions can be used in real-world applications such as:
– Credit Scoring
– Measuring the success rates of marketing campaigns
– Predicting the revenues of a certain product
Boosting

■ In 1988, Kearns and Valiant posed an interesting question, i.e., whether a weak
learning algorithm that performs just slightly better than random guess could be
“boosted” into an arbitrarily accurate strong learning algorithm.
■ AdaBoost was born with in response to this question. AdaBoost has given rise to
abundant research on theoretical aspects of ensemble methods, which can be
easily found in machine learning and statistics literature.
■ It is worth mentioning that for their AdaBoost paper, Schapire and Freund won the
Godel Prize, which is one of the most prestigious awards in theoretical computer
science, in the year of 2003.
How Adaboost works

■ First, it assigns equal weights to all the training examples (xi , yi )(i ∈ {1,..., m}). Denote
the distribution of the weights at the t -th learning round as Dt
■ From the training set and Dt the algorithm generates a weak or base learner ht : X → Y
by calling the base learning algorithm.
■ Then, it uses the training examples to test ht , and the weights of the incorrectly
classified examples will be increased. Thus, an updated weight distribution Dt +1 is
obtained.
■ From the training set and Dt +1 AdaBoost generates another weak learner by calling the
base learning algorithm again.
■ Such a process is repeated for T rounds, and the final model is derived by weighted
majority voting of the T weak learners, where the weights of the learners are determined
during the training process.
Artificial Neural Networks

■ Artificial Neural Networks are named so because they’re based on the structure and
functions of real biological neural networks.
■ Information flows through the network and in response, the neural network changes
based on the input and output.
■ Applications
– Character recognition (understanding human handwriting and converting it to
text)
– Image compression
– Stock market prediction
– Loan applications
Linear Discriminant Analysis
■ Linear discriminant analysis (LDA) and the related Fisher’s linear discriminant are
methods used in statistics, pattern recognition and machine learning to find a linear
combination of features which characterizes or separates two or more classes of
objects or events.
■ The resulting combination may be used as a linear classifier, or, more commonly, for
dimensionality reduction before later classification.
■ QDA is a general discriminant function with a quadratic decision boundaries which
can be used to classify datasets with two or more classes.
Method

■ LDA is based upon the concept of searching for a linear combination of variables
(predictors) that best separates two classes (targets)
■ To capture the notion of separability, Fisher defined the following score function.
■ Given the score function, the problem is to estimate the linear coefficients that
maximize the score function.
■ One way of assessing the effectiveness of the discrimination is to calculate
the Mahalanobis distance between two groups. A distance greater than 3 means
that in two averages differ by more than 3 standard deviations. It means that the
overlap (probability of misclassification) is quite small.
Predictors Contribution
■ A simple linear correlation between the model scores and predictors can be used to
test which predictors contribute significantly to the discriminant function. Correlation
varies from -1 to 1, with -1 and 1 meaning the highest contribution but in different
directions and 0 means no contribution at all.
Applications of LDA

■ Bankruptcy prediction: In bankruptcy prediction based on accounting ratios and


other financial variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs. survived.
■ Marketing: In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or products on the
basis of surveys or other forms of collected data.
■ Biomedical studies: The main application of discriminant analysis in medicine is the
assessment of severity state of a patient and prognosis of disease outcome.
The Gradient Descent algorithm

■ Gradient descent is an optimization algorithm used to find the values of parameters


(coefficients) of a function (f) that minimizes a cost function (cost).
■ The goal is to continue to try different values for the coefficients, evaluate their cost
and select new coefficients that have a slightly better (lower) cost.
How it Works

■ The procedure starts off with initial values for the coefficient or coefficients for the
function. These could be 0.0 or a small random value.
■ The cost of the coefficients is evaluated by plugging them into the function and
calculating the cost
■ The derivative of the cost is calculated. The derivative is a concept from calculus
and refers to the slope of the function at a given point. We need to know the slope
so that we know the direction (sign) to move the coefficient values in order to get a
lower cost on the next iteration
■ Now that we know from the derivative which direction is downhill, we can now
update the coefficient values.
Cont.

■ A learning rate parameter (alpha) must be specified that controls how much the
coefficients can change on each update.
■ delta = derivative(cost)
■ coefficient = coefficient – (alpha * delta)
■ This process is repeated until the cost of the coefficients (cost) is 0.0 or close
enough to zero to be good enough.
Applications

■ Common examples of algorithms with coefficients that can be optimized using


gradient descent are
– Linear Regression and
– Logistic Regression.
State of the Art Algorithms

■ XGBoost for Classification and Regression.


■ Convolutional Neural Networks for Image Classification.
■ DBSCAN for Clustering
■ Collaborative Filtering for Recommender Systems
■ SVD++ for Recommender Systems
■ NMF for Dimensionality Reduction
■ Deep Autoencoders for deep learning systems and to find the best set of features to represent a dataset
■ Sparse Filtering for Representation
■ Hash Kernels for Representation
■ T-SNE to visualize multidimensional datasets
■ LSTMs for Time Series and Sequences. Applications in Sentiment Analysis.
■ MCMC and Metropolis Hastings Algorithm.
XGBoost for Classification and
Regression
■ The XGBoost library implements the gradient boosting decision tree algorithm.
■ Boosting is an ensemble technique where new models are added to correct the
errors made by existing models. Models are added sequentially until no further
improvements can be made.
■ It gives more weight to the misclassified points sequentially for every model.
■ The Final Model is a weighted combination of the weak classifiers
■ You are updating your model using gradient descent and hence the name, gradient
boosting.
Convolutional Neural Networks for
Image Classification
■ CNNs have wide applications in image and video recognition, recommender systems
and natural language processing.
■ CNNs, like neural networks, are made up of neurons with learnable weights and
biases. Each neuron receives several inputs, takes a weighted sum over them, pass
it through an activation function and responds with an output.
■ Convolutional networks perform optical character recognition (OCR) to digitize text
and make natural-language processing possible on analog and hand-written
documents.
■ Convolutional neural networks ingest and process images as tensors.
Contd.
■ A tensor encompasses the dimensions beyond that 2-D plane e.g. a 2 x 3 x 2 tensor.
■ Tensors are formed by arrays nested within arrays, and that nesting can go on
infinitely, accounting for an arbitrary number of dimensions far greater than what we
can visualize spatially.
■ Convolutional networks pass many filters over a single image, each one picking up a
different signal. Therefore convolutional nets learn images in pieces that we call
feature maps.
DBSCAN for Clustering
■ It stands for Density Based Spatial Clustering of applications with Noise
■ it groups together points that are closely packed together (points with many nearby
neighbors), marking as outliers points that lie alone in low-density regions (whose
nearest neighbors are too far away)
■ The two parameters we need to specify are:
■ What is the minimum number of data points needed to determine a single cluster
■ How far away can one point be from the next point within the same cluster - Epsilon
Collaborative Filtering for Recommender
Systems
■ Collaborative filtering, also referred to as social filtering, filters information by using
the recommendations of other people.
■ Most collaborative filtering systems apply the so called neighborhood-based
technique.
■ In the neighbourhood-based approach a number of users is selected based on their
similarity to the active user.
■ A prediction for the active user is made by calculating a weighted average of the
ratings of the selected users.
SVD++ for Recommender Systems
■ Matrix factorization algorithms work by decomposing the user-item interaction
matrix into the product of two lower dimensionality rectangular matrices.
■ SVD consists of factorization two lower dimensional matrices, the first one has a row
for each user, while the second has a column for each item.
■ The row or column associated to a specific user or item is referred to as latent
factors.
■ Increasing the number of latent factor will improve personalization, therefore
recommendation quality, until the number of factors becomes too high, at which
point the model starts to overfit and the recommendation quality will decrease
■ SVD++ is a matrix factorization method with implicit feedback.
■ It exploit all available interactions both explicit (e.g. numerical ratings) and implicit
(e.g. likes, purchases, skipped, bookmarked).
NMF for Dimensionality Reduction
■ Non-negative matrix factorization is an important method in the analysis of high
dimensional datasets.
■ Principal component analysis (PCA) and singular value decomposition (SVD) are
popular techniques for dimensionality reduction based on matrix decomposition,
■ However they contain both positive and negative values in the decomposed
matrices.
■ Since matrices decomposed by NMF only contain non-negative values, the original
data are represented by only additive, not subtractive, combinations of the basis
vectors.
Deep Auto Encoders
■ An Autoencoder is a feedforward neural network having an input layer, one hidden
layer and an output layer.
■ The transition from the input to the hidden layer is called the encoding step and the
transition from the hidden to the output layer is called the decoding step.
■ A Deep Autoencoder has multiple hidden layers.
■ The additional hidden layers enable the Autoencoder to learn mathematically more
complex underlying patterns in the data.
Sparse Filtering

■ Traditionally, feature learning methods have largely sought to learn models that
provide good approximations of the true data distribution
■ Sparse Filtering is a form of unsupervised feature learning that learns a sparse
representation of the input data without directly modelling it.
■ It has only has only one hyperparameter, the number of features to learn.
■ Sparse filtering scales gracefully to handle high-dimensional inputs,
t-SNE to visualize multidimensional
datasets
■ t-SNE stands for t-Distributed Stochastic Neighbour Embedding and its main aim is
that of dimensionality reduction.
■ The dimensionality of a set of images is the number of pixels in any image, which
ranges from thousands to millions. We need to reduce the dimensionality of a
dataset from an arbitrary number to two or three.
■ Stochastic neighbour embedding techniques compute an N ×N similarity matrix in
both the original data space and in the low-dimensional embedding space called
Similarity Matrices.
Contd.
■ The distribution over pairs of objects is defined such that pairs of similar objects
have a high probability under the distribution, whilst pairs of dissimilar points have a
low probability.
■ The probabilities are generally given by a normalized Gaussian or Student-t kernel
computed from the data space or from the embedding space.
■ The low-dimensional embedding is learned by minimizing the Kullback-Leibler
divergence between the two probability distributions (computed in the original data
space and the embedding space) with respect to the locations of the points in the
embedding space.
■ This is the topic of manifold learning, also called nonlinear dimensionality reduction,
a branch of machine learning (more specifically, unsupervised learning).
■ It is still an active area of research today and tries to develop algorithms that can
automatically recover a hidden structure in a high-dimensional dataset.
LSTMs for Time Series and Sequences
■ A usual RNN (Recurrent Neural Network) has a short-term memory. In combination
with a LSTM they also have a long-term memory
■ An LSTM unit is composed of a cell, an input gate, an output gate and a forget gate.
■ The cell remembers values over arbitrary time intervals and the three gates regulate
the flow of information into and out of the cell.
■ LSTM’s enable Recurrent Neural Networks to remember their inputs over a long
period of time.
MCMC and Metropolis Algorithm
■ The Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method
for obtaining a sequence of random samples from multi-dimensional distributions,
especially when the number of dimensions is high.
■ The algorithm proceeds by generating random numbers over a unform distribution
and uses an accept or reject criteria.
■ If the criteria is accepted, the a transition is made over a Stochastic Transition
Matrix.
■ It uses the property of an Ergodicity of a Markov Process to ensure that the
probability of reaching any point in the space is greater than Zero.
■ A stochastic process is said to be ergodic if its statistical properties can be deduced
from a single, sufficiently long, random sample of the process.
■ The reasoning is that any collection of random samples from a process must
represent the average statistical properties of the entire process.
■ The End

You might also like