You are on page 1of 10

Data Science Cheatsheet

Compiled by Maverick Lin (http://mavericklin.com)


Last Updated August 13, 2018

What is Data Science? Probability Overview Descriptive Statistics


Multi-disciplinary field that brings together concepts Probability theory provides a framework for reasoning Provides a way of capturing a given data set or sample.
from computer science, statistics/machine learning, and about likelihood of events. There are two main types: centrality and variability
data analysis to understand and extract insights from the measures.
ever-increasing amounts of data. Terminology
Experiment: procedure that yields one of a possible set Centrality
Two paradigms of data research. of outcomes e.g. repeatedly tossing a die or coin Arithmetic Mean Useful to characterize symmetric
1. Hypothesis-Driven: Given a problem, what kind distributions without outliers µX = n1
P
Sample Space S: set of possible outcomes of an experi- x
of data do we need to help solve it? ment e.g. if tossing a die, S = (1,2,3,4,5,6 Geometric Mean Useful for averaging ratios. Always

2. Data-Driven: Given some data, what interesting Event E: set of outcomes of an experiment e.g. event less than arithmetic mean = n a1 a2 ...a3
problems can be solved with it? that a roll is 5, or the event that sum of 2 rolls is 7 Median Exact middle value among a dataset. Useful for
Probability of an Outcome s or P(s): number that skewed distribution or data with outliers.
The heart of data science is to always ask questions. Al- satisfies 2 properties Mode Most frequent element in a dataset.
ways be curious about the world. 1. Pfor each outcome s, 0 ≤ P(s) ≤ 1
1. What can we learn from this data? 2. p(s) = 1 Variability
2. What actions can we take once we find whatever it Standard Deviation Measures the squares differences
is we are looking for? Probability of Event E: sum of thePprobabilities of the between the individual elements
q PN and the mean
2
outcomes of the experiment: p(E) = s⊂E p(s) i=1 (xi −x)
σ= N −1
Types of Data Random Variable V: numerical function on the out- Variance V = σ 2
comes of a probability space
Structured: Data that has predefined structures. e.g.
Expected Value of Random Variable V: E(V) = Interpreting Variance
tables, spreadsheets, or relational databases. P
Unstructured Data: Data with no predefined struc- s⊂S p(s) * V(s) Variance is an inherent part of the universe. It is impossi-
ture, comes in any size or form, cannot be easily stored ble to obtain the same results after repeated observations
Independence, Conditional, Compound of the same event due to random noise/error. Variance
in tables. e.g. blobs of text, images, audio
Independent Events: A and B are independent iff: can be explained away by attributing to sampling or
Quantitative Data: Numerical. e.g. height, weight
P(A ∩ B) = P(A)P(B) measurement errors. Other times, the variance is due to
Categorical Data: Data that can be labeled or divided
P(A|B) = P(A) the random fluctuations of the universe.
into groups. e.g. race, sex, hair color.
P(B|A) = P(B)
Big Data: Massive datasets, or data that contains
; Conditional Probability: P(A|B) = P(A,B)/P(B) Correlation Analysis
greater variety arriving in increasing volumes and with
Bayes Theorem: P(A|B) = P(B|A)P(A)/P(B) Correlation coefficients r(X,Y) is a statistic that measures
ever-higher velocity (3 Vs). Cannot fit in the memory of
Joint Probability: P(A,B) = P(B|A)P(A) the degree that Y is a function of X and vice versa.
a single machine.
Marginal Probability: P(A) Correlation values range from -1 to 1, where 1 means
Data Sources/Fomats fully correlated, -1 means negatively-correlated, and 0
Probability Distributions means no correlation.
Most Common Data Formats CSV, XML, SQL,
Probability Density Function (PDF) Gives the prob- Pearson Coefficient Measures the degree of the rela-
JSON, Protocol Buffers
ability that a rv takes on the value x: pX (x) = P (X = x) tionship between linearly related variables
Data Sources Companies/Proprietary Data, APIs, Gov-
Cumulative Density Function (CDF) Gives the prob- r = Cov(X,Y )
ernment, Academic, Web Scraping/Crawling σ(X)σ(Y )
ability that a random variable is less than or equal to x: Spearman Rank Coefficient Computed on ranks and
FX (x) = P (X ≤ x) depicts monotonic relationships
Main Types of Problems Note: The PDF and the CDF of a given random variable
Two problems arise repeatedly in data science. contain exactly the same information. Note: Correlation does not imply causation!
Classification: Assigning something to a discrete set of
possibilities. e.g. spam or non-spam, Democrat or Repub-
lican, blood type (A, B, AB, O)
Regression: Predicting a numerical value. e.g. some-
one’s income, next year GDP, stock price
Data Cleaning Feature Engineering Statistical Analysis
Data Cleaning is the process of turning raw data into Feature engineering is the process of using domain knowl- Process of statistical reasoning: there is an underlying
a clean and analyzable data set. ”Garbage in, garbage edge to create features or input variables that help ma- population of possible things we can potentially observe
out.” Make sure garbage doesn’t get put in. chine learning algorithms perform better. Done correctly, and only a small subset of them are actually sampled (ide-
it can help increase the predictive power of your models. ally at random). Probability theory describes what prop-
Errors vs. Artifacts Feature engineering is more of an art than science. FE is erties our sample should have given the properties of the
1. Errors: information that is lost during acquisi- one of the most important steps in creating a good model. population, but statistical inference allows us to deduce
tion and can never be recovered e.g. power outage, As Andrew Ng puts it: what the full population is like after analyzing the sample.
crashed servers
“Coming up with features is difficult, time-consuming,
2. Artifacts: systematic problems that arise from
requires expert knowledge. ‘Applied machine learning’ is
the data cleaning process. these problems can be
basically feature engineering.”
corrected but we must first discover them
Continuous Data
Data Compatibility Raw Measures: data that hasn’t been transformed yet
Data compatibility problems arise when merging datasets. Rounding: sometimes precision is noise; round to
Make sure you are comparing ”apples to apples” and nearest integer, decimal etc..
not ”apples to oranges”. Main types of conver- Scaling: log, z-score, minmax scale
sions/unifications: Imputation: fill in missing values using mean, median, Sampling From Distributions
• units (metric vs. imperial) model output, etc.. Inverse Transform Sampling Sampling points from
• numbers (decimals vs. integers), Binning: transforming numeric features into categorical a given probability distribution is sometimes necessary
• names (John Smith vs. Smith, John), ones (or binned) e.g. values between 1-10 belong to A, to run simulations or whether your data fits a particular
• time/dates (UNIX vs. UTC vs. GMT), between 10-20 belong to B, etc. distribution. The general technique is called inverse
• currency (currency type, inflation-adjusted, divi- Interactions: interactions between features: e.g. sub- transform sampling or Smirnov transform. First draw
dends) traction, addition, multiplication, statistical test a random number p between [0,1]. Compute value x
Statistical: log/power transform (helps turn skewed such that the CDF equals p: FX (x) = p. Use x as the
Data Imputation distributions more normal), Box-Cox value to be the random value drawn from the distribution
Process of dealing with missing values. The proper meth- Row Statistics: number of NaN’s, 0’s, negative values, described by FX (x).
ods depend on the type of data we are working with. Gen- max, min, etc
eral methods include: Dimensionality Reduction: using PCA, clustering, Monte Carlo Sampling In higher dimensions, correctly
• Drop all records containing missing data factor analysis etc sampling from a given distribution becomes more tricky.
• Heuristic-Based: make a reasonable guess based on Generally want to use Monte Carlo methods, which
knowledge of the underlying domain Discrete Data typically follow these rules: define a domain of possible
• Mean Value: fill in missing data with the mean Encoding: since some ML algorithms cannot work on inputs, generate random inputs from a probability
• Random Value categorical data, we need to turn categorical data into nu- distribution over the domain, perform a deterministic
• Nearest Neighbor: fill in missing data using similar merical data or vectors calculation, and analyze the results.
data points Ordinal Values: convert each distinct feature into a ran-
• Interpolation: use a method like linear regression to dom number (e.g. [r,g,b] becomes [1,2,3])
predict the value of the missing data One-Hot Encoding: each of the m features becomes a
vector of length m with containing only one 1 (e.g. [r, g,
Outlier Detection b] becomes [[1,0,0],[0,1,0],[0,0,1]])
Outliers can interfere with analysis and often arise from Feature Hashing Scheme: turns arbitrary features into
mistakes during data collection. It makes sense to run a indices in a vector or matrix
”sanity check”. Embeddings: if using words, convert words to vectors
(word embeddings)
Miscellaneous
Lowercasing, removing non-alphanumeric, repairing,
unidecode, removing unknown characters

Note: When cleaning data, always maintain both the raw


data and the cleaned version(s). The raw data should be
kept intact and preserved for future use. Any type of data
cleaning/analysis should be done on a copy of the raw
data.
Classic Statistical Distributions Modeling- Overview Modeling- Philosophies
Binomial Distribution (Discrete) Modeling is the process of incorporating information into Modeling is the process of incorporating information
Assume X is distributed Bin(n,p). X is the number of a tool which can forecast and make predictions. Usually, into a tool which can forecast and make predictions.
”successes” that we will achieve in n independent trials, we are dealing with statistical modeling where we want Designing and validating models is important, as well as
where each trial is either a success or failure and each to analyze relationships between variables. Formally, we evaluating the performance of models. Note that the best
success occurs with the same probability p and each want to estimate a function f (X) such that: forecasting model may not be the most accurate one.
failure occurs with probability q=1-p.
Y = f (X) + 
PDF: P (X = x) = nk px (1 − p)n−x

Philosophies of Modeling
EV: µ = np Variance = npq where X = (X1 , X2 , ...Xp ) represents the input variables, Occam’s Razor Philosophical principle that the simplest
Y represents the output variable, and  represents random explanation is the best explanation. In modeling, if we
Normal/Gaussian Distribution (Continuous) error. are given two models that predicts equally well, we should
Assume X in distributed N (µ, σ 2 ). It is a bell-shaped choose the simpler one. Choosing the more complex one
and symmetric distribution. Bulk of the values lie close Statistical learning is set of approaches for estimating can often result in overfitting.
to the mean and no value is too extreme. Generalization this f (X). Bias Variance Trade-Off Inherent part of predictive
of the binomial distribution2 as 2n → ∞. modeling, where models with lower bias will have higher
PDF: P (x) = σ√12π e−(x−µ) /2σ Why Estimate f(X)? variance and vice versa. Goal is to achieve low bias and
EV: µ Variance: σ 2 Prediction: once we have a good estimate fˆ(X), we can low variance.
Implications: 68%-95%-99% rule. 68% of probability use it to make predictions on new data. We treat fˆ as a • Bias: error from incorrect assumptions to make tar-
mass fall within 1σ of the mean, 95% within 2σ, and black box, since we only care about the accuracy of the get function easier to learn (high bias → missing rel-
99.7% within 3σ. predictions, not why or how it works. evant relations or underfitting)
Inference: we want to understand the relationship • Variance: error from sensitivity to fluctuations in
Poisson Distribution (Discrete) between X and Y. We can no longer treat fˆ as a black the dataset, or how much the target estimate would
Assume X is distributed Pois(λ). Poisson expresses box since we want to understand how Y changes with differ if different training data was used (high vari-
the probability of a given number of events occurring respect to X = (X1 , X2 , ...Xp ) ance → modeling noise or overfitting)
in a fixed interval of time/space if these events occur
independently and with a known constant rate λ. More About 
−λ x
PDF: P (x) = e x!λ EV: λ Variance = λ The error term  is composed of the reducible and irre-
ducible error, which will prevent us from ever obtaining a
Power Law Distributions (Discrete) perfect fˆ estimate.
Many data distributions have much longer tails than • Reducible: error that can potentially be reduced
the normal or Poisson distributions. In other words, by using the most appropriate statistical learning
the change in one quantity varies as a power of another technique to estimate f . The goal is to minimize
quantity. It helps measure the inequality in the world. the reducible error.
e.g. wealth, word frequency and Pareto Principle (80/20 • Irreducible: error that cannot be reduced no
No Free Lunch Theorem No single machine learning
Rule) matter how well we estimate f . Irreducible error is
algorithm is better than all the others on all problems.
PDF: P(X=x) = cx−α , where α is the law’s exponent unknown and unmeasurable and will always be an
It is common to try multiple models and find one that
and c is the normalizing constant upper bound for .
works best for a particular problem.
Note: There will always be trade-offs between model
Thinking Like Nate Silver
flexibility (prediction) and model interpretability (infer-
1. Think Probabilistically Probabilistic forecasts are
ence). This is just another case of the bias-variance trade-
more meaningful than concrete statements and should be
off. Typically, as flexibility increases, interpretability de-
reported as probability distributions (including σ along
creases. Much of statistical learning/modeling is finding a
with mean prediction µ.
way to balance the two.
2. Incorporate New Information Use live models,
which continually updates using new information. To up-
date, use Bayesian reasoning to calculate how probabilities
change in response to new evidence.
3. Look For Consensus Forecast Use multiple distinct
sources of evidence. Ssome models operate this way, such
as boosting and bagging, which uses large number of weak
classifiers to produce a strong one.
Modeling- Taxonomy Modeling- Evaluation Metrics Modeling- Evaluation Environment
There are many different types of models. It is important Need to determine how good our model is. Best way to Evaluation metrics provides use with the tools to estimate
to understand the trade-offs and when to use a certain assess models is out-of-sample predictions (data points errors, but what should be the process to obtain the
type of model. your model has never seen). best estimate? Resampling involves repeatedly drawing
samples from a training set and refitting a model to each
Parametric vs. Nonparametric Classification sample, which provides us with additional information
• Parametric: models that first make an assumption Predicted Yes Predicted No
compared to fitting the model once, such as obtaining a
about a function form, or shape, of f (linear). Then Actual Yes True Positives (TP) False Negatives (FN) better estimate for the test error.
fits the model. This reduces estimating f to just Actual No False Positives (FP) True Negatives (TN)
estimating set of parameters, but if our assumption Key Concepts
Accuracy: ratio of correct predictions over total pre-
was wrong, will lead to bad results. Training Data: data used to fit your models or the set
dictions. Misleading when class sizes are substantially
• Non-Parametric: models that don’t make any as- used for learning
different. accuracy = T P +TT N P +T N
+F N +F P
sumptions about f , which allows them to fit a wider Validation Data: data used to tune the parameters of
Precision: how often the classifier is correct when it
range of shapes; but may lead to overfitting a model
predicts positive: precision = T PT+F P
P
Supervised vs. Unsupervised Test Data: data used to evaluate how good your model
Recall: how often the classifier is correct for all positive
• Supervised: models that fit input variables xi = is. Ideally your model should never touch this data until
instances: recall = T PT+F P
N
(x1 , x2 , ...xn ) to a known output variables yi = final testing/evaluation
F-Score: single measurement to describe performance:
(y1 , y2 , ...yn ) precision·recall
F = 2 · precision + recall
• Unsupervised: models that take in input variables Cross Validation
ROC Curves: plots true positive rates and false pos-
xi = (x1 , x2 , ...xn ), but they do not have an asso- Class of methods that estimate test error by holding out
itive rates for various thresholds, or where the model
ciated output to supervise the training. The goal a subset of training data from the fitting process.
determines if a data point is positive or negative (e.g. if
is understand relationships between the variables or Validation Set: split data into training set and valida-
>0.8, classify as positive). Best possible area under the
observations. tion set. Train model on training and estimate test error
ROC curve (AUC) is 1, while random is 0.5, or the main
Blackbox vs. Descriptive using validation. e.g. 80-20 split
diagonal line.
• Blackbox: models that make decisions, but we do Leave-One-Out CV (LOOCV): split data into
not know what happens ”under the hood” e.g. deep training set and validation set, but the validation set
Regression
learning, neural networks consists of 1 observation. Then repeat n-1 times until all
Errors are defined as the difference between a prediction
• Descriptive: models that provide insight into why observations have been used as validation. Test erro is
y0 and the actual result y.
they make their decisions e.g. linear regression, de- the average of these n test error estimates.
Absolute Error: ∆ = y0 − y
cision trees k-Fold CV: randomly divide data into k groups (folds) of
Squared Error: ∆2 = (y0 − y)2
First-Principle vs. Data-Driven approximately equal size. First fold is used as validation
Mean-Squared Error: M SE = n1 n − yi )2
P
• First-Principle: models based on a prior belief of i=1 (y0
√i and the rest as training. Then repeat k times and find
how the system under investigation works, incorpo- Root Mean-Squared Error: RMSD = M SE average of the k estimates.
rates domain knowledge (ad-hoc) Absolute Error Distribution: Plot absolute error dis-
• Data-Driven: models based on observed correla- tribution: should be symmetric, centered around 0, bell- Bootstrapping
tions between input and output variables shaped, and contain rare extreme outliers. Methods that rely on random sampling with replacement.
Deterministic vs. Stochastic Bootstrapping helps with quantifying uncertainty associ-
• Deterministic: models that produce a single ”pre- ated with a given estimate or model.
diction” e.g. yes or no, true or false
• Stochastic: models that produce probability distri- Amplifying Small Data Sets
butions over possible events What can we do it we don’t have enough data?
Flat vs. Hierarchical • Create Negative Examples- e.g. classifying pres-
• Flat: models that solve problems on a single level, idential candidates, most people would be unquali-
no notion of subproblems fied so label most as unqualified
• Hierarchical: models that solve several different • Synthetic Data- create additional data by adding
nested subproblems noise to the real data
Linear Regression Linear Regression II Logistic Regression
Linear regression is a simple and useful tool for predicting Improving Linear Regression Logistic regression is used for classification, where the
a quantitative response. The relationship between input Subset/Feature Selection: approach involves identify- response variable is categorical rather than numerical.
variables X = (X1 , X2 , ...Xp ) and output variable Y takes ing a subset of the p predictors that we believe to be best
the form: related to the response. Then we fit model using the re- The model works by predicting the probability that Y be-
duced set of variables. longs to a particular category by first fitting the data to a
Y ≈ β0 + β1 X1 + ... + βp Xp +  • Best, Forward, and Backward Subset Selection linear regression model, which is then passed to the logis-
Shrinkage/Regularization: all variables are used, but tic function (below). The logistic function will always pro-
β0 ...βp are the unknown coefficients (parameters) which
estimated coefficients are shrunken towards zero relative duce a S-shaped curve, so regardless of X, we can always
we are trying to determine. The best coefficients
to the least squares estimate. λ represents the tuning obtain a sensible answer (between 0 and 1). If the prob-
will lead us to the best ”fit”, which can be found by
parameter- as λ increases, flexibility decreases → de- ability is above a certain predetermined threshold (e.g.
minimizing the residual sum squares (RSS), or the
creased variance but increased bias. The tuning parameter P(Yes) > 0.5), then the model will predict Yes.
sum of the differences betweenPthe actual ith value and
is key in determining the sweet spot between under and
the predicted ith value. RSS = n i=1 ei , where ei = yi − y
ˆi p(X) = eβ0 +β1 X1 +...+βp Xp
over-fitting. In addition, while Ridge will always produce 1+eβ0 +β1 X1 +...+βp Xp
a model with p variables, Lasso can force coefficients to
How to find best fit? How to find best coefficients?
be equal to zero.
Matrix Form: We can solve the closed-form equation for Maximum Likelihood: The coefficients β0 ...βp are un-
• Lasso (L1): min RSS + λ pj=1 |βj |
P
coefficient vector w: w = (X T X)−1 X T Y . X represents known and must be estimated from the training data. We
• Ridge (L2): min RSS + λ pj=1 βj2
P
the input data and Y represents the output data. This seek estimates for β0 ...βp such that the predicted proba-
method is used for smaller matrices, since inverting a Dimension Reduction: projecting p predictors into a bility p̂(xi ) of each observation is a number close to one if
matrix is computationally expensive. M-dimensional subspace, where M < p. This is achieved its observed in a certain class and close to zero otherwise.
Gradient Descent: First-order optimization algorithm. by computing M different linear combinations of the This is done by maximizing the likelihood function:
We can find the minimum of a convex function by variables. Can use PCA. Y Y
starting at an arbitrary point and repeatedly take steps Miscellaneous: Removing outliers, feature scaling, l(β0 , β1 ) = p(xi ) (1 − p(xi ))
in the downward direction, which can be found by taking removing multicollinearity (correlated variables) i:yi =1 i0 :yi0 =1

the negative direction of the gradient. After several Potential Issues


iterations, we will eventually converge to the minimum. Evaluating Model Accuracy q Imbalanced Classes: imbalance in classes in training
1
In our case, the minimum corresponds to the coefficients Residual Standard Error (RSE): RSE = n−2
RSS. data lead to poor classifiers. It can result in a lot of false
with the minimum error, or the best line of fit. The Generally, the smaller the better. positives and also lead to few training data. Solutions in-
learning rate α determines the size of the steps we take R2 : Measure of fit that represents the proportion of clude forcing balanced data by removing observations from
in the downward direction. variance explained, or the variability in Y that can be the larger class, replicate data from the smaller class, or
explained using X. It takes on a value between 0 and 1. heavily weigh the training examples toward instances of
Gradient descent algorithm in two dimensions. Repeat Generally the higher the better. R2 = 1 − RSS T SS
, where the larger class.
until convergence. Total Sum of Squares (TSS) =
P
(yi − ȳ)2 Multi-Class Classification: the more classes you try to
1. w0t+1 := w0t − α ∂w

0
J(w0 , w1 ) predict, the harder it will be for the the classifier to be ef-
t+1 t ∂
2. w1 := w1 − α ∂w1 J(w0 , w1 ) Evaluating Coefficient Estimates fective. It is possible with logistic regression, but another
Standard Error (SE) of the coefficients can be used to per- approach, such as Linear Discriminant Analysis (LDA),
For non-convex functions, gradient descent no longer guar- form hypothesis tests on the coefficients: may prove better.
antees an optimal solutions since there may be local min- H0 : No relationship between X and Y, Ha : Some rela-
imas. Instead, we should run the algorithm from different tionship exists. A p-value can be obtained and can be
starting points and use the best local minima we find for interpreted as follows: a small p-value indicates that a re-
the solution. lationship between the predictor (X) and the response (Y)
Stochastic Gradient Descent: instead of taking a step exists. Typical p-value cutoffs are around 5 or 1 %.
after sampling the entire training set, we take a small
batch of training data at random to determine our next
step. Computationally more efficient and may lead to
faster convergence.
Distance/Network Methods Nearest Neighbor Classification Clustering
Interpreting examples as points in space provides a way Distance functions allow us to identify the points closest Clustering is the problem of grouping points by sim-
to find natural groupings or clusters among data e.g. to a given target, or the nearest neighbors (NN) to a ilarity using distance metrics, which ideally reflect the
which stars are the closest to our sun? Networks can also given point. The advantages of NN include simplicity, similarities you are looking for. Often items come from
be built from point sets (vertices) by connecting related interpretability and non-linearity. logical ”sources” and clustering is a good way to reveal
points. those origins. Perhaps the first thing to do with any
k-Nearest Neighbors data set. Possible applications include: hypothesis
Measuring Distances/Similarity Measure Given a positive integer k and a point x0 , the KNN development, modeling over smaller subsets of data, data
There are several ways of measuring distances between classifier first identifies k points in the training data reduction, outlier detection.
points a and b in d dimensions- with closer distances most similar to x0 , then estimates the conditional
implying similarity. probability of x0 being in class j as the fraction of K-Means Clustering
qP the k points whose values belong to j. The opti- Simple and elegant algorithm to partition a dataset into
Minkowski Distance Metric: dk (a, b) = k d mal value for k can be found using cross validation. K distinct, non-overlapping clusters.
i=1 |ai − bi |
k

The parameter k provides a way to tradeoff between the 1. Choose a K. Randomly assign a number between 1
largest and the total dimensional difference. In other and K to each observation. These serve as initial
words, larger values of k place more emphasis on large cluster assignments
differences between feature values than smaller values. Se- 2. Iterate until cluster assignments stop changing
lecting the right k can significantly impact the the mean- (a) For each of the K clusters, compute the cluster
ingfulness of your distance function. The most popular centroid. The kth cluster centroid is the vector
values are 1 and 2. of the p feature means for the observations in
• Manhattan (k=1): city block distance, or the sum the kth cluster.
KNN Algorithm (b) Assign each observation to the cluster whose
of the absolute difference between two points
1. Compute distance D(a,b) from point b to all points centroid is closest (where closest is defined us-
• Euclidean (k=2): straight line distance
2. Select k closest points and their labels ing distance metric).
3. Output class with most frequent labels in k points Since the results of the algorithm depends on the initial
Optimizing KNN random assignments, it is a good idea to repeat the
Comparing a query point a in d dimensions against n train- algorithm from different random initializations to obtain
ing examples computes with a runtime of O(nd), which the best overall results. Can use MSE to determine which
can cause lag as points reach millions or billions. Popular cluster assignment is better.
choices to speed up KNN include:
• Vernoi Diagrams: partitioning plane into regions
qP
Weighted Minkowski: dk (a, b) = k d Hierarchical Clustering
i=1 wi |ai − bi | , in
k
based on distance to points in a specific subset of Alternative clustering algorithm that does not require us
some scenarios, not all dimensions are equal. Can convey the plane to commit to a particular K. Another advantage is that it
this idea using wi . Generally not a good idea- should • Grid Indexes: carve up space into d-dimensional results in a nice visualization called a dendrogram. Ob-
normalize data by Z-scores before computing distances. boxes or grids and calculate the NN in the same cell servations that fuse at bottom are similar, where those at
a·b as the point the top are quite different- we draw conclusions based on
Cosine Similarity: cos(a, b) = |a||b| , calculates the • Locality Sensitive Hashing (LSH): abandons the location on the vertical rather than horizontal axis.
similarity between 2 non-zero vectors, where a · b is the the idea of finding the exact nearest neighbors. In- 1. Begin with n observations and a measure of all the
dot product (normalized between 0 and 1), higher values stead, batch up nearby points to quickly find the (n)n−1
2
pairwise dissimilarities. Treat each observa-
imply more similar vectors most appropriate bucket B for our query point. LSH tion as its own cluster.
is defined by a hash function h(p) that takes a 2. For i = n, n-1, ...2
Kullback-Leibler Divergence: KL(A||B) = di=i ai log2 abii
P
point/vector as input and produces a number/ code (a) Examine all pairwise inter-cluster dissimilari-
KL divergence measures the distances between probabil- as output, such that it is likely that h(a) = h(b) if ties among the i clusters and identify the pair
ity distributions by measuring the uncertainty gained or a and b are close to each other, and h(a)!= h(b) if of clusters that are least dissimilar ( most simi-
uncertainty lost when replacing distribution A with dis- they are far apart. lar). Fuse these two clusters. The dissimilarity
tribution B. However, this is not a metric but forms the
between these two clusters indicates height in
basis for the Jensen-Shannon Divergence Metric.
dendrogram where fusion should be placed.
Jensen-Shannon: JS(A, B) = 12 KL(A||M )+ 21 KL(M ||B),
(b) Assign each observation to the cluster whose
where M is the average of A and B. The JS function is the
centroid is closest (where closest is defined us-
right metric for calculating distances between probability
ing distance metric).
distributions
Linkage: Complete (max dissimilarity), Single (min), Av-
erage, Centroid (between centroids of cluster A and B)
Machine Learning Part I Machine Learning Part II Machine Learning Part III
Comparing ML Algorithms Decision Trees Support Vector Machines
Power and Expressibility: ML methods differ in terms Binary branching structure used to classify an arbitrary Work by constructing a hyperplane that separates
of complexity. Linear regression fits linear functions while input vector X. Each node in the tree contains a sim- points between two classes. The hyperplane is de-
NN define piecewise-linear separation boundaries. More ple feature comparison against some field (xi > 42?). termined using the maximal margin hyperplane, which
complex models can provide more accurate models, but Result of each comparison is either true or false, which is the hyperplane that is the maximum distance from
at the risk of overfitting. determines if we should proceed along to the left or the training observations. This distance is called
Interpretability: some models are more transparent right child of the given node. Also known as some- the margin. Points that fall on one side of the
and understandable than others (white box vs. black box times called classification and regression trees (CART). hyperplane are classified as -1 and the other +1.
models)
Ease of Use: some models feature few parame-
ters/decisions (linear regression/NN), while others
require more decision making to optimize (SVMs)
Training Speed: models differ in how fast they fit the
necessary parameters
Prediction Speed: models differ in how fast they make
predictions given a query
Advantages: Non-linearity, support for categorical
variables, easy to interpret, application to regression. Principal Component Analysis (PCA)
Disadvantages: Prone to overfitting, instable (not Principal components allow us to summarize a set of
robust to noise), high variance, low bias correlated variables with a smaller set of variables that
collectively explain most of the variability in the original
Note: rarely do models just use one decision tree. set. Essentially, we are ”dropping” the least important
Instead, we aggregate many decision trees using methods feature variables.
like ensembling, bagging, and boosting.
Naive Bayes Principal Component Analysis is the process by
Naive Bayes methods are a set of supervised learning Ensembles, Bagging, Random Forests, Boosting which principal components are calculated and the use
algorithms based on applying Bayes’ theorem with the Ensemble learning is the strategy of combining many of them to analyzing and understanding the data. PCA
”naive” assumption of independence between every pair different classifiers/models into one predictive model. It is an unsupervised approach and is used for dimensional-
of features. revolves around the idea of voting: a so-called ”wisdom of ity reduction, feature extraction, and data visualization.
crowds” approach. The most predicted class will be the Variables after performing PCA are independent. Scal-
Problem: Suppose we need to classify vector X = x1 ...xn final prediction. ing variables is also important while performing PCA.
into m classes, C1 ...Cm . We need to compute the proba- Bagging: ensemble method that works by taking B boot-
bility of each possible class given X, so we can assign X strapped subsamples of the training data and constructing
the label of the class with highest probability. We can B trees, each tree training on a distinct subsample as
calculate a probability using the Bayes’ Theorem: Random Forests: builds on bagging by decorrelating
P (X|Ci )P (Ci ) the trees. We do everything the same like in bagging, but
P (Ci |X) = when we build the trees, everytime we consider a split, a
P (X)
random sample of the p predictors is chosen as split can-
Where: √
didates, not the full set (typically m ≈ p). When m =
1. P (Ci ): the prior probability of belonging to class i p, then we are just doing bagging.
2. P (X): normalizing constant, or probability of seeing Boosting: the main idea is to improve our model where
the given input vector over all possible input vectors it is not performing well by using information from previ-
3. P (X|Ci ): the conditional probability of seeing ously constructed classifiers. Slow learner. Has 3 tuning
input vector X given we know the class is Ci parameters: number of classifiers B, learning parameter λ,
interaction depth d (controls interaction order of model).
The prediction model will formally look like:
P (X|Ci )P (Ci )
C(X) = argmaxi∈classes(t) P (X)

where C(X) is the prediction returned for input X.


Machine Learning Part IV Deep Learning Part I Deep Learning Part II
ML Terminology and Concepts What is Deep Learning? Tensorflow
Deep learning is a subset of machine learning. One popu- Tensorflow is an open source software library for numeri-
Features: input data/variables used by the ML model lar DL technique is based on Neural Networks (NN), which cal computation using data flow graphs. Everything in
Feature Engineering: transforming input features to loosely mimic the human brain and the code structures TF is a graph, where nodes represent operations on data
be more useful for the models. e.g. mapping categories to are arranged in layers. Each layer’s input is the previous and edges represent the data. Phase 1 of TF is building
buckets, normalizing between -1 and 1, removing null layer’s output, which yields progressively higher-level fea- up a computation graph and phase 2 is executing it. It is
Train/Eval/Test: training is data used to optimize the tures and defines a hierarchy. A Deep Neural Network is also distributed, meaning it can run on either a cluster of
model, evaluation is used to asses the model on new data just a NN that has more than 1 hidden layer. machines or just a single machine.
during training, test is used to provide the final result TF is extremely popular/suitable for working with Neural
Classification/Regression: regression is prediction a Networks, since the way TF sets up the computational
number (e.g. housing price), classification is prediction graph pretty much resembles a NN.
from a set of categories(e.g. predicting red/blue/green)
Linear Regression: predicts an output by multiplying
and summing input features with weights and biases
Logistic Regression: similar to linear regression but
predicts a probability
Overfitting: model performs great on the input data but
poorly on the test data (combat by dropout, early stop-
ping, or reduce # of nodes or layers)
Bias/Variance: how much output is determined by the Recall that statistical learning is all about approximating
features. more variance often can mean overfitting, more f (X). Neural networks are known as universal approx-
bias can mean a bad model imators, meaning no matter how complex a function is,
Regularization: variety of approaches to reduce over- there exists a NN that can (approximately) do the job.
fitting, including adding the weights to the loss function, We can increase the approximation (or complexity) by
randomly dropping layers (dropout) adding more hidden layers and neurons.
Ensemble Learning: training multiple models with dif- Tensors
ferent parameters to solve the same problem Popular Architectures In a graph, tensors are the edges and are multidimensional
A/B testing: statistical way of comparing 2+ techniques There are different kinds of NNs that are suitable for data arrays that flow through the graph. Central unit
to determine which technique performs better and also if certain problems, which depend on the NN’s architecture. of data in TF and consists of a set of primitive values
difference is statistically significant shaped into an array of any number of dimensions.
Baseline Model: simple model/heuristic used as refer- Linear Classifier: takes input features and combines A tensor is characterized by its rank (# dimensions
ence point for comparing how well a model is performing them with weights and biases to predict output value in tensor), shape (# of dimensions and size of each di-
Bias: prejudice or favoritism towards some things, people, DNN: deep neural net, contains intermediate layers of mension), data type (data type of each element in tensor).
or groups over others that can affect collection/sampling nodes that represent “hidden features” and activation
and interpretation of data, the design of a system, and functions to represent non-linearity Placeholders and Variables
how users interact with a system CNN: convolutional NN, has a combination of convolu- Variables: best way to represent shared, persistent state
Dynamic Model: model that is trained online in a con- tional, pooling, dense layers. popular for image classifica- manipulated by your program. These are the parameters
tinuously updating fashion tion. of the ML model are altered/trained during the training
Static Model: model that is trained offline Transfer Learning: use existing trained models as start- process. Training variables.
Normalization: process of converting an actual range of ing points and add additional layers for the specific use Placeholders: way to specify inputs into a graph that
values into a standard range of values, typically -1 to +1 case. idea is that highly trained existing models know hold the place for a Tensor that will be fed at runtime.
Independently and Identically Distributed (i.i.d): general features that serve as a good starting point for They are assigned once, do not change after. Input nodes
data drawn from a distribution that doesn’t change, and training a small network on specific examples
where each value drawn doesn’t depend on previously RNN: recurrent NN, designed for handling a sequence of
drawn values; ideal but rarely found in real life inputs that have ”memory” of the sequence. LSTMs are
Hyperparameters: the ”knobs” that you tweak during a fancy version of RNNs, popular for NLP
successive runs of training a model GAN: general adversarial NN, one model creates fake ex-
Generalization: refers to a model’s ability to make cor- amples, and another model is served both fake example
rect predictions on new, previously unseen data as op- and real examples and is asked to distinguish
posed to the data used to train the model Wide and Deep: combines linear classifiers with deep
Cross-Entropy: quantifies the difference between two neural net classifiers, ”wide” linear parts represent mem-
probability distributions orizing specific examples and “deep” parts represent un-
derstanding high level features
Deep Learning Part III Big Data- Hadoop Overview Big Data- Hadoop Ecosystem
Deep Learning Terminology and Concepts Data can no longer fit in memory on one machine An entire ecosystem of tools have emerged around
(monolithic), so a new way of computing was devised Hadoop, which are based on interacting with HDFS.
Neuron: node in a NN, typically taking in multiple in- using a group of computers to process this ”big data” Below are some popular ones:
put values and generating one output value, calculates the (distributed). Such a group is called a cluster, which
output value by applying an activation function (nonlin- makes up server farms. All of these servers have to be Hive: data warehouse software built o top of Hadoop that
ear transformation) to a weighted sum of input values coordinated in the following ways: partition data, coor- facilitates reading, writing, and managing large datasets
Weights: edges in a NN, the goal of training is to deter- dinate computing tasks, handle fault tolerance/recovery, residing in distributed storage using SQL-like queries
mine the optimal weight for each feature; if weight = 0, and allocate capacity to process. (HiveQL). Hive abstracts away underlying MapReduce
corresponding feature does not contribute jobs and returns HDFS in the form of tables (not HDFS).
Neural Network: composed of neurons (simple building Hadoop Pig: high level scripting language (Pig Latin) that
blocks that actually “learn”), contains activation functions Hadoop is an open source distributed processing frame- enables writing complex data transformations. It pulls
that makes it possible to predict non-linear outputs work that manages data processing and storage for big unstructured/incomplete data from sources, cleans it, and
Activation Functions: mathematical functions that in- data applications running in clustered systems. It is com- places it in a database/data warehouses. Pig performs
troduce non-linearity to a network e.g. RELU, tanh prised of 3 main components: ETL into data warehouse while Hive queries from data
Sigmoid Function: function that maps very negative • Hadoop Distributed File System (HDFS): warehouse to perform analysis (GCP: DataFlow).
numbers to a number very close to 0, huge numbers close a distributed file system that provides high- Spark: framework for writing fast, distributed programs
to 1, and 0 to .5. Useful for predicting probabilities throughput access to application data by partition- for data processing and analysis. Spark solves similar
Gradient Descent/Backpropagation: fundamental ing data across many machines problems as Hadoop MapReduce but with a fast in-
loss optimizer algorithms, of which the other optimizers • YARN: framework for job scheduling and cluster memory approach. It is an unified engine that supports
are usually based. Backpropagation is similar to gradient resource management (task coordination) SQL queries, streaming data, machine learning and
descent but for neural nets • MapReduce: YARN-based system for parallel graph processing. Can operate separately from Hadoop
Optimizer: operation that changes the weights and bi- processing of large data sets on multiple machines but integrates well with Hadoop. Data is processed
ases to reduce loss e.g. Adagrad or Adam using Resilient Distributed Datasets (RDDs), which are
Weights / Biases: weights are values that the input fea- HDFS immutable, lazily evaluated, and tracks lineage. ;
tures are multiplied by to predict an output value. Biases Each disk on a different machine in a cluster is comprised Hbase: non-relational, NoSQL, column-oriented
;
are the value of the output given a weight of 0. of 1 master node and the rest are workers/data nodes. database management system that runs on top of
Converge: algorithm that converges will eventually reach The master node manages the overall file system by HDFS. Well suited for sparse data sets (GCP: BigTable)
an optimal answer, even if very slowly. An algorithm that storing the directory structure and the metadata of the Flink/Kafka: stream processing framework. Batch
doesn’t converge may never reach an optimal answer. files. The data nodes physically store the data. Large streaming is for bounded, finite datasets, with periodic
Learning Rate: rate at which optimizers change weights files are broken up and distributed across multiple ma- updates, and delayed processing. Stream processing
and biases. High learning rate generally trains faster but chines, which are also replicated across multiple machines is for unbounded datasets, with continuous updates,
risks not converging, whereas a lower rate trains slower to provide fault tolerance. and immediate processing. Stream data and stream
Numerical Instability: issues with very large/small val- processing must be decoupled via a message queue.
ues due to limits of floating point numbers in computers MapReduce Can group streaming data (windows) using tumbling
Embeddings: mapping from discrete objects, such as Parallel programming paradigm which allows for process- (non-overlapping time), sliding (overlapping time), or
words, to vectors of real numbers. useful because classi- ing of huge amounts of data by running processes on mul- session (session gap) windows.
fiers/neural networks work well on vectors of real numbers tiple machines. Defining a MapReduce job requires two Beam: programming model to define and execute data
Convolutional Layer: series of convolutional opera- stages: map and reduce. processing pipelines, including ETL, batch and stream
tions, each acting on a different slice of the input matrix • Map: operation to be performed in parallel on small (continuous) processing. After building the pipeline,
Dropout: method for regularization in training NNs, portions of the dataset. the output is a key-value it is executed by one of Beam’s distributed processing
works by removing a random selection of some units in pair < K, V > back-ends (Apache Apex, Apache Flink, Apache Spark,
a network layer for a single gradient step • Reduce: operation to combine the results of Map and Google Cloud Dataflow). Modeled as a Directed
Early Stopping: method for regularization that involves Acyclic Graph (DAG).
ending model training early YARN- Yet Another Resource Negotiator Oozie: workflow scheduler system to manage Hadoop
Gradient Descent: technique to minimize loss by com- Coordinates tasks running on the cluster and assigns new jobs
puting the gradients of loss with respect to the model’s nodes in case of failure. Comprised of 2 subcomponents: Sqoop: transferring framework to transfer large amounts
parameters, conditioned on training data the resource manager and the node manager. The re- of data into HDFS from relational databases (MySQL)
Pooling: Reducing a matrix (or matrices) created by an source manager runs on a single master node and sched-
earlier convolutional layer to a smaller matrix. Pooling ules tasks across nodes. The node manager runs on all
usually involves taking either the maximum or average other nodes and manages tasks on the individual node.
value across the pooled area
SQL Part I Python- Data Structures Recommended Resources
Structured Query Language (SQL) is a declarative Data structures are a way of storing and manipulating • Data Science Design Manual
language used to access & manipulate data in databases. data and each data structure has its own strengths and (www.springer.com/us/book/9783319554433)
Usually the database is a Relational Database Man- weaknesses. Combined with algorithms, data structures • Introduction to Statistical Learning
agement System (RDBMS), which stores data arranged allow us to efficiently solve problems. It is important to (www-bcf.usc.edu/~gareth/ISL/)
in relational database tables. A table is arranged in know the main types of data structures that you will need • Probability Cheatsheet
columns and rows, where columns represent character- to efficiently solve problems. (/www.wzchen.com/probability-cheatsheet/)
istics of stored data and rows represent actual data entries. • Google’s Machine Learning Crash Course
Lists: or arrays, ordered sequences of objects, mutable (developers.google.com/machine-learning/
Basic Queries crash-course/)
>>> l = [42, 3.14, "hello","world"]
- filter columns: SELECT col1, col3... FROM table1
- filter the rows: WHERE col4 = 1 AND col5 = 2 Tuples: like lists, but immutable
- aggregate the data: GROUP BY. . .
- limit aggregated data: HAVING count(*) > 1 >>> t = (42, 3.14, "hello","world")
- order of the results: ORDER BY col2 Dictionaries: hash tables, key-value pairs, unsorted

Useful Keywords for SELECT >>> d = {"life": 42, "pi": 3.14}


DISTINCT- return unique results Sets: mutable, unordered sequence of unique elements.
BETWEEN a AND b- limit the range, the values can frozensets are just immutable sets
be numbers, text, or dates
LIKE- pattern search within the column text >>> s = set([42, 3.14, "hello","world"])
IN (a, b, c) - check if the value is contained among given Collections Module
deque: double-ended queue, generalization of stacks and
Data Modification queues; supports append, appendLeft, pop, rotate, etc
- update specific data with the WHERE clause:
UPDATE table1 SET col1 = 1 WHERE col2 = 2 >>> s = deque([42, 3.14, "hello","world"])
- insert values manually
Counter: dict subclass, unordered collection where ele-
INSERT INTO table1 (col1,col3) VALUES (val1,val3);
ments are stored as keys and counts stored as values
- by using the results of a query
INSERT INTO table1 (col1,col3) SELECT col,col2 >>> c = Counter('apple')
FROM table2; >>> print(c)
Counter({'p': 2, 'a': 1, 'l': 1, 'e': 1})
Joins
heqpq Module
The JOIN clause is used to combine rows from two or more
Heap Queue: priority queue, heaps are binary trees for
tables, based on a related column between them.
which every parent node has a value greater than or equal
to any of its children (max-heap), order is important; sup-
ports push, pop, pushpop, heapify, replace functionality
>>> heap = []
>>> for n in data:
... heappush(heap, n)
>>> heap
[0, 1, 3, 6, 2, 8, 4, 7, 9, 5]

You might also like