You are on page 1of 4

Machine Learning & Data Mining

Study online at quizlet.com/_40q9sb

1. 1.) True positives True positives (TP): These are cases in which we
2.) True negatives predicted yes and the category is yes.
3.) False positives
4.) False negatives True negatives (TN): These are cases in which we
predicted no and the category is no.

False positives (FP): These are cases in which we


predicted yes and the category is no.

False negatives (FN): These are cases in which


we predicted no and the category is yes.
2. 3 V's 1.) Volume: terabytes and up.
2.) Velocity: from streaming data
3.) Variety: numeric, video, sensor, unstructured text...
3. Basic Performance Classification Accuracy (1- Error Rate)
Measure - - Proportion of test cases classified correctly
Classification
4. Basic Performance Root Mean Square Error (error = Prediction - Actual)
Measure -
Regression
5. Baye's Rule P(b | a) = P(a | b) P(b) / P(a)

- Follows from Product Rule


- Allows us to reason about causes when we have observed effect
6. Big Data Data sets of scale and complexity such that they can be difficult to process using current standard methods -
Standard DB tools & data management apps - Moving target

3V's - Volume, Velocity, Variety


7. Conditional Conditional Probability: - Also known as posterior probability
Probability:
- Probability is conditioned by other evidence

- P(cavity | toothache) = 0.8 "Prob. of cavity is 0.8, given that all you know is you have toothache
8. Confusion Matrix a table that is often used to describe the performance of a classification model (or "classifier") on a set of test
data for which the true values are known
9. Curse of The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional
Dimensionality data.

First, it's very easy to overfit the the training data, since we can have a lot of assumptions that describe the
target label (in case of supervised learning). In other words we can easily express the target using the
dimensions that we have.

Second,we may need to increase the number of training data exponentially, to overcome the curse of
dimensionality and that may not be feasible.

Third, in ML learning algorithms that depends on the distance, like k-means for clustering or k nearest
neighbors, everything can become far from each others and it's difficult to interpret the distance between the
data points.
10. Data Mining - Extract interesting knowledge from large unstructured data-sets
* non-obvious, comprehensible, meaningful, useful
11. Eager learning When given training data, construct model for future use in prediction that summarises the data

- Analogy: compilation in programming language


- Slow in model construction, quicker in subsequent use
- Model itself may be useful/informative
12. Entropy measure of uncertainty of a random variable (acquisition of information corresponds to a reduction of
entropy)
13. Explain the Data crisp-dm
Mining process - Problem Definition
- Data Exploration
- Data Preparation
- Modelling
- Evaluation
- Deployment

Cross Industry Standard Process for Data Mining


14. Explain what Refers to the fact that we gave the algorithm a data set in which the "right answers" were given
Supervised learning is
Regression - Predict continuous valued output
Classification: Discrete valued output (0 or 1)
15. Gradient Descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent
starts with an initial set of parameter values and iteratively moves toward a set of parameter values that
minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative
direction of the function gradient.
16. Inductive Learning of Step 1 - For all the attributes that have not yet been used in the tree, calculate their entropy and information
a Decision Tree gain values for the training samples

Step 2 - Select attribute that has the highest information

Step 3 - Make a tree node containing that attribute

Step 4 - This node partitions the data:


apply the algorithm recursively to each partition
17. Information Gain of an attribute in Entropy from partitioning the data according to that attribute
18. In order to solve a Determine the type of training examples. Before doing anything else, the user should decide what kind of
given problem of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single
supervised learning, handwritten character, an entire handwritten word, or an entire line of handwriting.
one has to perform Gather a training set. The training set needs to be representative of the real-world use of the function. Thus,
the following steps: a set of input objects is gathered and corresponding outputs are also gathered, either from human experts
or from measurements.
Determine the input feature representation of the learned function. The accuracy of the learned function
depends strongly on how the input object is represented. Typically, the input object is transformed into a
feature vector, which contains a number of features that are descriptive of the object. The number of
features should not be too large, because of the curse of dimensionality; but should contain enough
information to accurately predict the output.
Determine the structure of the learned function and corresponding learning algorithm. For example, the
engineer may choose to use support vector machines or decision trees.
Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning
algorithms require the user to determine certain control parameters. These parameters may be adjusted by
optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of
the resulting function should be measured on a test set that is separate from the training set.
19. Key insights to - Each sample can be considered to be a point in sample space
kNN - if two samples are close to each other in space, they should be close to each other in their target values
20. Lazy Learning No explicit model constructed
- Calculations deferred until new case to be classified
21. lazy learning ...
22. Logistic predicts probabilities, and is therefore a regression algorithm. However, it is commonly described as a classification
regression method in the machine learning literature, because it can be (and is often) used to make classifiers. There are also
"true" classification algorithms, such as SVM, which only predict an outcome and do not provide a probability.
23. Main symptom Much better performance on the training data than on independent test data
of over fitting
24. ML task entail: figure this out
Classification
25. ML task entail: figure out
clustering
26. ML task entail: figure out
regression
27. Noise Imprecise or incorrect attribute values or labels
- Can't always quantify it, but should know from situation if it is present
- E.g. labels may require subjective judgement or values may come from imprecise measurements
28. Pre-processing is the initial manipulation of your data for your learner
29. Q learning a type of Reinforcement Learning that minimizes behaviour of a system through trial and error
- Updates its policy (state-action mapping) based on a reward
30. Receiver ROC Curve is a graphical plot that illustrates the diagnostic ability of a binary classifier
Operating
Characteristic - The Roc Curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various
threshold settings
31. Regression predict a continuous variable, such as rainfall amount or sunlight intensity. They can also predict probabilities, such
models as the probability that an image contains a cat. A probability-predicting regression model can be used as part of a
classifier by imposing a decision rule - for example, if the probability is 50% or more, decide it's a cat.
32. reinforcement is the process of learning with an environment through positive feedback
learning
33. Relationship of figure out
Data Mining to
Machine
Learning
34. Sources of - Incomplete knowledge: lack of relevant facts, partial observations, inaccurate measurements, incomplete domain
uncertainty: theory

- Inability to process: too complex to use all possible relevant data in computations, or to consider all possible
exceptions and qualifications
35. Supervised Given examples, return function h (hypothesis) that approximates some 'true' function f that (hypothetically)
Learning: Task generated the labels for the examples
Def
36. Test sets used to evaluate/compare hypothesis
37. Text Classification where a document is classified into one or more existing classes. Typically words are used as
features (attributes).

Issues:
1.) Lots of attributes
2.) Large number of rarely used attributes such as neologisms or antiquated words
3.) Large # of frequently used words that contain no useful info
4.) All these can be mitigated with pre-processing
38. Training Set Quality - MAR When missing data is not random but can be totally related to a variable where there is complete
information
Example - Men not reporting depression
39. Training Set Quality - MCAR The presence/absence of data is completely independent of observable variables
40. Training Set Quality - MNAR When the missing values are neither MCAR nor MAR. People w/ depression not reporting it.
41. Training sets used to construct hypotheses
42. Validation set "Randomly sample a validation set and hide it"
"Finally unlock the drawer with the validation set; evaluate (objectively) using it: publish these results
43. What is Machine Learning? "Field of study that gives computers the ability to learn without being explicitly programmed"
- Samuel, 1959

"Learning is changing behavior in a way that makes performance better in the future"
- Witten & Frank 1999

"Improvement with experience at some task" and "A well-defined ML problem:


- improve over task T
- w/ regards to performance measure p
- based on experience E"
...Mitchell, 1997
44. Wolpert's "No Free Lunch" There are no hard-and-fast rules for which algorithm will work well for your data
theorem - Different algorithms make different assumptions, that are either well suited or poorly suited to the
particular dataset

You might also like