Professional Documents
Culture Documents
1. 1.) True positives True positives (TP): These are cases in which we
2.) True negatives predicted yes and the category is yes.
3.) False positives
4.) False negatives True negatives (TN): These are cases in which we
predicted no and the category is no.
- P(cavity | toothache) = 0.8 "Prob. of cavity is 0.8, given that all you know is you have toothache
8. Confusion Matrix a table that is often used to describe the performance of a classification model (or "classifier") on a set of test
data for which the true values are known
9. Curse of The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional
Dimensionality data.
First, it's very easy to overfit the the training data, since we can have a lot of assumptions that describe the
target label (in case of supervised learning). In other words we can easily express the target using the
dimensions that we have.
Second,we may need to increase the number of training data exponentially, to overcome the curse of
dimensionality and that may not be feasible.
Third, in ML learning algorithms that depends on the distance, like k-means for clustering or k nearest
neighbors, everything can become far from each others and it's difficult to interpret the distance between the
data points.
10. Data Mining - Extract interesting knowledge from large unstructured data-sets
* non-obvious, comprehensible, meaningful, useful
11. Eager learning When given training data, construct model for future use in prediction that summarises the data
- Inability to process: too complex to use all possible relevant data in computations, or to consider all possible
exceptions and qualifications
35. Supervised Given examples, return function h (hypothesis) that approximates some 'true' function f that (hypothetically)
Learning: Task generated the labels for the examples
Def
36. Test sets used to evaluate/compare hypothesis
37. Text Classification where a document is classified into one or more existing classes. Typically words are used as
features (attributes).
Issues:
1.) Lots of attributes
2.) Large number of rarely used attributes such as neologisms or antiquated words
3.) Large # of frequently used words that contain no useful info
4.) All these can be mitigated with pre-processing
38. Training Set Quality - MAR When missing data is not random but can be totally related to a variable where there is complete
information
Example - Men not reporting depression
39. Training Set Quality - MCAR The presence/absence of data is completely independent of observable variables
40. Training Set Quality - MNAR When the missing values are neither MCAR nor MAR. People w/ depression not reporting it.
41. Training sets used to construct hypotheses
42. Validation set "Randomly sample a validation set and hide it"
"Finally unlock the drawer with the validation set; evaluate (objectively) using it: publish these results
43. What is Machine Learning? "Field of study that gives computers the ability to learn without being explicitly programmed"
- Samuel, 1959
"Learning is changing behavior in a way that makes performance better in the future"
- Witten & Frank 1999