You are on page 1of 18

Machine Learning

Study online at quizlet.com/_2du7gi

1. |Example| a1 | a2 | Classification| ...


|1|A|D|+|
|2|A|E|-|
|3|B|D|+|
|4|C|D|+|
|5|B|E|+|

Calculate the entropy E(Sa) for the


given the table , where Sa is the
subset of S for which an atribute X
has the value v. X=(a1,a2). This is,
calculate the entropy for each
attribute value. It is sufficient to
provide the form s-log2(y). Exact
calculations of the logarithm is not
necessary.

What is the information gain of a1


and a2 relative to the given training
examples?
2. decision trees: top-down induction learning a decision tree is finding the best order to ask the attribute values (best means we
want a small tree → information gain)
3. Describe in your own words: How In the EM-Algorithm all known values are considered via their likelihood depending on the
does the EM-algorithm deal with distribution. In the same way hidden (i.e. missing) values are considered as depending on the
the missing value problem? probability distribution and additionally on the known values. So the complete distribution can
be seen as the product of two probability distributions (known and missing values).

The algorithm searches for the parameters that maximize the log-likelihood. As they depend on
the missing values, those are averaged out. In an iterative procedure the estimated parameter is
improved (M-step) followed by averaging over the missing values using the obtained
parameter (E-step). This will lead the estimation of the parameter to converge to a local
maximum which hopefully is close to the real parameter value. The principle in handling missing
values here is to not try to regain them somehow, but to invent values from a model obtained
through the probability distribution. In the best case this does not lead to information loss,
although it generally does. However, this at least makes the existing values technically usable.
4. Describe possible disadvantages of learns nothing from negative examples,
the Find-S- algorithm. cannot tell whether it learned a concept,
cannot tell whether training data is inconsistent
5. Describe recurrent networks Feed-forward networks can be computed sequentially (updating), since there are no cycles -->
they are 'timeless'
Not possible for recurrent networks (due to cycles) --> recurrent networks are dynamical
systems, time is important.
They can either be described continously by differential equations or discrete by the use of
synchronous (next timestep for all neurons) or asynchronous updating (randomly updating
neurons).
Phase space: space of all possible states of a network (all possible activation vectors)
Evolution of the system is trajectory threw the state space
Basic scenarios:
--> convergence to point attractor
--> convergence to limit cycle
--> convergence to strange attractor
--> chaos
6. Describe the Perceptron is iteratively trained using labled data D = (~x, ~y), ~x^n element R^d
training of a Perceptron training rule:
perceptron and Delta ~w = epsilon (t - y)~x
describe two modes Convergence can be shown if learning rate epsilon is sufficiently small (and task is solvable by a perceptron).
of learning The learning rule can be derived from minimizing the mean squared error.
Two models of learning:
-batch mode: gradient descent with respect to entire training set:
Delta~w =epsilon Sum((ti-y(~xi))~xi)
- incremental mode: gradient descent with respect to one example (~xi, ti)
Delta~w = (ti - y(~xi))~xi
For small : incremental mode can be shown to approximate batch mode arbitrarily close.
7. Explain at least one - Scatterplot matrix: Scatter data into plots for each combination of two attributes.
other data - Glyphs: Map data dimensions onto parameters for geometrical figures, e.g. star glyphs, arrows. Properties
visualization could be lengths, widths, orientations, colors, ...
technique from the - Parallel coordinates: Use feature dimensions as one axis and feature values as another. Plot datapoints as
lecture lines.
- Projections: Several different techniques to project the data: PCA, scaling strategies, ...
8. Explain Hebb's rule? increase weight wi according to the product of the activation of the neuron and the input via this channel i
So far, weights can become arbitrarily large. To prevent this, there are several solutions:
Decay Term (including constant forgetting):
Delta~w = epsilon *y(~x~w)~x - gamma~w
Dynamic normalization to |~w| = 1:
Delta~w = epsilon y(~x~w)~x - gamma~w( ~w^2 - 1)
Explicit normalization to |~w| = 1:
~w^new = (~w^old+Delta~w)/(|
~w^old+Delta~w|)
Oja's rule, uses weight decay y^2:
Delta~w = epsilon * y(~x~w)(~x - ~y(~x~w) ~w)
9. Explain Hypothesis ID3 searches the tree that builds up possible decision trees. Since this space is complete, the target hypothesis
space search by ID3 is surely in there. Output is a single hypothesis (not the version space) → no queries to resolve among
competing hypotheses. No backtracking implemented → ends up in local maxima. Statistically-based search
choices → robust to noisy data. Disadvantage: all examples needed for every choice, no incremental learning.
10. Explain in your own Intrinsic dimensionality exists in contrast to the descriptive dimensionality of data, which is defined by the
words the concepts numbers of parameters used to produce or represent the raw data (i.e. the number of pixels in an unprocessed
of descriptive and image).
intrinsic
dimensionality Additionally to this representive dimensionality, there is also a (most of the time smaller) number of
independent parameters which is necessary to describe the data, always in regard to a specific problem we
want to use the data on.
For example: a data set might consist of a number of portraits, all with size 1920x1080 pixels, which constitutes
their descriptive dimensionality. To do some facial recognition on these portraits however, we do not need the
complete descriptive dimension space (which would be way too big anyway), but only a few independent
parameters (which we can get by doing PCA and looking at the eigenfaces).
This is possible because the data never fill out the entire high dimensional vector space but instead
concentrate along a manifold of a much lower dimensionality.
11. Explain the idea Idea: For merging, do not only take inter-cluster distance into account but also consider size of the clusters
behind minimum (preferring
variance clustering small ones)
12. Explain the idea Idea: merge the two clusters for which the increase in total variance is minimized
behing Ward's This approach is optimization based. However, it can be implemented by a distance measure
minimum variance Properties: prefers spherical clusters and clusters of similar size. Robust against noise but not against outliers.
clustering.
What are
properties?
13. Explain what entropy and information he decision which attribute to choose is based on entropy:
gain is used for
In many machine learning applications it is crucial to determine which criterions are
necessary for a good classification. Decision trees have those criterions close to the
root, imposing an order from significant to less significant criterions. One way to select
the most important criterion is to compare its information gain or its entropy to others.
14. For data structures is normal PCA linear data structures
appropriate? Problems occur if data structure is non-linear. Better solution here is local PCA
15. Given the general boundary G= Answer right???
{Strong,?,?} and the specific boundary S= VS = {<Strong,?,?>,<Strong, sunny, ?>, <Strong,cloudy, ?>,<Strong, ?,warm>,<Strong,
{Strong, sunny, warm}v{Strong, cloudy, ?,cool>,<Strong, sunny, warm>, <Strong, cloudy cool>}
cool}, which we learned using the The most general boundary in this example is <Strong,?,?>, having less '?'s makes the
Candidate Elimination algorithm. Provide boundary more specific.
the complete version space, including
more general than relations. Provide a
definition of your choice of displaying
more general than relations
16. Give the sigmoid function and its first σ(t)=1/(1+e^(−t))
derivative. ∂σ/∂t=σ(t)(1−σ(t))
Why is it used for Multi-Layer The sigmoid function is commonly used because of its nice analytical properties: Its
Perceptrons? domain is [0,1], it is non-linear, strictly monotonous, continuous, differentiable and the
derivative can be expressed in terms of the original function at the given point. This
allows us to avoid redundant calculations. The sigmoid function is a special case of the
more general Logistic function which can be found in many different fields: Biology,
chemistry, economics, demography and recently most prominently: artificial neural
networks.
17. How can we avoid overfitting? Stop growing when data split is no more statistically significant OR grow tree & post-
prune.
18. how can we obtain binary attributes in to include continuous valued attributes, define thresholds to obtain binary attribute.
continuous valued attributes? Thresholds should be chosen to gain info: sort of attribute sets over examples, find
boundaries where classification changes, set thresholds at boundaries.
19. How does Backpropagation in MLPs Multilayer perceptrons (MLPs) can be regarded as a simple concatenation (and
work? parallelization) of several perceptrons, each having a specified activation function σ and
a set of weights wij. The idea that this can be done was discovered early after the
invention of the perceptron, but people didn't really use it in practice because nobody
really knew how to figure out the appropriate wij. The solution to this problem was the
discovery of the backpropagation algorithm which consists of two steps: first
propagating the input forward through the layers of the MLP and storing the
intermediate results and then propagating the error backwards and adjusting the
weights of the units accordingly.

An updating rule for the output layer can be derived straightforward. The rules for the
intermediate layers can be derived very similarly and only require a slight shift in
perspective - the mathematics for that are however not in the standard toolkit so we are
going to omit the calculations and refer you to the lecture slides.

We take the least-squares approach to derive the updating rule, i.e. we want to minimize
the Loss function
L=1/2*(y−t)^2
where t is the given (true) label from the dataset and y is the (single) output produced
by the MLP. To find the weights that minimize this expression we want to take the
derivative of L w.r.t. wi where we are now going to assume that the wi are the ones
directly before the output layer
∂L/∂wi=(y−t)oi*y(1−y)
20. How does the LDA classifier The LDA classifier assumes two normally distributed classes of data with the same covariance
work? What restrictions have to matrices each. It subtracts the mean of the two distribution from each other and normalizes by
be fullfilled by the data for this the inverse variance.
method to work and why? Taking the inner product of the weight (the normalized difference function) with the new data
point yields either a positive or a negative value which can be compared to the threshold to
determine the estimator for the class of the new data point.
21. How does the nearest neighbor For a new sample the nearest neighbor classifier searches for the corresponding nearest data
classifier work? When would point in our training data.
you use it and how is it trained? You would use it when memory and runtime timings are not of great importance, as training is
instant: You just store the training data. However, classification becomes a computationally more
expensive task.
22. How does the z-test work? detect outliers vie z-values:
zt =|xt -mu|/ sigma
zt is a measure for the distance of xt from the mean mu's normalized with the standard deviation
sigma. Normally, data with Z > 3 are considered outliers. Improvement: outliers use mu, use
median instead (commonly with threshold 3,5 then)
23. How do you find the Funashi: arbitrary bounded continous mapping can be approximated with arbitrary precision with
appropriate architecture for one hidden layer
your MLP? with sigmoidal activation functions and a layer with linear output functions
What do layer 1, 2, and 3 define? Cybenko: 2 hidden layers are sufficient to approximate any function with arbitrary precision
Are there restrictions regarding (however, layers migth
the number of nodes? be huge)
How so?
--> 1st layer defines hyperplanes
--> 2nd layer defines ANDs over the hyperplanes (convex areas)
--> 3rd layer defines ORs over convex areas
No restrictions on number of nodes, BUT:
--> less nodes than input layer may destroy degrees of freedom (if present in the data)
--> more nodes may facilitate representation but cannot invent additional degrees of freedom
Hidden layer may serve to discover features in the input data (more dimensions than needed etc)
24. How is a RBFN trained? Note: all input data has to be normalized.

Training a RBFN is a two-step process. First the functions in the hidden layer are initialized. This
can be either done by sampling from the input data or by first performing a k-means clustering,
where k is the number of nodes that have to be initialzed.

The second step fits a linear model with coefficients wi to the hidden layer's outputs with respect
to some objective function. The objective function depends on the task: it can be the least
squares function, or the weights can be adapted by gradient descent.
25. How is learning in a self- In self-organizing maps the nodes take part in competitive learning. They compete for each input
organizing map achieved? (As and a winner is dertermined whos weights (and the weights of his neighbors) are the only weights
opposed to techniques used in that are adapted. MLPs use error correction learning where a delta of the output to a target value
MLP?) is computed and used to adapt all weights in the network.
26. How many principal components are there at most There are at most 20 principal components possible for the face images, but
when you apply the PCA with the 20 training face only four principal components for the binary images.
images provided? How many principal components
were there for the 16 binary images if we made a
PCA on all of them?
27. Kohonen Net too complicated --> Lernen auf Lücke
28. MLP encoder General result: for an arbitrary distributions of vectors, the n neurons of the
hidden layer (aka bottleneck) converges
to span the subspace of the input space which is spanned by the n principal
components with the largest eigenvectors.
29. Name and explain the three types of learning in Unsupervised Learning: no teacher, unlabeled examples. Effect of learning is
artificial neural networks coded in the learning rules.
Supervised Learning: Teacher/labeled examples. Learning is directed at
mapping the input part of the examples to the label as an output.
Reinforcement Learning: Weak teacher: agent tries to reach goal and only
gets feedback if goal has been reached or not. Agent has to find way (create
mapping) itself.
More realistic learning becomes popular: weakly supervised and semi-
supervised learning: pre structuring of data with
unsupervised learning, use of rare labeled data after unsupervised learning.
30. Name and explain two simple approaches of Nearest neighbour algorithm
instance based learning. Training: Memorize all examples
Application: for new ~x find best match and give output for of this best
match.
K-nearest neighbour algorithm
for unknwon ~x find set S of nearest neighbours of stored examples.
Discrete valued output: vote
Continous output:
use mean
31. Name and explain two ways to extract more 1. succesive application of single Hebb neurons: extract ~v1, project D to
principal components ~v from D space orthorgonal to ~v1 and extract ~v2 and so on
--> single neurons need less steps (only get the 'correct' input for them.
Good for sequential training
2. Method by Sanger: chain of laterally coupled neurons, each trained by
Hebb's rule. First neuron gets unfiltered input, second gets input minus
direction of the first weight vector and so on. --> each step requires training
of the complete chain, while first neurons are not properly trained, later
neurons receive input where PC's with
larger eigenvalues are not filtered out correctly. Good for parallel training.
32. Name different linkage criterions - single linkage clustering: uses minimum cluster distance. Prefers chaining
- complete linkage clustering: uses the maximum cluster distance. Prefers
compact clusters
- average linkage clustering (UPGMA): uses mean cluster distance.
- centroid clustering: uses centroid distance. Clusters are repr. by their
centroids / real valued attributes needed for centroid computation / when
joining clusters,. resulting centroid is dominated by the cluster with more
members
33. Name some differences between a SVM and a MLP. When SVM:
would you use which? - linear separatrix (unless using kernel trick)
- binary classifier (extension to cascade possible)
- no online training
- can deal with outliers: slack variables
MLP:
- non-linear separatrix
- multiple classes possible
- online training possible
- outliers affect the whole network
34. PCA: How can we find a suitable number m of eigenvectors? look for aprupt jumps in the spectrum of eigenvalues indicating a
suitable cut off.
PCA does not generate structure - it only makes existing structure
accessible.
35. Provide a pseudocode algorithm for agglomerative Initialization: assign each of n data elements to a cluster Ci i = 1...n
hierarchical complete linkage clustering while n > 2 do
find the pair of clusters Ci , Cj , i < j that optimizes the linkage
criterion
merge: Ci ← CiU
if j < n then Cj ← Cn
end if
n--
end while
Optimizing the linkage criterion requires a distance measure. This is
were the algorithm can be modified.
36. Provide a pseudocode for basic maximization algorithm (note:~x = vektor x)
Initialization: partition data somehow into clusters C1...Cn
while stop criterion not reached do
Choose an example ~x at random, denote its cluster as C(~x)
Randomly select a target cluster Ct
compute the change of the goodness function: DeltaE = E('~x
element Ct')- E('~x element C(~x)')
if Delta E > 0 then
Put ~x from C(~x) to Ct
else
Put ~x from C(~x) to Ct with probability e^(belta Delta E)
end if
increase beta
end while
37. Provide Pseudocode for ID3 learning algotithm while not perfect classification do
A ← decision attribute with highest Gain
node ← A as decision attribute
for all value of A → new descendant of node
sort training examples according to leaf nodes
iterate over new nodes
end while
38. Provide the most specific hypothesis which is learned S0={<∅,∅,∅>}
according to the Find-S algorithm when using the examples S1={< s, o, p>}
below. Provide the individual learning steps, i.e., provide S2={< s, ?, p>}
the most specific hypothesis after each learning step. S3={< s, ?, ?>}
→ is this right???
Example a1 a2 a3 Classification
1 s o p true
2 s r t true
3 t r p false
4 t o p false
39. Sortierarfgabe? Neural Architectures In the following we will look at N x N connectivity matrices, since they
comprise the complete wiring information
- fully connected
--> higly complex dynamics (chaotic)
-sparsely connected
--> some statements about dynamics are possible, interesting case
- diagonal matrix
--> boring, each neuron only connected to itself
- tridiagonal matrix
--> chain of N neurons, each neuron has forward & backward coupling
to it's neighbours (and itself)
- lower triangular matrix
--> feed forward network, no recurreny, type of MLP, diagonal = 0 so
no slef coupling
- block diagonal
--> seperated local networks
- block diagonal + sparse connections
--> small world networks
- symmetric matrix
--> Hopeld networks, dynamics converge to point attractors
40. training MLP + Use same error function as before & minimize by gradient descent.
Backpropagation pseudocode Problem: all derivatives on the path from wik(m,n) to output layer
required. The backpropagation algorithm provides a scheme for
ecient computation of all required derivatives and weight adaptations.

initialize ~w randomly
while stop criterion is met do
propagate ~x forward trough net to obtain ~y
compute error ti - yi(~x) between actual output and desired output
for each output dimension i
Propagate errors backwards trough net to find 'responsible' weights
update weights
end while

weight of neuron j is adapted in proportion to activation of neuron in


previous layer to which it is connected %
weighted errors it causes at the output.
This complex scheme is necessary since target values are only
available for the outputs, not the hidden layer neurons
(so a 'target value' has to be constructed). BP-algorithm performs a
stochastic apprximation of gradient descent for
error function E.
41. Tree 2 only has a 96% Tree 1 is probably overfitted to this specific dataset, i.e. it has not only captured the structure but also the
accuracy on the training noise in the data. It probably won't generalize as well as the second tree.
set. Why might this tree Another advantage of tree 2 is that it is faster at classifying new data since less computations have to be
still be preferable over made. This difference is hardly noticeable however.
tree 1?
42. What ais the idea of the Hamming distance (# of positions where two binary strings di er) is equal to Manhattan distance for
Manhattan distance? binary
(aka city block distance)
43. What are aims of - find local dimensionalities of the data manifold
dimension reduction? - find new coordinate system to represent the data manifold and project the data to it
- get rid of empty parts in the space
- new parameters may be more meaningful
44. What are common - Eigenvalue magnitudes: Find the cut-off depth. This is useful for classification problems, especially for
choices on which to problems to be solved by computers.
decide the number of - Visualization: Choose the number of dimensions which is useful to visualize the data in a meaningful way.
dimensions to project This choice depends a lot on your problem definition. For printing 2D is usually a good choice - but
the data? maybe your data is just very nice for !D already. Or maybe you are using a glyph plot (see sheet 06) which
can represent high dimensional data.
- Classification results: In the Eigenfaces assignment below we figured out that the number of principal
components (and thus the number of dimensions) can have a crucial impact on classification rates. It is
thus an option to fine tune the number of dimensions for a good classification result on some training set.
45. What are Glyphs? Maps each dimension onto a parameter of a geometrical gure. Glyphs are normally perceived as a whole
! extracting
information of glyphs requires training!
Examples: star glyphs, parallel coordinates, Chernov faces.
46. What are possible minimum distance: Dmin(X, Y )
distance measures of maximum distance: Dmax(X, Y )
clusters? mean of all distances: Dmean(X, Y )
distance of centers: Dmean(X, Y )

D = pdist2(X,Y) returns a matrix D containing the Euclidean distances between each pair of observations in
the mx-by-n data matrix X and my-by-n data matrix Y. Rows of X and Y correspond to observations,
columns correspond to variables. D is an mx-by-my matrix, with the (i,j) entry equal to distance between
observation i in X and observation j in Y. The (i,j) entry will be NaN if observation i in X or observation j in
Y contain NaNs.
47. What are principal principal curves: can handle nonlinear distributions via nonlinear basis functions. Data is projected on the
curves? principal
curve. However, an appropriate dimensionality and
exibility parameter have to be found.
48. What are problems with Minimization here is still computationally expensive, suffers from local optima, is diffcult to
minimization in MLPs? terminate (good fit but
What are ways to avoid local no overfitting).
minima? Ways to avoid local minima:
- repeat training with different initial weights to find different basins of attraction
- annealing: add noise, e.g. every n learning steps: wji(k + 1, k) = wji(k + 1, k) + T Gji(k + 1, k)
G is a Gaussian random variable with mu= 0, T is the temperatur
Annealing improves minimization but requires longer learning.
- Step size adaptation: increase in epsilon at regions and decrease in steep terrain
- momentum: Delta wji(k + 1, k)(t))epsilon deltaj(k + 1)oi(k) + wji(k + 1, k)(t - 1)
t is a step counter, so direction of step t-1 is kept to some degree. This avoids abrupt changes of
direction and
thereby counteracts stoping at minor minima and oscillations.

Weight decay: overly large weights are problematic, they lead to binary decisions of single
neurons. Therefore an
additional quadratic regularization term can be added to the error function to avoid large
weights. This leads to a
decay part in the learning rule.
49. What are properties of EM- EM yields only local optima
algorithm? computationally much more expensive than K-means
precautions against collapse of Gaussian to a single point necessary
K-means can provide useful initialization for ~muk, local PCA for Ck
There are K! equivalent solutions → parameter identification might be difficult
50. What are properties of high random pairs of vectors are likely to have similar distances and be close to
dimensional spaces? perpendicular. Also the volume of the surface layer dramatically increases compared to the
volume of the inner part (overall probability that one dimension has an extreme value increases).
51. What are properties of k-means? - number of clusters is the only parameter (implicitly defines scale and shape). - Greedy
optimization → local optima, depending on initial conditions.
- Fast algorithm.
52. What are properties of local In MLP like ANN's single weight adaptations might influence performance of the whole net (all
methods in instance based ouput channels). They
learning are a very delicate and non robust structure. 'Death' aka removal of single neurons might have
drastic consequences
for the network.
Local method networks are far more robust, since computation of output only (or mainly) relies
on local units and
can be nicely compensated for by other units. Also adaptation has only local effects.
53. What are properties of the basic - maybe caught in local optima
maximization algorithm? - depends on initial partitioning
- to escape local optima, 'downhill' steps are accepted with probability ebelta Delta E
- simulated annealing: initially small allows frequent downhill steps, increasing it makes them
more unlikely
- for the following optimization criteria, only W needs to be computed, due to A = B +W
54. What are pro's and con's of the pro: simple and frequent measure
Eukledian distance? con: no individual weighting of components
55. What are pro's and con's of the pro: weights dimensions according to their standard deviation
Pearson distance? con: correlated vector components are overweighted
56. What are radial Radial basis functions are all functions that fullfill the following criteria:
basis functions?
The value of the function for a certain point depends only on the distance of that point to the origin or some
other fixed center point. In mathematical formulation that spells out to: ϕ(x)=ϕ(∥x∥) or ϕ(x,c)=ϕ(∥x−c∥). Notice that
it is not necessary (but most common) to use the norm as the measure of distance.
57. What are the Curse of dimensionality describes the phenomenon that in high dimensional vector spaces, two randomly drawn
curse of vectors will almost always be close to orthogonal to each other. This is a real problem in data mining problems,
dimensionality where for a higher number of features, the number of possible combinations and therefore the volume of the
and its resulting feature space exponentionally increases.
implication for
pattern In such a high dimensional space, data vectors from real data sets lie far away from each other (which means
classification? dense sampling becomes impossible, as there aren't enough samples close to each other). This also leads to the
Explain how this problem that pairs of data vectors have a high probability of having similar distances and to be close to
phenomenon orthogonal to each other. The result is that clustering becomes really difficult, as the vectors more or less stand
could be used to on their own and distance measures cannot be applied easily.
one's advantage
This is actually an advantage if you want to discriminate between a high number of individuals (see Bertillonage,
where using only 11 features results in a feature space big enough to discriminate humans), but if you want to get
usable information out of data, such a 'singling out' of samples is a great disadvantage.
58. What are the The eigenvectors are called principal components, which can be interpreted as features, receptive fields or filter
features of PCA? kernels. Expansion after the m < d largest eigenvectors is the optimal linear method to minimize the mean square
reconstruction error
59. What are the - activity communicated via axon is represented as real number x (aka spiking freq)
properties of - for a neuron with d inputs --> ~x element of R^d is the input vector
formal neurons? - each input xi is weighted by a weight wi
- activation is the sum of the weighted inputs
- output y is then given by feeding the activation into an activation function: y = y(s)
60. What are the - any distance measure can be used
properties of - we only need the distance matrix (not the data)
hierachical - no parameters
clustering? - efficiency: agglomerative O(n^3) (optimized SLINK O(n^2)) / divisive: O(2^n) (optimized CLINK O(n^2))
- resulting dendrogram offers alternative solutions (but has to be analyzed)
- cut off at different levels of dendrogram may be necessary to get comparable clusters
- outliers are fully incorporated
61. What are the 1. plain nearest neighbour --> hard boundaries (assigns new output to Voronoi tesselation cell)
properties of k-nearest --> allows continous transitions (new, unseen outputs)
instance based 2. suitable k depends on local intrinsic data dim
learning? 3. Training: very fast / requires memory (no compr) / no info wasted / no params or complex procedures
4. application: may be slow (when many exampes stored) / sensitive to errors & noise
62. What are the 1.Initialize your MLP.
steps of the Use as many input neurons as there are dimensions in the data. Input neurons always expect 1D input. Then create
algorithm of neurons for each hidden and the output layer. Each neuron in the hidden and output layers expects as many inputs
implementing as there are neurons in the layer before them.
an MLP?
2.Initialize the neurons' weights.
For each neuron in layers 1...LH+1 initialize the weights to small random values

3.Implement the activation (feed-forward) step.


A.Decompose the input into its components and pass them to the correct input neuron.
B.Each input neuron passes its unprocessed input to the next layer. That means each neuron in layer 1 receives all
outputs from each input layer as its own input.
oi(0)=xi
C.Calculate the weighted sums of their inputs and apply their activation function σ for each neuron in the layers
1...LH+1. This is best done iteratively layer by layer, as each layer's input is the output of its preceding layer (Note:
wj0(k,k) denotes the bias for neuron j in layer k):
D.The resulting oi(LH+1) are the outputs yi for each output neuron i.

4.Implement the adaption (backpropagation) step. A.Compute the error between the target and output
components to calculate the error signals δi(LH+1) :
δi(LH+1)=oi(LH+1) (1−oi(LH+1)) (ti−oi(LH+1))
B. Calculate the error signals δi(k) for each hidden layer k, starting with k=LH and going down to k=1.
C. Adapt the weights for each neuron in the hidden and output layers.
Δwji(k+1,k)=ϵδj(k+1)oi(k)
63. What are the boundaries of the version space:
two boundaries - general boundary G of VS(H;D) is the set of maximally general members
of the version - specific boundary S of VS(H;D) is the set of its maximally specific members
space? every member of the VS lies between (including) these boundaries
64. What are two - agglomerative clustering: start with each data point as a cluster, then merge recursively bottom up
complementary - divisive clustering: start with all data points as a single cluster, split clusters recursively top down
methods in The result is a dendrogram representing all data in a hierarchy of clusters.
hierachical
clustering?
65. What can cause - measurement and technical errors (cut-o e.g. may lead to high concentration of values at the boundaries)
outliers? - unexpected true effect (can be modeled by two or more overlaid distributions
P(x) = (1 -p) Pa(x) + p Pb(x) p << 1
- data with inherent high variability (can be modeled by distribution with broad flanks.
There are different types of outliers which can have different causes. They could arise through measurement or
technical errors when collecting data. This may be connected to having a sharp cut-off in regard to the range of
measurements, which could lead to a high concentration of values at the artificial boundaries of an experiment.
However they may also show us a true underlying effect in our data that we didn't expect or account for. This
might be the case when we are treating the measurements as a single distribution, when in reality there are actually
two underlying distributions. Lastly, our distribution might actually naturally have a high variance, which makes
outliers or extreme values a natural part of the distribution.
66. What do both RBFN:
models (RBFN - non-linear layered feedforward network
and MLP) have - hidden neurons use radial basis functions, output neurons use linear function
in common? - universal approximator
Where do they - learning usually affects only one or some RBF
differ?
MLP
- non-linear layered feedforward network
- input, hidden and output-layer all use the same activation function
- universal approximator
- learning affects many weights throught the network
67. What does it Sepal length and sepal width are not relevant for the classification. This might be either because they are
mean if some redundant or because they are independent of the class.
features( x1 and
x2) do not
appear in the
decision trees?
68. What do we do Usually outliers are detected & removed. But to do this, we first need to define what is regular. Most often we use
with them (in normal distribution here, for multivariate data, clustering algorithms with normal distribution assumed for each
general)? cluster can be used.
Z-value and
Rosner Test First, we need to detect probable outliers. In order to decide which data points we want to declare as an outlier
we have to find a model for regular, meaning "not outlying", data points. What we do most of the time is to assume
a normal distribution underlying the data (or a multivariate distribution where each cluster is normally distributed).

One option is to calculate the z-value for each data point (a measure of the distance from the mean in terms of
the standard deviation) -- data points with a high z-value would be regarded outliers, a common threshold would
be a z value bigger than 3. This can be improved by using the median instead of the mean and tweaking the
threshold. The Rosner test takes it one step further by iteratively calculating z-values and removing found outliers
until none can be found anymore. This can be done one outlier at a time or k outliers at a time for more efficiency.

A different approach would be to not remove the outliers completely, but to weight them according to the z-
values. And lastly an alternative for complete removal would be to fill up the emerging gaps with values that fit
the distribution better.
69. What is an The inductive bias of a machine learning algorithm is the set of assumptions that must be added to the observed
inductive bias? data to get a logical deduction from them.
That means that it is some preference of the algorithm for a specific set of hypotheses based on a set of training
observations.
70. What is a A perceptron represents a d-1 dimensional decision surface, a hyperplane othorgonal to ~w.
perceptron able Necessary conditon: only solvable with a perceptron if data is linearly seperable. This is the case for many basic
to do? What is it logical operations, except for XOR (XOR can be solved by distortion of feature space or adding of further input
not able to do? channels like x3 = x1 * x2)
71. What is a Matrix of 2D plots of all possible combinations of available dimensions. If displaying all axes is infeasible, PCA may
scatterplot yield suitable directions.
matrix?
72. What is called Habituation
"the anti Hebb ~w has a parallel and an orthogonal component to ~x. For continous training with the same ~x the parallel
rule"? Explain! component
becomes 0, while the orthogonal component does not change.
The anti-Hebb rule leads to habituation: ~w becomes orthogonal to the repeated stimulus --> ~w filters out new
stimuli.
For several habituated stimuli, only the ~w component orthogonal to the space spanned by those stimuli can pass
the
filter.
73. What is meant dea: given we have a data set consisting of d-dimensional vectors that shall be transmitted over some channel
by where
"Compressing bits per data point depend on dimension d. We can now cluster our data, transmit the cluster centers once and
by Clustering"? then
only transmit for each data point the cluster that represents it best. A small number of clusters has high
compression
but bad quality, a high number of clusters vice versa.
74. What is remove nodes to achieve better generalization on test data. Pruning node n: remove subtree of n and make n a
reduced error leaf node, assigning most common classification of its affiliated training examples. Check all nodes for pruning,
pruning? actually remove the one that results in highest performance increase (greedy). Do while performance on test data
increases. properties: produces smallest version of most accurate tree / removes nodes produced by noise (noise
per definition not present in test data). If few data, use post pruning instead.
75. What is rule build decision tree (allow overfitting), convert tree to rules (one rule for each path from root to leaf), prune each
post pruning? rule (remove any preconditions that improve accuracy n test data), sort final data by accuracy on validation set and
apply them in this order for new classification. why pruning rules? only paths are prunde, not entire substrees →
more sensitive pruning / no hierarchy while pruning, even pruning of root node possible / better readability
76. What is the bias All clustering algorithms have a bias:
in cluster - the bias prefers a certain cluster model that compromises scale & shape of the clusters
algorithms? - optimally the bias can be chosen explicitly, but usual it is partly built in in the algorithm
- the adjustable parameters are usually processing parameters, not model parameters
- the connection between the parameters and the cluster model has to be inferred from the way the algorithm is
working
- hierarchical clustering solves the problem for the scale parameter by having all different scale solutions present
in an ordered way
77. What is the Candidate Elimination is a learning algorithm that, in each step, tries to generate a description which is consistent
candidate- with all previously observed examples in a training set. That description could hypothetically then be used to
Elimination classify examples outside the training set.
algorithm?
idea: compute whole VS. Like list-then-eliminate start with complete VS but do not name them explicitly. Instead
represent VS by its boundaries. Start with most general G0 <?; ?; ?; ? > and most specific S0 <∅,∅,∅,∅ >. Those
delimit the whole VS. Now for each training example specialize G and generalize S until they overlap.
78. What is the combinatorical explosion: n^d (volume of resulting space) combinations for d variables with
curse of n values each, which makes n^d points necessary to sample the whole space !--> sampling for large d becomes
dimensionality? impossible. Any real data set will not be able to will a high dimensional space
79. What is the PCA:
difference mean square error function for a data set given by a probability density P(~x):
between PCA E=1/2Integral(~x- Sum(a1(~x)~pt)^2*P(~x)d~x))
and principal Approximate the manifold by m vectors ~pi, find coeffcients ai(~x) for each ~x
curves?
Principal Curves:
Generalize approximation to a parameterized surface ~X (a1,..., a2| ~w) of m dimensions. The vector ~w element of
R^n --> parameters that determine the shape of the surface minimizing
E =1/2Integral((~x-~X(a1(~x))...am(~(x)j ~w))^2P(~x)d~x)
The m parameters ai(~x) determine the point on the surface of ~x that best matches ~x and have to be computed
for each ~x individually (in a 2D example dataset, there is only one parameter, since m = 1 and ~X is a curve.)
The number n of parameters in ~w are responsible for the ability of ~X to t a manifold (small n underfit, large n
overfit). For a good fit, additional smoothness constraints should be used. ~w can be fitted using e.g. gradient
descent (normally stochastic approximation is used (downhill step over one single sample), since otherwise each
step would require integration over all data).
80. What is the Single-linkage tends to chain clusters along the data. That is why it combines the points in the center area with
difference those in the bottom right corner.
between
single- and Complete-linkage prefers compact clusters and thus combines each of the point heavy areas individually without
complete- merging them.
linkage
clustering?
81. What is the The connection between two neurons, represented by the weight of the postsynaptic neuron, will increase during
effect of Hebb's training until an equilibrium with the decay term is reached. Then, the weight is proportional to the correlation of
rule on a pair the
of neurons? activity of the neurons.
82. What is the effect of Hebb's rule on ~W converges to the eigenvector of C with the largest eigenvalue.
the weight vector? Hebbain learning finds the largest principal component similar to PCA
Compared to a winner takes it all rule, the Hebb-trained neurons are affected by all stimuli:
close input has less effect
than far input. This is why ~w is adjusted according to the data variance.
83. What is the general srchitecture of an - input layer (Layer 0): one neuron for each dimension of input vector (input neurons only
MLP? represent input)
- at least one hidden layer (L1...LH), hidden layer can have different numbers of neurons
- output layer(LH+1): one node per dimension of the output
- feed forward architecture: only connections from layer k to i with k < i
- more notation: neurons in layer i: 1(i)...N(i) / output of neuron i in layer k: oi(k) / weight from
neuron k in
layer n to neuron i in layer m: wik(m, n)

Sigmoid activation functions σ are used (succession of linear transforms is itself a single
linear transform, so nothing to gain here). σ enforces a decision, soft step required
(backpropagation requires di erentiability). Squashing maps
incoming info to smaller range.
84. What is the idea and what are pro's idea: scale distances using the covariance matrix C
and con's of the Mahalanobis Pro: scale & translation invariant / if C is unit matrix → Euclidian distance
distance? Con: scaling might destroy structure within data
The points of equal Mahalanobis distance to a center form an ellipsoid
85. What is the idea behind a projection idea: project onto 2-3 selected directions like PCA, but choose directions that exhibit
pursuit? interesting structure.
What is interesting? what is interesting? variance / non-Gaussian distribution (structure) / clusters
What is the procedure? procedure:
Name four criteria to measure teh 1. select 1 to 3 directions (by PCA or simply original dimensions)
deviation of a distribution from the 2. project onto these directions & get density P(~x) of projected data
standardized normal distribution. 3. compute index how 'interesting' (according to some criterion) the data is
What are Problems? 4. maximize index by search for better directions
criteria:
- Firedman-Turkey-Index
minimized if P is a parabolic function similar to norm. distribution
- Hermite/ Hall Index
minimized when P is a standarized normal distribution
- Natural Hermite Index
like Hermite Index only higher weighting of the center
- Entropy Index
minimized by standardized normal distribution
Problems:
maximization requires estimation of density P(~x) of the projected data. Methods are Kernel
density estimation (Parzen windows) / orthonormal function expansion
86. What is the idea behind Conceptual idea: employs idea of clustering and decision tree learning for unsupervised classifcation.
clustering? Most known algorithm: COBWEB. Motivated by the drawbacks of ID3 and inspired by
cognitive classifcation: category formation is strongly connected to forming prototypical
concepts (basic level category and generalizations and specializations of it).
Ideas of COBWEB:
- unsupervised learning
- incremental learning
- probabilistic representation: gradual assignment of objects to categories
- no a priori fixed number of categories
- realized by global utility function which determines: # of categories / # of hierarchical
levels / assignment of objects to categories
87. What is the idea behind idea: finding maximally specific hypotheses
Find-S algorithm? Problems:
What are Problems and - learns nothing from negative examples,
what are advantages? - cannot tell whether it learned a concept,
- cannot tell whether training data is inconsistent
Good: picks maximally specific h (but depending of H there might be several solutions).
88. What is the idea behind idea: simply store the training examples D = (~xn,~tn), ~xn element R^d,n ,
instance based learning?
89. What is the idea behind Construct better local approximation of ~y(~x) by computing t function in the region around the
Locally weighted samples
Regression? Two choices: what fit function (linear, quadratic,...) & what error function (what will be minimized?
local!!!)
90. What is the idea behind first clustering, than PCA in each cluster, or better, iteratively improve position of cluster centers and
local PCA? local PCs.
For continous, non-clustered data:
--> no continous description of the manifold --> different clusterings are possible, leading to entirely
different local projection
91. What is the idea behind idea: find a lower dimensional manifold such that projection preserves the structure of the data as good
multidimensional scaling? as possible.
We must define structure. Important: distances between data points.
Given R^D and lower dimensional space R^d find a mapping : R^D--> R^d such that the distances
between data points in R^D are well approximated by the distances of the projections of those
data points in R^d.
92. What is the idea behind idea: solve nonlinear seperation tasks with nonlinear separatrices & classes covering disjoint parts by
Multilayer Perceptons? combining several perceptrons and also generalize to several outputs
93. What is the idea behind So far, all dist. measures relied on the topology of R^n. But there are data with other topologies
nominal scales? (angular attributes...).
What is a possible Solution: embed different topologies into R^n
problem and solution?
nominal scales: map nominal attributes to real values. (stone;wood; metal) → ((1, 0, 0), (0, 1, 0), (0, 0, 1))
problem: for large n of attribute values, dimensionality becomes too high, a solution would be to
choose normalized random vectors instead.
94. What is the idea behing So far: clusters as sets of data points belonging to their respective centers.
soft clustering? → disjoint clusters / hard clustering (each point assigned to ONE cluster) Drawback → no way to
express uncertainty
about an assignment.
Soft clustering idea: describe data by a probability distribution P(~x). A data point is assigned to a
cluster by probabilities (expressing uncertainty). Clusters have no boundaries and are represented by
Gaussians.
95. What is the idea of The idea of a decision tree is to classify data by a sequence of interpretable steps. Each node tests a
decision trees? value and each
branch stands for one of the values of this attributes. Each endnode than provides a classification.
96. What is the idea of ID3 idea: find best attribute by the distribution over the examples. Put the best one at the root and the
learning algorithm? possible values as branches. Search second best value for each of the branches...
97. What is the idea of k- idea: Divide dataset D into clusters C1...CK which are represented by their K centers of gravity (aka
means clustering? means) ~w1... ~wK and then minimize the quadratic error
Iterative K-means:
Start with randomly chosen reference vectors, assign data to best match reference vector, update
reference vectors by
shifting them to the mean of their cluster, stop if cluster centers haven't moved more element
98. What is the idea of PCA? - find subspace that captures most of the data variance.
- Unsupervised.
- Given: data set D = {~x1, ~x2,...}, ~xi element R^d
with zero mean: < ~x >=
sum over xi(~xi) = 0
--> PCA finds m < d orthonormal vectors ~p1...~pm such that ~pi are the directions of the largest
variance of D.
99. What is the idea of the idea: minimum number of moves a king needs between two positions on a chessboard
Chebyshev distance?
(aka Maximum/chessboard
distance)
100. What is the idea, problems and idea: start with all hypothesis, then for each example eliminate the inconsistent hypotheses
advantages of the List-then- for a large VS one needs many or few quite informative examples, if VS is ∅ there are
eliminate algorithm? inconsistencies
Good: computes complete VS (ideally only one hypothesis remains)
Problems: can only be applied to finite H, requires enumerating all hypothesis
→ impractical for real problems
101. What is the inductive bias in H is power set of instances of X → unbiased search? NO, short trees are preferred (corresponds
ID3? to Occams razor), high info gain attributes near root are preferred. Bias is preference for some
hypotheses rather than a restriction of hypothesis space.
102. What is the inductive bias of a ...
learner? Provide an example
using the „weather" problem,
which was discussed in the
lecture
103. What is the inductive bias of the assume that most of the cases in a small neighborhood in feature space belong to the same class.
nearest neighbor classifier? Given a case for which the class is unknown, guess that it belongs to the same class as the
majority in its immediate neighborhood. This is the bias used in the k-nearest neighbors algorithm.
The assumption is that cases that are near each other tend to belong to the same class.
104. What is the motivation behind Clusters are basic structures. Additionally, clusters in some feature spaces may indicate closeness
clustering? of the data on a semantic level. Clustered data implies rules. Compression can be achieved by
transmitting only cluster centers. However, it is not always trivial to define clusters, this depends
on the scale and the shape one wants to achieve.
105. What is the p-norm? General norm
E.g. p=2 --> Euclidean distance
p=1 --> Manhattan distance
p--> inf maximum distance
106. What is the problem domain of learning a discrete target function for instances describable as attribute-value pairs.
decision trees? Disjunctive hypothesis required.
Possible for noisy data.
Decision trees can be written as a set of rules (e.g. as disjunction of conjunction of attribute
values: each path is one conjunction, combine all paths with disjunction).
107. What is the problem with if an attribute has many values, Gain tends to select it (even if nonsense, because it enables
attributes with many values in perfect classification). However, this prevents good generalization.
classification?
What is a solution for this One solution: GainRatio instead of Gain:
problem? St is subset of S for which A has value vi and A has n different values in total. SplitInformation is
the entropy with respect to the attribute values.
'normalizing the Gain'. GainRatio favors attribute with fewer values, if two attributes yield same
gain. GainRatio not defined for attributes with same value for all examples (zero denominator),
but they are useless anyway and have to be excluded.
108. What is the problem with Outlieres? Outliers can drastically spoil statistics, especially in small data sets. There are some measures,
that are robust against outliers even without explicit detection (e.g. median).
109. What is the Rosner test? Describe its The Rosner test is an iterative procedure to remove outliers of a data set via a z-test
purpose and provide its formal while new outliers are found do
definition? calculate mean mu or median m and SD sigma
find data point xit with largest z-value: i = argmaxtzt
if xt is an outlier then remove xt
end if
end while
110. What is the structure of a RBFN? RBFN's are networks that contain only one hidden layer. The input is connected to all the
hidden units. Each of the hidden units has a different radial basis function that is sensitive to
ranges in the input domain. The output is then a linear combination of the outpus ot those
functions.
111. What is the version space? The version space VS(H;D) with respect to the hypothesis space H and training examples D is
the subset of hypotheses
from H consistent with all training examples in D (i.e. version space = all consistent
hypothesis).
VS(H;D) = {h element H | Consistent(h;D)}
112. What to do with outliers? - removal: simple, but loss of information
- weight according to z-values
- remove outliers and fill up gaps with methods following
113. What us the idea behind distance idea: nearer neighbours are more important.
weighted k-nearest neighbours? weight average with distance to the input with inverse distance as weight
cool thing: now neighbours can be the entire set of examples
114. What would be an alternative to The nodes/weights can be sampled from the subspace spanned by the larges principal
initializing nodes randomly in self- components. With this method learning should become faster since the nodes already have
organizing maps? a good initial fit to the structure of the data. This method only works if the dataset is not
essentially non-linear-
115. When is hypothesis h consistent with hypotheses h is consistent with training examples D of target concept c i ff h(x) = c(x)
training examples D of target (correctly classified) for all training examples x; c(x)) element D
concept c? Consistent(h;D) <==> all (x; c(x)) element D : h(x) = c(x)
116. When would you use a RBFN instead RBFNs are more robust to noise and should therefore be used when the data contains false-
of a Multilayer Perceptron? positives.
117. Which methods can be used if we - if node n tests A, assign most common values of A among other examples sorted to n
have missing attributes in decision - assign most common value of A among other examples with same target value
trees? - assign probability pi to each possible value vi of A (prob estimated from value distribution
of examples assigned to n).
- Assign fraction pi of examples with missing values to each descendant in tree.
118. Which of the learning algorithms The inductive bias of the Candidate Elimination algorithm is the assumption that the target
you heard about in the lecture concept is contained in the given hypothesis space.
(Candidate Elimination and Find-S) The inductive bias of the Find-S Algorithm is that the resulting hypothesis labels all new
has the stronger bias? instances as negative instances unless the opposite is entailed by its prior knowledge from
the training set. This has a really big impact as negative examples are ignored completely.
This means Find-S has the stronger bias.
119. Why an inductive bias? The hypothesis space limits what the algorithm can find. E.g. by choosing conjunctive
hypothesis we have biased the learner. However, when representing all concepts (disjunction
of all positive examples), we would learn nothing.
We would only write the data differently. There would be no generalization possible and
convergence would only be achieved when all possible instances would have been
presented.
How is generalization achieved? The 'inductive leap' came about via the independence of the
attribute underlying the
construction of H.
- bias-free learning systems make no a-priori assumptions ! it can only collect examples
without generalization
- inductive learning makes sense only with prior assumptions (bias)!
- learning is simply: collect examples, interpolate among the examples according to the
inductive bias, actively acquire examples following the suggestions from the version space
- for every applied learning system the inductive bias should be clear!
120. Why are self-organizing maps It could be argued that self-organizing maps work in a way similar as the brain when it
possibly interesting for cognitive handels different sensory input in different parts of the brain. The areas themselves are
scientists in general? topologically structured in a way that similar inputs activate the same area ( just like the SOMs
as well)
121. Why did Chernoff use faces for his Humans are exceptionally good at face recognition. It is very easy to realize if one eye is
representation? Why not bigger than another or eye brows are closer together in face-like images for humans than for
something else, like dogs or example figuring out differences in windows sizes or changes in roof skewness between
houses? houses.
122. Why do we need data visualization Sometimes it is necessary to visualize high dimensional data and a projection via PCA or
techniques and what are techniques similar methods might not help enough: We might lose information in a 2D projection.
to visualize high dimensional data?
In those cases it is useful to come up with other representations of data which we could
potentially print on a sheet of paper.

Techniques are usually glyphs, but different kinds of projection might already be enough
(taking information loss into account).

You might also like