Professional Documents
Culture Documents
1
Inductive Learning Method Inductive Learning Method
Construct/adjust h to agree with f on training set Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples) (h is consistent if it agrees with f on all examples)
Attribute-Based
Inductive Learning Method
Representations
Construct/adjust h to agree with f on training set Examples described by attribute values (Boolean,
discrete, continuous, etc.)
(h is consistent if it agrees with f on all examples)
E.g., situations where I will/won't wait for a table:
2
Decision tree Expressiveness
One possible representation for hypotheses Decision trees can express any function of the input attributes.
E.g., here is the "true'' tree for deciding whether to wait: E.g., for Boolean functions, truth table row path to leaf:
Information Information
Information answers questions Before asking any question, we can only base our
answere on the prior probabilities.
The less information we have about the answer initially, the more
Since there is 6 positive and 6 negative examples
information is contained in the answer we get.
in our training data, we need exactly one bit of
Information content I can be computed based on the prior
information to answerer the question if we should
probabilities P1 to Pn: stay or not.
n
I( P1 ,, Pn ) = P (vi ) log 2 P (vi )
i =1
p n 6 6 6 6
IWait , = log 2 + log 2 = [ 0.5 * 1] + [ 0.5 * 1] = 1
The information content is also referred to as entropy. p + n p + n 12 12 12 12
3
Information Information Theory
Suppose that we test the Patrons attribute. The information still required after testing attribute A is
called the reminder R:
The information still required after testing Patrons
is the average over the three alternatives None, v
pi + ni pi ni
R ( A) = * I ,
Some and Full. i =1 p+n pi + ni pi + ni
2 4 2 2 4 4
I Full , = log 2 + log 2 = [ 0.33 * 1.59] + [ 0.66 * 0.59] = 0.92
6 6 6 6 6 6
Choose the attribute A with the largest gain!
4
Why Would We Want to Imitate
Neural Networks
the Brain Architecture?
Two major types: Superior performance
Fault tolerant!
Neural Networks Brain cells die all the time!
Still works because of local connections
In our brains
Can learn by examples (inductive learning)
Artificial Neural Networks Good generalization
Attempts to imitate our Biological We manage to do something even with previously
unseen information or in totally new situations!
Neural Networks with software and hardware
Can we imitate the brain ?
Terminology:
Units (nodes)
g : activation function
The computational unit of an Artificial Neural Network
ai : activation level of node i (output of the node).
Connected through links with numeric weights
Computes an output as a function of all inputs from
surrounding nodes The neuron is also called node, or unit.
5
What Functions Can a
Perceptron Represent?
Linear Separability
n
a = step Wj a j Consider a perceptron with two inputs and a threshold
j=0 (bias):
a1 W1 = 1
The perceptron fires if w1a1 + w2a2 t 0
W0 = 1.5 a = step( W0 + W1a1 + W2a2 ) = step( 1.5+ a1 + a2 ) Recall the weights for the and perceptron:
a2
W2 = 1 AND a1 + a2 1.5 0
This is really the equation for a line!
a1 W 1 =1 a2 a1 + 1.5
W0 = 0.5 a = step( W0 + W1a1 + W2a2 ) = step( 0.5+ a1 + a2 )
a2
W2 = 1 OR The activation threshold for a perceptron is actually a
linearly separable hyperplane in the space of inputs
W1 = 1
a1 W0 = 0.5 a = step( W0 + W1a1) = step(0.5 a1)
NOT
2007-10-29 Erik Billing 2007-10-29 Erik Billing
Ume University, Computing Science Ume University, Computing Science
Training Multi-Layer
Multi-Layer Network
Network
Training multi-layer networks can be a bit complicated (the
weight space is large!)
The perceptron rule works fine for a single unit that mapped
input features to the final output value
But hidden units do not produce the final output
Output unit(s) take other perceptrons not known feature
values as inputs
6
Perceptron Training Perceptron Learning
Conceptually, the perceptron rule does: A perceptron learns is by adjusting its weights in order to
minimize the error on the training set
Compare the perceptrons output (s) to what it should
Consider updating the value for a single weight on a single
have been ( f, e.g. true), i.e. compute error example a with the perceptron learning rule:
If the error is large, assign blame to the weight/input
combinations that most influenced the wrong call, and
raise/lower the weights accordingly Err = y ai y: The desired output
ai: Activation of current node i.
If the error is small, dont change them as much
The key parameter is the learning rate :
x j = a j wj Err: The output error
wj: Weight between node j and i.
If too small, learn slowly and convergence takes forever w j = Err x j xj: Weight influence from input j.
If too large, can make changes that are too drastic
w j w j + w j
7
Problems with BP Training and Over Training
Training set (solid) Test set (dotted)
Because BP is a gradient descent (hill-climbing) 16
search, it suffers from the same problems: 14
12
Doesnt necessarily find the globally best weight vector
Total error
10
8
Convergence is determined by the starting point (randomly
6
initialized weights) 4
If set too large, can bounce right over the global 2
0
minimum into a local minimum
1 2 3 4 5 6 7 8
Number of epochs (*100)
Over training occurs when the model has fit all characteristics
To deal with these problems:
and starts fitting noise in data.
Usually initialize weights close to 0
Can repeat training with multiple random restarts The problem increases with the number of weights (i.e.
the model complexity)
Neural Network
Overfitting in Neural Nets
Characteristics
As usual, we can use a tuning set to avoid Universal approximators very powerful and can
overfitting in neural nets approximate any function, but they do have drawbacks
We can train several candidate structures and use the Many weights to learn: training can take a while
tuning set to find one thats appropriately expressive Sensitive to structure, initial weights, and learning rate
More common: given a network structure, use early Well suited for parallel computing since each node only
depend on the connected neighbor nodes.
stopping by evaluating the network on the tuning set Seem to have good generalization abilities. Tolerant to
after each epoch noise in the input.
Stop when performance begins to dip on the tuning set
Sometimes allow a fixed number of epochs beyond the
dip just in case it goes back up
Neural Network
Summary
Characteristics
Can be applied to a number of totally different applications The second best solution to pretty much everything!
Black boxes. Hard to describe WHY the network produces Perceptrons are mathematical models of neurons
a certain output or decision. (brain cells)
Learn linearly separable functions
Prior knowledge. Hard to incorporate knowledge about the Insufficiently expressive for many problems
function into the architecture. Neural Networks are machine learning models that
have multiple layers of perceptrons
Trained using back-propagation, a gradient descent
search through weight space (NN hypothesis space)
Sufficiently expressive for any classification or
regression task, also quite robust to noise.
8
Summary
Many applications:
Speech processing, driving, face/handwriting
recognition, backgammon, checkers, etc.
Disadvantages:
Overly expressive: prone to overfitting
Difficult to design appropriate structure
Many parameters to estimate: slow training
Hill-climbing can get stuck in local optima
Poor comprehensibility (black box-probolem)