Learning Chapter18

Learning
Learning is essential for unknown environments,

i.e., when designer lacks omniscience
Artificial Intelligence Learning is useful as a system construction method,

i.e., expose the agent to reality rather than trying to write
it down
Learning Learning modifies the agent's decision mechanisms to

Chapter 18 improve performance
2007-10-29 Erik Billing 2007-10-29 Erik Billing

Ume University, Computing Science Ume University, Computing Science
Learning Types of Learning

The agents percepts should be used not only for acting, but Supervised learning:
also for improving future performance. Given a value x, f(x) is immediately provided by a
supervisor. f(x) is learned from a number of
Tasks to learn for an agent: examples: (x1,y1), (x2,y2), , (xn,yn)
What state will be the result of an action? Reinforcement leaning:
How will the changing world evolve? A correct answer y is not provided for each x.
What is the value of each state Rather a general evaluation is proved after a sequence of actions
Which kind of states has high (low) value? (occasional rewards)
Which percepts are relevant? Unsupervised learning:
The agent learns relationships among its percepts. I.e. it perform
clustering.
Learning task estimations of functions y=f(x): x y

Inductive Learning Inductive Learning Method

The agent is fed with examples (x1,y1), (x2,y2), , (xn,yn) Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
Problem: find a hypothesis h
such that h f
given a training set of examples E.g., curve fitting:
Which hypothesis is best?

All learning methods make assumptions about f.
This preference is called a bias.
h should perform well on the examples AND on unseen
examples: h should generalise well.

1
Inductive Learning Method Inductive Learning Method
Construct/adjust h to agree with f on training set Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples) (h is consistent if it agrees with f on all examples)
E.g., curve fitting: E.g., curve fitting:

Inductive Learning Method Inductive Learning Method

Construct/adjust h to agree with f on training set Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples) (h is consistent if it agrees with f on all examples)
E.g., curve fitting: E.g., curve fitting:

Attribute-Based
Inductive Learning Method
Representations
Construct/adjust h to agree with f on training set Examples described by attribute values (Boolean,
discrete, continuous, etc.)
(h is consistent if it agrees with f on all examples)
E.g., situations where I will/won't wait for a table:
E.g., curve fitting:
Ockham's razor: maximize

a combination of consist-
ency and simplicity

2
Decision tree Expressiveness
One possible representation for hypotheses Decision trees can express any function of the input attributes.
E.g., here is the "true'' tree for deciding whether to wait: E.g., for Boolean functions, truth table row path to leaf:
Trivially, a consistent decision tree for any training set

with one path to leaf for each example (unless f non-
deterministic in x) but it probably won't generalize to new
examples
Prefer to find more compact decision trees

Learning Decision Trees Choosing an attribute

The tree can often represent a set of examples in a Idea: a good attribute splits the examples into subsets that
compact way by identifying patterns in the are (ideally) "all positive'' or "all negative''
examples.
This is a simpler function with hopefully better

generalisation.
Decision trees is a major tool for Machine Learning in

real applications.
Patrons? is a better choice gives information about
the classification

Information Information
Information answers questions Before asking any question, we can only base our
answere on the prior probabilities.
The less information we have about the answer initially, the more
Since there is 6 positive and 6 negative examples
information is contained in the answer we get.
in our training data, we need exactly one bit of
Information content I can be computed based on the prior
information to answerer the question if we should
probabilities P1 to Pn: stay or not.
n
I( P1 ,, Pn ) = P (vi ) log 2 P (vi )
i =1
p n 6 6 6 6
IWait , = log 2 + log 2 = [ 0.5 * 1] + [ 0.5 * 1] = 1
The information content is also referred to as entropy. p + n p + n 12 12 12 12

3
Information Information Theory
Suppose that we test the Patrons attribute. The information still required after testing attribute A is
called the reminder R:
The information still required after testing Patrons
is the average over the three alternatives None, v
pi + ni pi ni
R ( A) = * I ,
Some and Full. i =1 p+n pi + ni pi + ni
0 2 0 0 2 2 Now, the information gain G from testing attribute A can

I None , = log 2 + log 2 = [ 0 * Inf ] + [ 1* 0 ] = 0 be computed as:
2 2 2 2 2 2
p n
4 0 4 4 0 0 G ( A) = I , R( A)
I Some , = log 2 + log 2 = [ 1* 0] + [ 0 * Inf ] = 0 p+n p+n
4 4 4 4 4 4
2 4 2 2 4 4
I Full , = log 2 + log 2 = [ 0.33 * 1.59] + [ 0.66 * 0.59] = 0.92
6 6 6 6 6 6
Choose the attribute A with the largest gain!

Performance of the Learning

Performance Measurement
Algorithm
The critical question: How do we know that h f ?
How does the hypothesis perform on unseen examples?
Test methodology: Try h on a new test set of
1) Collect a large set of examples examples
2) Divide them randomly a training set and a test set (use same istribution over
3) Generate a hypothesis using the training set example space as training
set)
4) Measure the performance on examples from the
test set. I.e. The percentage of examples in the test set Learning curve: % correct on test set
as a function of training set size
that are correctly classified.
5) Optionally: repeat 1-4 many times

Noise and Overfitting Summary

Often there are TOO MANY attributes ! Learning needed for unknown environments.
It is hard to know in advance which attributes are relevant Learning method depends on available feedback, type of
for the classification task. Irrelevant attributes act as noise. component to be improved, and its representation.
The problem is an example of OVERFITTING. For supervised learning, the aim is to find a simple
hypothesis approximately consistent with training examples.
Decision tree learning using information gain.
Makes the tree fit the training data
Learning performance = prediction accuracy measured on
Gives BAD generalisation test set.
We need help to determines when to stop adding nodes !

(all learning techniques have the general problems with
overfitting)

4
Why Would We Want to Imitate
Neural Networks
the Brain Architecture?
Two major types: Superior performance
Fault tolerant!
Neural Networks Brain cells die all the time!
Still works because of local connections
In our brains
Can learn by examples (inductive learning)
Artificial Neural Networks Good generalization
Attempts to imitate our Biological We manage to do something even with previously
unseen information or in totally new situations!
Neural Networks with software and hardware
Can we imitate the brain ?

Computing Elements Neuron

aj Wj ,i
Neurons
The computational unit of the brain
Input Output
Connected through Synapses which can be Excitatory or
links g W j ,i a j ai links
Inhibitory j
Computes an output as a function of all inputs from
surrounding neurons
Terminology:
Units (nodes)
g : activation function
The computational unit of an Artificial Neural Network
ai : activation level of node i (output of the node).
Connected through links with numeric weights
Computes an output as a function of all inputs from
surrounding nodes The neuron is also called node, or unit.

The output from a node: Activation

Perceptron
n
ai = g Wj,i a j Functions
j =1
Common activation functions g: Input layer with 4 inputs
1, x t Output layer with 3 output nodes
Step function: step (x) =
t 0, x < t
I1
+1, x 0
Sign function: sign(x)= I2
1, x < 0
1 I3
Sigmoid: sigmoid (x) =
1 + e x I4
tanh: tanh(x) Ij Wj,i Oi

5
What Functions Can a
Perceptron Represent?
Linear Separability

n
a = step Wj a j Consider a perceptron with two inputs and a threshold
j=0 (bias):
a1 W1 = 1
The perceptron fires if w1a1 + w2a2 t 0
W0 = 1.5 a = step( W0 + W1a1 + W2a2 ) = step( 1.5+ a1 + a2 ) Recall the weights for the and perceptron:
a2
W2 = 1 AND a1 + a2 1.5 0
This is really the equation for a line!
a1 W 1 =1 a2 a1 + 1.5
W0 = 0.5 a = step( W0 + W1a1 + W2a2 ) = step( 0.5+ a1 + a2 )
a2
W2 = 1 OR The activation threshold for a perceptron is actually a
linearly separable hyperplane in the space of inputs
W1 = 1
a1 W0 = 0.5 a = step( W0 + W1a1) = step(0.5 a1)
NOT
Linear Separability Multi-Layer Networks

I2 f =1
I2
1
The structure of a multi-layer network is fairly
1
I1
straightforward:
f = 0 I1
I2 The input layer is the set of features (percepts)
0 0 I2
0 1 I1 0 1 I1 Next is a hidden layer, which has an arbitrary
number of perceptrons called hidden units that
I2 I2 take the features (input layer) as inputs
1 1 The perceptron(s) in the output layer then
? takes the outputs of the hidden units as its
I1 I1 XOR
inputs
I2
0 0
0 1 I1 0 1 I1

Training Multi-Layer
Multi-Layer Network
Network
Training multi-layer networks can be a bit complicated (the
weight space is large!)
The perceptron rule works fine for a single unit that mapped
input features to the final output value
But hidden units do not produce the final output
Output unit(s) take other perceptrons not known feature
values as inputs
The solution is to use the back-propagation algorithm,

which is an intuitive extension of the perceptron training
algorithm

6
Perceptron Training Perceptron Learning
Conceptually, the perceptron rule does: A perceptron learns is by adjusting its weights in order to
minimize the error on the training set
Compare the perceptrons output (s) to what it should
Consider updating the value for a single weight on a single
have been ( f, e.g. true), i.e. compute error example a with the perceptron learning rule:
If the error is large, assign blame to the weight/input
combinations that most influenced the wrong call, and
raise/lower the weights accordingly Err = y ai y: The desired output
ai: Activation of current node i.
If the error is small, dont change them as much
The key parameter is the learning rate :
x j = a j wj Err: The output error
wj: Weight between node j and i.
If too small, learn slowly and convergence takes forever w j = Err x j xj: Weight influence from input j.
If too large, can make changes that are too drastic
w j w j + w j

Perceptron Learning Gradient Descent

Now that we can update a single weight on one example, we Recall that minimization
want to update all weights so that we minimize error on the problems are called gradient
whole training set descent tasks
So this is really an optimization search in weight space, which is the
hypothesis space for perceptrons If we have a perceptron with 2
weights, we want to find the pair
For reasons we wont get into, its actually useful to minimize the of weights (i.e. point in 2D
squared error over the whole set: weight space) where E[w] is the
E[w] d (trued od)2 lowest
Where E[w] is the sum of squared errors for the weight vector w, and
d ranges over examples in the training set
But the weights are continuous
values, so how do we know
how much to change them?

Gradient Descent Back-Propagation (BP)

Solution: use calculus BP generalizes the perceptron rule:
Compute the magnitude and direction of the Gradient-descent search to minimize error on the
gradient of the error surface by using a partial training data (again, usually in iterative mode)
derivative: In the forward pass, features are fed forward to the
E[w] [E/w0, E/w1,...,E/wn] output layer where error is calculated
We want to update each weight wi by wi : From earlier slide:
wi = [E/wi]
The perceptron rule worked fine for a single unit that
And if we combine this with our weight update rule, mapped input features to the final output value
we get the following complete perceptron training
But hidden units do not produce the final output
rule:
Output unit(s) take other perceptrons not known
wi wi + ( [E/wi])
This makes sense: if (true o) is positive, the weight should be
feature values as inputs
increased for positive inputs ai, and decreased for negatives

7
Problems with BP Training and Over Training
Training set (solid) Test set (dotted)
Because BP is a gradient descent (hill-climbing) 16
search, it suffers from the same problems: 14
12
Doesnt necessarily find the globally best weight vector
Total error
10
8
Convergence is determined by the starting point (randomly
6
initialized weights) 4
If set too large, can bounce right over the global 2
0
minimum into a local minimum
1 2 3 4 5 6 7 8
Number of epochs (*100)
Over training occurs when the model has fit all characteristics
To deal with these problems:
and starts fitting noise in data.
Usually initialize weights close to 0
Can repeat training with multiple random restarts The problem increases with the number of weights (i.e.
the model complexity)

Neural Network
Overfitting in Neural Nets
Characteristics
As usual, we can use a tuning set to avoid Universal approximators very powerful and can
overfitting in neural nets approximate any function, but they do have drawbacks
We can train several candidate structures and use the Many weights to learn: training can take a while
tuning set to find one thats appropriately expressive Sensitive to structure, initial weights, and learning rate
More common: given a network structure, use early Well suited for parallel computing since each node only
depend on the connected neighbor nodes.
stopping by evaluating the network on the tuning set Seem to have good generalization abilities. Tolerant to
after each epoch noise in the input.
Stop when performance begins to dip on the tuning set
Sometimes allow a fixed number of epochs beyond the
dip just in case it goes back up

Neural Network
Summary
Characteristics
Can be applied to a number of totally different applications The second best solution to pretty much everything!
Black boxes. Hard to describe WHY the network produces Perceptrons are mathematical models of neurons
a certain output or decision. (brain cells)
Learn linearly separable functions
Prior knowledge. Hard to incorporate knowledge about the Insufficiently expressive for many problems
function into the architecture. Neural Networks are machine learning models that
have multiple layers of perceptrons
Trained using back-propagation, a gradient descent
search through weight space (NN hypothesis space)
Sufficiently expressive for any classification or
regression task, also quite robust to noise.

8
Summary
Many applications:
Speech processing, driving, face/handwriting
recognition, backgammon, checkers, etc.
Disadvantages:
Overly expressive: prone to overfitting
Difficult to design appropriate structure
Many parameters to estimate: slow training
Hill-climbing can get stuck in local optima
Poor comprehensibility (black box-probolem)
2007-10-29 Erik Billing

Ume University, Computing Science

Learning Chapter18

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Chapter18

Uploaded by

Copyright:

Available Formats

Learning

Learning is essential for unknown environments,

Artificial Intelligence Learning is useful as a system construction method,

Learning Learning modifies the agent's decision mechanisms to

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Learning Types of Learning

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Inductive Learning Inductive Learning Method

Which hypothesis is best?

2007-10-29 Erik Billing 2007-10-29 Erik Billing

E.g., curve fitting: E.g., curve fitting:

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Inductive Learning Method Inductive Learning Method

E.g., curve fitting: E.g., curve fitting:

2007-10-29 Erik Billing 2007-10-29 Erik Billing

E.g., curve fitting:

Ockham's razor: maximize

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Trivially, a consistent decision tree for any training set

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Learning Decision Trees Choosing an attribute

This is a simpler function with hopefully better

Decision trees is a major tool for Machine Learning in

2007-10-29 Erik Billing 2007-10-29 Erik Billing

2007-10-29 Erik Billing 2007-10-29 Erik Billing

0 2 0 0 2 2 Now, the information gain G from testing attribute A can

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Performance of the Learning

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Noise and Overfitting Summary

We need help to determines when to stop adding nodes !

2007-10-29 Erik Billing 2007-10-29 Erik Billing

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Computing Elements Neuron

2007-10-29 Erik Billing 2007-10-29 Erik Billing

The output from a node: Activation

tanh: tanh(x) Ij Wj,i Oi

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Linear Separability Multi-Layer Networks

2007-10-29 Erik Billing 2007-10-29 Erik Billing

The solution is to use the back-propagation algorithm,

2007-10-29 Erik Billing 2007-10-29 Erik Billing

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Perceptron Learning Gradient Descent

2007-10-29 Erik Billing 2007-10-29 Erik Billing

Gradient Descent Back-Propagation (BP)

2007-10-29 Erik Billing 2007-10-29 Erik Billing

2007-10-29 Erik Billing 2007-10-29 Erik Billing

2007-10-29 Erik Billing 2007-10-29 Erik Billing

2007-10-29 Erik Billing 2007-10-29 Erik Billing

2007-10-29 Erik Billing

You might also like