You are on page 1of 34

Machine Learning

What is Learning?
Changes in the system that are adaptive in the
sense that they enable the system to do the same task
or tasks drawn from the same population more
efficiently the next time."
--Herbert Simon
"Learning is constructing or modifying
representations of what is being experienced."
--Ryszard Michalski
"Learning is making useful changes in our minds."
--Marvin Minsky

Why does Machine Learning do?


Understand and improve efficiency of human learning.
Discover new things or structures that are unknown to
humans
Example: Data mining
Fill in incomplete specifications about a domain.
- Large, complex AI systems cannot be completely
derived by hand and require dynamic updating to
incorporate new information.
- Learning new characteristics expands the domain
or expertise and lessens the "brittleness" of the
system

Components of a Learning System


Critic

Sensors

Learning Element

Problem Generator

Performance Element

Effectors

Learning Element -- makes changes to the system based on


how it is doing.
Performance Element -- the part that chooses the actions to
take.
Critic tells the Learning Element how it is doing (e.g., success
or failure) by comparing with a fixed standard of performance.
Problem Generator suggests "problems" or actions that will
generate new examples or experiences that will aid in training
the system further.
In designing a learning system, there are four major issues to be
considered:
components -- which parts of the performance
element are to be improved
representation of those components
feedback available to the system
prior information available to the system

Major Paradigms of Machine Learning


Rote Learning
One-to-one mapping from inputs to stored representation.
"Learning by memorization. Association-based storage and
retrieval.
Analogy
- Determine correspondence between two different
representations.
- It is an inductive learning in which a system transfers
knowledge from one database into that of a different domain.
Inductive Learning
- Use specific examples to reach general conclusions.
- Extrapolate from a given set of examples so that we can
make accurate predictions about future examples.

Supervised versus Unsupervised learning


Want to learn an unknown function f(x) = y, where x is an input
example and y is the desired output.
- Supervised learning
A situation in which both the inputs and the outputs of a
component can be observed.
Supervised learning implies that we are given a set of (x, y) pairs
by a supervisor or teacher.
- Unsupervised learning
Learning when there is no information about what the correct
outputs are.
Unsupervised learners can learn to predict future percepts based
upon present ones, but cannot learn which actions to take without a
utility function.
Unsupervised learning means that we are only given the xs. In
either case, the goal is to estimate f.

Clustering
It is unsupervised, inductive learning in which "natural classes"
are found for data instances, as well as ways of classifying them.
Discovery
- Unsupervised learning, specific goal is not given.
- It is both inductive and deductive learning in which system learns
without the help from a teacher.
- It is deductive if it proves theorems and discovers concepts about
those theorems.
- It is inductive when it raises conjectures (formulation of opinion
using incomplete information).
Reinforcement
- Only feedback (positive or negative reward) given at the end of a
sequence of steps.
Requires assigning reward to steps by solving the credit assignment
problem i.e., which steps should receive credit or blame for a final
result?

Learning from examples (Concept learning)


- Inductive learning in which concepts are learned from sets
of labeled instances.
- Given a set of examples of some concept/class/category,
determine if a given example is an instance of the concept or
not.
- If it is an instance, we call it a positive example.
- If it is not, it is called a negative example.
Supervised Concept Learning by Induction
- Given a training set of positive and negative examples of a
concept, construct a description that will accurately classify
whether future examples are positive or negative.
- That is, learn some good estimation of function f given a
training set {(x1, y1), (x2, y2), ..., (xn, yn)} where each yi is
either + (positive) or - (negative).

Inductive Bias
Inductive learning is an inherently conjectural process
because any knowledge created by generalization from
specific facts cannot be proven true; it can only be proven
false. Hence, inductive inference is falsity preserving, not
truth preserving.
To generalize beyond the specific training examples, we
need constraints or biases on what f is best.
That is, learning can be viewed as searching the
Hypothesis Space H of possible f functions.
A bias allows us to choose one f over another one.
A completely unbiased inductive algorithm could only
memorize the training examples and could not say anything
more about other unseen examples.
Two types of biases are commonly used in machine
learning:

Restricted Hypothesis Space Bias


Allow only certain types of f functions, not arbitrary ones.
Preference Bias
Define a metric for comparing fs so as to determine whether
one is better than another.
Inductive Learning Framework
Raw input data from sensors are preprocessed to obtain a
feature vector X, that adequately describes all of the relevant
features for classifying examples. Each X is a list of (attribute,
value) pairs.
For example,
X = {(Person, Sue), (Eye_Color, Brown, (Age, Young),
(Sex, Female)}
The number of attributes (also called features) is fixed. Each
attribute has a fixed, finite number of possible values.

Inductive Learning by Nearest-Neighbor


Classification
One simple approach to inductive learning is to save each
training example as a point in Feature Space, and then
classify a new example by giving it the same classification
(+ or -) as its nearest neighbor in Feature Space.
The problem with this approach is that it doesn't
necessarily generalize well if the examples are not
"clustered."
Inductive Concept Learning by Learning Decision Trees
Goal: Build a decision tree for classifying examples as
positive or negative instances of a concept
A decision tree is a simple inductive learning structure.

Given an instance of an object or situation, which is

specified by a set of properties, the tree returns a "yes" or


"no" decision about that instance.
In decision tree each non-leaf node has associated with it
an attribute (feature), each leaf node has associated with it a
classification (+ or -), and each arc has associated with it one
of the possible values of the attribute at the node where the
arc is directed from.
Decision Tree Construction using a Greedy Algorithm - Algorithm called ID3 or C5.0, originally developed
by Quinlan (1987)
- Top-down construction of the decision tree by
recursively selecting the "best attribute" to use at the
current node in the tree.
- Once the attribute is selected for the current node,
generate children nodes, one for each possible value
of the selected attribute.

Partition the examples using the possible values of this


attribute, and assign these subsets of the examples to the
appropriate child node.
Repeat for each child node until all examples associated with
a node are either all positive or all negative.
How to Choose the Best Attribute for a Node?
Some possibilities:
Random: Select any attribute at random
Least-Values: Choose the attribute with the smallest
number of possible values
Most-Values: Choose the attribute with the largest
number of possible values
Max-Gain: Choose the attribute that has the largest
expected information gain. In other words, try to select
the attribute that will result in the smallest expected size
of the subtrees rooted at its children.

Case Studies
Many case studies have shown that decision trees are at
least as accurate as human experts.
For example, one study for diagnosing breast cancer had
humans correctly classifying the examples 65% of the
time, and the decision tree classified 72% correct.
British Petroleum designed a decision tree for gas-oil
separation for offshore oil platforms.
Replaced a rule-based expert system.
Cessna designed an airplane flight controller using
90,000 examples and 20 attributes per example.

Knowledge-based inductive learning


Learning in which both the background knowledge and
the new hypothesis combine to explain the observations.
This inductive learning generates new knowledge.
This is a kind of learning in which our background
knowledge, together with our observations, lead us to make
a hypothesis that explains the examples we see.
The entailment constraint in this case is
Background ^ Hypothesis ^ Descriptions |=
Classifications
Such knowledge-based inductive learning has been
studied mainly in the field of inductive logic programming.

Such systems reduce learning complexity in two ways.


First, by requiring all new hypotheses to be consistent with
existing knowledge, they reduce the search space of
hypotheses.
Secondly, the more prior knowledge available, the less new
knowledge required in the hypothesis to explain the
observations.
Memoization
Accumulating a database of input/output values to avoid
having to reason again about an already-solved problem.
Incremental learning
A kind of learning in which, as more examples are shown
over time, the system improves its hypotheses.

Deductive learning :
- Deductive learning works on existing facts and knowledge and
deduces new knowledge from the old. So deductive learning or
reasoning can be described as reasoning of the form if A then B.
- Arguably deductive learning does not generate "new"
knowledge at all, it simply memorizes the logical consequences
of what is known already.
- Deduction is in some sense the direct application of
knowledge in the production of new knowledge.
- However, this new knowledge does not represent any new
semantic information: the rule represents the knowledge
completely as the added knowledge since any time the assertions
(A) are true then the conclusion B is true as well.
- Purely deductive learning includes method like explanation
based learning.

Explanation-based learning
This is based on deductive learning concept which converts
principles into usable rules.
This kind of learning occurs when the system finds an
explanation of an instance it has seen, and generalizes the
explanation.
The general rule follows logically from the background
knowledge possessed by the system.
The basic idea is to construct an explanation of the observed
result, and then generalize the explanation.
Then a new rule is built in which the left-hand side is the leaves
of the proof tree, and the right-hand side is the variablized goal,
up to any bindings that must be made with the generalized proof.
Any conditions true regardless of the variables are dropped.

PAC learning
Probably Approximately Correct learning. A system is PAC
if Pr[error(f, h) > e] < d, where e is the accuracy parameter,
d is the confidence parameter and h is hypothesis.
Bayesian learning
Learning that treats the problem of building hypotheses as a
particular case of the problem of making predictions.
The probabilities of various hypotheses are estimated, and
predictions are made using the posterior probabilities of the
hypotheses to weight them.
Adaptive dynamic programming
Adaptive dynamic programming is any kind of reinforcement
learning method that works by solving the utility equations
using a dynamic programming algorithm.

Neural network
A computational model somewhat similar to the human
brain; it has many simple units that work in parallel with no
central control. Connections between units are weighted, and
these weights can be modified by the learning system.
It is a form of Connectionist learning in which the data
structure is a set of nodes connected by weighted links, each
node passing a 0 or 1 to other links depending on whether a
function of its inputs reaches its activation level.
Relevance-based learning
This is a kind of in which background knowledge relates the
relevance of a set of features in an instance to the general
goal predicate.

For example, if I see men in Rome speaking Latin, and I


know that if seeing someone in a city speaking a language
usually means all people in the city speak that language, I can
conclude Romans speak Latin.
In general, background knowledge, together with the
observations, allows the agent to form a new, general rule to
explain the observations.
The entailment constraint for Relevant Based Learning is
Hypothesis ^ Descriptions |= Classifications
Background ^ Descriptions ^ Classifications |=
Hypothesis
This is a deductive form of learning, because it cannot
produce hypotheses that go beyond the background
knowledge and observations.
We presume that our knowledge base has a set of functional
dependencies that support the construction of hypotheses.

Genetic Algorithms and Evolutionary


Programming
Genetic algorithms are inspired by Darwin's theory of
natural evolution.
In the natural world, organisms that are poorly suited for
an environment die off, while those well-suited for it
prosper.
Genetic algorithms search the space of individuals for good
candidates.
Algorithm begins with a set of initial solutions
(represented by set of chromosomes) called population.
A chromosome is a string of elements called genes.
Solutions from one population are taken and are used to
form a new population by generating offspring.
This is motivated by a hope, that the new population will
be better than the old one.

New population is formed using old population and


offspring based on their fitness value.
The "goodness" of an individual is measured by some
fitness function.
This is repeated until some condition (for example no
improvement of the best solution or finite number of
repetition) is satisfied.
Search can takes place in parallel, with many
individuals in each generation.
The approach is a hill-climbing one, since in each
generation, the offspring of the best candidates are
preserved.
In the standard genetic algorithm approach, each
individual is a bit-string that encodes its characteristics.

Outline of the Basic Genetic Algorithm


1. [Start] Generate random population of n chromosomes
(suitable solutions for the problem)
2. [Fitness] Evaluate the fitness f(x) of each chromosome x in the
population
3. Repeat until terminating condition is satisfied
3.1. [Selection] Select two parent chromosomes from a
population according to their fitness (the better fitness, the
bigger chance to be selected).
3.2.[Crossover] With a crossover probability, cross over the
parents to form new offspring (children). If no crossover was
performed, offspring is the exact copy of parents.
3.3. [Mutation] With a mutation probability, mutate new
offspring at each locus (position in chromosome).
3.4. [Accepting] Generate new population by placing new
offspring

4. Return the best solution in current population

The algorithm consists of looping through generations.


In each generation, a subset of the population is selected to
reproduce; usually this is a random selection in which the
probability of choice is proportional to fitness.
Selection is usually done with replacement (so a fit individual
may reproduce many times).
Reproduction occurs by randomly pairing all of the
individuals in the selection pool, and then generating two new
individuals by performing crossover, in which the initial n bits
(where n is random) of the parents are exchanged.
There is a small chance that one of the genes in the resulting
individuals will mutate to a new value.
We may think that generating populations from only two
parents may cause you to loose the best chromosome from the
last population.

This means, that at least one of a generation's best solution is


copied without changes to a new population, so the best solution can
survive to the succeeding generation.
Genetic algorithms are broadly applicable and have the advantage
that they require little knowledge encoded in the system.
However, as might be expected from a knowledge-poor approach,
they give very poor performance on some problems.
As the outline of the basic GA is very general. There are many
parameters and settings that can be implemented differently in
various problems. The following questions need to be answered:
* How to create chromosomes and what type of encoding to
choose?
* How to perform Crossover and Mutation, the two basic
operators of GA?
* How to select parents for crossover?
(This can be done in many ways, but the main idea is
to select the better parents)

Encoding of a Chromosome
A chromosome should in some way contain information about
solution that it represents.
The commonly used way of encoding is a binary string.
Each chromosome is represented by a binary string and could
look like this:
Chromosome 1
1101100100110110
Chromosome 2
1101111000011110
Each bit in the string can represent some characteristics of the
solution.
There are many other ways of encoding. The encoding depends
mainly on the problem.
Crossover
After we have decided what encoding we will use, we can
proceed to crossover operation.

Crossover operates on selected genes from parent chromosomes


and creates new offspring.
The simplest way is to choose randomly some crossover point
and copy everything before this point from the first parent and then
copy everything after the crossover point from the other parent.
Crossover can be illustrated as follows: ( | is the crossover point):
Chromosome 1
11011 | 00100110110
Chromosome 2
11011 | 11000011110
Offspring 1
11011 | 11000011110
Offspring 2
11011 | 00100110110
There are other ways to make crossover, for example we can
choose more crossover points.
Crossover can be quite complicated and depends mainly on the
encoding of chromosomes.
Specific crossover made for a specific problem can improve
performance of the genetic algorithm.

Mutation
After a crossover is performed, mutation takes place.
Mutation is intended to prevent falling of all solutions in
the population into a local optimum of the problem.
Mutation operation randomly changes the offspring
resulted from crossover. In case of binary encoding we can
switch a few randomly chosen bits from 1 to 0 or from 0 to 1.
Mutation can be then illustrated as follows:
Original offspring 1 1101111000011110
Original offspring 2 1101100100110110
Mutated offspring 1 1100111000011110
Mutated offspring 2 1101101100110110
The technique of mutation (as well as crossover) depends
mainly on the encoding of chromosomes.

Crossover and Mutation


As already mentioned, crossover and mutation are two
basic operators of GA.
Performance of GA depends on the encoding and also on
the problem.
There are several encoding schemes to perform crossover
and mutation.

Binary Encoding
1. Crossover
Single point crossover - one crossover point is selected,
binary string from the beginning of the chromosome to the
crossover point is copied from the first parent, the rest is
copied from the other parent
11001011+11011111 = 11001111

Two point crossover - two crossover points are selected, binary


string from the beginning of the chromosome to the first crossover
point is copied from the first parent, the part from the first to the
second crossover point is copied from the other parent and the rest
is copied from the first parent again
11001011 + 11011111 = 11011111
Arithmetic crossover - Arithmetic operation is performed to make
a new offspring
11001011 + 11011111 = 11001001 (AND)
Uniform crossover - bits are randomly copied from the first or
from the second parent
11001011 + 11011101 = 11011111

2. Mutation
Bit inversion - selected bits are inverted
11001001 => 10001001

Permutation Encoding
1. Crossover
Single point crossover - one crossover point is selected, the
permutation is copied from the first parent till the crossover
point, then the other parent is scanned and if the number is
not yet in the offspring, it is added
Note: there are more ways how to produce the rest after
crossover point
(1 2 3 4 5 6 7 8 9) + (4 5 3 6 8 9 7 2 1) =
(1 2 3 4 5 6 8 9 7)
2. Mutation
Order changing - two numbers are selected and exchanged
(1 2 3 4 5 6 8 9 7) => (1 8 3 4 5 6 2 9 7)

Value Encoding
1. Crossover
All crossovers methods from binary encoding can be used
2. Mutation
Adding (for real value encoding) - a small number is added to
(or subtracted from) selected values
(1.29 5.68 2.86 4.11 5.55) => (1.29 5.68 2.73 4.22 5.55)

Tree Encoding
1. Crossover
Tree crossover - one crossover point is selected in both parents,
and the parts below crossover points are exchanged to produce
new offspring
2. Mutation
Changing operator, number - selected nodes are changed

Evolutionary programming is usually taken to mean a more


complex form of genetic programming in which the
individuals are more complex structures.
Classifier systems are genetic programs that develop rules
for classifying input instances.
Each rule is weighted, and has some set of feature patterns
as a condition and some pattern as an action.
On each cycle, the rules that match make bids proportional
to their weights.
One or more rules are then selected for application with
probabilities based on their bids.
Weights are adjusted according to the contribution of the
rules to desired behavior; the bucket brigade algorithm
[Holland, 1985] is an effective method of doing this.

You might also like