You are on page 1of 26

Classification:

Neural Networks
(part 1)
Machine Learning
Classification
 Classification is a multivariate technique concerned with
assigning data cases (i.e. observations) to one of a fixed number
of possible classes (represented by nominal output variables).

The Goal of classification is to:


 sort observations into two or more labeled classes. The emphasis
is on deriving a rule that can be used to optimally assign new
objects to the labeled classes.

 In short, the aim of classification is to assign input cases to one


of a number of classes
Simple pattern Classification
Example
 Let us consider a simple problem of
distinguishing handwritten versions of
the characters ‘a’ and ‘b’.

 We seek an algorithm which can


distinguish as reliably as possible
between the two characters.

 Therefore, goal in this classification


problem is to develop an algorithm
which will assign any image,
represented by a vector x, to one of
two classes, which we shall denote by
Ck, where k=1,2, so that class C1
corresponds to the character ‘a’ and
class C2 corresponds to ‘b’.
Example
 A large number of input variables can present severe
problems for pattern recognition systems. One
technique to alleviate such problems is to combine
input variables together to make a smaller number of
new variables called features.

 In the present example we could evaluate the ratio of


the height of the character to its width
( x1) and we might expect that characters from class C2
(corresponding to ‘b’) will typically have larger values
of x1 than the characters from class C1 (corresponding
to ‘a’).
 How can we make the best use of x1
to classify a new image so as to
minimize the number of
misclassifications?

 One approach would be to build a


classifier system which uses a threshold
for the value of x1 and which classifies
as C2 any image for which x1 exceeds
the threshold, and which classifies all
other images as C1.

 The number of misclassifications will


be minimized if we choose the
threshold to be at the point where the
two histograms cross.
This classification procedure is based
on the evaluation of x1 followed by its
comparison with a threshold.

 Problem of this classification


procedure: There is still significant
overlap of the histograms, and many
of the new characters we will test will
be misclassified.
 Now consider another
feature x2. We try to classify
new images on the basis of
the values of x1 and x2.

 We see examples of patterns


from two classes plotted in
the (x1,x2) space. It is
possible to draw a line in this
space, known as the decision
boundary which gives good
separation of the two classes.

 New patterns which lie above


the decision boundary are
classified as belonging to C1
while patterns falling below
the decision boundary are
classified as C2.
 We could continue to consider larger number of
independent features in the hope of improving
the performance .
 Instead we could aim to build a classifier which
has the smallest probability of making a mistake.
Classification Theory
 In the terminology of pattern recognition, the given
examples together with their classifications are known
as the training set and future cases form the test set.
 Our primary measure of success is the error or
(misclassification) rate.
 Confusion matrix gives the number of cases with true
class i classified as of class j.
 Assign costs Lij to allocating a case of class i to class j.
Therefore we are interested in the average error cost
rather than the error rate.
Average Error Cost
 The average error cost is minimized by the Bayes rule,
which is to allocate to the class c minimizing
∑iLij p(i|x)
 where p(i|x) is the posterior distribution of the classes
after observing x.
 If the costs of all errors are the same this rule amounts
to choosing the class c with the largest posterior
probability p(c|x).
 Minimum average cost is known as the Bayes risk.
Classification and Regression
 We can represent the outcome of the classification in terms of a
variable y which takes the value 1 if the image is classified as C1,
and the value of 0 if it is classified as C2.
 yk = yk(x;w)
 w denotes the vector of parameters often called weights
 The importance of neural networks in this context is that they
offer a very powerful and very general framework for
representing non-linear mappings from several input variables to
several output variables where the form of the mapping is
governed by a number of adjustable parameters.
Objective: Simulate the Behavior of a
Human Nerve

 Inputs are accumulated by a weighted sum.


 This sum is the input for output function φ.
A single neuron is not very flexible

 Input layer contains the value of each variable


 Hidden layer allows approximations by combining
multiple logarithmic functions
 Output neuron with highest probability determines
class
Regression = Learning
 The weights are adjusted iteratively (batch or on-
line)
 Initially, they are random and small
 Weight decay (λ) keeps weights from becoming
too large
Backpropagation
 Adjusts weights “back to front”
 Uses partial derivatives and chain rule

E
wij  
wij
Avoiding Local Maxima
 Make weights initially random
 Use multiple runs and take the average
An Example: Cushing’s Syndrome

Cushing’s syndrome is a The observations are


hypersensitive disorder
associated with over- urinary excretion rates
secretion of cortisol by (mg/24hr) of the
the adrenal gland.
Three recognized types of steroid metabolites
syndromes: tetrahydrocortisone =
a: adenoma T and pregnanetriol =
b: bilateral hyperplasia P, and are consider on
c: carcinoma
log scale.
u: unknown type
Cushing’s Syndrome Data
Tetrahydrocortisone Pregnanetriol Type b8 6.5 0.40 b
a1 3.1 11.70 a b9 5.7 0.40 b
a2 3.0 1.30 a b10 13.6 1.60 b
a3 1.9 0.10 a c1 10.2 6.40 c
a4 3.8 0.04 a c2 9.2 7.90 c
a5 4.1 1.10 a c3 9.6 3.10 c
a6 1.9 0.40 a c4 53.8 2.50 c
b1 8.3 1.00 b c5 15.8 7.60 c
b2 3.8 0.20 b u1 5.1 0.40 u
b3 3.9 0.60 b u2 12.9 5.00 u
b4 7.8 1.20 b u3 13.0 0.80 u
b5 9.1 0.60 b u4 2.6 0.10 u
b6 15.4 3.60 b u5 30.0 0.10 u
b7 7.7 1.60 b u6 20.5 0.80 u
R Code
library(MASS); library(class); library(nnet)
cush <- log(as.matrix(Cushings[, -3]))[1:21,]
tpi <- class.ind(Cushings$Type[1:21, drop = T])
xp <- seq(0.6, 4.0, length = 100); np <- length(xp)
yp <- seq(-3.25, 2.45, length = 100)
cushT <- expand.grid(Tetrahydrocortisone = xp, Pregnanetriol = yp)

pltnn <- function(main, ...) {


plot(Cushings[,1], Cushings[,2], log="xy", type="n",
xlab="Tetrahydrocortisone", ylab = "Pregnanetriol", main=main, ...)
for(il in 1:4) {
set <- Cushings$Type==levels(Cushings$Type)[il]
text(Cushings[set, 1], Cushings[set, 2],
as.character(Cushings$Type[set]), col = 2 + il) }}
#pltnn plots T and P against each other by type (a, b, c, u)
> cush <- log(as.matrix(Cushings[, -3]))[1:21,] > tpi <- class.ind(Cushings$Type[1:21, drop = T])
> cush > tpi
Tetrahydrocortisone Pregnanetriol abc
a1 1.1314021 2.45958884 [1,] 1 0 0
a2 1.0986123 0.26236426 [2,] 1 0 0
a3 0.6418539 -2.30258509 [3,] 1 0 0
a4 1.3350011 -3.21887582 [4,] 1 0 0
a5 1.4109870 0.09531018 [5,] 1 0 0
a6 0.6418539 -0.91629073 [6,] 1 0 0
b1 2.1162555 0.00000000 [7,] 0 1 0
b2 1.3350011 -1.60943791 [8,] 0 1 0
b3 1.3609766 -0.51082562 [9,] 0 1 0
b4 2.0541237 0.18232156 [10,] 0 1 0
b5 2.2082744 -0.51082562 [11,] 0 1 0
b6 2.7343675 1.28093385 [12,] 0 1 0
b7 2.0412203 0.47000363 [13,] 0 1 0
b8 1.8718022 -0.91629073 [14,] 0 1 0
b9 1.7404662 -0.91629073 [15,] 0 1 0
b10 2.6100698 0.47000363 [16,] 0 1 0
c1 2.3223877 1.85629799 [17,] 0 0 1
c2 2.2192035 2.06686276 [18,] 0 0 1
c3 2.2617631 1.13140211 [19,] 0 0 1
c4 3.9852735 0.91629073 [20,] 0 0 1
c5 2.7600099 2.02814825 [21,] 0 0 1
plt.bndry <- function(size=0, decay=0, ...) {
cush.nn <- nnet(cush, tpi, skip=T, softmax=T, size=size,
decay=decay, maxit=1000)
invisible(b1(predict(cush.nn, cushT), ...)) }

cush – data frame of x values of examples.


tpi – data frame of target values of examples.
skip – switch to add skip-layer connections from input to output.
softmax – switch for softmax (log-linear model) and maximum conditional likelihood
fitting.
size – number of units in the hidden layer.
decay – parameter for weight decay.
maxit – maximum number of iterations.
invisible – return a (temporarily) invisible copy of an object.
predict – generic function for predictions from the results of various model fitting
functions. The function invokes particular _methods_ which depend on the
'class' of the first argument. Here: using cush.nn to predict cushT
b1 <- function(Z, ...) {
zp <- Z[,3] - pmax(Z[,2], Z[,1])
contour(exp(xp), exp(yp), matrix(zp, np),
add=T, levels=0, labex=0, ...)
zp <- Z[,1] - pmax(Z[,3], Z[,2])
contour(exp(xp), exp(yp), matrix(zp, np),
add=T, levels=0, labex=0, ...)
}
par(mfrow = c(2, 2))

pltnn("Size = 2")
set.seed(1); plt.bndry(size = 2, col = 2)
set.seed(3); plt.bndry(size = 2, col = 3)
plt.bndry(size = 2, col = 4)

pltnn("Size = 2, lambda = 0.001")


set.seed(1); plt.bndry(size = 2, decay = 0.001, col = 2)
set.seed(2); plt.bndry(size = 2, decay = 0.001, col = 4)

pltnn("Size = 2, lambda = 0.01")


set.seed(1); plt.bndry(size = 2, decay = 0.01, col = 2)
set.seed(2); plt.bndry(size = 2, decay = 0.01, col = 4)

pltnn("Size = 5, 20 lambda = 0.01")


set.seed(2); plt.bndry(size = 5, decay = 0.01, col = 1)
set.seed(2); plt.bndry(size = 20, decay = 0.01, col = 2)
# functions pltnn and b1 are in the scripts
pltnn("Many local maxima")
Z <- matrix(0, nrow(cushT), ncol(tpi))
for(iter in 1:20) {
set.seed(iter)
cush.nn <- nnet(cush, tpi, skip = T, softmax = T, size = 3,
decay = 0.01, maxit = 1000, trace = F)
Z <- Z + predict(cush.nn, cushT)
cat("final value", format(round(cush.nn$value,3)), "\n")
b1(predict(cush.nn, cushT), col = 2, lwd = 0.5)
}
pltnn("Averaged")
b1(Z, lwd = 3)
References
Bishop, C.M. (1995) Neural Networks for Pattern
Recognition. Oxford: Clarendon Press.
Ripley, B.D. (1996) Pattern Recognition and Neural
Networks. Cambridge: Cambridge University
press.

You might also like