You are on page 1of 9

Using the Taylor Expansion of Multilayer Feedforward Neural

Networks
AP Engelbrecht,
Department of Computer Science, University of Pretoria, Pretoria, South Africa, engel@driesie.cs.up.ac.za

Abstract
The Taylor series expansion of continuous functions has shown - in many fields - to be an extremely powerful tool to
study the characteristics of such functions. This paper illustrates the power of the Taylor series expansion of multilayer
feedforward neural networks. The paper shows how these expansions can be used to investigate positions of decision
boundaries, to develop active learning strategies and to perform architecture selection.
Keywords: Feedforward Neural Networks, Taylor Series, Sensitivity Analysis, Decision Boundaries, Active Learning, Architecture Selection
Computing Review Categories: I.2.6

1 Introduction
The Taylor series expansion of a continuous function
reveals interesting characteristics of that function under
small changes in parameters of the function. Let P be a
continuous function, for which the first derivative exists
and having a bounded second derivative, which represents
a performance measure of some system. Let be a parameter of this system (assume, without loss of generality, that
is a scalar). Then, from a Taylor expansion of P around
, the change in performance due to an infinitesimal perturbation of is expressed as

P ( + ) = P () +

2
P () +
P () + 
1!
2!
0

00

(1)

Then, the change in performance of the system P with respect to perturbation is

P ( + ) P () =

2
P () +
P () + 
1!
2!
0

00

(2)

The sum on the right hand side of equation (2) quantifies


the change in performance. In terms of robustness and sta2
bility, the ideal is that
1! P () + 2! P () +  ! 0 when
! 0.
In terms of artificial neural networks (NN), the Taylor series expansion of NN performance functions has revealed many new insights into the working of NNs. Here
the performance measure can be expressed as the objective
(error) function to be optimized (usually the sum squared
error), or the NN output function. In both cases the parameters include network weights, input unit activations and
output unit activations. A Taylor series expansion of these
performance measures thus enables a study of the behavior of the objective function or the NN output function with
respect to small perturbations in the weights, input and/or
hidden unit activations.
0

SART / SACJ, No 25, 2000

00

The main objective of this paper is to summarize current insights gained from such perturbation analysis studies, and to present new insights recently obtained through
studies by the author. Section 2 considers the Taylor expansion of the objective function, while section 3 presents
insights obtained from an expansion of the NN output
function. Section 4 compares the two approaches with specific reference to architecture selection. Section 5 presents
in detail how Taylor series expansions of the NN output
function can be used to develop tools to analyze decision
boundaries, to optimize network architectures and training
data.

2 Objective Function Taylor Expansion


One of the most widely used Taylor expansions in NN theory and applications is that of the objective function with
respect to NN parameters (hereafter referred to as objective
function sensitivity analysis). Usually, the sum squared error (SSE) is used as objective funtion. If E denotes the
objective function, ~ = (1 ;  ; I ) the parameter vector of
the NN, i a single parameter, i a small perturbation of
that parameter and E the partial derivative with respect to
i , then from (1)
0

E (1 ;  ; i + i ;  ; I )

E (~) + E (~)i

1
E (~)2i
2

00



(3)

is the Taylor expansion of E around i . From equation (3),


the change in error due to perturbation i is
1

E (1 ;  ; i + i ;  ; I ) E (~)

=
+
+

E (~)i
0

about the characteristics of the objective function and


error surface. For this reason, OBS does not assume
off-diagonal terms to be zero, but uses the full Hessian matrix which is extremely expensive to compute
- especially for large networks.

1
E (~)2i
2

(4)
00

The first order term E (~) is used in gradient descent optimization to drive the NN to a local minimum [26]. In this
case i represents a weight of the NN. The second order
term has also been used in optimization to improve convergence [3, 4]. Objective function sensitivity analysis has
been used widely in pruning of NN parameters. Optimal
Brain Damage (OBD) [19, 24] and Optimal Brain Surgeon
(OBS) [21, 22] prune weights with low saliency, while
Optimal Cell Damage (OCD) [6] prunes irrelevant input
and hidden units. OBD, OBS and OCD use second order
derivatives to approximate saliencies.
Pruning using OBD, OBS and OCD is based upon assumptions to reduce the complexity of calculating equation
(4) [6, 19, 21, 22, 24]. These assumptions are discussed
below:
0

Extremal approximation assumption: It is assumed


that pruning is applied only after convergence is
reached. At the local minimum, the derivative of the
objective function is approximately zero, which means
that the first term, E (~)i , can be removed from
equation (4). The extremal approximation assumption also relies on the assumption that noise follows
a Gaussian, zero mean distribution. The extremal approximation assumption is not valid if many outliers
occur in the training set.
0

Quadratic approximation assumption: The objective function is assumed to be well approximated by


a second-order expansion around its minimum point.
This is not always the case, especially for flat surfaces
with a bath tub-like shape. Gorodkin et al show for
some experiments that the second order approximation does not give an accurate description of the cost
function [19]. This assumption also applies only for
objective functions that are linear or quadratic.
Diagonal approximation assumption: OBD and
OCD assume that the off-diagonal terms of the Hessian matrix are zero. This assumption is only valid if
it can be assumed that the principal curvature of the
error surface is captured in the diagonal terms. However, there may be regions in the error surface of a
problem where small changes in some weights result
in very large changes in the error gradient. That is,
in regions where the principle curvature is not parallel to the weight axes. Becker and Le Cun illustrate
this to be true for experiments investigated by them
[4]. They show through an eigenvalue decomposition
of the Hessian matrix that off-diagonal terms also have
high eigenvalues, indicating regions in the error surface that are sensitive to weight perturbations. Hassibi
and Stork also found the diagonal assumption to be
incorrect, leading to the pruning of the wrong weights
[21]. The diagonal assumption loses some information

Levenberg-Marquardt assumption: It is assumed


that the errors between the target and output values,
( p)
( p)
( p)
( p)
tk
ok are approximately zero. All tk
ok terms
are therefore removed from the Taylor expansions. Assuming a Gaussian noise with zero mean for the input
space, the correlation between errors and the secondorder derivatives of the objective function vanishes for
large training sets. However, if many outliers occur in
the training set, the Levenberg-Marquardt assumption
may lead to inaccuracies.

Using these assumptions, OBD and OCD define the


saliency measure Si of parameter i from equation (4) as
Si

= E (1 ;



i + i ;  ; I )

E (~)  Hii 2i (5)


1
2

and OBS as
Si

= E (1 ;



i + i ;  ; I )

E (~) 

1 2i
(6)
2 [H 1 ii

where H = ~E2 is the Hessian matrix containing all the sec2

ond order derivatives, and [H 1 ii denotes the ith diagonal


element of the inverted Hessian.
2
The equations for calculating E2 depend on the objeci

tive function (usually the SSE function) and the activation


functions (usually sigmoid functions). For NNs that use
a different objective function, the OBD, OBS and OCD
pruning models change substantially. Changes in activation functions cause only minor changes in the sensitivity
equations.

3 Output Function Taylor Expansion


The Taylor series expansion of the NN output, from now
on referred to as NN output sensitivity analysis, has been
used for several NN applications [5, 7, 8, 9, 10, 11, 12,
13, 14, 17, 18, 20, 25, 27, 29]. Without loss of generality
assume one output unit. If FNN denotes the output function
of the NN, ~ = (1 ;  ; I ) the parameter vector, i a single
parameter and i a small perturbation of i , then

FNN (1 ;  ; i + i ;  ; I ) FNN (~)  FNN (~)i i


0

(7)
The change in output due to the perturbation is then entirely described by the derivative

FNN ()i 
0

~
i ;;I ) FNN ()
limi !0 FNN (1 ;;i +
i

(8)

SACJ / SART, No 25, 2000

Output sensitivity analysis therefore consists of simply calculating FNN


for all parameters i .
i
Output sensitivity analysis has been used to study the
generalization characteristics of NNs [5, 17], for causal inferencing to determine the significance of input parameters [8, 18, 20, 25], to quantify the degree of non-linearity
in the data [23], to detect and visualize decision boundaries [10, 11, 12, 18], to prune oversized NN architectures
[7, 8, 9, 14, 16, 27, 29], for active learning [10, 13], and for
automatically learning first-order derivatives [2]. For pruning purposes, the expansion is used to compute the significance of weights, input and hidden units [8, 9, 14, 29].
Contrary to objective function expansions, expansions
of the NN output function for the purposes of pruning, are
not based on any assumptions to reduce model complexity. The only assumptions are that (1) the activation functions are at least once differentiable, with bounded second
derivative (which is the case since bounded activation functions are used), and (2) the NN should be well trained to
accurately approximate the true derivatives. This is necessary for pruning to correctly remove irrelevant parameters .
This assumption is not as strict as the extremal approximation assumption of OBD, OBS and OCD, since the validity
of the output sensitivity analysis model does not depend on
the network being in a local minimum.
Pruning using NN output expansion are independent
of the objective function, since the NN output is taken
as performance function, and not the objective function as
in OBD, OBS and OCD. Whatever error function is used,
the equations to compute the derivatives of the NN output function with regard to network parameters remain the
same. They do, however, depend on the type of activation
k
function used, due to the need to calculate o
, where can
be an input unit, a hidden unit, or a weight.

4 Comparison of Objective and Output Function Expansions


The objective of this section is to compare the two approaches to NN Taylor expansions with reference to pruning. After a general comparison of the complexity and
characteristics of the two approaches, the section concludes with a mathematical comparison between OBD and
output sensitivity analysis for NN pruning.
The main difference between objective and NN output
sensitivity analysis is the performance function used. Objective function sensitivity analysis uses the error function
to be optimized as measure of the change in error caused
by small parameter perturbations as expressed in equation
(4). Output sensitivity analysis, on the other hand, uses
the actual NN output function as performance measure to
quantify the change in NN output due to small perturbations, as expressed in equation (7). Conceptually, the two
approaches mean the same: The error of a single pattern is
( p)
( p)
ok ) between the target
computed as the difference (tk
( p)

value tk

( p)

and the actual output value ok of the

SART / SACJ, No 25, 2000

kth

output

unit for pattern p. A change in error, due to some perturbation, is determined by a change in the output value caused
by that perturbation. This relationship is further illustrated
in table 1 and equations (10) and (11), considering the assumptions as for OBD.
While objective function and NN output sensitivity
analysis mean conceptually the same thing, it is more complex to compute objective function sensitivity information.
Since the goal of learning is to minimize the objective
function, the first order derivative of the objective function,
E , is approximately zero at convergence. Thus requiring
second order derivatives to be computed. Since this needs
the calculation of the Hessian matrix, objective function
sensitivity analysis is computationally expensive. In contrast, with NN output sensitivity analysis, first order information is sufficient to quantify the influence of parameter
perturbations, since we can assume that [29]
lim

i !0

1
F (~)2i + ) = 0
2 NN
00

(9)

where FNN is the NN output function. It is much less expensive to calculate the Jacobian matrix than the Hessian
matrix.
Output sensitivity analysis is also more general than
objective function sensitivity analysis in that the latter depends on the error function (objective function) to be optimized. Usually, the sum squared error function is used,
but for any other error function, the sensitivity equations
as summarized in table 1 need to be redefined. The output
sensitivity analysis equations, on the other hand, remain
the same whatever objective function is used.
Since both sensitivity analysis approaches have been
applied to pruning of NN architectures, this application is
used to find a mathematical relationship between the two
methods. For this purpose OBD is used, considering all
its assumptions as listed in section 2. To derive this relationship, assume a NN with one output ok . Although the
comparison below is for one pattern only, it can quite easily be generalized to include the entire training set through
application of a suitable norm.
From table 1, irrespective of which NN parameter is
considered, the following general relationship applies (assuming least squares as objective function):
2 Ek
2

k 2
 ( o
)

(10)

This supports the statement that objective function sensitivity analysis and NN output sensitivity analysis are conceptually the same (under the assumptions listed in section 2). This means that the same parameter significance
ordering will occur for the two methods. In the case of
pruning, the same parameters will therefore be pruned. In
general, for more than one output the following relationship holds:
2 E
2

2 Ek
2
k=1
K

k 2
 ( o
)

(11)

k=1

Parameter
zi
yj
wk j
v ji

Error Sensitivity
2 Ek
z2i
2 Ek
y2j
2 Ek
w2k j
2 Ek
v2ji

Output Sensitivity


 ( fo )2 w2k j
 ( fo )2 y2j
 ( fy )2 ( fo )2 w2k j z2i
k

0
ok
J
zi = f ok j =1 wk j f y j v ji
0
ok
y j = f ok wk j

0
ok
wk j = f ok y j

0
J
( f o )2 [ j=1 wk j f y j v ji 2
k
0

0
0
ok
v ji = f y j f ok wk j zi

Table 1: Comparison of Objective Function and Output Sensitivity Analysis

illustrating that the change in model error due to parameter


perturbations is simply an additive function of the changes
in NN output due to these perturbations. Therefore, instead
of using a more complex objective function sensitivity approach, output sensitivity analysis can be used to the same
effect.

Definition 1: One Point Decision Boundary: Under assumptions 1 and 2, if there exists an input parameter
( p)
( p)
value zi and a small perturbation zi of zi such that,
( p)
( p)
for any output ok , fok (zi ) 6= fok (zi + zi ), then a de( p) ( p)
cision boundary is located in the range [zi ; zi + zi
of input parameter zi , where p denotes a single pattern.

5 Insights from Output Function Expansions

Definition 2: Range Decision Boundary: Under assumptions 1 and 2, if there exist two input parame( p)
( p)
( p)
( p)
ter values zi1 and zi2 with zi1 < zi2 such that for

In this section a summary is given of how the Taylor expansion of the NN output function can be used to analyze
decision boundaries, to develop active learing algorithms
to optimize the use of training data, and to prune irrelevant units. The approaches surveyed below have been
developed by the author of this paper. Each aspect has
been published separately, and elaborate experimental results presented in these publications. This section serves as
a collective overview of these approaches. The interested
reader is referred to the referenced literature.

5.1 Decision Boundary Analysis [12]


The objective of learning in classification problems is to
construct optimal boundaries in the input space to separate
the different classes. Visualization of these boundaries reveals interesting properties of the data set, such as the exact input parameter values that cause a change in output.
Decision boundary analysis also shows which boundary is
implemented by which hidden unit, and if hidden units perform any function at all. Using sensitivity analysis, two
types of decision boundaries are defined (viewed from one
dimension), considering the assumptions given:
Assumption 1: A NN implements a nonlinear differentiable mapping F (DT ;W ) : RI ! RK , where DT is the
training set, W represents the weights, I is the dimension of input space, and K is the dimension of output
space.
Assumption 2: The activation functions in the output
layer are monotonically decreasing, bounded, and produces a discrete output fok for each output unit ok . For
example, 0 or 1 for the sigmoid activation function.
4

( p)

zi

2 [zi p

( )

( p)

( p)

zi2 , a small perturbation zi of zi


( p)

any output ok , fok (zi

and

( p)

+ zi ), then a deci( p) ( p)
range [zi1 ; zi2 of input

) = f ok (zi

sion boundary spans over the


parameter zi , where p denotes a single pattern.

From definition 1, a decision boundary is located at


the point in input space where a small perturbation to a
value of an input unit causes the value of an output unit
to change from one class to another. Similarly, definition
2 defines a range of input parameter values over which a
decision boundary is formed.
These definitions are theoretically justified from a first
( p)
order Taylor expansion of fok around zi (under the assumption of small zi ):
( p)

fok (zi

+ zi )

 fo (zi p ) + zi fop
( )

k
( )

(12)

zi

Returning to definitions 1 and 2, under assumption


2 the second term in equation (12) determines whether
the value of an output unit changes. That is, the higher
the value of
( p)

fok (zi

f ok

( p)

zi

+ zi ).

( p)

, the greater the chance that fok (zi

Therefore, patterns with high

f ok
( p)

zi

)=

values

lie closest to decision boundaries.


This point is further illustrated in figure 1, which plots,
for example, the sigmoid activation function f (z) and its
derivative zf . The peak of the derivative at z = 0 coincides with the inflextion point of f (z). For classification
problems, the inflextion point is used as threshold to decide between the two discrete output values.
fo
Letting ok = fok , the factor ( pk) can be written as o( pk) ,
zi

zi

which is simply the sensitivity of output ok to perturbations


( p)
in input value zi .
SACJ / SART, No 25, 2000

shrinks. Patterns selected are removed from the candidate


set and added to the training subset.
Using a Taylor series expansion of the NN output, the
informativeness of a pattern is defined as

f(z)=1/(1+exp(-z))
df/dz

0.9
0.8

Definition 3: Pattern Informativeness: Define the informativeness of a pattern as the sensitivity of the NN
output vector to small perturbations in the input vector. Let ( p) denote the informativeness of pattern p.
Then,
:
( p)
( p) = jj~So jj
(13)

f(z), df/dz

0.7
0.6
0.5
0.4

( p)

0.3

where ~So is the output sensitivity vector for pattern


p (defined in equation (15)), and jjjj is any suitable
norm.

0.2
0.1

This study suggests the maximum-norm,

(p) = jj~So

( p)

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4


z

Figure 1: Sigmoid activation function and derivative


The first order sensitivity analysis above assigns a
measure of closeness to boundaries for each pattern [12],
also referred to as pattern informativeness. Highly informative patterns convey the most information about decision boundaries, and have high o( pk) values.
zi

The position of boundaries can be visualized drawing


( p)
graphs of o( pk) versus zi for each output ok and input
zi

zi . Peaks in these graphs correspond to one point boundaries, while a set of approximately equal high sensitivity
values corresponds to a range decision boundary. These
graphs can be used to investigate regions of input space for
which classification is uncertain, and to determine under
what conditions the result of the classification changes.
y
Graphs of ( pj ) indicates the range of input values

( )

The objective of active learning algorithms is to make optimal use of the data provided in the training set. Usually, the
NN learner passively receives training data and has to learn
on all the data. With active learning, the learner selects
training patterns from a candidate set of training patterns
based on the learners current knowledge of the problem.
In allowing the learner to decide what to learn, the most
informative patterns are selected that convey the most information about the function to be learned.
There are two approaches to active learning: selective
learning and incremental learning. With selective learning the learner selects at each selection interval (eg each
epoch) a new training subset from the candidate set. The
candidate set retains all patterns. With incremental learning the actual training subset grows while the candidate set
SART / SACJ, No 25, 2000

(14)

( p)

( p)

So;k

jjSozp jj
( )

(15)

( p)

where Soz is the output-input layer sensitivity matrix for


( p)
pattern p. Each element Soz;ki of the sensitivity matrix is
computed as
ok
( p)
(16)
Soz;ki = ( p)
zi
which assumes differentiable activation functions. Suitable norms for calculating each element of the output sensitivity vector are the sum-norm,
( p)

So;k

jjSozp jj1 = jSozp ki j


( )

( )
;

(17)

i=1

or the Euclidean-norm

5.2 Active Learning [10, 13]

where So;k refers to the sensitivity of a single output unit


ok to small perturbations in the input vector ~z, and K is the
total number of output units. In equations (13) and (14),
the elements of the output sensitivity vector is defined as

zi

for which hidden unit y j is active. Thus indicating what


boundary is implemented by a hidden unit, and if that hidden unit implements a boundary at all.

jj = k max
fjSopk jg
1  K

( p)

So;k

jjSoz jj2 = (Sozp ki)2


( p)

( )
;

(18)

i=1

where I is the total number of input units.


Using definition 3 and equation (14), a pattern is considered informative if any one, or more, of the output units
is sensitive to small perturbations in the input vector. The
( p)
larger the value of , the more informative is pattern p.
The above measure of pattern informativeness has
been used successfully to dynamically select patterns during training for classification problems [10]. Engelbrecht
and Cloete developed a selective learning algorithm which
selects from the candidate set those patterns that have the
highest informative measures. These patterns lie close
to decision boundaries and convey the most information
about the optimal position of boundaries. At each subset
5

selection interval, all patterns in the candidate training set


that satisfy the condition
( p) > (1

(19)

are selected for training, where is the average pattern


informativeness.
The subset selection constant is crucial to the efficacy of the algorithm. This selection constant, which lies
in the range [0; 1, is used to control the region around decision boundaries within which patterns will be considered
as informative. The larger the value of , the more patterns
will be selected. If is too small, only a few patterns will
be selected which may not include enough information to
form boundaries, with a consequent reduction in generalization performance. Low values for will however mean
less computational costs. A conservative choice of close
to 1 improves the chances of selecting patterns representing enough information about the target concept, ensuring
most of the candidate patterns to be included in the initial
training subset. A conservative value for does, however,
not mean a small reduction in training set size. As training
progresses, more and more patterns become uninformative,
resulting in larger reductions in training set size [10]. If
= 1, the selective learning algorithm simply generalizes
to normal fixed set learning.
Pattern informative measures have also been used to
develop an incremental learning algorithm [13]. In this
case the learning algorithm is summarized as:
1. Initialize weights and learning parameters. Initialize
the subset size, PSs , i.e. the number of patterns selected from the candidate set DC . Construct the initial
training subset DS0  DC . Let DT DS0 be the current
training set.
2. Repeat
(a) Repeat
Train the NN on training set DT
until a termination criterion on DT is triggered
(discussed below).
(b) Compute the new training subset DSs

i. For each p 2 DC , compute the sensitivity


( p)
matrix Soz;ki
( p)

ii. Compute the output sensitivity vector ~So


for each p 2 DC from equation (15).
iii. Compute the informativeness ( p) of each
pattern p 2 DC using equation (14).
iv. Find the subset DSs of the PSs most informative patterns as
DS s

f p 2 DC jp = q max
fq g;
1  P
8q 2 DC not yet selectedg
( )

( )

until convergence is reached.


The incremental learning algorithm summarized
above starts training on one pattern, and selects only one
new pattern from DC at each subset selection interval. This
allows the study of the incremental learning algorithm under conservative conditions.
The objective is to continue training on the current
training set DT to achieve maximum gain from the new
patterns before requesting new information. However, the
learner should not be allowed to memorize the training
subset, and should not spend too much time on DT without achieving sufficient gain. For this purpose several termination criteria are incorporated, and tested after each
sweep through DT (after each training epoch), to determine
whether a new pattern needs to be added:
1. The total number of training sweeps through the current training subset is limited to make sure that the NN
does not indefinitely train on the set without achieving
much. The current implementation allows only 100
training epochs on the current training set.
2. If the error on DT , or the validation set DV , decreases
sufficiently (i.e. sufficient gain is achieved from learning on DT ), a new subset is selected. For the current
implementation a new subset is selected when the error is decreased by 80% since training started on the
current training subset. This may sound a bit strict,
but other criteria will stop training on the subset if the
learner cannot achieve this goal.
3. If the average decrease in error per epoch for the current training interval is too small for the training set
DT or the validation set DV , a new subset is selected.
This measure will prevent the learner from training on
DT with achieving too little gain. Also, by considering
the error on the validation set, overfitting is avoided.
The current implementation of the algorithm scales the
threshold for the average error per epoch linearly in
the order of magnitude of the training error. The algorithm starts with a threshold of 0.001, dividing it by 10
as the order of magnitude of the error decreases. If the
average decrease in error for DT or DV is less than this
threshold, a new subset is selected.
4. If the error EV on the validation set increases too
much, a new subset is selected. Currently, DT is increased when EV > E V + EV , where E V is the average validation error over the previous epochs, and EV
is the standard deviation in validation error. The goal
of this termination test is to prevent the learner from
memorizing the current training subset, by triggering
a new subset selection as soon as overfitting of DT is
observed.

where PC is the number of patterns remainDT [ DSs and


ing in DC . Then, let DT
DC
DC DSs .
6

5.3 Neural Network Pruning [9, 14]


The optimality of the architecture of a NN is very important to its performance. The computational complexity of
SACJ / SART, No 25, 2000

learning is directly dependent on the number of weights,


which is determined by the number of units in each layer.
Too many weights may cause the network to overfit, or
memorize the training patterns, leading to bad generalization [28, 1]. Too few weights will lead to an underfit of the
data, also causing bad generalization.
Taylor expansions of the NN output can be applied to
identify irrelevant, or redundant, input units, hidden units
and weights [8, 9, 29]. These irrelevant parameters can
then be pruned to obtain a smaller, optimal architecture.
The relevance of parameters is computed from a variance analysis of first order Taylor expansions of the NN
output function over all training patterns. A variance nullity measure is computed for each parameter, based on
ideas borrowed from the non-convergent tests of Finnoff,
Hergert and Zimmermann [15]. The basic idea of the variance nullity measure is to test whether the variance in parameter sensitivity for the different patterns is significantly
different from zero. If the variance in parameter sensitivities is not significantly different from zero, it indicates that
the corresponding parameter has little or no effect on the
output of the NN over all patterns considered. A hypothesis testing step, described below, uses these variance nullity measures to statistically test if a parameter should be
pruned, using the 2 distribution.
Definition 4: Parameter Variance Nullity: Define the
statistical nullity in the parameter sensitivity variance
of a NN parameter i over patterns p = 1;  ; P as
i

(P

1)2i

20

(20)

where 2i is the variance of the sensitivity of the network output to perturbations in parameter i , and 20 is
a value close to zero (the characteristics of this value
are explained below).
as

The variance in parameter sensitivity, 2i , is computed


( p)

2i

i )2
Pp=1 (i
P 1

where

(21)

( p)

( p)

Kk=1 So;ki
K

(22)

and i is the average parameter sensitivity


( p)

Pp=1 i
P

( p)

(23)

In equation (22), i is the average sensitivity of the NN


output to perturbations in parameter i for pattern p, and
( p)
So;ki is the sensitivity of output ok to perturbations in parameter i for pattern p.
The analysis of variance approach is followed here instead of an analysis of means as is done by Finnoff et al
[15]. In this study, an analysis of means is not appropriate
since large negative and positive sensitivities may cancel
SART / SACJ, No 25, 2000

each other, or produce a sum close to zero, indicating that


the parameter is insignificant - which is not true. Using the
variance nullity measure defined in definition 4, statistical
theory prescribes the use of the 2 (P 1) distribution to
determine if a parameter can be pruned.
The statistical pruning heuristic is based on proving or
disproving the null hypothesis that the variance in parameter sensitivity is approximately zero. For this purpose, the
null hypothesis that the variance in parameter sensitivity is
approximately zero is tested, where the null hypothesis

H0 : 2i = 20

(24)

is defined for each parameter i . Unfortunately, from equation (20), 20 6= 0, and we cannot hypothesize that the variance in parameter sensitivity over all patterns is exactly
zero, i.e. 2i = 0. Instead, a small value close to zero is
chosen for 20 , and the alternative hypothesis,

H1 : 2i < 20

(25)

is tested. Using the fact that under the null hypothesis the
variance nullity measure has a 2 (P 1) distribution in the
case of P patterns, the critical value c is obtained from 2
distribution tables,
c = 2v;1

(26)

where v = P 1 is the number of degrees of freedom, and


is the level of significance. A significance level = 0:01,
for example, means that we are satisfied with incorrectly
rejecting the hypothesis once out of 100 times.
Using the critical value defined in equation (26), if
i  c , the alternative hypothesis H1 is accepted and parameter i is pruned.
The value of 20 is crucial to the success of this pruning
heuristic. If 20 is too small, no parameters will be pruned.
On the other hand, if 20 is too large, important parameters
will be pruned. The pruning algorithm, summarized below, therefore starts with a small value of 20 , and increases
the value if no parameters can be pruned under the smaller
value of 20 . After each pruning step, the performance of
the pruned network is first tested to see if performance is
not degraded too much. If the deterioration in performance
is unacceptable, the original network is restored, and pruning stops.
To reduce computation time during the hypothesis
testing phase, the variance nullity measures i are arranged in increasing order. Hypothesis tests start on the
smallest i value and continue until no more parameters
can be identified for pruning.
The statistical pruning heuristic based on variance nullity is summarized below:
1. Initialize the NN architecture and learning parameters
2. repeat
(a) train the NN until overfitting is observed
(b) let 20 = 0:0001
7

(c) for each i


i. for each p = 1;  ; P, calculate i using
equation (22)
ii. calculate the average i using equation
(23)
iii. calculate the variance in parameter sensitivity using 2i from equation (21)
iv. calculate test variable i using equation
(20)
( p)

(d) apply the pruning heuristic


i.
ii.
iii.
iv.

arrange i in increasing order


find c using equation (26)
for each i , if i  c , then prune i
if i > c for all i , let 20 = 20  10

until no i is pruned, or the reduced network is not


accepted due to an unacceptable deterioration in generalization performance
3. Train the final pruned NN architecture
The variance nullity algorithm starts pruning the hidden layer first, then the input layer. After pruning of
the hidden and input layers, it is proposed that irrelevant
weights are pruned. Calculation of the variance nullity
measures can be done on the training, validation or test
sets. During training, the error on a validation set is monitored to detect when overfitting starts, at which point pruning is initiated. When a network is pruned, the pruning
model starts retraining of the reduced model on new initial
random weights. If no more parameters can be identified
for pruning, or if a reduced model is not accepted, the pruning process terminates. A pruned architecture is accepted
if generalization performance is not unacceptably deteriorated.

6 Conclusion
This paper illustrated the power of Taylor expansions of
NN performance functions. Analysis of the expansion of
the objective function and the NN output functions resulted
in the development of efficient tools to optimize training
through optimization of learning rates, NN architectures,
and training data. A comparison of the two NN taylor expansion types, with reference to NN pruning, showed the
two approaches to be conceptually the same, while applications using the expansion of the NN output functions are
less complex than that of the objective function.
A summary is given of three major application types
of NN output function expansions, i.e. analysis of decision
boundaries for classifiaction problems, active learning and
pruning. The new algorithms developed from such Taylor expansions have shown to improve the understanding
of NNs and the performance of NNs (better generalization, better convergence characteristics and less learning
8

computations). The interested reader is referred to the referenced material for detailed experimental results which
illustrate these observations.

References
[1] S Amari, N Murata, K-R Muller, M Finke, H
Yanh, Asymptotic Statistical Theory of Overfitting
and Cross-Validation, Technical Report METR 9506, Department of Mathematical Engineering and Information, University of Tokyo, 1995.
[2] E Basson, AP Engelbrecht, Approximation of a Function and Its Derivatives in Feedforward Neural Networks, IEEE International Joint Conference on Neural Networks, Washington DC, USA, 1999, paper
2152.
[3] R Battiti, First- and Second-Order Methods for
Learning: Between Steepest Descent and Newtons
Method, Neural Computation, Vol 4, 1992, pp 141166.
[4] S Becker, Y Le Cun, Improving the Convergence
of Back-Propagation Learning with Second Order
Methods, DS Touretzky, GE Hinton, TJ Sejnowski
(eds), Proceedings of the 1988 Connectionist Summer School, Morgan Kaufmann:Los Angeles, 1988.
[5] JY Choi, C-H Choi, Sensitivity Analysis of Multilayer Perceptron with Differential Activation Functions, IEEE Transactions on Neural Networks, 3(1),
1992, pp 101-107.
[6] T Cibas, F Fogelman Soulie, P Gallinari, S Raudys,
Variable Selection with Neural Networks, Neurocomputing, Vol 12, 1996, pp 223-248.
[7] T Czernichow, Architecture Selection through Statistical Sensitivity Analysis, International Conference on
Artificial Neural Networks, 1996, pp 179-184.
[8] AP Engelbrecht, I Cloete, JM Zurada, Determining
the Significance of Input Parameters using Sensitivity
Analysis, International Workshop on Artificial Neural
Networks, Torremolinos, Spain, June 1995, in J Mira,
F Sandoval (eds), From Natural to Artificial Neural
Computing, in the series Lecture Notes in Computer
Science, Vol 930, pp 382-388.
[9] AP Engelbrecht, I Cloete, A Sensitivity Analysis Algorithm for Pruning Feedforward Neural Networks,
IEEE International Conference in Neural Networks,
Washington, Vol 2, 1996, pp 1274-1277.
[10] AP Engelbrecht, I Cloete, Selective Learning using
Sensitivity Analysis, IEEE World Congress on Computational Intelligence, International Joint Conference on Neural Networks, Anchorage, Alaska, 1998,
pp 1150-1155.
SACJ / SART, No 25, 2000

[11] AP Engelbrecht, HL Viktor, Rule Improvement


through Decision Boundary Detection using Sensitivity Analysis, the International Work Conference on
Neural Networks (IWANN99), 2-4 June, Alicante:
Spain, Lecture Notes in Computer Science (Volume
1607), Springer-Verlag, Berlin: Germany, pp 78-84.
[12] AP Engelbrecht, Sensitivity Analysis for Decision
Boundaries, Neural Processing Letters, Vol 10, 1999,
pp 1-14.

[23] MH Lamers, JN Kok, A Multilevel Nonlinearity


Study Design, IEEE International World Congress on
Computational Intelligence, International Joint Conference on Neural Networks, Anchorage, Alaska,
1998, pp 730-734.
[24] Y Le Cun, JS Denker, SA Solla, Optimal Brain Damage, D Touretzky (ed), Advances in Neural Information Processing systems, Vol 2, 1990, pp 598-605.

[13] AP Engelbrecht, I Cloete, Incremental Learning using Sensitivity Analysis, IEEE International Joint
Conference on Neural Networks, Washington DC,
USA, 1999, paper 380.

[25] I Modai, NI Saban, M Stoler, A Valevski, N Saban, Sensitivity Profile of 41 Psychiatric Parameters Determined by Neural Network in Relation to
8-Week Outcome, Computers in Human Behavior,
11(2), 1995, pp 181-190.

[14] AP Engelbrecht, L Fletcher, I Cloete, Variance Analysis of Sensitivity Information for Pruning Feedforward Neural Networks, IEEE International Joint Conference on Neural Networks, Washington DC, USA,
1999, paper 379.

[26] DE Rumelhart, GE Hinton, RJ Williams, Learning


Internal Representation by Error Propagation, DE
Rumelhart, JL McClelland and the PDP Research
Group (eds), Parallel Distributed Processing, Vol 1,
MIT Press, Cambridge, Mass, 1986.

[15] W Finnoff, F Hergert, HG Zimmermann, Improving


Model Selection by Nonconvergent Methods, Neural
Networks, Vol 6, 1993, p 771-783.

[27] H Takenaga, S Abe, M Takamoto, M Kayama, T


Kitamura, Y Okuyama, Input Layer Optimization of
Neural Networks by Sensitivity Analysis and Its Application to Recognition of Numerals, Electrical Engineering in Japan, 111(4), 1991, pp 130-138.

[16] L Fletcher, V Katkovnik, FE Steffens, AP Engelbrecht, Optimizing the Number of Hidden Nodes
of a Feedforward Artificial Neural Network, IEEE
World Congress on Computational Intelligence, International Joint Conference on Neural Networks,
Anchorage, Alaska, 1998, pp 1608-1612.
[17] L Fu, T Chen, Sensitivity Analysis for Input Vector in
Multilayer Feedforward Neural Networks, IEEE International Conference on Neural Networks, Vol 1,
1993, pp 215-218.

[28] HH Thodberg, Improving Generalization of Neural


Networks through Pruning, International Journal of
Neural Systems, 1(4), 1991, pp 317-326.
[29] JM Zurada, A Malinowski, S Usui, Perturbation
Method for Deleting Redundant Inputs of Perceptron
Networks, Neurocomputing, Vol. 14, 1997, pp 177193.

[18] T-H Goh, Semantic Extraction using Neural Network


Modelling and Sensitivity Analysis, Proceedings of
the 1993 International Joint Conference on Neural
Networks, 1993, pp 1031-1034.
[19] J Gorodkin, LK Hansen, A Krogh, C Svarer, A Quantitative Study of Pruning by Optimal Brain Damage,
International Journal of Neural Systems, 4(2), 1993,
pp 159-169.
[20] Z Guo, RE Uhrig, Sensitivity Analysis and Applications to Nuclear Power Plant, Proceedings of the
IEEE, Vol 2, 1992, pp 453-458.
[21] B Hassibi, DG Stork, Second Order Derivatives for
Network Pruning: Optimal Brain Surgeon, C Lee
Giles, SJ Hanson, JD Cowan (eds), Advances in Neural Information Processing Systems, Vol 5, 1993, pp
164-171.
[22] B Hassibi, DG Stork, G Wolff, Optimal Brain Surgeon: Extensions and Performance Comparisons, JD
Cowan, G Tesauro, J Alspector (eds), Advances in
Neural Information Processing Systems, Vol 6, 1994,
pp 263-270.
SART / SACJ, No 25, 2000
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

You might also like