Professional Documents
Culture Documents
Networks
AP Engelbrecht,
Department of Computer Science, University of Pretoria, Pretoria, South Africa, engel@driesie.cs.up.ac.za
Abstract
The Taylor series expansion of continuous functions has shown - in many fields - to be an extremely powerful tool to
study the characteristics of such functions. This paper illustrates the power of the Taylor series expansion of multilayer
feedforward neural networks. The paper shows how these expansions can be used to investigate positions of decision
boundaries, to develop active learning strategies and to perform architecture selection.
Keywords: Feedforward Neural Networks, Taylor Series, Sensitivity Analysis, Decision Boundaries, Active Learning, Architecture Selection
Computing Review Categories: I.2.6
1 Introduction
The Taylor series expansion of a continuous function
reveals interesting characteristics of that function under
small changes in parameters of the function. Let P be a
continuous function, for which the first derivative exists
and having a bounded second derivative, which represents
a performance measure of some system. Let be a parameter of this system (assume, without loss of generality, that
is a scalar). Then, from a Taylor expansion of P around
, the change in performance due to an infinitesimal perturbation of is expressed as
P ( + ) = P () +
2
P () +
P () +
1!
2!
0
00
(1)
P ( + ) P () =
2
P () +
P () +
1!
2!
0
00
(2)
00
The main objective of this paper is to summarize current insights gained from such perturbation analysis studies, and to present new insights recently obtained through
studies by the author. Section 2 considers the Taylor expansion of the objective function, while section 3 presents
insights obtained from an expansion of the NN output
function. Section 4 compares the two approaches with specific reference to architecture selection. Section 5 presents
in detail how Taylor series expansions of the NN output
function can be used to develop tools to analyze decision
boundaries, to optimize network architectures and training
data.
E (1 ; ; i + i ; ; I )
E (~) + E (~)i
1
E (~)2i
2
00
(3)
=
+
+
E (~)i
0
1
E (~)2i
2
(4)
00
The first order term E (~) is used in gradient descent optimization to drive the NN to a local minimum [26]. In this
case i represents a weight of the NN. The second order
term has also been used in optimization to improve convergence [3, 4]. Objective function sensitivity analysis has
been used widely in pruning of NN parameters. Optimal
Brain Damage (OBD) [19, 24] and Optimal Brain Surgeon
(OBS) [21, 22] prune weights with low saliency, while
Optimal Cell Damage (OCD) [6] prunes irrelevant input
and hidden units. OBD, OBS and OCD use second order
derivatives to approximate saliencies.
Pruning using OBD, OBS and OCD is based upon assumptions to reduce the complexity of calculating equation
(4) [6, 19, 21, 22, 24]. These assumptions are discussed
below:
0
= E (1 ;
i + i ; ; I )
and OBS as
Si
= E (1 ;
i + i ; ; I )
E (~)
1 2i
(6)
2 [H 1 ii
(7)
The change in output due to the perturbation is then entirely described by the derivative
FNN ()i
0
~
i ;;I ) FNN ()
limi !0 FNN (1 ;;i +
i
(8)
value tk
( p)
kth
output
unit for pattern p. A change in error, due to some perturbation, is determined by a change in the output value caused
by that perturbation. This relationship is further illustrated
in table 1 and equations (10) and (11), considering the assumptions as for OBD.
While objective function and NN output sensitivity
analysis mean conceptually the same thing, it is more complex to compute objective function sensitivity information.
Since the goal of learning is to minimize the objective
function, the first order derivative of the objective function,
E , is approximately zero at convergence. Thus requiring
second order derivatives to be computed. Since this needs
the calculation of the Hessian matrix, objective function
sensitivity analysis is computationally expensive. In contrast, with NN output sensitivity analysis, first order information is sufficient to quantify the influence of parameter
perturbations, since we can assume that [29]
lim
i !0
1
F (~)2i + ) = 0
2 NN
00
(9)
where FNN is the NN output function. It is much less expensive to calculate the Jacobian matrix than the Hessian
matrix.
Output sensitivity analysis is also more general than
objective function sensitivity analysis in that the latter depends on the error function (objective function) to be optimized. Usually, the sum squared error function is used,
but for any other error function, the sensitivity equations
as summarized in table 1 need to be redefined. The output
sensitivity analysis equations, on the other hand, remain
the same whatever objective function is used.
Since both sensitivity analysis approaches have been
applied to pruning of NN architectures, this application is
used to find a mathematical relationship between the two
methods. For this purpose OBD is used, considering all
its assumptions as listed in section 2. To derive this relationship, assume a NN with one output ok . Although the
comparison below is for one pattern only, it can quite easily be generalized to include the entire training set through
application of a suitable norm.
From table 1, irrespective of which NN parameter is
considered, the following general relationship applies (assuming least squares as objective function):
2 Ek
2
k 2
( o
)
(10)
This supports the statement that objective function sensitivity analysis and NN output sensitivity analysis are conceptually the same (under the assumptions listed in section 2). This means that the same parameter significance
ordering will occur for the two methods. In the case of
pruning, the same parameters will therefore be pruned. In
general, for more than one output the following relationship holds:
2 E
2
2 Ek
2
k=1
K
k 2
( o
)
(11)
k=1
Parameter
zi
yj
wk j
v ji
Error Sensitivity
2 Ek
z2i
2 Ek
y2j
2 Ek
w2k j
2 Ek
v2ji
Output Sensitivity
( fo )2 w2k j
( fo )2 y2j
( fy )2 ( fo )2 w2k j z2i
k
0
ok
J
zi = f ok j =1 wk j f y j v ji
0
ok
y j = f ok wk j
0
ok
wk j = f ok y j
0
J
( f o )2 [ j=1 wk j f y j v ji 2
k
0
0
0
ok
v ji = f y j f ok wk j zi
Definition 1: One Point Decision Boundary: Under assumptions 1 and 2, if there exists an input parameter
( p)
( p)
value zi and a small perturbation zi of zi such that,
( p)
( p)
for any output ok , fok (zi ) 6= fok (zi + zi ), then a de( p) ( p)
cision boundary is located in the range [zi ; zi + zi
of input parameter zi , where p denotes a single pattern.
Definition 2: Range Decision Boundary: Under assumptions 1 and 2, if there exist two input parame( p)
( p)
( p)
( p)
ter values zi1 and zi2 with zi1 < zi2 such that for
In this section a summary is given of how the Taylor expansion of the NN output function can be used to analyze
decision boundaries, to develop active learing algorithms
to optimize the use of training data, and to prune irrelevant units. The approaches surveyed below have been
developed by the author of this paper. Each aspect has
been published separately, and elaborate experimental results presented in these publications. This section serves as
a collective overview of these approaches. The interested
reader is referred to the referenced literature.
( p)
zi
2 [zi p
( )
( p)
( p)
and
( p)
+ zi ), then a deci( p) ( p)
range [zi1 ; zi2 of input
) = f ok (zi
fok (zi
+ zi )
fo (zi p ) + zi fop
( )
k
( )
(12)
zi
fok (zi
f ok
( p)
zi
+ zi ).
( p)
f ok
( p)
zi
)=
values
zi
f(z)=1/(1+exp(-z))
df/dz
0.9
0.8
Definition 3: Pattern Informativeness: Define the informativeness of a pattern as the sensitivity of the NN
output vector to small perturbations in the input vector. Let ( p) denote the informativeness of pattern p.
Then,
:
( p)
( p) = jj~So jj
(13)
f(z), df/dz
0.7
0.6
0.5
0.4
( p)
0.3
0.2
0.1
(p) = jj~So
( p)
zi . Peaks in these graphs correspond to one point boundaries, while a set of approximately equal high sensitivity
values corresponds to a range decision boundary. These
graphs can be used to investigate regions of input space for
which classification is uncertain, and to determine under
what conditions the result of the classification changes.
y
Graphs of ( pj ) indicates the range of input values
( )
The objective of active learning algorithms is to make optimal use of the data provided in the training set. Usually, the
NN learner passively receives training data and has to learn
on all the data. With active learning, the learner selects
training patterns from a candidate set of training patterns
based on the learners current knowledge of the problem.
In allowing the learner to decide what to learn, the most
informative patterns are selected that convey the most information about the function to be learned.
There are two approaches to active learning: selective
learning and incremental learning. With selective learning the learner selects at each selection interval (eg each
epoch) a new training subset from the candidate set. The
candidate set retains all patterns. With incremental learning the actual training subset grows while the candidate set
SART / SACJ, No 25, 2000
(14)
( p)
( p)
So;k
jjSozp jj
( )
(15)
( p)
So;k
( )
;
(17)
i=1
or the Euclidean-norm
zi
jj = k max
fjSopk jg
1 K
( p)
So;k
( )
;
(18)
i=1
(19)
f p 2 DC jp = q max
fq g;
1 P
8q 2 DC not yet selectedg
( )
( )
(P
1)2i
20
(20)
where 2i is the variance of the sensitivity of the network output to perturbations in parameter i , and 20 is
a value close to zero (the characteristics of this value
are explained below).
as
2i
i )2
Pp=1 (i
P 1
where
(21)
( p)
( p)
Kk=1 So;ki
K
(22)
Pp=1 i
P
( p)
(23)
H0 : 2i = 20
(24)
is defined for each parameter i . Unfortunately, from equation (20), 20 6= 0, and we cannot hypothesize that the variance in parameter sensitivity over all patterns is exactly
zero, i.e. 2i = 0. Instead, a small value close to zero is
chosen for 20 , and the alternative hypothesis,
H1 : 2i < 20
(25)
is tested. Using the fact that under the null hypothesis the
variance nullity measure has a 2 (P 1) distribution in the
case of P patterns, the critical value c is obtained from 2
distribution tables,
c = 2v;1
(26)
6 Conclusion
This paper illustrated the power of Taylor expansions of
NN performance functions. Analysis of the expansion of
the objective function and the NN output functions resulted
in the development of efficient tools to optimize training
through optimization of learning rates, NN architectures,
and training data. A comparison of the two NN taylor expansion types, with reference to NN pruning, showed the
two approaches to be conceptually the same, while applications using the expansion of the NN output functions are
less complex than that of the objective function.
A summary is given of three major application types
of NN output function expansions, i.e. analysis of decision
boundaries for classifiaction problems, active learning and
pruning. The new algorithms developed from such Taylor expansions have shown to improve the understanding
of NNs and the performance of NNs (better generalization, better convergence characteristics and less learning
8
computations). The interested reader is referred to the referenced material for detailed experimental results which
illustrate these observations.
References
[1] S Amari, N Murata, K-R Muller, M Finke, H
Yanh, Asymptotic Statistical Theory of Overfitting
and Cross-Validation, Technical Report METR 9506, Department of Mathematical Engineering and Information, University of Tokyo, 1995.
[2] E Basson, AP Engelbrecht, Approximation of a Function and Its Derivatives in Feedforward Neural Networks, IEEE International Joint Conference on Neural Networks, Washington DC, USA, 1999, paper
2152.
[3] R Battiti, First- and Second-Order Methods for
Learning: Between Steepest Descent and Newtons
Method, Neural Computation, Vol 4, 1992, pp 141166.
[4] S Becker, Y Le Cun, Improving the Convergence
of Back-Propagation Learning with Second Order
Methods, DS Touretzky, GE Hinton, TJ Sejnowski
(eds), Proceedings of the 1988 Connectionist Summer School, Morgan Kaufmann:Los Angeles, 1988.
[5] JY Choi, C-H Choi, Sensitivity Analysis of Multilayer Perceptron with Differential Activation Functions, IEEE Transactions on Neural Networks, 3(1),
1992, pp 101-107.
[6] T Cibas, F Fogelman Soulie, P Gallinari, S Raudys,
Variable Selection with Neural Networks, Neurocomputing, Vol 12, 1996, pp 223-248.
[7] T Czernichow, Architecture Selection through Statistical Sensitivity Analysis, International Conference on
Artificial Neural Networks, 1996, pp 179-184.
[8] AP Engelbrecht, I Cloete, JM Zurada, Determining
the Significance of Input Parameters using Sensitivity
Analysis, International Workshop on Artificial Neural
Networks, Torremolinos, Spain, June 1995, in J Mira,
F Sandoval (eds), From Natural to Artificial Neural
Computing, in the series Lecture Notes in Computer
Science, Vol 930, pp 382-388.
[9] AP Engelbrecht, I Cloete, A Sensitivity Analysis Algorithm for Pruning Feedforward Neural Networks,
IEEE International Conference in Neural Networks,
Washington, Vol 2, 1996, pp 1274-1277.
[10] AP Engelbrecht, I Cloete, Selective Learning using
Sensitivity Analysis, IEEE World Congress on Computational Intelligence, International Joint Conference on Neural Networks, Anchorage, Alaska, 1998,
pp 1150-1155.
SACJ / SART, No 25, 2000
[13] AP Engelbrecht, I Cloete, Incremental Learning using Sensitivity Analysis, IEEE International Joint
Conference on Neural Networks, Washington DC,
USA, 1999, paper 380.
[25] I Modai, NI Saban, M Stoler, A Valevski, N Saban, Sensitivity Profile of 41 Psychiatric Parameters Determined by Neural Network in Relation to
8-Week Outcome, Computers in Human Behavior,
11(2), 1995, pp 181-190.
[14] AP Engelbrecht, L Fletcher, I Cloete, Variance Analysis of Sensitivity Information for Pruning Feedforward Neural Networks, IEEE International Joint Conference on Neural Networks, Washington DC, USA,
1999, paper 379.
[16] L Fletcher, V Katkovnik, FE Steffens, AP Engelbrecht, Optimizing the Number of Hidden Nodes
of a Feedforward Artificial Neural Network, IEEE
World Congress on Computational Intelligence, International Joint Conference on Neural Networks,
Anchorage, Alaska, 1998, pp 1608-1612.
[17] L Fu, T Chen, Sensitivity Analysis for Input Vector in
Multilayer Feedforward Neural Networks, IEEE International Conference on Neural Networks, Vol 1,
1993, pp 215-218.