Print Zeanah Jeff

In Search of Neural Network
Stability
Presented at:
M2007
by:
Jeff Zeanah
President
Z Solutions, Inc.
October 1, 2007
Page 1 Copyright © Z Solutions, Inc. 2007 All rights reserved
Presentation Methodology
• Proceeds as a set of analyses
• Follows the results and thought processes of the
analyses
• Likely not one answer, but issues to understand
• Three datasets explored
1
What would you do?
• 10 identical Neural Network models are built with the
same data and inputs.
• On a given dataset: 3 are better than the “traditional”
model now used, 7 are not
Option A: Choose one of the better models and forget the

others exist
Option B: Run out of time and Punt!
Option C: Proclaim, “I didn’t even know there was an
instability issue!”
Why Neural Network Stability?

• Learning generally starts with randomly generated initial
weights
• It is very important that the final result is not dependent
on the starting point
• High degree of parameterization (connections between
inputs and hidden nodes, therefore multiple parameters
per input) can give the instance with overtraining
– I.E., learning of spurious relationships in the training data
• Even without overtraining, we can get unstable results
2
Possible methods to control instability
• Learning Algorithm
• Data Scaling
• Data Preprocessing
• Network Architecture
– Number of hidden nodes
– Initial scaling of weights
– Transformations of activation function
What is the magnitude of Instability?
• Proc Neural this is used in the Neural Network tool in

Enterprise Miner is used in foundation SAS
• Loop through a process to generate unique random
numbers
• Initialize weights with the random numbers as seeds,
train a model and evaluate
• Compare instability to Logistic Regression
• Training is performed using a training set and a
validation set, using the iteration with the best fit on
validation data.
3
PVA Data Set
• Predicting respondents from a set of lapsing donors
• Used in1998 KDD Cup competition
• Screened variables: Directly correlated with target
– 6 continuous variables
– 1 binary variable
• Equal number of “1’s” and “0’s” in the target: “50/50”
• 3390 observations in both the training and validation
datasets
• Average Error on the validation set is evaluated
Code Snips
• Generate Random Numbers
data random;
call streaminit(123);
do i=1 to 50;
x1=floor(10000*rand('uniform'));
output;
end;
run;
4
Code Snips(cont’d)
• Neural Training Code (Default algorithm)
proc neural data=pva50.TRAIN50 dmdbcat=WORK.pva_DMDB

validdata = pva50.VALIDATE50 random = &&x&i;
input %INTINPUTS / level=interval id=intvl;
input %BININPUTS / level=nominal id=bin;
target TARGET_B / level=NOMINAL id=TARGET_B;
arch MLP Hidden=&j; *nbr of hidden varied;
nloptions FCONV=.0000001;
train Maxiter=40 maxtime=14400
Outest=work.Neural_outest estiter=1
Outfit=work.Neural_OUTFIT;
run;
quit;
Note: EM procedures in code are not supported by Tech. Support
Default Results – Stable?
5
How does this compare?
• Compared to Logistic Regression can
– Reasonable comparison is to do multiple random draws of
Training and Validation sets
• Using Proc DMREG
– Selection = Stepwise
– Choose=VERROR
• All 2 factor interactions and quadratic terms are available
for selection
• Default neural training with 9 nodes
• Comparison of Validation errors
Better average fit for neural, however

more variability
6
Does the variability change with JGZ1
different learning algorithms

Algorithms Tested and Key Settings
• Default Learning
– No changes
• Back Propagation Learning (Historical Neural Network
error assignment and reduction )
– Learning rate = 0.1 (default)
• Double Dogleg (Derivation of “Newton Type”
optimizations)
– No changes
• RProp (Variation of Back Propagation)
– Learning rate = 0.1 (default)
5 nodes best; default worst; RProp

mean is the best
7
Slide 13
JGZ1 Dir: stability analysis\pva
"PVA_1 pva50 bprop2.sas"

Jeff, 7/18/2007
Intermediate Conclusions
• Nodes and Learning Algorithm seems to have little
impact on stability
• R-Prop seems to be given best results
Two frequent concerns of variability
• Scaling of the inputs

– Default scales with “Standard Deviation” which means that are
rescaled to have a mean of 0 and a standard deviation of 1
– Change to “Range” which scales from 0 to 1
• Weight Initialization
– Default randomly pulls from a Normal distribution with
• mean = 0 and standard deviation = 1
– Change to pulling from a Uniform distribution with a range of ±
0.05 and 0.10
8
Scaling Changes shown in Code
proc neural data=pva50.TRAIN50 dmdbcat=WORK.pva_DMDB

validdata = pva50.VALIDATE50 random = &&x&i;
netopts randist=UNIFORM ranloc=0 ranscale=0.10;
input %INTINPUTS / level=interval id=intvl std=range;
input %BININPUTS / level=nominal id=bin std=range;
target TARGET_B / level=NOMINAL id=TARGET_B;
arch MLP Hidden=&j;
nloptions FCONV=.0000001;
train Technique = RPROP
learn = 0.1
maxiter = 1000
maxtime=900
Outest=work.Neural_outest estiter=1
Outfit=work.Neural_OUTFIT;
run;
quit;
Recommendations about Initialization

“The initial weight values are chosen to be small so that
sigmoidal activation functions are not driven into the
saturation regions …”
- C.M. Bishop
Neural Networks for Pattern Recognition
“The weights are usually generated from a simple

distribution, such as a spherically symmetric Gaussian,
for convenience …”
- C.M. Bishop
Neural Networks for Pattern Recognition
9
TanH Transformation
ea − e−a
tanh( a ) = a − a
e +e
1
where a = f ( x )
tanh(a)
-1
-3 0 3
Page 19
a Copyright © Z Solutions, Inc. 2007 All rights reserved
Seems to work
RProp scaled compared to Default

10
Assessment
• Scaling seems to reduce variability
– However just tested on the RProp Algorithm
– Should test on other algorithms
• The PVA data is rather simple
– Only 7 variables
– Other algorithms not tested
Data1 Data Set

• Binary Dependent variable
• Screened variables: Directly correlated with dependent
variable or having some signs of interaction
– 296 Binary Variables
– 1 binary target
• Equal number of “1’s” and “0’s” in the target: “50/50”
datasets
11
Data 1 Results
Data 1: Reduced Range and

Initialization
12
Data 1 Results: Observations
• Scaling steps improved all algorithms
• Back Propagation gave the most stable results
• Back Propagation gave the best results (measured on
validation data)
Data 2 Data Set

• Binary Dependent variable
• Screened variables: Directly correlated with dependent
variable or having some signs of interaction
– 38 Interval Variables
– Mostly relatively normally distributed
• 17% of target observations are “1”
datasets
13
Data 2: Defaults
Data 2: Reduced Range and

Initialization
14
Data 2 Results: Observations
• Scaling steps improved all algorithms
• Back Propagation gave the most stable results
• Back Propagation and Default gave the best mean
results (measured on validation data)
Can we identify something behind the

stability or lack thereof?
• Is there a reason behind the variable results?
• In a Regression model we would look at parameter
stability
• In Neural Network models the parameters are
notoriously inscrutable
– However, that doesn’t hide all meaning
15
Variable Importance Rankings
• The absolute magnitude of the neural network weights
can be used to determine variable importance
– Averaged across all the hidden nodes
– The higher absolute weights, the more important the variable
– Can be done with and without factoring in the hidden weights
• This is true only if the variables are on the same scale
• Can be used for variable selection
• Can be used for exploratory data analysis
– To find interesting interactions
– Flexible to all types of responses
• Has been proposed as a possibility for credit scoring
Measurement of Stability of Variable

Importance Rankings
• 50 models are developed as before
• Mean Absolute weights are calculated for each variable
• Variables are ranked (in each of the 50 models) by the
Mean Absolute weights
• The Mean Rank is captured as will as the lowest rank
(highest importance) and highest rank (lowest
importance)
• Therefore stability of which variables are important can
be analyzed, not just model fit
16
Data 2 Back Prop Reduced Range
Default Learning Reduced Range
17
Back Propagation Standard Range
Possible methods to control instability
• Learning Algorithm (addressed)

• Data Scaling (addressed)
• Data Preprocessing (not addressed)
• Network Architecture
– Number of hidden nodes (little impact, if any)
– Initial scaling of weights (addressed)
– Transformations or activation function (not addressed)
18
Conclusions
• 16 years of personal experience has led to this general
heuristic concerning Neural Networks:
“There are no general heuristics

concerning Neural Networks”
• However these results appear convincing

• Rare target (without over-sampling) may give different
results
• Others are encouraged to download the code used and
try on their own data
Recommendations
• Maximum iterations of 20 should be increased
– 300 for Back-Propagation and R-Prop
– 80 for Default and Double Dogleg
– Regardless, results should always be checked to make sure
learning did not stop early
• Scaling on Inputs to Range and Weights should be
drawn from a uniform distribution ± 0.10
– Hard to envision any harm in these recommendations
– Will give slightly longer training times
• Consideration should be given to looking at Back-
Propagation training again
– Algorithm improvements may be available at this time???
• More testing is recommended
19
Download Code Examples From
www.zsolutions.com/nn_code
Comments and questions please email:

Jeff Zeanah
jeffz@zsolutions.com
20

Print Zeanah Jeff

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Print Zeanah Jeff

Uploaded by

Copyright:

Available Formats

In Search of Neural Network

Page 1 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 2 Copyright © Z Solutions, Inc. 2007 All rights reserved

Option A: Choose one of the better models and forget the

Page 3 Copyright © Z Solutions, Inc. 2007 All rights reserved

Why Neural Network Stability?

Page 4 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 5 Copyright © Z Solutions, Inc. 2007 All rights reserved

What is the magnitude of Instability?

• Proc Neural this is used in the Neural Network tool in

Page 6 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 7 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 8 Copyright © Z Solutions, Inc. 2007 All rights reserved

proc neural data=pva50.TRAIN50 dmdbcat=WORK.pva_DMDB

Page 9 Copyright © Z Solutions, Inc. 2007 All rights reserved

Default Results – Stable?

Page 10 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 11 Copyright © Z Solutions, Inc. 2007 All rights reserved

Better average fit for neural, however

Page 12 Copyright © Z Solutions, Inc. 2007 All rights reserved

different learning algorithms

Page 13 Copyright © Z Solutions, Inc. 2007 All rights reserved

5 nodes best; default worst; RProp

Page 14 Copyright © Z Solutions, Inc. 2007 All rights reserved

JGZ1 Dir: stability analysis\pva

"PVA_1 pva50 bprop2.sas"

Page 15 Copyright © Z Solutions, Inc. 2007 All rights reserved

Two frequent concerns of variability

• Scaling of the inputs

Page 16 Copyright © Z Solutions, Inc. 2007 All rights reserved

proc neural data=pva50.TRAIN50 dmdbcat=WORK.pva_DMDB

Page 17 Copyright © Z Solutions, Inc. 2007 All rights reserved

Recommendations about Initialization

“The weights are usually generated from a simple

Page 18 Copyright © Z Solutions, Inc. 2007 All rights reserved

RProp scaled compared to Default

Page 21 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data1 Data Set

Page 22 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 23 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data 1: Reduced Range and

Page 24 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 25 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data 2 Data Set

Page 26 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 27 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data 2: Reduced Range and

Page 28 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 29 Copyright © Z Solutions, Inc. 2007 All rights reserved

Can we identify something behind the

Page 30 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 31 Copyright © Z Solutions, Inc. 2007 All rights reserved

Measurement of Stability of Variable

Page 32 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 33 Copyright © Z Solutions, Inc. 2007 All rights reserved

Default Learning Reduced Range

Page 34 Copyright © Z Solutions, Inc. 2007 All rights reserved

Page 35 Copyright © Z Solutions, Inc. 2007 All rights reserved

Possible methods to control instability

• Learning Algorithm (addressed)

Page 36 Copyright © Z Solutions, Inc. 2007 All rights reserved

“There are no general heuristics

• However these results appear convincing