You are on page 1of 21

In Search of Neural Network

Stability
Presented at:
M2007
by:
Jeff Zeanah
President
Z Solutions, Inc.
October 1, 2007

Page 1 Copyright © Z Solutions, Inc. 2007 All rights reserved

Presentation Methodology
• Proceeds as a set of analyses
• Follows the results and thought processes of the
analyses
• Likely not one answer, but issues to understand
• Three datasets explored

Page 2 Copyright © Z Solutions, Inc. 2007 All rights reserved

1
What would you do?
• 10 identical Neural Network models are built with the
same data and inputs.
• On a given dataset: 3 are better than the “traditional”
model now used, 7 are not

Option A: Choose one of the better models and forget the


others exist
Option B: Run out of time and Punt!
Option C: Proclaim, “I didn’t even know there was an
instability issue!”

Page 3 Copyright © Z Solutions, Inc. 2007 All rights reserved

Why Neural Network Stability?


• Learning generally starts with randomly generated initial
weights
• It is very important that the final result is not dependent
on the starting point
• High degree of parameterization (connections between
inputs and hidden nodes, therefore multiple parameters
per input) can give the instance with overtraining
– I.E., learning of spurious relationships in the training data
• Even without overtraining, we can get unstable results

Page 4 Copyright © Z Solutions, Inc. 2007 All rights reserved

2
Possible methods to control instability

• Learning Algorithm
• Data Scaling
• Data Preprocessing
• Network Architecture
– Number of hidden nodes
– Initial scaling of weights
– Transformations of activation function

Page 5 Copyright © Z Solutions, Inc. 2007 All rights reserved

What is the magnitude of Instability?

• Proc Neural this is used in the Neural Network tool in


Enterprise Miner is used in foundation SAS
• Loop through a process to generate unique random
numbers
• Initialize weights with the random numbers as seeds,
train a model and evaluate
• Compare instability to Logistic Regression
• Training is performed using a training set and a
validation set, using the iteration with the best fit on
validation data.

Page 6 Copyright © Z Solutions, Inc. 2007 All rights reserved

3
PVA Data Set
• Predicting respondents from a set of lapsing donors
• Used in1998 KDD Cup competition
• Screened variables: Directly correlated with target
– 6 continuous variables
– 1 binary variable
• Equal number of “1’s” and “0’s” in the target: “50/50”
• 3390 observations in both the training and validation
datasets
• Average Error on the validation set is evaluated

Page 7 Copyright © Z Solutions, Inc. 2007 All rights reserved

Code Snips
• Generate Random Numbers

data random;
call streaminit(123);
do i=1 to 50;
x1=floor(10000*rand('uniform'));
output;
end;
run;

Page 8 Copyright © Z Solutions, Inc. 2007 All rights reserved

4
Code Snips(cont’d)
• Neural Training Code (Default algorithm)

proc neural data=pva50.TRAIN50 dmdbcat=WORK.pva_DMDB


validdata = pva50.VALIDATE50 random = &&x&i;
input %INTINPUTS / level=interval id=intvl;
input %BININPUTS / level=nominal id=bin;
target TARGET_B / level=NOMINAL id=TARGET_B;
arch MLP Hidden=&j; *nbr of hidden varied;
nloptions FCONV=.0000001;
train Maxiter=40 maxtime=14400
Outest=work.Neural_outest estiter=1
Outfit=work.Neural_OUTFIT;
run;
quit;
Note: EM procedures in code are not supported by Tech. Support

Page 9 Copyright © Z Solutions, Inc. 2007 All rights reserved

Default Results – Stable?

Page 10 Copyright © Z Solutions, Inc. 2007 All rights reserved

5
How does this compare?
• Compared to Logistic Regression can
– Reasonable comparison is to do multiple random draws of
Training and Validation sets
• Using Proc DMREG
– Selection = Stepwise
– Choose=VERROR
• All 2 factor interactions and quadratic terms are available
for selection
• Default neural training with 9 nodes
• Comparison of Validation errors

Page 11 Copyright © Z Solutions, Inc. 2007 All rights reserved

Better average fit for neural, however


more variability

Page 12 Copyright © Z Solutions, Inc. 2007 All rights reserved

6
Does the variability change with JGZ1

different learning algorithms


Algorithms Tested and Key Settings
• Default Learning
– No changes
• Back Propagation Learning (Historical Neural Network
error assignment and reduction )
– Learning rate = 0.1 (default)
• Double Dogleg (Derivation of “Newton Type”
optimizations)
– No changes
• RProp (Variation of Back Propagation)
– Learning rate = 0.1 (default)

Page 13 Copyright © Z Solutions, Inc. 2007 All rights reserved

5 nodes best; default worst; RProp


mean is the best

Page 14 Copyright © Z Solutions, Inc. 2007 All rights reserved

7
Slide 13

JGZ1 Dir: stability analysis\pva

"PVA_1 pva50 bprop2.sas"


Jeff, 7/18/2007
Intermediate Conclusions
• Nodes and Learning Algorithm seems to have little
impact on stability
• R-Prop seems to be given best results

Page 15 Copyright © Z Solutions, Inc. 2007 All rights reserved

Two frequent concerns of variability

• Scaling of the inputs


– Default scales with “Standard Deviation” which means that are
rescaled to have a mean of 0 and a standard deviation of 1
– Change to “Range” which scales from 0 to 1
• Weight Initialization
– Default randomly pulls from a Normal distribution with
• mean = 0 and standard deviation = 1
– Change to pulling from a Uniform distribution with a range of ±
0.05 and 0.10

Page 16 Copyright © Z Solutions, Inc. 2007 All rights reserved

8
Scaling Changes shown in Code

proc neural data=pva50.TRAIN50 dmdbcat=WORK.pva_DMDB


validdata = pva50.VALIDATE50 random = &&x&i;
netopts randist=UNIFORM ranloc=0 ranscale=0.10;
input %INTINPUTS / level=interval id=intvl std=range;
input %BININPUTS / level=nominal id=bin std=range;
target TARGET_B / level=NOMINAL id=TARGET_B;
arch MLP Hidden=&j;
nloptions FCONV=.0000001;
train Technique = RPROP
learn = 0.1
maxiter = 1000
maxtime=900
Outest=work.Neural_outest estiter=1
Outfit=work.Neural_OUTFIT;
run;
quit;

Page 17 Copyright © Z Solutions, Inc. 2007 All rights reserved

Recommendations about Initialization


“The initial weight values are chosen to be small so that
sigmoidal activation functions are not driven into the
saturation regions …”
- C.M. Bishop
Neural Networks for Pattern Recognition

“The weights are usually generated from a simple


distribution, such as a spherically symmetric Gaussian,
for convenience …”
- C.M. Bishop
Neural Networks for Pattern Recognition

Page 18 Copyright © Z Solutions, Inc. 2007 All rights reserved

9
TanH Transformation
ea − e−a
tanh( a ) = a − a
e +e
1

where a = f ( x )
tanh(a)

-1
-3 0 3

Page 19
a Copyright © Z Solutions, Inc. 2007 All rights reserved

Seems to work

RProp scaled compared to Default


Page 20 Copyright © Z Solutions, Inc. 2007 All rights reserved

10
Assessment
• Scaling seems to reduce variability
– However just tested on the RProp Algorithm
– Should test on other algorithms
• The PVA data is rather simple
– Only 7 variables
– Other algorithms not tested

Page 21 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data1 Data Set


• Binary Dependent variable
• Screened variables: Directly correlated with dependent
variable or having some signs of interaction
– 296 Binary Variables
– 1 binary target
• Equal number of “1’s” and “0’s” in the target: “50/50”
• 3988 observations in both the training and validation
datasets
• Average Error on the validation set is evaluated

Page 22 Copyright © Z Solutions, Inc. 2007 All rights reserved

11
Data 1 Results

Page 23 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data 1: Reduced Range and


Initialization

Page 24 Copyright © Z Solutions, Inc. 2007 All rights reserved

12
Data 1 Results: Observations
• Scaling steps improved all algorithms
• Back Propagation gave the most stable results
• Back Propagation gave the best results (measured on
validation data)

Page 25 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data 2 Data Set


• Binary Dependent variable
• Screened variables: Directly correlated with dependent
variable or having some signs of interaction
– 38 Interval Variables
– Mostly relatively normally distributed
• 17% of target observations are “1”
• 1094 observations in both the training and validation
datasets
• Average Error on the validation set is evaluated

Page 26 Copyright © Z Solutions, Inc. 2007 All rights reserved

13
Data 2: Defaults

Page 27 Copyright © Z Solutions, Inc. 2007 All rights reserved

Data 2: Reduced Range and


Initialization

Page 28 Copyright © Z Solutions, Inc. 2007 All rights reserved

14
Data 2 Results: Observations
• Scaling steps improved all algorithms
• Back Propagation gave the most stable results
• Back Propagation and Default gave the best mean
results (measured on validation data)

Page 29 Copyright © Z Solutions, Inc. 2007 All rights reserved

Can we identify something behind the


stability or lack thereof?
• Is there a reason behind the variable results?
• In a Regression model we would look at parameter
stability
• In Neural Network models the parameters are
notoriously inscrutable
– However, that doesn’t hide all meaning

Page 30 Copyright © Z Solutions, Inc. 2007 All rights reserved

15
Variable Importance Rankings
• The absolute magnitude of the neural network weights
can be used to determine variable importance
– Averaged across all the hidden nodes
– The higher absolute weights, the more important the variable
– Can be done with and without factoring in the hidden weights
• This is true only if the variables are on the same scale
• Can be used for variable selection
• Can be used for exploratory data analysis
– To find interesting interactions
– Flexible to all types of responses
• Has been proposed as a possibility for credit scoring

Page 31 Copyright © Z Solutions, Inc. 2007 All rights reserved

Measurement of Stability of Variable


Importance Rankings
• 50 models are developed as before
• Mean Absolute weights are calculated for each variable
• Variables are ranked (in each of the 50 models) by the
Mean Absolute weights
• The Mean Rank is captured as will as the lowest rank
(highest importance) and highest rank (lowest
importance)
• Therefore stability of which variables are important can
be analyzed, not just model fit

Page 32 Copyright © Z Solutions, Inc. 2007 All rights reserved

16
Data 2 Back Prop Reduced Range

Page 33 Copyright © Z Solutions, Inc. 2007 All rights reserved

Default Learning Reduced Range

Page 34 Copyright © Z Solutions, Inc. 2007 All rights reserved

17
Back Propagation Standard Range

Page 35 Copyright © Z Solutions, Inc. 2007 All rights reserved

Possible methods to control instability

• Learning Algorithm (addressed)


• Data Scaling (addressed)
• Data Preprocessing (not addressed)
• Network Architecture
– Number of hidden nodes (little impact, if any)
– Initial scaling of weights (addressed)
– Transformations or activation function (not addressed)

Page 36 Copyright © Z Solutions, Inc. 2007 All rights reserved

18
Conclusions
• 16 years of personal experience has led to this general
heuristic concerning Neural Networks:

“There are no general heuristics


concerning Neural Networks”

• However these results appear convincing


• Rare target (without over-sampling) may give different
results
• Others are encouraged to download the code used and
try on their own data

Page 37 Copyright © Z Solutions, Inc. 2007 All rights reserved

Recommendations
• Maximum iterations of 20 should be increased
– 300 for Back-Propagation and R-Prop
– 80 for Default and Double Dogleg
– Regardless, results should always be checked to make sure
learning did not stop early
• Scaling on Inputs to Range and Weights should be
drawn from a uniform distribution ± 0.10
– Hard to envision any harm in these recommendations
– Will give slightly longer training times
• Consideration should be given to looking at Back-
Propagation training again
– Algorithm improvements may be available at this time???
• More testing is recommended

Page 38 Copyright © Z Solutions, Inc. 2007 All rights reserved

19
Download Code Examples From

www.zsolutions.com/nn_code

Comments and questions please email:


Jeff Zeanah
jeffz@zsolutions.com

Page 39 Copyright © Z Solutions, Inc. 2007 All rights reserved

20

You might also like