Professional Documents
Culture Documents
Stability
Presented at:
M2007
by:
Jeff Zeanah
President
Z Solutions, Inc.
October 1, 2007
Presentation Methodology
• Proceeds as a set of analyses
• Follows the results and thought processes of the
analyses
• Likely not one answer, but issues to understand
• Three datasets explored
1
What would you do?
• 10 identical Neural Network models are built with the
same data and inputs.
• On a given dataset: 3 are better than the “traditional”
model now used, 7 are not
2
Possible methods to control instability
• Learning Algorithm
• Data Scaling
• Data Preprocessing
• Network Architecture
– Number of hidden nodes
– Initial scaling of weights
– Transformations of activation function
3
PVA Data Set
• Predicting respondents from a set of lapsing donors
• Used in1998 KDD Cup competition
• Screened variables: Directly correlated with target
– 6 continuous variables
– 1 binary variable
• Equal number of “1’s” and “0’s” in the target: “50/50”
• 3390 observations in both the training and validation
datasets
• Average Error on the validation set is evaluated
Code Snips
• Generate Random Numbers
data random;
call streaminit(123);
do i=1 to 50;
x1=floor(10000*rand('uniform'));
output;
end;
run;
4
Code Snips(cont’d)
• Neural Training Code (Default algorithm)
5
How does this compare?
• Compared to Logistic Regression can
– Reasonable comparison is to do multiple random draws of
Training and Validation sets
• Using Proc DMREG
– Selection = Stepwise
– Choose=VERROR
• All 2 factor interactions and quadratic terms are available
for selection
• Default neural training with 9 nodes
• Comparison of Validation errors
6
Does the variability change with JGZ1
7
Slide 13
8
Scaling Changes shown in Code
9
TanH Transformation
ea − e−a
tanh( a ) = a − a
e +e
1
where a = f ( x )
tanh(a)
-1
-3 0 3
Page 19
a Copyright © Z Solutions, Inc. 2007 All rights reserved
Seems to work
10
Assessment
• Scaling seems to reduce variability
– However just tested on the RProp Algorithm
– Should test on other algorithms
• The PVA data is rather simple
– Only 7 variables
– Other algorithms not tested
11
Data 1 Results
12
Data 1 Results: Observations
• Scaling steps improved all algorithms
• Back Propagation gave the most stable results
• Back Propagation gave the best results (measured on
validation data)
13
Data 2: Defaults
14
Data 2 Results: Observations
• Scaling steps improved all algorithms
• Back Propagation gave the most stable results
• Back Propagation and Default gave the best mean
results (measured on validation data)
15
Variable Importance Rankings
• The absolute magnitude of the neural network weights
can be used to determine variable importance
– Averaged across all the hidden nodes
– The higher absolute weights, the more important the variable
– Can be done with and without factoring in the hidden weights
• This is true only if the variables are on the same scale
• Can be used for variable selection
• Can be used for exploratory data analysis
– To find interesting interactions
– Flexible to all types of responses
• Has been proposed as a possibility for credit scoring
16
Data 2 Back Prop Reduced Range
17
Back Propagation Standard Range
18
Conclusions
• 16 years of personal experience has led to this general
heuristic concerning Neural Networks:
Recommendations
• Maximum iterations of 20 should be increased
– 300 for Back-Propagation and R-Prop
– 80 for Default and Double Dogleg
– Regardless, results should always be checked to make sure
learning did not stop early
• Scaling on Inputs to Range and Weights should be
drawn from a uniform distribution ± 0.10
– Hard to envision any harm in these recommendations
– Will give slightly longer training times
• Consideration should be given to looking at Back-
Propagation training again
– Algorithm improvements may be available at this time???
• More testing is recommended
19
Download Code Examples From
www.zsolutions.com/nn_code
20