You are on page 1of 31

KDD-98: A Comparison of Leading Data Mining Tools

A Comparison of Leading Data Mining Tools


John F. Elder IV & Dean W. Abbott Elder Research

Fourth International Conference on Knowledge Discovery & Data Mining Friday, August 28, 1998 New York, New York

KDD-98: A Comparison of Leading Data Mining Tools

Copyright 1998, John F. Elder IV and Dean W. Abbott

All rights reserved

Manufactured in the United States of America

1998 Elder Research

T8-2

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Contacting Elder Research


http://www.datamininglab.com Dr. John F. Elder IV 1006 Wildmere Place Charlottesville, VA 22901 elder@datamininglab.com 804-973-7673 Fax: 804-995-0064 Dean W. Abbott 3443 Villanova Avenue San Diego, CA 92122-2310 dean@datamininglab.com 619-450-0313

1998 Elder Research

T8-3

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Tutorial Goals
Compare and Summarize Data Mining Tools which:
Offer multiple modeling and classification algorithms Support project stages surrounding model construction Stand alone Are general-purpose Cost a lot We could get our hands on

Include some (focused) Desktop Tools


Other Reports: Two Crows, Aberdeen Group, Elder Research (forthcoming ), Data Mining Journal
1998 Elder Research T8-4 updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Topics
Products covered Review of algorithms Comparative tables of properties Screen shots exemplifying qualities Summary of distinctives

1998 Elder Research

T8-5

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Caveats
We dont know every tool well (and are sure to have missed some!)
Level of exposure noted for each tool

Our background (biasing our perspective)


Very technical, early adopters Emphasize solving real-world applications More classification than estimation

Field of tools is quite dynamic


New versions appear regularly
1998 Elder Research T8-6 updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Data Mining Products


PRW

Model 1

1998 Elder Research

T8-7

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Tools Evaluated
Version Tested 4 3.0.1 2.1.1 Beta 4.0.3 2 2.5 3.1 1 2.1 3.5 3 8.1 2 1.07 4 1.1

Product

Company

URL http://www.isl.co.uk/clem.html http://www.think.com/html/products/products.htm http://www.datamindcorp.com http://www.sas.com/software/components/miner.html http://www.urbanscience.com/main/gainpage.htm http://www.software.ibm.com/data/iminer/ http://www.sgi.com/Products/software/MineSet/ http://www.unica-usa.com/model1.htm http://www.abtech.com http://www.unica-usa.com/prodinfo.htm http://www.salford-systems.com http://www.wardsystems.com/neuroshe.htm mailto://olpars@partech.com http://www.cognos.com/busintell/products/index.html http://www.rulequest.com/see5-info.html http://www.mathsoft.com/splus/ http://www.wizsoft.com/why.html

Our Experience Moderate Moderate High Moderate Low Low Low Moderate Moderate High Moderate Moderate High Moderate Moderate High Moderate

Integral Solutions, Ltd. Integral Solutions, Ltd. Clementine Thinking Thinking Machines, Machines, Corp. Corp. Darwin DataMind DataMind DataCruncher SASInstitute Institute Enterprise Miner SAS Urban Science Urban Science GainSmarts IBM Intelligent Miner IBM Silicon Graphics, Inc. Silicon Graphics, Inc. MineSet Group Group 1 1/Unica Technologies Model 1 AbTechCorp. Corp. AbTech ModelQuest Unica Technologies, Inc. Unica PRW CART NeuroShell OLPARS Scenario See5 S-Plus WizWhy Salford Systems Ward Systems Group, Inc. PAR Government Systems Cognos RuleQuest Research MathSoft WizSoft

1998 Elder Research

T8-8

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Categories for Comparisons


Platforms Supported Algorithms Included
Decision Trees Neural Networks Other

Data Input and Model Output Options Usability Ratings Visualization Capabilities Modeling Automation Methods

1998 Elder Research

T8-9

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Unix Standalone

PC Standalone (95/NT)

Unix Server / PC Client

Platforms

Database Connectivity

NT Server / PC Client

Key
blank + no capability some capability good capability excellent capability

Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy

+ +

1998 Elder Research

T8-10

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Tool Groupings
Desktop
PC (standalone) Flat Files One or Two Algorithms Data Fits into RAM

High End
Multiple Platforms, ClientServer Flat Files or Direct Database Access Multiple Algorithm Types Large Databases

1998 Elder Research

T8-11

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

End User Perspectives


Business
Intuitive Interface
Clear steps in data mining process Non-technical terminology Familiar environment

Technical
Algorithm Options
Knobs to enhance model performance

Model Automation
Simplify model design cycle Documentation of steps used in generating models (repeatability)

Descriptive Reporting
Domain terminology Graphical representations

1998 Elder Research

T8-12

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Data Input & Model Output


Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy 1998 Elder Research

Automatic Header

Save Data Format

Native ODBC Database Drivers

Summary Reports

Output Source Code

T8-13

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Decision Trees
a>4
n y

b>3.5
n y n

b>2
y

a> 1
n y

1
1998 Elder Research

0
T8-14 updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Polynomial Networks
Z17 = 3.1 + 0.4a - .15b2 + 0.9bc - 0.62abc + 0.5c3
Layer 0 (Normalizers) a k f d h
N1

Layer 1 Layer 2

z1 z16 z9 Double 16 Single 14 z14 z17 Triple 17 z8 Double 19 Double 20 z20 z19
U7

Layer 3 Unitizers

N9

z6
N6 N4 N8

z4

MultiLinear 15

z15

z21 Triple 21
U2

Y1

Y2

N5

z5

1998 Elder Research

T8-15

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Consensus Models
Parametrically Summarize Data Points
orders, terms

Regression Polynomial Networks


(e.g. GMDH, ASPN)

Decision Trees
(e.g., CART, CHAID, C5)

Logistic or Sigmoidal Networks (ANNs) Hinging Hyperplanes, MARS

1998 Elder Research

T8-16

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Consensus Models (continued)


orientation, bin width function

Histogram

Radial Basis Function Wavelets

family, order

1998 Elder Research

T8-17

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Contributory Models
retain data points; each potentially affects estimate at new point Kernels

shape, spread

k, distance metric

k-Nearest Neighbor

Goal, iterations

Delaunay Planes Projection Pursuit Regression


T8-18 updated October 19, 1998

Spread, index
1998 Elder Research

KDD-98: A Comparison of Leading Data Mining Tools

Properties of Algorithms
Algorithm Classical (LR, LDA) Neural Networks Visualization Decision Trees Polynomial Networks K-Nearest Neighbors Kernels Accurate Scalable Interpretable Useable Robust Versatile Fast Hot

Key
C good neutral D bad

C C D C D C

C D DD C DD DD

C D C C D C D

C D C C C D

CC C D D D

D D C D D

C DD

D C

DDD C C C D C C D D

1998 Elder Research

T8-19

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Generalized Linear Models

Multi-layer Perceptrons

Radial Basis Functions

Polynomial Networks

Sequential Discovery

Decision Trees

Algorithms

Association Rules

Nearest Neighbor

Linear/Statistical

Rule Induction

Time Series

Clementine Darwin Datamind Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest PRW CART Cognos NeuroShell OLPARS See5 SPlus WizWhy

+ + + +

1998 Elder Research

T8-20

updated October 19, 1998

Kohonen

K Means

Bayes

KDD-98: A Comparison of Leading Data Mining Tools

Multiple Activation Functions

Automatic Model Selection

Advanced Learning Alg.

Multiple Stop Criteria

Learning Rate Decay

Normalize Inputs

Cross-Validation

Clementine Darwin Enterprise Miner Intelligent Miner Model 1 PRW NeuroShell OLPARS

1998 Elder Research

T8-21

updated October 19, 1998

Network Visual

Learning Rate

Multi-Layer Perceptrons

Parameter Summary

Other Cost functions

Momentum

KDD-98: A Comparison of Leading Data Mining Tools

Classification Costs

Pruning Severity

Missing Data

C5 or C4.5

"CART"

Decision Trees

Clementine Darwin Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest CART Scenario S-Plus See5

+ + +

1998 Elder Research

T8-22

updated October 19, 1998

Visual Trees

CHAID

Priors

Other

KDD-98: A Comparison of Leading Data Mining Tools

Regression / Stats
Clementine Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest Enterprise PRW S-Plus S-Plus Scenario

Linear Y + + +

Logistic

Complexity CrossPenalty Validation

Input Selection

Factor Analysis

Clementine

+ +

+ +

+ +

1998 Elder Research

T8-23

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Usability
Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest Enterprise PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy

Data Loading and Manipulation + + + + +

Model Building + + + + + + +

Model Understanding + + + + + + + + +

Technical Support + + + +

Overall + + + + + +

1998 Elder Research

T8-24

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Visualization
Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest Enterprise PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy

Pie Histograms Charts

Scatter/ Classification Rotating Conditional Line Decision Scatter Plots Plots Regions

Correlation Plots

1998 Elder Research

T8-25

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Automation
Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy
1998 Elder Research

Method of Automation Visual Programming, Programming Language Programming Language (Task manager) Visual Programming, Programming Language Macro Language, Wizards (Wizards) Data History, Log Model Wizard Batch Agenda Experiement Manager; Macros Built-in Basic Scripting

Free Text Annotation of Steps

Scripting (S); C/C++

T8-26

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

A Recent Breakthrough: Bundling


1) Construct varied models, and 2) Combine their estimates Generate component models by varying: Case Weights Data Values Guiding Parameters Variable Subsets Combine estimates using: Estimator Weights Voting Advisor Perceptrons Partitions of Design Space
1998 Elder Research T8-27 updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Example Bundling Techniques


Bayes: sum estimates of possible models, weighted by priors GMDH (Ivakhenko 68) -- multiple layers of quadratic polynomials, using two inputs each, fit by LR Stacking (Wolpert 92) -- train a 2nd-level (LR) model using leave-1-out estimates of 1st-level (neural net) models Bagging (Breiman 96) (bootstrap aggregating) -- bootstrap data (to build trees mostly); take majority vote or average Bumping (Tibshirani 97) -- bootstrap, select single best Boosting (Freund & Shapire 96) -- weight error cases by = (1-e(t))/e(t), iteratively re-model; weight model t by ln() Crumpling (Anderson & Elder 98) -- average cross-validations Born-Again (Breiman 98) -- invent new X data...
1998 Elder Research T8-28 updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Distinctives
Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy

Strengths vis ual inte rfa c e ; algorithm bre a d th e ffic ient c lie nt-s e rve r; intuitive inte rfa c e o p tions e a s e o f us e d e p th of algorithm s ; visual inte rfa c e data trans fo rmations , built on SAS ; a lgorithm option depth algorithm bre a d th; graphical tre e /clus te r output data visualization e a s e o f us e ; automate d m o d e l dis c o ve ry bre a d th of alg o rithms e xte n s ive a lgorithms; automate d m o d e l s e le c tion d e p th of tre e o p tions e a s e o f us e multiple neural ne twork archite c ture s multiple s tatis tical algorithms; cla s s -bas e d vis ualization d e p th of tre e o p tions d e p th of algorithm s ; visualization; programable /e xte ndable e a s e o f us e ; e a s e o f mode l unde rs tanding

Weaknesses s c a lability no uns upe rvis e d ; limite d vis ualization s ingle a lgorithm harde r to u s e ; ne w product is s ue s no uns upe rvis e d ; limite d vis ualization fe w a lg o rithm options ; no automation fe w a lg o rithms; no model e xport re a lly a ve rtical to o l s o m e non-intuitive inte rfa c e o p tions limite d visualization difficult file I/O; limite d visualization narrow analysis path unorthodox inte rfa c e ; only neural ne tworks date d inte rfa c e ; difficult file I/O limite d visualization; fe w data options limite d inductive m e tho d s ; s te e p le a rning curve limite d visualization

1998 Elder Research

T8-29

updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Closing Observations
Data Mining Tools Can:
Enhance inference process Speed up design cycle

Data Mining Tools Can Not:


Substitute for statistical and domain expertise

Users are advised to:


Get training on tools Be alert for product upgrades
1998 Elder Research T8-30 updated October 19, 1998

KDD-98: A Comparison of Leading Data Mining Tools

Forthcoming Report
Report provides detailed comparison of high-end data mining tools, including capabilities, ease of use, and practical tips. Available for $695 from Elder Research (http://www.datamininglab.com), Q4 1998. Purchasers receive brief free consulting session to explore report findings in more detail, if desired.
Note: The analyses and reviews were performed completely independently, and were made possible by the cooperation of the vendors, for which Elder Research is very grateful. The companies, however, provided no financial support, and had no influence on its editorial content.
1998 Elder Research T8-31 updated October 19, 1998

You might also like