You are on page 1of 78

D2K Tutorial

Supercomputing 2003

Loretta Auvil
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
217. 265.8021
lauvil@ncsa.uiuc.edu
Outline

• Overview of D2K Functionality


• Hands-On Exercise: Predictive Modeling
• Classification
– Using Naïve Bayesian
– Using Decision Trees
• Hands-On Exercise: Discovery
• Rule Association
– Using SQL Htree
• Clustering
• Deviation Detection
• Visualization
– Parallel Coordinates
– Small Multiples of scatterplots

alg | Automated Learning Group


Goals

• Understanding the Knowledge Discovery in Databases Process


• Gaining Knowledge of Basic Data Mining Operations and Techniques
• Understanding the Role of the Knowledge Discovery Framework
• Key Issues in Utilization of D2K Framework
• Understanding the Role of Information Visualization in Data Mining

alg | Automated Learning Group


Overview of Knowledge Discovery

What is It?

Knowledge Discovery in Databases is the non-trivial process of


identifying valid, novel, potentially useful, and ultimately
understandable patterns in data

• The understandable patterns are used to:


• Make predictions about or classifications of new data
• Explain existing data
• Summarize the contents of a large database to support decision making
• Create graphical data visualization to aid humans in discovering complex
patterns

alg | Automated Learning Group


Overview of Knowledge Discovery

Knowledge Discovery Process

alg | Automated Learning Group


Overview of Knowledge Discovery

Required Effort for each KDD Step

Arrows indicate the direction we want the effort to go


60

50

40
Effort (%)

30

20

10

0
Objectives Data Preparation Data Mining Interpretation/
Determination Evaluation

alg | Automated Learning Group


Overview of Knowledge Discovery

Three Primary Paradigms

• Predictive Modeling – supervised learning approach where


classification or prediction of one of the attributes is desired
• Classification is the prediction of predefined classes
– e.g. Naive Bayesian, Decision Trees, and Neural Networks
• Regression is the prediction of continuous data
– e.g. Neural Networks, and Decision (Regression) Trees
• Discovery – unsupervised learning approach for exploratory data
analysis
• e.g. Association Rules, Link Analysis, Clustering, and Self Organizing Maps
• Deviation Detection – identifying outliers in the data
• e.g. Visualization

alg | Automated Learning Group


Importance of Data Mining Framework

• Provides capability to build custom applications


• Provides access to data management tools
• Loading data from database, flat file or DataSpaces
• Contains data mining algorithms for prediction and discovery that
can be applied
• Provides data transformations for standard operations
• Supports an extensible interface for creating one’s own algorithms
• Provides means for building and applying models
• Provides integrated visualizations components
• Provides access to distributed computing capabilities

alg | Automated Learning Group


D2K Overview
D2K - Data To Knowledge

D2K is a flexible data mining system that integrates


effective analytical data mining methods for prediction,
discovery, and anomaly detection with data management
and information visualization

alg | Automated Learning Group


D2K Overview
D2K and Its Many Components

• D2K Infrastructure
D2K API, data flow environment,
distributed computing framework
and runtime system
• D2K Modules
Computational units written in Java
that follow the D2K API
• D2K Itineraries
Modules that are connected to form
an application
• D2K Toolkit
User interface for specification of
itineraries and execution that
provides the rapid application
development environment
• D2K-Driven Applications
Applications that use D2K modules,
but do not need to run in the D2K
Toolkit

alg | Automated Learning Group


D2K Overview
D2K Toolkit

Major features that D2K provides


to an application developer
include:

• Visual programming system


employing a data flow
paradigm
• Scalable distributed computing
capabilities
• Flexible and extensible
software development
environment
• Multi-layered learning
strategies
• Integrated environment for
models and visualization
• Capability to access data
transparently from multiple
sources

alg | Automated Learning Group


D2K Overview
D2K Basic 1.0
• New release of D2K 3.0
• New release of the D2K Toolkit
• New release of a set of D2K Modules to perform data mining techniques
• Prediction
– Decision Trees
C4.5 Decision Tree, Continuous Decision Tree, SQL Rain Forest Decision Tree
– Naïve Bayesian Classification and SQL Naïve Bayesian Classification
– Neural Networks
• Discovery
– Rule Association
Apriori, Htree
– Clustering
Hierarchical Agglomerative, Kmeans, Coverage, etc.
• Better documentation for Toolkit and modules
• Includes visualizations for many of the modeling approaches
• Includes a set of data transformations
• Attribute selection, binning, filtering, attribute construction
• Includes optimization strategy for searching parameter space
• Plus more…

alg | Automated Learning Group


D2K Overview
D2K 3.0 Features

• Current Release downloadable off our website


• Extension of existing API
• Provides the capability to programmatically connect modules and set properties
• Allows D2K-driven applications to be developed
• Provides ability to pause and restart an itinerary
• Enhanced Distributed Computing
• Allows modules that are re-entrant to be executed remotely
• Uses Jini services to look up distributed resources
• Includes interface for specifying the runtime layout of a distributed itinerary
• Processor Status Overlay
• Shows utilization of distributed computing resources
• Distributed Checkpointing
• Resource Manager
• Provides a mechanism for treating selected data structures as if they were stored in
global memory
• Provides memory space that is accessible from multiple modules running locally as
well as remotely

alg | Automated Learning Group


D2K Overview
New D2K 4.0 Highlights

• Ability to use the web for deployment


• Ability for modules to run headless (with no gui)
• Changed the way itineraries are saved
• Stored in zip file
• Itinerary is described in an xml format
• Annotation is saved in html format
• Additional data is stored in a serialized HashMap
• Table structure was re-implemented to improve performance and
simplify the API
• Improvements of module selection, with area selection
• Support of copy and paste of selected modules

alg | Automated Learning Group


D2K Overview
D2K ToolKit

1. Workspace
2. Resource
Panel
3. Modules
4. Models
5. Itineraries
6. Visualizations
7. Generated
Visualizations
8. Generated
Models
9. Component
Information
10. Toolbar
11. Console

alg | Automated Learning Group


D2K Overview
D2K Modules

Input Module: Loads data from the outside world


• Flat files, database, etc.

Data Prep Module: Performs functions to select, clean, or transform the data
• Binning, Normalizing, Feature Selection, etc.

Compute Module: Performs main algorithmic computations


• Naïve Bayesian, Decision Tree, Apriori, etc.

User Input Module: Requires interaction with the user


• Data Selection, Input and Output selection, etc.

Output Module: Saves data to the outside world


• Flat files, databases, etc.

Visualization Module: Provides visual feedback to the user


• Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot,
3D Surface Plot

alg | Automated Learning Group


D2K Overview
D2K Module Icon Description

Module Progress Bar


Appears during execution to show the
percentage of time that this
module executed over the entire
execution time. It is green when
the module is executing and red
when not

Input Port Output Port


Rectangular shapes on the left side of Rectangular shapes on the right
the module represent the inputs side of the module represent the
for the module. They are colored
outputs for the module. They are
according to the data type that
they represent colored according to the data
type that they represent

Properties Symbol
If a “P” is shown in the lower left
corner of the module, then the
module has properties that can be
set before execution

alg | Automated Learning Group


D2K Overview
Resource Panel

The area to the left of the Workspace that contains the components
necessary for data analysis
• Modules
• Models
• Itineraries
• Visualizations

alg | Automated Learning Group


D2K Overview
D2K Itineraries

• Itineraries are partial or


complete applications
composed of connected
modules
• D2K Core Itineraries
include:
• Prediction
• Discovery
• Anomaly Detection
• Data Selection
• Transformation
• Visualization

alg | Automated Learning Group


D2K Overview
Workspace

The Workspace is the area where applications are formed


• Modules are placed, connected, and properties set
• Itineraries are saved and executed

alg | Automated Learning Group


D2K Overview
Session Panes

• Component Information
• Shows detailed information about components of D2K
• Shows module information, inputs, outputs, and property descriptions
• Shows itinerary annotation
• Generated Visualization
• Shows visualizations generated during this session
• Provides ability to save these visualizations for later use
• Generated Models
• Shows models generated during this session
• Provides ability to save these visualizations for later use

alg | Automated Learning Group


D2K Overview
D2K Setup

• Preferences
• Written to a file called “d2k.props”
• Set up automatically the first time D2K is installed
• Changed via Edit menu… Preferences…
• Some changes do require restart of D2K
• Check the User Manual for more details (available online)

alg | Automated Learning Group


D2K Overview
Using the Toolkit

Build an itinerary for loading data and


viewing it in a TableViewer
• Drag and Drop Modules from
Modules Pane of Resource Panel to
the Workspace as shown
• Expand directory ncsa/io/file/input
– Drag and Drop Input1Filename to
Workspace
– Drag and Drop
CreateDelimitedParser to
Workspace
– Drag and Drop ParseFileToTable
to Workspace
• Expand directory ncsa/vis
– Drag and Drop TableViewer to
Workspace

alg | Automated Learning Group


D2K Overview
Using the Toolkit (cont’d)

Connect the modules like shown


• Drag from the output port of
one module to the input port
of the next module
• Check the properties of
modules by double clicking
on the module
• Input File Name
– Choose data/UCI/iris.csv
• Create Delimited File Parser
– Defaults work
• Parse File To Table
– Defaults work
• Click Run to execute

alg | Automated Learning Group


D2K Overview
Variation Using a Nested Itinerary

• An itinerary can be used


as a module – nested
itinerary
• Properties can be set by
holding Control and double
clicking on the nested
itinerary
• Then connecting the
inputs and output ports of
the nested itinerary as one
would any other module

alg | Automated Learning Group


PREDICTIVE MODELING

CLASSIFICATION
NAÏVE BAYESIAN

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Naïve Bayesian Classification

• Applied to supervised learning


problem
• Expects training examples with
input and output attributes
• Single output attribute with small
number of possible values for best
performance
• Computes the distribution of an
input associated with each class,
for example, given the variable X
with a value at xi the probability Mathematically speaking — If one knows
how P(X | C), and the densities P(xi) and
of it being in Class A is greater
P(cj) (prior probabilities) are known
than it being in Class B
then the classifier is one which assigns
class cj to datum xi if cj has the highest
posterior probability given the data

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Bayesian Classification: Why?

• Probabilistic learning: Calculate explicit probabilities for


hypothesis, is among the most practical approaches to certain
types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
• Prior knowledge: Can be combined with observed data
• Standard:
• Provide a standard of optimal decision making against which other methods
can be measured
• In a simpler form, provide a baseline against which other methods can be
measured

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Naïve Bayesian Classification

• Naïve assumption:
• Feature independence
• P(xi|C) is estimated as the relative frequency of examples having
value xi as feature in class C
• Computationally easy!!!

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Classification Applications Using Naïve Bayesian
• Predict a response to a
marketing campaign
• Predict the most
profitable customers
for a product or
service
• Classify applicants as
high/med/low risk
• Predict which
customers will leave
for a competitor
• Predict whether email
message is SPAM or not

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Opening the Itinerary

• Click on the “Itinerary” Pane in the Resource Panel


• Expand the “Prediction” directory with a single click
• Double click on “NaïveBayes” to load the itinerary into your Workspace

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Executing the Itinerary

• Check modules with


properties
• Double click to open property
editor
• Respond to User Interfaces
that open
• Click Run button
• Respond to GUI’s that pop-up

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
PredictionTableReport for iris data

Double click on the


PredictionTableReport to launch the
report that shows the classification
error and confusion matrix for the
data

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Naïve Bayesian Visualization
• Double click on the
NaiveBayesVis to view the
results
• The upper right hand pane
shows the distribution of the
classes
• The left hand pane shows the
attributes and each of their
values. They are listed by
order of significance
• The message box shows details
about each pie chart when
brushed
• Clicking on a pie chart shows
how knowing this information
can change the overall class
predication
• Clicking on multiple pie charts
calculates conditional Notice Iris-versicolor has a 33%
probabilities likelihood

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Naïve Bayesian Visualization

What if scenarios…
• Click on petal-width of
1.3:1.9
• Now the probability of
Iris-versicolor is
66.37%

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Naïve Bayesian Visualization

What if scenarios…
continue with
conditional probabilities
calculations
• Click on petal-length of
3.95:5.32
• Click on sepal-length of
5.28:6.15
• Now the probability of
Iris-versicolor is 94.99%

alg | Automated Learning Group


Predictive Modeling: Naïve Bayesian
Applying Models
• In Generated Models Session Pane, right click on the model and
choose Save
• The saved model shows up in the Model View of the Resource Panel
• Click and drag the model into the workspace
• Connect the input and output of the model as shown

alg | Automated Learning Group


PREDICTIVE MODELING

CLASSIFICATION
Decision Trees

alg | Automated Learning Group


Predictive Modeling: Decision Trees
Decision Trees Classification

• Supervised learning problem


• Builds a model to classify one
attribute based on other data
attributes
• Builds the tree by deciding how to
split the data so that classification
error is reduced
• Shown is a decision tree predicting
whether one will play tennis based
on some weather conditions

alg | Automated Learning Group


Predictive Modeling: Decision Trees
Applications Using Decision Trees

• Decision trees can solve both


classification and regression
problems
• Decision Trees work for many
of the same problems as
Naïve Bayesian analysis
• Prediction of who should be
given a loan
• Prediction of high/med/low
risk

alg | Automated Learning Group


Predictive Modeling: Decision Trees
PredictionTableReport for iris data

Double click on the


PredictionTableReport to launch the
report that shows the classification
error and a confusion matrix for the
data
Note: This is a very clean data set

alg | Automated Learning Group


Predictive Modeling: Decision Trees
Decision Tree Visualization
Two main panes
• Navigator Pane shown in the
top left pane illustrates the
full decision tree, the
viewable decision tree is
shown with a black box
outline
• Viewable Tree shows a chart
of the percentages of the
examples in each of the
classes
• Brushing indicates the
percentages in the Brushing
Pane
• Clicking on a small chart opens
a larger view of the chart
-showing the complete path
taken to get to this node

alg | Automated Learning Group


Predictive Modeling: Decision Trees
Using the Model

• In Generated Models Session


Pane, right click on the
model and choose Save. The
saved model shows up in the
Model View of the Resource
Panel
• Click and drag the model into
the workspace (shown in
green circle, disconnect the
items in the red blob)
• Connect the input and output
of the model as shown
• Results can be sent to the
PredictionTableReport and to
the DecisionTreeVis
• New (test) data can be
examined with the model

alg | Automated Learning Group


DISCOVERY
RULE ASSOCIATION
Using fp-growth

alg | Automated Learning Group


Discovery: Rule Association
Market Basket Example

? Where should detergents be placed in the


Store to maximize their sales?

? Are window cleaning products purchased


when detergents and orange juice are
bought together?

? Is soda typically purchased with bananas?


Does the brand of soda make a difference?

? How are the demographics of the


neighborhood affecting what customers
are buying?

alg | Automated Learning Group


Discovery: Rule Association
Association Rules

• There has been a considerable amount of research in the area of


Market Basket Analysis. Its appeal comes from the clarity and utility
of its results, which are expressed in the form association rules

• Given
• Database of transactions
• Each transaction contains a set of items

• Find all rules X->Y that correlate the presence of one set of items X
with another set of items Y
• Example: When a customer buys bread and butter, they buy milk 85% of
the time

alg | Automated Learning Group


Discovery: Rule Association
Overview

• Unsupervised learning problem


• Find all rules that correlate the presence of one set of items X with
another item Y
• Example: When a customer buys bread and butter, they buy milk 85% of the
time
• Support is the percentage of the records that contain both X and Y
• A rule must have some minimum user-specified support to show its impact
• Confidence is the percentage of records that contain X and Y out of
the number of records that contain X
• A rule must have some minimum user-specified confidence to show its value

alg | Automated Learning Group


Discovery: Rule Association
Results: Useful, Trivial, or Inexplicable?

• While association rules are easy to understand, they are not always
useful

Useful
On Fridays convenience store customers often purchase diapers and beer
together

Trivial
Customers who purchase maintenance agreements are very likely to
purchase large appliances

Inexplicable
When a new Super Store opens, one of the most commonly sold item is
light bulbs

alg | Automated Learning Group


Discovery: Rule Association
How Does It Work?
• In the data, two of five
Grocery Point-of-Sale Transactions
transactions include both
soda and orange juice Customer Items
• These two transactions 1 Orange Juice,
juice, Soda
support the rule
2 Milk, Orange Juice, Window Cleaner
• Support for the rule is
two out of five or 40% 3 Orange Juice, Detergent

• Since both transactions 4 Orange Juice, Detergent, soda


juice, detergent, Soda
that contain soda also 5 Window Cleaner, Soda
cleaner, soda
contain orange juice
• There is a high degree of
Co-Occurrence of Products
confidence in the rule
Window
• In fact every transaction OJ Cleaner Milk Soda Detergent
that contains soda
OJ 4 1 1 2 1
contains orange juice
• So the rule If soda, THEN Window Cleaner 1 2 1 1 0
orange juice has a Milk 1 1 1 0 0
confidence of 100% Soda 2 1 0 3 1
Detergent 1 0 0 1 2

alg | Automated Learning Group


Discovery: Rule Association
Confidence and Support - How Good Are the Rules

• A rule must have some minimum user-specified confidence


• 1 and 2 -> 3 has a 90% confidence if when a customer bought 1 and 2, in
90% of the cases, the customer also bought 3
• A rule must have some minimum user-specified support
• 1 and 2 -> 3 should hold in some minimum percentage of transactions to
have value

alg | Automated Learning Group


Discovery: Rule Association
Confidence and Support

Transaction ID # Items
1 { 1, 2, 3 }
For minimum support = 50% = 2 transactions
2 { 1,3 }
and minimum confidence = 50%
3 { 1,4 }
4 { 2, 5, 6 }

Frequent Item Set Support


{1} 75 % For the rule 1=> 3:
{2} 50 % Support = Support({1,3}) = 50%
{3} 50 % Confidence = Support ({1,3})/Support({1}) = 66%

{4} 50 %

alg | Automated Learning Group


Discovery: Rule Association
Association Examples

• Find all rules that have “Diet Coke” as a consequent (result)


• These rules may help plan what the store should do to boost the sales of
Diet Coke

• Find all rules that have “Yogurt” in the antecedent (condition)


• These rules may help determine what products may be impacted if the
store discontinues selling “Yogurt”

• Find all rules that have “Brats” in the antecedent and “mustard”
in the consequent
• These rules may help in determining the additional items that have to be
sold together to make it highly likely that mustard will also be sold

• Find the best k rules that have “Yogurt” in the result

alg | Automated Learning Group


Discovery: Rule Association
Basic Process

• Choosing the right set of items


• Taxonomies
• Virtual Items
• Anonymous versus Signed
• Generation of rules
• If condition Then result
• Negation/Dissociation
• Improvement
• Overcoming the practical limits imposed by thousand or tens of
thousands of products
• Minimum Support Pruning

alg | Automated Learning Group


Discovery: Rule Association
Strengths and Weaknesses

Strengths
• It produces easy to understand results
• It supports undirected data mining
• It works on variable length data
• Rules are relatively easy to compute

Weaknesses
• It is an exponential growth algorithm
• It is difficult to determine the optimal number of items
• It discounts rare items
• It is limited by the support that it provides attributes
• It produces many rules
• For large numbers of attribute-value combinations, considerable
cpu and memory resources are consumed

alg | Automated Learning Group


Discovery: Rule Association Using fp-growth
Opening the Itinerary

• Click on the “Itinerary” Pane in the Resource Panel


• Expand the “Discovery” directory with a single click
• Expand the “RuleAssociation” directory with a single click
• Double click on “fp-growth” to load the itinerary into your Workspace

alg | Automated Learning Group


Discovery: Rule Association Using fp-growth
Executing the Itinerary

• Check modules with


properties
• Double click to open
property editor
• fp-growth
• Compute Confidence
• Respond to User Interfaces
that open
• Click Run button

alg | Automated Learning Group


Discovery: Rule Association Using fp-growth
Rule Association Visualization
• Read rules down the column
• Example - the first rule is
• If petal-width Binned=[…:0.7] then
flower-type=Iris-setosa
• Support = 25%
• Confidence = 100%
• Brush the bars to find out support
and confidence levels
• Different sorting schemes
• Sort by Confidence
• Sort by Support
• Alphabetize button sorts the
attribute-value pairs alphabetically
• Rank button sorts the rows based
on the current Confidence/Support
selection, moving the consequents
and antecedents of the highest
ranking rules to the top of the
attribute-value list

alg | Automated Learning Group


Discovery: Rule Association
Choosing the Right Set of Items

Frozen
General

Foods
Partial Product Taxonomy

Frozen Frozen Frozen


Desserts Vegetables Dinners

Frozen Ice Frozen


Yogurt Cream Fruit Bars Peas Carrots Mixed Other
Specific

Chocolate Strawberry Vanilla Rocky Cherry Other


Road Garcia

alg | Automated Learning Group


Discovery: Rule Association
Other Association Rule Applications

• Quantitative Association Rules


• Age[35..40] and Married[Yes] -> NumCars[2]

• Association Rules with Constraints


• Find all association rules where the prices of items are > 100 dollars

• Temporal Association Rules


• Diaper -> Beer (1% support, 80% confidence)
• Diaper -> Beer (20%support) 7:00-9:00 PM weekdays

• Optimized Association Rules


• Given a rule (l < A < u) and X -> Y, Find values for l and u such that
support greater than certain threshold and maximizes a support,
confidence, or gain
• ChkBal [$ 30,000 .. $50,000] -> JumboCD = Yes

alg | Automated Learning Group


DISCOVERY

CLUSTERING

alg | Automated Learning Group


Discovery: Clustering
Overview

• Unsupervised learning problem


• Group all examples that are similar
• View results with dendogram or parallel coordinates
• Provide several different clustering algorithms
• Kmeans
• Buckshot
• Fractionation
• Coverage

alg | Automated Learning Group


Discovery: Clustering
Clustering Algorithms

• KMeans clustering
• Creates a sample set containing Number of Clusters rows is chosen from an input
table of examples and used as initial cluster centers
• These initial clusters undergo a series of assignment/refinement iterations, resulting
in a final cluster model
• Buckshot clustering
• Creates a sample of size Sqrt(Number of Clusters * Number of Examples) is chosen at
random from the table of examples
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model
• Coverage clustering
• Creates a sample set from the input table such that the set formed is approximately
the minimum number of samples needed such that for every example in the input
table there is at least one example in the sample set of distance = Distance Threshold
(% of Maximum)
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model

alg | Automated Learning Group


Discovery: Clustering
Clustering Algorithms (2)

• Fractionation
• Creates a sample set of the initial examples (converted to clusters) by a
key attribute denoted by Sort Attribute
• The set of sorted clusters is then segmented into equal partitions of size
maxPartitionsize
• Each of these partitions is then passed through the agglomerative clusterer
to produce numberOfClusters clusters
• All the clusters are gathered together for all partitions and the entire
process is repeated until only Number of Clusters clusters remain. The
sorting step is to encourage like clusters into same partitions

alg | Automated Learning Group


Discovery: Clustering
Opening the Itinerary

• Click on “Itinerary” Pane in the Resource Panel


• Expand the “Discovery” directory
• Expand the “Clustering” directory
• Double click on “BuckshotClusterer”

alg | Automated Learning Group


Discovery: Clustering
Clustering Results

Dendogram or Parallel Coordinates

alg | Automated Learning Group


DEVIATION DETECTION
VISUALIZATIONS

PARALLEL COORDINATES
SCATTERPLOT

alg | Automated Learning Group


Deviation Detection: Parallel Coordinates
Itinerary

• Visualization to detect outliers and patterns


• Expand the vis directory and load the “ParallelCoordinate” itinerary

alg | Automated Learning Group


Deviation Detection: Parallel Coordinates
Parallel Coordinates - Visualization
• Each vertical line represents a
attribute with the minimum and
maximum values shown at
bottom and top
• Each record has a line that
connects it to the its value at
each attribute
• Lines are colored based on the
output attribute
• Clicking and dragging on the
label boxes allows the attributes
to be rearranged
• Zooming is accomplished by
dragging a box over the desired
area. Clicking returns to the
original view

alg | Automated Learning Group


Deviation Detection: Scatterplots
Scatterplots – Itinerary
• Visualization to detect outliers and patterns
• Load the “scatterplot” itinerary

alg | Automated Learning Group


Deviation Detection: Scatterplots
Scatterplots – Visualization

alg | Automated Learning Group


Deviation Detection: Small Multiples
Small Multiples of Scatterplots - Itinerary

alg | Automated Learning Group


Deviation Detection: Small Multiples
Small Multiples of Scatterplots Vis

alg | Automated Learning Group


Deviation Detection: Small Multiples
Small Multiples of Linear Regressions Vis

alg | Automated Learning Group


D2K SL
D2K Streamline (D2K SL)

• Reduces the learning


curve associated with the
KDD process
• Encompasses discovery,
prediction and deviation
detection techniques
• Saves and applies models
to new data sets easily
• Supports return to earlier
steps in the KDD process
to run with different
parameters
• Uses the D2K
Infrastructure
transparently

alg | Automated Learning Group


D2K SL
New D2K User Interface – D2K SL

• Provides step
by step
interface to
guide user in
data analysis
• Uses same
D2K modules
• Provides way
to capture
different
experiments
(streams)

alg | Automated Learning Group


D2K SL
Another View of the New D2K User Interface – D2K SL

• Help users keep


track of data
• Define templates
that can be
reused in
different
experiments
(streams)

alg | Automated Learning Group


The ALG Team
Staff Students
Loretta Auvil Tyler Alumbaugh
Ruth Aydt Peter Groves
Peter Bajcsy Olubanji Iyun
Colleen Bushell Sang-Chul Lee
Dora Cai Xiaolei Li
David Clutter Brian Navarro
Lisa Gatzke Jeff Ng
Vered Goren Scott Ramon
Chris Navarro Sunayana Saha
Greg Pape Martin Urban
Tom Redman Bei Yu
Duane Searsmith Hwanjo Yu
Andrew Shirk
Anca Suvaiala
David Tcheng
Michael Welge

alg | Automated Learning Group


Licensing D2K

• Faculty, staff and students at US academic institutions will be able


to license and use D2K for free by downloading from
alg.ncsa.uiuc.edu
• Private Sector Partners who have provided funding for projects
related to D2K will be able to license and use D2K for free
• Private Sector Partners who have not provided funding will be able
to license and use D2K for a discounted fee

Contact John McEntire


Office of Technology Management
308 Ceramics Building, MC-243
105 South Goodwin Avenue
Urbana, Illinois 61801-2901
(217) 333-3715
jmcentir@uiuc.edu

alg | Automated Learning Group

You might also like