d2k Tutorial

D2K Tutorial
Supercomputing 2003
Loretta Auvil
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
217. 265.8021
lauvil@ncsa.uiuc.edu
Outline
• Overview of D2K Functionality

• Hands-On Exercise: Predictive Modeling
• Classification
– Using Naïve Bayesian
– Using Decision Trees
• Hands-On Exercise: Discovery
• Rule Association
– Using SQL Htree
• Clustering
• Deviation Detection
• Visualization
– Parallel Coordinates
– Small Multiples of scatterplots
alg | Automated Learning Group

Goals
• Understanding the Knowledge Discovery in Databases Process

• Gaining Knowledge of Basic Data Mining Operations and Techniques
• Understanding the Role of the Knowledge Discovery Framework
• Key Issues in Utilization of D2K Framework
• Understanding the Role of Information Visualization in Data Mining

Overview of Knowledge Discovery
What is It?
Knowledge Discovery in Databases is the non-trivial process of

identifying valid, novel, potentially useful, and ultimately
understandable patterns in data
• The understandable patterns are used to:

• Make predictions about or classifications of new data
• Explain existing data
• Summarize the contents of a large database to support decision making
• Create graphical data visualization to aid humans in discovering complex
patterns

Knowledge Discovery Process

Required Effort for each KDD Step
Arrows indicate the direction we want the effort to go

60
50
40
Effort (%)
30
20
10
0
Objectives Data Preparation Data Mining Interpretation/
Determination Evaluation

Three Primary Paradigms
• Predictive Modeling – supervised learning approach where

classification or prediction of one of the attributes is desired
• Classification is the prediction of predefined classes
– e.g. Naive Bayesian, Decision Trees, and Neural Networks
• Regression is the prediction of continuous data
– e.g. Neural Networks, and Decision (Regression) Trees
• Discovery – unsupervised learning approach for exploratory data
analysis
• e.g. Association Rules, Link Analysis, Clustering, and Self Organizing Maps
• Deviation Detection – identifying outliers in the data
• e.g. Visualization

Importance of Data Mining Framework
• Provides capability to build custom applications

• Provides access to data management tools
• Loading data from database, flat file or DataSpaces
• Contains data mining algorithms for prediction and discovery that
can be applied
• Provides data transformations for standard operations
• Supports an extensible interface for creating one’s own algorithms
• Provides means for building and applying models
• Provides integrated visualizations components
• Provides access to distributed computing capabilities

D2K Overview
D2K - Data To Knowledge
D2K is a flexible data mining system that integrates

effective analytical data mining methods for prediction,
discovery, and anomaly detection with data management
and information visualization

D2K Overview
D2K and Its Many Components
• D2K Infrastructure
D2K API, data flow environment,
distributed computing framework
and runtime system
• D2K Modules
Computational units written in Java
that follow the D2K API
• D2K Itineraries
Modules that are connected to form
an application
• D2K Toolkit
User interface for specification of
itineraries and execution that
provides the rapid application
development environment
• D2K-Driven Applications
Applications that use D2K modules,
but do not need to run in the D2K
Toolkit

D2K Overview
D2K Toolkit
Major features that D2K provides

to an application developer
include:
• Visual programming system

employing a data flow
paradigm
• Scalable distributed computing
capabilities
• Flexible and extensible
software development
environment
• Multi-layered learning
strategies
• Integrated environment for
models and visualization
• Capability to access data
transparently from multiple
sources

D2K Overview
D2K Basic 1.0
• New release of D2K 3.0
• New release of the D2K Toolkit
• New release of a set of D2K Modules to perform data mining techniques
• Prediction
– Decision Trees
C4.5 Decision Tree, Continuous Decision Tree, SQL Rain Forest Decision Tree
– Naïve Bayesian Classification and SQL Naïve Bayesian Classification
– Neural Networks
• Discovery
– Rule Association
Apriori, Htree
– Clustering
Hierarchical Agglomerative, Kmeans, Coverage, etc.
• Better documentation for Toolkit and modules
• Includes visualizations for many of the modeling approaches
• Includes a set of data transformations
• Attribute selection, binning, filtering, attribute construction
• Includes optimization strategy for searching parameter space
• Plus more…

D2K Overview
D2K 3.0 Features
• Current Release downloadable off our website

• Extension of existing API
• Provides the capability to programmatically connect modules and set properties
• Allows D2K-driven applications to be developed
• Provides ability to pause and restart an itinerary
• Enhanced Distributed Computing
• Allows modules that are re-entrant to be executed remotely
• Uses Jini services to look up distributed resources
• Includes interface for specifying the runtime layout of a distributed itinerary
• Processor Status Overlay
• Shows utilization of distributed computing resources
• Distributed Checkpointing
• Resource Manager
• Provides a mechanism for treating selected data structures as if they were stored in
global memory
• Provides memory space that is accessible from multiple modules running locally as
well as remotely

D2K Overview
New D2K 4.0 Highlights
• Ability to use the web for deployment

• Ability for modules to run headless (with no gui)
• Changed the way itineraries are saved
• Stored in zip file
• Itinerary is described in an xml format
• Annotation is saved in html format
• Additional data is stored in a serialized HashMap
• Table structure was re-implemented to improve performance and
simplify the API
• Improvements of module selection, with area selection
• Support of copy and paste of selected modules

D2K Overview
D2K ToolKit
1. Workspace
2. Resource
Panel
3. Modules
4. Models
5. Itineraries
6. Visualizations
7. Generated
Visualizations
8. Generated
Models
9. Component
Information
10. Toolbar
11. Console

D2K Overview
D2K Modules
Input Module: Loads data from the outside world

• Flat files, database, etc.
Data Prep Module: Performs functions to select, clean, or transform the data
• Binning, Normalizing, Feature Selection, etc.
Compute Module: Performs main algorithmic computations

• Naïve Bayesian, Decision Tree, Apriori, etc.
User Input Module: Requires interaction with the user

• Data Selection, Input and Output selection, etc.
Output Module: Saves data to the outside world

• Flat files, databases, etc.
Visualization Module: Provides visual feedback to the user

• Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot,
3D Surface Plot

D2K Overview
D2K Module Icon Description
Module Progress Bar

Appears during execution to show the
percentage of time that this
module executed over the entire
execution time. It is green when
the module is executing and red
when not
Input Port Output Port

Rectangular shapes on the left side of Rectangular shapes on the right
the module represent the inputs side of the module represent the
for the module. They are colored
outputs for the module. They are
according to the data type that
they represent colored according to the data
type that they represent
Properties Symbol
If a “P” is shown in the lower left
corner of the module, then the
module has properties that can be
set before execution

D2K Overview
Resource Panel
The area to the left of the Workspace that contains the components
necessary for data analysis
• Modules
• Models
• Itineraries
• Visualizations

D2K Overview
D2K Itineraries
• Itineraries are partial or

complete applications
composed of connected
modules
• D2K Core Itineraries
include:
• Prediction
• Discovery
• Anomaly Detection
• Data Selection
• Transformation
• Visualization

D2K Overview
Workspace
The Workspace is the area where applications are formed

• Modules are placed, connected, and properties set
• Itineraries are saved and executed

D2K Overview
Session Panes
• Component Information
• Shows detailed information about components of D2K
• Shows module information, inputs, outputs, and property descriptions
• Shows itinerary annotation
• Generated Visualization
• Shows visualizations generated during this session
• Provides ability to save these visualizations for later use
• Generated Models
• Shows models generated during this session
• Provides ability to save these visualizations for later use

D2K Overview
D2K Setup
• Preferences
• Written to a file called “d2k.props”
• Set up automatically the first time D2K is installed
• Changed via Edit menu… Preferences…
• Some changes do require restart of D2K
• Check the User Manual for more details (available online)

D2K Overview
Using the Toolkit
Build an itinerary for loading data and

viewing it in a TableViewer
• Drag and Drop Modules from
Modules Pane of Resource Panel to
the Workspace as shown
• Expand directory ncsa/io/file/input
– Drag and Drop Input1Filename to
Workspace
– Drag and Drop
CreateDelimitedParser to
Workspace
– Drag and Drop ParseFileToTable
to Workspace
• Expand directory ncsa/vis
– Drag and Drop TableViewer to
Workspace

D2K Overview
Using the Toolkit (cont’d)
Connect the modules like shown

• Drag from the output port of
one module to the input port
of the next module
• Check the properties of
modules by double clicking
on the module
• Input File Name
– Choose data/UCI/iris.csv
• Create Delimited File Parser
– Defaults work
• Parse File To Table
– Defaults work
• Click Run to execute

D2K Overview
Variation Using a Nested Itinerary
• An itinerary can be used

as a module – nested
itinerary
• Properties can be set by
holding Control and double
clicking on the nested
itinerary
• Then connecting the
inputs and output ports of
the nested itinerary as one
would any other module

PREDICTIVE MODELING
CLASSIFICATION
NAÏVE BAYESIAN

Predictive Modeling: Naïve Bayesian
Naïve Bayesian Classification
• Applied to supervised learning

problem
• Expects training examples with
input and output attributes
• Single output attribute with small
number of possible values for best
performance
• Computes the distribution of an
input associated with each class,
for example, given the variable X
with a value at xi the probability Mathematically speaking — If one knows
how P(X | C), and the densities P(xi) and
of it being in Class A is greater
P(cj) (prior probabilities) are known
than it being in Class B
then the classifier is one which assigns
class cj to datum xi if cj has the highest
posterior probability given the data

Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for

hypothesis, is among the most practical approaches to certain
types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
• Prior knowledge: Can be combined with observed data
• Standard:
• Provide a standard of optimal decision making against which other methods
can be measured
• In a simpler form, provide a baseline against which other methods can be
measured

Naïve Bayesian Classification
• Naïve assumption:
• Feature independence
• P(xi|C) is estimated as the relative frequency of examples having
value xi as feature in class C
• Computationally easy!!!

Classification Applications Using Naïve Bayesian
• Predict a response to a
marketing campaign
• Predict the most
profitable customers
for a product or
service
• Classify applicants as
high/med/low risk
• Predict which
customers will leave
for a competitor
• Predict whether email
message is SPAM or not

Opening the Itinerary
• Click on the “Itinerary” Pane in the Resource Panel

• Expand the “Prediction” directory with a single click
• Double click on “NaïveBayes” to load the itinerary into your Workspace

Executing the Itinerary
• Check modules with

properties
• Double click to open property
editor
• Respond to User Interfaces
that open
• Click Run button
• Respond to GUI’s that pop-up

PredictionTableReport for iris data
Double click on the

PredictionTableReport to launch the
report that shows the classification
error and confusion matrix for the
data

Naïve Bayesian Visualization
• Double click on the
NaiveBayesVis to view the
results
• The upper right hand pane
shows the distribution of the
classes
• The left hand pane shows the
attributes and each of their
values. They are listed by
order of significance
• The message box shows details
about each pie chart when
brushed
• Clicking on a pie chart shows
how knowing this information
can change the overall class
predication
• Clicking on multiple pie charts
calculates conditional Notice Iris-versicolor has a 33%
probabilities likelihood

What if scenarios…
• Click on petal-width of
1.3:1.9
• Now the probability of
Iris-versicolor is
66.37%

What if scenarios…
continue with
conditional probabilities
calculations
• Click on petal-length of
3.95:5.32
• Click on sepal-length of
5.28:6.15
• Now the probability of
Iris-versicolor is 94.99%

Applying Models
• In Generated Models Session Pane, right click on the model and
choose Save
• The saved model shows up in the Model View of the Resource Panel
• Click and drag the model into the workspace
• Connect the input and output of the model as shown

PREDICTIVE MODELING
CLASSIFICATION
Decision Trees

Predictive Modeling: Decision Trees
Decision Trees Classification
• Supervised learning problem

• Builds a model to classify one
attribute based on other data
attributes
• Builds the tree by deciding how to
split the data so that classification
error is reduced
• Shown is a decision tree predicting
whether one will play tennis based
on some weather conditions

Applications Using Decision Trees
• Decision trees can solve both

classification and regression
problems
• Decision Trees work for many
of the same problems as
Naïve Bayesian analysis
• Prediction of who should be
given a loan
• Prediction of high/med/low
risk

PredictionTableReport for iris data
Double click on the

PredictionTableReport to launch the
report that shows the classification
error and a confusion matrix for the
data
Note: This is a very clean data set

Decision Tree Visualization
Two main panes
• Navigator Pane shown in the
top left pane illustrates the
full decision tree, the
viewable decision tree is
shown with a black box
outline
• Viewable Tree shows a chart
of the percentages of the
examples in each of the
classes
• Brushing indicates the
percentages in the Brushing
Pane
• Clicking on a small chart opens
a larger view of the chart
-showing the complete path
taken to get to this node

Using the Model
• In Generated Models Session

Pane, right click on the
model and choose Save. The
saved model shows up in the
Model View of the Resource
Panel
• Click and drag the model into
the workspace (shown in
green circle, disconnect the
items in the red blob)
• Connect the input and output
of the model as shown
• Results can be sent to the
PredictionTableReport and to
the DecisionTreeVis
• New (test) data can be
examined with the model

DISCOVERY
RULE ASSOCIATION
Using fp-growth

Discovery: Rule Association
Market Basket Example
? Where should detergents be placed in the

Store to maximize their sales?
? Are window cleaning products purchased

when detergents and orange juice are
bought together?
? Is soda typically purchased with bananas?

Does the brand of soda make a difference?
? How are the demographics of the

neighborhood affecting what customers
are buying?

Association Rules
• There has been a considerable amount of research in the area of

Market Basket Analysis. Its appeal comes from the clarity and utility
of its results, which are expressed in the form association rules
• Given
• Database of transactions
• Each transaction contains a set of items
• Find all rules X->Y that correlate the presence of one set of items X
with another set of items Y
• Example: When a customer buys bread and butter, they buy milk 85% of
the time

Overview
• Unsupervised learning problem

• Find all rules that correlate the presence of one set of items X with
another item Y
• Example: When a customer buys bread and butter, they buy milk 85% of the
time
• Support is the percentage of the records that contain both X and Y
• A rule must have some minimum user-specified support to show its impact
• Confidence is the percentage of records that contain X and Y out of
the number of records that contain X
• A rule must have some minimum user-specified confidence to show its value

Results: Useful, Trivial, or Inexplicable?
• While association rules are easy to understand, they are not always
useful
Useful
On Fridays convenience store customers often purchase diapers and beer
together
Trivial
Customers who purchase maintenance agreements are very likely to
purchase large appliances
Inexplicable
When a new Super Store opens, one of the most commonly sold item is
light bulbs

How Does It Work?
• In the data, two of five
Grocery Point-of-Sale Transactions
transactions include both
soda and orange juice Customer Items
• These two transactions 1 Orange Juice,
juice, Soda
support the rule
2 Milk, Orange Juice, Window Cleaner
• Support for the rule is
two out of five or 40% 3 Orange Juice, Detergent
• Since both transactions 4 Orange Juice, Detergent, soda

juice, detergent, Soda
that contain soda also 5 Window Cleaner, Soda
cleaner, soda
contain orange juice
• There is a high degree of
Co-Occurrence of Products
confidence in the rule
Window
• In fact every transaction OJ Cleaner Milk Soda Detergent
that contains soda
OJ 4 1 1 2 1
contains orange juice
• So the rule If soda, THEN Window Cleaner 1 2 1 1 0
orange juice has a Milk 1 1 1 0 0
confidence of 100% Soda 2 1 0 3 1
Detergent 1 0 0 1 2

Confidence and Support - How Good Are the Rules
• A rule must have some minimum user-specified confidence

• 1 and 2 -> 3 has a 90% confidence if when a customer bought 1 and 2, in
90% of the cases, the customer also bought 3
• A rule must have some minimum user-specified support
• 1 and 2 -> 3 should hold in some minimum percentage of transactions to
have value

Confidence and Support
Transaction ID # Items
1 { 1, 2, 3 }
For minimum support = 50% = 2 transactions
2 { 1,3 }
and minimum confidence = 50%
3 { 1,4 }
4 { 2, 5, 6 }
Frequent Item Set Support

{1} 75 % For the rule 1=> 3:
{2} 50 % Support = Support({1,3}) = 50%
{3} 50 % Confidence = Support ({1,3})/Support({1}) = 66%
{4} 50 %

Association Examples
• Find all rules that have “Diet Coke” as a consequent (result)

• These rules may help plan what the store should do to boost the sales of
Diet Coke
• Find all rules that have “Yogurt” in the antecedent (condition)

• These rules may help determine what products may be impacted if the
store discontinues selling “Yogurt”
• Find all rules that have “Brats” in the antecedent and “mustard”
in the consequent
• These rules may help in determining the additional items that have to be
sold together to make it highly likely that mustard will also be sold
• Find the best k rules that have “Yogurt” in the result

Basic Process
• Choosing the right set of items

• Taxonomies
• Virtual Items
• Anonymous versus Signed
• Generation of rules
• If condition Then result
• Negation/Dissociation
• Improvement
• Overcoming the practical limits imposed by thousand or tens of
thousands of products
• Minimum Support Pruning

Strengths and Weaknesses
Strengths
• It produces easy to understand results
• It supports undirected data mining
• It works on variable length data
• Rules are relatively easy to compute
Weaknesses
• It is an exponential growth algorithm
• It is difficult to determine the optimal number of items
• It discounts rare items
• It is limited by the support that it provides attributes
• It produces many rules
• For large numbers of attribute-value combinations, considerable
cpu and memory resources are consumed

Discovery: Rule Association Using fp-growth
• Click on the “Itinerary” Pane in the Resource Panel

• Expand the “Discovery” directory with a single click
• Expand the “RuleAssociation” directory with a single click
• Double click on “fp-growth” to load the itinerary into your Workspace

Executing the Itinerary
• Check modules with

properties
• Double click to open
property editor
• fp-growth
• Compute Confidence
• Respond to User Interfaces
that open
• Click Run button

Rule Association Visualization
• Read rules down the column
• Example - the first rule is
• If petal-width Binned=[…:0.7] then
flower-type=Iris-setosa
• Support = 25%
• Confidence = 100%
• Brush the bars to find out support
and confidence levels
• Different sorting schemes
• Sort by Confidence
• Sort by Support
• Alphabetize button sorts the
attribute-value pairs alphabetically
• Rank button sorts the rows based
on the current Confidence/Support
selection, moving the consequents
and antecedents of the highest
ranking rules to the top of the
attribute-value list

Choosing the Right Set of Items
Frozen
General
Foods
Partial Product Taxonomy
Frozen Frozen Frozen

Desserts Vegetables Dinners
Frozen Ice Frozen

Yogurt Cream Fruit Bars Peas Carrots Mixed Other
Specific
Chocolate Strawberry Vanilla Rocky Cherry Other

Road Garcia

Other Association Rule Applications
• Quantitative Association Rules

• Age[35..40] and Married[Yes] -> NumCars[2]
• Association Rules with Constraints

• Find all association rules where the prices of items are > 100 dollars
• Temporal Association Rules

• Diaper -> Beer (1% support, 80% confidence)
• Diaper -> Beer (20%support) 7:00-9:00 PM weekdays
• Optimized Association Rules

• Given a rule (l < A < u) and X -> Y, Find values for l and u such that
support greater than certain threshold and maximizes a support,
confidence, or gain
• ChkBal [$ 30,000 .. $50,000] -> JumboCD = Yes

DISCOVERY
CLUSTERING

Discovery: Clustering
Overview
• Unsupervised learning problem

• Group all examples that are similar
• View results with dendogram or parallel coordinates
• Provide several different clustering algorithms
• Kmeans
• Buckshot
• Fractionation
• Coverage

Clustering Algorithms
• KMeans clustering
• Creates a sample set containing Number of Clusters rows is chosen from an input
table of examples and used as initial cluster centers
• These initial clusters undergo a series of assignment/refinement iterations, resulting
in a final cluster model
• Buckshot clustering
• Creates a sample of size Sqrt(Number of Clusters * Number of Examples) is chosen at
random from the table of examples
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model
• Coverage clustering
• Creates a sample set from the input table such that the set formed is approximately
the minimum number of samples needed such that for every example in the input
table there is at least one example in the sample set of distance = Distance Threshold
(% of Maximum)
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model

Clustering Algorithms (2)
• Fractionation
• Creates a sample set of the initial examples (converted to clusters) by a
key attribute denoted by Sort Attribute
• The set of sorted clusters is then segmented into equal partitions of size
maxPartitionsize
• Each of these partitions is then passed through the agglomerative clusterer
to produce numberOfClusters clusters
• All the clusters are gathered together for all partitions and the entire
process is repeated until only Number of Clusters clusters remain. The
sorting step is to encourage like clusters into same partitions

• Click on “Itinerary” Pane in the Resource Panel

• Expand the “Discovery” directory
• Expand the “Clustering” directory
• Double click on “BuckshotClusterer”

Clustering Results
Dendogram or Parallel Coordinates

DEVIATION DETECTION
VISUALIZATIONS
PARALLEL COORDINATES
SCATTERPLOT

Deviation Detection: Parallel Coordinates
Itinerary
• Visualization to detect outliers and patterns

• Expand the vis directory and load the “ParallelCoordinate” itinerary

Deviation Detection: Parallel Coordinates
Parallel Coordinates - Visualization
• Each vertical line represents a
attribute with the minimum and
maximum values shown at
bottom and top
• Each record has a line that
connects it to the its value at
each attribute
• Lines are colored based on the
output attribute
• Clicking and dragging on the
label boxes allows the attributes
to be rearranged
• Zooming is accomplished by
dragging a box over the desired
area. Clicking returns to the
original view

Deviation Detection: Scatterplots
Scatterplots – Itinerary
• Visualization to detect outliers and patterns
• Load the “scatterplot” itinerary

Deviation Detection: Scatterplots
Scatterplots – Visualization

Deviation Detection: Small Multiples
Small Multiples of Scatterplots - Itinerary

Small Multiples of Scatterplots Vis

Small Multiples of Linear Regressions Vis

D2K SL
D2K Streamline (D2K SL)
• Reduces the learning

curve associated with the
KDD process
• Encompasses discovery,
prediction and deviation
detection techniques
• Saves and applies models
to new data sets easily
• Supports return to earlier
steps in the KDD process
to run with different
parameters
• Uses the D2K
Infrastructure
transparently

D2K SL
New D2K User Interface – D2K SL
• Provides step
by step
interface to
guide user in
data analysis
• Uses same
D2K modules
• Provides way
to capture
different
experiments
(streams)

D2K SL
Another View of the New D2K User Interface – D2K SL
• Help users keep

track of data
• Define templates
that can be
reused in
different
experiments
(streams)

The ALG Team
Staff Students
Loretta Auvil Tyler Alumbaugh
Ruth Aydt Peter Groves
Peter Bajcsy Olubanji Iyun
Colleen Bushell Sang-Chul Lee
Dora Cai Xiaolei Li
David Clutter Brian Navarro
Lisa Gatzke Jeff Ng
Vered Goren Scott Ramon
Chris Navarro Sunayana Saha
Greg Pape Martin Urban
Tom Redman Bei Yu
Duane Searsmith Hwanjo Yu
Andrew Shirk
Anca Suvaiala
David Tcheng
Michael Welge

Licensing D2K
• Faculty, staff and students at US academic institutions will be able

to license and use D2K for free by downloading from
alg.ncsa.uiuc.edu
• Private Sector Partners who have provided funding for projects
related to D2K will be able to license and use D2K for free
• Private Sector Partners who have not provided funding will be able
to license and use D2K for a discounted fee
Contact John McEntire

Office of Technology Management
308 Ceramics Building, MC-243
105 South Goodwin Avenue
Urbana, Illinois 61801-2901
(217) 333-3715
jmcentir@uiuc.edu

d2k Tutorial

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

d2k Tutorial

Uploaded by

Copyright:

Available Formats

D2K Tutorial

• Overview of D2K Functionality

alg | Automated Learning Group

• Understanding the Knowledge Discovery in Databases Process

alg | Automated Learning Group

Knowledge Discovery in Databases is the non-trivial process of

• The understandable patterns are used to:

alg | Automated Learning Group

Knowledge Discovery Process

alg | Automated Learning Group

Required Effort for each KDD Step

Arrows indicate the direction we want the effort to go

alg | Automated Learning Group

Three Primary Paradigms

• Predictive Modeling – supervised learning approach where

alg | Automated Learning Group

• Provides capability to build custom applications

alg | Automated Learning Group

D2K is a flexible data mining system that integrates

alg | Automated Learning Group

alg | Automated Learning Group

Major features that D2K provides

• Visual programming system

alg | Automated Learning Group

alg | Automated Learning Group

• Current Release downloadable off our website

alg | Automated Learning Group

• Ability to use the web for deployment

alg | Automated Learning Group

alg | Automated Learning Group

Input Module: Loads data from the outside world

Compute Module: Performs main algorithmic computations

User Input Module: Requires interaction with the user

Output Module: Saves data to the outside world

Visualization Module: Provides visual feedback to the user

alg | Automated Learning Group

Module Progress Bar

Input Port Output Port

alg | Automated Learning Group

alg | Automated Learning Group

• Itineraries are partial or

alg | Automated Learning Group

The Workspace is the area where applications are formed

alg | Automated Learning Group

alg | Automated Learning Group

alg | Automated Learning Group

Build an itinerary for loading data and

alg | Automated Learning Group

Connect the modules like shown

alg | Automated Learning Group

• An itinerary can be used

alg | Automated Learning Group

alg | Automated Learning Group

• Applied to supervised learning

alg | Automated Learning Group

• Probabilistic learning: Calculate explicit probabilities for

alg | Automated Learning Group

alg | Automated Learning Group

alg | Automated Learning Group

• Click on the “Itinerary” Pane in the Resource Panel

alg | Automated Learning Group

• Check modules with