Professional Documents
Culture Documents
Mining
process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases
Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
Verification, Model
Operational Evaluation Patterns
Databases
Data mining algorithm components
Model representation
descriptions of discovered patterns
overly limited representation -- unable to capture data patterns
too powerful -- potential for overfit
(decision trees, rules, linear/non-linear regression & classification,
nearest neighbor and case-based reasoning methods, graphical
dependency models)
Model evaluation criteria
how well a pattern (model) meets goals (fit function)
eg., accuracy, novelty, etc.
Search method
parameter search: optimization of of parameters for a given model
representation
model search: considers a family of models
Different methods suit different problems. Proper
problem formulation crucial.
Data mining involves fitting models to and determining
patterns from observed data.
Knowledge Discovery Process
Goal
understanding the application domain, and goals of KDD effort
Data selection, acquisition, integration
Data cleaning
noise, missing data, outliers,etc.
Exploratory data analysis
dimensionality reduction, transformations
selection of appropriate model for analysis, hypotheses to test
Data mining
selecting appropriate method that match set goals (classification,
regression, clustering, etc)
selecting algorithm
Testing and verification
Interpretation
Consolidation and use
Issues and challenges
large data
number of variables (features), number of cases (examples)
multi gigabyte, terabyte databases
efficient algorithms, parallel processing
high dimensionality
large number of features: exponential increase in search space
potential for spurious patterns
dimensionality reduction
Overfitting
models noise in training data, rather than just the general patterns
Changing data, missing and noisy data
Use of domain knowledge
utilizing knowledge on complex data relationships, known facts
Understandability of patterns
Data Mining
Prediction Methods
using some variables to predict unknown or future values of
other variables
Descriptive Methods
finding human-interpretable patterns describing the data
Data Mining Tasks
Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Regression
Deviation Detection
Classification
Data defined in terms of attributes, one of which is the class
DDA
HELOC
customers
customers
(~250K cases)
Example
Data
DDA history of loan balances over 3,6,9,12,18 months,
returned checks
demographic data (age, income, length of residence, etc.),
both internal and external
property data sourced externally (home purchase price,
loan-to-value ratio, etc.)
credit worthiness data
response to previous mailings
120 variables selected
less than half the DDAs had history records; missing fields;
(45 K cases remaining for use -- prospects database)
exclude variables like sex, race, age (legal restrictions)
Neural network (radial basis function) model for
value prediction
Example
Training data
randomly sample from prospects database; weighted to
include more responders than present in actual data
Validation
rank on likelihood of response
consider top and bottom 10% -- use visualization, decision
tree to understand rationale for obtained classification
Testing
sample from prospects database; unweighted with normal
proportion of responders and non-responders
gains (lift) chart
Example: Lift analysis