You are on page 1of 28

PREDICTIVE ANALYTICS AND DATA MINING

PROCESS AND METHODOLOGY


YOUTUBE VIDEOS

DATA Mining Process and CRISP_DM by Conitir


Predictive Analytics for Practioners - with Dean Abbott - from
(APA book
Data Mining versus OLAP

OLAP provides you with a very


good view of what is happening,
but can not predict what will
happen in the future or why it is
happening
Verification Driven (OLAP) vs
Discovery Driven (Data Mining)
What is Data Mining ?

Data mining (knowledge discovery in databases):


Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) information or patterns from data in large databases
Data Mining vs Statistics
Predictive Analytics and Data Mining Process
and Methodology
Analytics Lifecycle: Process To Continuously
Uncover And Publish New Business Insights
1) Business
Defines mandate
and requirements 2) DWH
1 Acquires and
integrates data

5) Business 2
Consumes insights Key
and measures
effectiveness 5 Business
Processes 3) Data Scientists
4) BI Build and refine analytic

Publishes new
3 models

insights 4

Big Data: How Data Powers Big Business, Bill Schmarzo, Wiley
Data Scientist Lifecycle
1 Do I have enough
information to
Discovery draft an analytic
plan?

6 2 Do I have
enough
Operationalize Data Prep
good data
to start
building the
model?
5 3
Communicate Model
Results Planning

Is the model robust 4


Model Do I have a good idea about
enough? Have we the type of model to try? Can
failed enough? Building
I refine the analytic plan?

Big Data: How Data Powers Big Business, Bill Schmarzo, Wiley
Why Should There be a Standard Process?
Framework for recording experience
Allows projects to be replicated
Aid to project planning and
management
Comfort factor for new adopters
Demonstrates maturity of Data
Mining
Reduces dependency on stars

The data mining process must be reliable


and repeatable by people with little data
mining background.
CRISP-DM
Non-proprietary
Application/Industry neutral
Tool neutral
Focus on business issues
As well as technical analysis
Framework for guidance
Experience base
Templates for Analysis
CRISP-DM: Overview

13
CRISP-DM Phases and Tasks
CRISP-DM : A new Blueprint for Data Mining, Shearer, Journal of Data Warehousing

Business Data Data


Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine
Business Objectives Data Set
Background Collect Initial Data Data Set Description Select Modeling Evaluate Results Plan Deployment
Business Objectives
Business Success Initial Data Collection Technique Assessment of Data Deployment Plan
Criteria Report Select Data Modeling Technique Mining Results w.r.t.
Situation Assessment Rationale for Inclusion / Modeling Assumptions Business Success Plan Monitoring and
Inventory of Resources Describe Data Exclusion Criteria Maintenance
Requirements,
Assumptions, and Data Description Report Generate Test Design Approved Models Monitoring and
Constraints Clean Data Test Design Maintenance Plan
Risks and Contingencies Explore Data Data Cleaning Report Review Process
Terminology Data Exploration Report Build Model Review of Process Produce Final Report
Costs and Benefits
Determine Construct Data Parameter Settings Final Report
Data Mining Goal Verify Data Quality Derived Attributes Models Determine Next Steps Final Presentation
Data Mining Goals Data Quality Report Generated Records Model Description List of Possible Actions
Data Mining Success Decision Review Project
Criteria
Produce Project Plan Integrate Data Assess Model Experience
Project Plan Merged Data Model Assessment Documentation
Initial Asessment of Revised Parameter
Tools and Techniques Format Data Settings
Reformatted Data
CRISP-DM: Phases
Business Understanding
Understanding project objectives and requirements
Data mining problem definition
Data Understanding
Initial data collection and familiarization
Identify data quality issues
Initial, obvious results
Data Preparation
Record and attribute selection
Data cleansing
Modeling
Run the data mining tools
Evaluation
Determine if results meet business objectives
Identify business issues that should have been addressed earlier
Deployment
Put the resulting models into practice
Set up for repeated/continuous mining of the data
Why CRISP-DM?
The data mining process must be reliable and repeatable by people with little
data mining skills

CRISP-DM provides a uniform framework for


- guidelines
- experience documentation
CRISP-DM is flexible to account for differences
- Different business/agency problems
- Different data

16
SAS SEMMA Process
SAS have their own data mining process known as SEMMA
Sample
Explore
Modify
Model
Assess
Many of the steps in the SEMMA process directly correlate with steps in the
CRISP-DM methodology
SAS SEMMA Process
Sample identify input data sets (identify input data; sample from a larger data
set; partition data set into training, validation, and test data sets).
Explore explore data sets statistically and graphically (plot the data, obtain
descriptive statistics, identify important variables, perform association analysis).
Modify prepare the data for analysis (create additional variables or transform
existing variables for analysis, identify outliers, replace missing values, modify the
way in which variables are used for the analysis, perform cluster analysis
Model fit a predictive model (model a target variable by using a regression
model, a decision tree, a neural network).
Assess compare competing predictive models (build charts that plot the
percentage of respondents, percentage of respondents captured, lift, and profit).
Data Mining and Data Visualization
Increasing potential
to support
business decisions End User
Making
Decisions
Business
Data Presentation Analyst
Visualization Techniques
Data Mining Data
Analyst
Information Discovery
Data
Statistical Analysis, Querying and Reporting
Exploration
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Visualization in Data Mining Process

Data Visualization plays a critical role in Data Understanding and Data


Preparation (CRISP-DM) and Sample, Explore and Modify phases in SAS
SEMMA Process
The Final Step (Assessment/ Evaluation) is highly dependent on the Data
Visualization techniques
Visualize and Communicate the results
Bad/inappropriate technique may result in misunderstanding
Misunderstanding may cause an incorrect decision

It is important to consider that the Data Mining process is


useless if the results are not understandable
Data Mining and Data Visualization
Need to determine techniques that balance simplicity with
completeness
If this can be done for non-expert users
Simplicity & Completeness Understanding
Understanding Trust
Trust more use of KDD/DM
Result will be:
Better business value
Higher ROI
9 Laws of Data Mining by Tom Khabaza
1st Law of Data Mining Business Goals 2nd Law of Data Mining Business
Law: Knowledge Law:
Business objectives are the origin of Business knowledge is central to every
every data mining solution step of the data mining process

4th Law of Data Mining NFL-DM:


3rd Law of Data Mining Data
The right model for a given application
Preparation Law:
can only be discovered by experiment
Data preparation is more than half of
or There is No Free Lunch for the
every data mining process
Data Miner

5th Law of Data Mining Watkins


Law: There are always patterns
9 Laws of Data Mining
6th Law of Data Mining Insight Law: 7th Law of Data Mining Prediction
Data mining amplifies perception in the Law:
business domain Prediction increases information locally
by generalisation

8th Law of Data Mining Value Law: 9th Law of Data Mining Law of
The value of data mining results is not Change: All patterns are subject to change
determined by the accuracy or stability
of predictive models
Keeping Up with your Quants* Ask a Lot of
questions
1. What was the source of your data?
2. How well do the sample data represent the population?
3. Does your data distribution include outliers? How did they affect the
results?
4. What assumptions are behind your analysis? Might certain conditions
render your assumptions and your model invalid?
5. Why did you decide on that particular analytical approach? What
alternatives did you consider?
6. How likely is it that the independent variables are actually causing the
changes in the dependent variable? Might other analyses establish
causality more clearly?
Keep up with Your Quants, Davenport, Harvard Business Review
DATA VISUALIZATION AND STORY TELLING WITH DATA
YOUTUBE: 200 Countries, 200 years, 4 YOUTUBE: Persuasion and the Power of
minutes: The Joy of Stats the Story
NEXT TOPIC: PREDICTIVE ANALYTICS
USING DECISION TREES

You might also like