Professional Documents
Culture Documents
5) Business 2
Consumes insights Key
and measures
effectiveness 5 Business
Processes 3) Data Scientists
4) BI Build and refine analytic
Publishes new
3 models
insights 4
Big Data: How Data Powers Big Business, Bill Schmarzo, Wiley
Data Scientist Lifecycle
1 Do I have enough
information to
Discovery draft an analytic
plan?
6 2 Do I have
enough
Operationalize Data Prep
good data
to start
building the
model?
5 3
Communicate Model
Results Planning
Big Data: How Data Powers Big Business, Bill Schmarzo, Wiley
Why Should There be a Standard Process?
Framework for recording experience
Allows projects to be replicated
Aid to project planning and
management
Comfort factor for new adopters
Demonstrates maturity of Data
Mining
Reduces dependency on stars
13
CRISP-DM Phases and Tasks
CRISP-DM : A new Blueprint for Data Mining, Shearer, Journal of Data Warehousing
Determine
Business Objectives Data Set
Background Collect Initial Data Data Set Description Select Modeling Evaluate Results Plan Deployment
Business Objectives
Business Success Initial Data Collection Technique Assessment of Data Deployment Plan
Criteria Report Select Data Modeling Technique Mining Results w.r.t.
Situation Assessment Rationale for Inclusion / Modeling Assumptions Business Success Plan Monitoring and
Inventory of Resources Describe Data Exclusion Criteria Maintenance
Requirements,
Assumptions, and Data Description Report Generate Test Design Approved Models Monitoring and
Constraints Clean Data Test Design Maintenance Plan
Risks and Contingencies Explore Data Data Cleaning Report Review Process
Terminology Data Exploration Report Build Model Review of Process Produce Final Report
Costs and Benefits
Determine Construct Data Parameter Settings Final Report
Data Mining Goal Verify Data Quality Derived Attributes Models Determine Next Steps Final Presentation
Data Mining Goals Data Quality Report Generated Records Model Description List of Possible Actions
Data Mining Success Decision Review Project
Criteria
Produce Project Plan Integrate Data Assess Model Experience
Project Plan Merged Data Model Assessment Documentation
Initial Asessment of Revised Parameter
Tools and Techniques Format Data Settings
Reformatted Data
CRISP-DM: Phases
Business Understanding
Understanding project objectives and requirements
Data mining problem definition
Data Understanding
Initial data collection and familiarization
Identify data quality issues
Initial, obvious results
Data Preparation
Record and attribute selection
Data cleansing
Modeling
Run the data mining tools
Evaluation
Determine if results meet business objectives
Identify business issues that should have been addressed earlier
Deployment
Put the resulting models into practice
Set up for repeated/continuous mining of the data
Why CRISP-DM?
The data mining process must be reliable and repeatable by people with little
data mining skills
16
SAS SEMMA Process
SAS have their own data mining process known as SEMMA
Sample
Explore
Modify
Model
Assess
Many of the steps in the SEMMA process directly correlate with steps in the
CRISP-DM methodology
SAS SEMMA Process
Sample identify input data sets (identify input data; sample from a larger data
set; partition data set into training, validation, and test data sets).
Explore explore data sets statistically and graphically (plot the data, obtain
descriptive statistics, identify important variables, perform association analysis).
Modify prepare the data for analysis (create additional variables or transform
existing variables for analysis, identify outliers, replace missing values, modify the
way in which variables are used for the analysis, perform cluster analysis
Model fit a predictive model (model a target variable by using a regression
model, a decision tree, a neural network).
Assess compare competing predictive models (build charts that plot the
percentage of respondents, percentage of respondents captured, lift, and profit).
Data Mining and Data Visualization
Increasing potential
to support
business decisions End User
Making
Decisions
Business
Data Presentation Analyst
Visualization Techniques
Data Mining Data
Analyst
Information Discovery
Data
Statistical Analysis, Querying and Reporting
Exploration
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Visualization in Data Mining Process
8th Law of Data Mining Value Law: 9th Law of Data Mining Law of
The value of data mining results is not Change: All patterns are subject to change
determined by the accuracy or stability
of predictive models
Keeping Up with your Quants* Ask a Lot of
questions
1. What was the source of your data?
2. How well do the sample data represent the population?
3. Does your data distribution include outliers? How did they affect the
results?
4. What assumptions are behind your analysis? Might certain conditions
render your assumptions and your model invalid?
5. Why did you decide on that particular analytical approach? What
alternatives did you consider?
6. How likely is it that the independent variables are actually causing the
changes in the dependent variable? Might other analyses establish
causality more clearly?
Keep up with Your Quants, Davenport, Harvard Business Review
DATA VISUALIZATION AND STORY TELLING WITH DATA
YOUTUBE: 200 Countries, 200 years, 4 YOUTUBE: Persuasion and the Power of
minutes: The Joy of Stats the Story
NEXT TOPIC: PREDICTIVE ANALYTICS
USING DECISION TREES