Professional Documents
Culture Documents
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
An Overview
Understanding What is a Data Warehouse
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
10
Data Modeling
Effective way of using a Data Warehouse
11
Data Modeling Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
13
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
14
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
15
16
The Need For Data Quality Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with
17
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential
18
Limitations
Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts
Limitation
19
2009 Wipro Ltd - Confidential
Limitations
Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
22
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
23
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
24
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, 2009 Wipro Ltd - Confidential
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
25
2009 Wipro Ltd - Confidential
26
27
ETL Tools Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
28
2009 Wipro Ltd - Confidential
Metadata Management
29
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
30
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
31
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
33
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
35
OLAP
37
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
12/20/2012
38
38
39
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
40
40
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
41
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
42
3 x 3 x 3 = 27 cells
43
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
12/20/2012
44
2009 Wipro Ltd - Confidential
44
Issues with MDDB - Sparsity Example If dimension members of different dimensions Employee Age do not interact , then blank cell is left behind. LAST NAME EMP# AGE
Smith
M O D E L
01 21 12 Sales Volumes 19 31 63 Miini Van 14 6 5 31 4 54 3 5 27 Coupe 5 03 56 4 3 2 Sedan 41 45 Blue Red White 33 COLOR 41 23 19
21
Regan
19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
12/20/2012
45
2009 Wipro Ltd - Confidential
45
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
12/20/2012
46
2009 Wipro Ltd - Confidential
46
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
47
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
48
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
12/20/2012
49
2009 Wipro Ltd - Confidential
49
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
12/20/2012
50
2009 Wipro Ltd - Confidential
50
12/20/2012
51
2009 Wipro Ltd - Confidential
51
1st Qtr
4th Qtr
52
53
12/20/2012
54
2009 Wipro Ltd - Confidential
54
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
55
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
12/20/2012
56
2009 Wipro Ltd - Confidential
56
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
57
2009 Wipro Ltd - Confidential
57
12/20/2012
58
2009 Wipro Ltd - Confidential
58
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
59
2009 Wipro Ltd - Confidential
59
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
12/20/2012
60
2009 Wipro Ltd - Confidential
60
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
12/20/2012
61
2009 Wipro Ltd - Confidential
61
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
12/20/2012
62
2009 Wipro Ltd - Confidential
62
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
12/20/2012
63
2009 Wipro Ltd - Confidential
63
64
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
65
66
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
67
68
69
70
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
71
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
72
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
73
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
74
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
75
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
76
Questions
77
Thank You
78
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
80
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
81
An Overview
Understanding What is a Data Warehouse
82
83
84
85
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
87
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
88
An Overview
Understanding What is a Data Warehouse
89
90
91
92
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
94
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
95
An Overview
Understanding What is a Data Warehouse
96
97
98
99
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
100
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
101
Data Modeling
Effective way of using a Data Warehouse
102
Data Modeling Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
104
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
105
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
106
107
The Need For Data Quality Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with
108
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential
109
Limitations
Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts
Limitation
110
2009 Wipro Ltd - Confidential
Limitations
Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
113
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
114
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
115
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, 2009 Wipro Ltd - Confidential
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
116
2009 Wipro Ltd - Confidential
117
118
ETL Tools Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
119
2009 Wipro Ltd - Confidential
Metadata Management
120
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
121
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
122
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
124
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
126
OLAP
128
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
12/20/2012
129
129
130
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 131 data
131
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
132
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
133
3 x 3 x 3 = 27 cells
134
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
12/20/2012
135
2009 Wipro Ltd - Confidential
135
Issues with MDDB - Sparsity Example If dimension members of different dimensions Employee Age do not interact , then blank cell is left behind. LAST NAME EMP# AGE
Smith
M O D E L
01 21 12 Sales Volumes 19 31 63 Miini Van 14 6 5 31 4 54 3 5 27 Coupe 5 03 56 4 3 2 Sedan 41 45 Blue Red White 33 COLOR 41 23 19
21
Regan
19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
12/20/2012
136
2009 Wipro Ltd - Confidential
136
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
12/20/2012
137
2009 Wipro Ltd - Confidential
137
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
138
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
139
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
12/20/2012
140
2009 Wipro Ltd - Confidential
140
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
12/20/2012
141
2009 Wipro Ltd - Confidential
141
12/20/2012
142
2009 Wipro Ltd - Confidential
142
1st Qtr
4th Qtr
143
144
12/20/2012
145
2009 Wipro Ltd - Confidential
145
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
146
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
12/20/2012
147
2009 Wipro Ltd - Confidential
147
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
148
2009 Wipro Ltd - Confidential
148
12/20/2012
149
2009 Wipro Ltd - Confidential
149
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
150
2009 Wipro Ltd - Confidential
150
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
12/20/2012
151
2009 Wipro Ltd - Confidential
151
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
12/20/2012
152
2009 Wipro Ltd - Confidential
152
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
12/20/2012
153
2009 Wipro Ltd - Confidential
153
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
12/20/2012
154
2009 Wipro Ltd - Confidential
154
155
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
156
157
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
158
159
160
161
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
162
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
163
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
164
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
165
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
166
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
167
Questions
168
Thank You
169
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
170
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
171
Data Modeling
Effective way of using a Data Warehouse
172
Data Modeling Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
174
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
175
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
176
177
The Need For Data Quality Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with
178
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential
179
Limitations
Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts
Limitation
180
2009 Wipro Ltd - Confidential
Limitations
Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
183
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
184
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
185
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, 2009 Wipro Ltd - Confidential
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
186
2009 Wipro Ltd - Confidential
187
188
ETL Tools Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
189
2009 Wipro Ltd - Confidential
Metadata Management
190
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
191
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
192
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
194
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
196
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
198
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
200
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
201
An Overview
Understanding What is a Data Warehouse
202
203
204
205
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
206
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
207
Data Modeling
Effective way of using a Data Warehouse
208
Data Modeling Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
210
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
211
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
212
213
The Need For Data Quality Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with
214
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential
215
Limitations
Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts
Limitation
216
2009 Wipro Ltd - Confidential
Limitations
Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
219
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
220
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
221
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, 2009 Wipro Ltd - Confidential
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
222
2009 Wipro Ltd - Confidential
223
224
ETL Tools Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
225
2009 Wipro Ltd - Confidential
Metadata Management
226
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
227
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
228
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
230
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
232
OLAP
234
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
12/20/2012
235
235
236
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 237 data
237
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
238
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
239
3 x 3 x 3 = 27 cells
240
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
12/20/2012
241
2009 Wipro Ltd - Confidential
241
Issues with MDDB - Sparsity Example If dimension members of different dimensions Employee Age do not interact , then blank cell is left behind. LAST NAME EMP# AGE
Smith
M O D E L
01 21 12 Sales Volumes 19 31 63 Miini Van 14 6 5 31 4 54 3 5 27 Coupe 5 03 56 4 3 2 Sedan 41 45 Blue Red White 33 COLOR 41 23 19
21
Regan
19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
12/20/2012
242
2009 Wipro Ltd - Confidential
242
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
12/20/2012
243
2009 Wipro Ltd - Confidential
243
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
244
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
245
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
12/20/2012
246
2009 Wipro Ltd - Confidential
246
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
12/20/2012
247
2009 Wipro Ltd - Confidential
247
12/20/2012
248
2009 Wipro Ltd - Confidential
248
1st Qtr
4th Qtr
249
250
12/20/2012
251
2009 Wipro Ltd - Confidential
251
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
252
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
12/20/2012
253
2009 Wipro Ltd - Confidential
253
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
254
2009 Wipro Ltd - Confidential
254
12/20/2012
255
2009 Wipro Ltd - Confidential
255
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
256
2009 Wipro Ltd - Confidential
256
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
12/20/2012
257
2009 Wipro Ltd - Confidential
257
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
12/20/2012
258
2009 Wipro Ltd - Confidential
258
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
12/20/2012
259
2009 Wipro Ltd - Confidential
259
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
12/20/2012
260
2009 Wipro Ltd - Confidential
260
261
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
262
263
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
264
265
266
267
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
268
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
269
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
270
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
271
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
272
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
273
Questions
274
Thank You
275
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
276
An Overview
Understanding What is a Data Warehouse
277
278
279
280
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
281
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
282
Data Modeling
Effective way of using a Data Warehouse
283
Data Modeling Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
285
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
286
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
287
288
The Need For Data Quality Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with
289
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential
290
Limitations
Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts
Limitation
291
2009 Wipro Ltd - Confidential
Limitations
Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
294
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
295
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
296
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, 2009 Wipro Ltd - Confidential
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
297
2009 Wipro Ltd - Confidential
298
299
ETL Tools Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
300
2009 Wipro Ltd - Confidential
Metadata Management
301
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
302
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
303
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
305
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
307
OLAP
309
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
12/20/2012
310
310
311
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 312 data
312
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
313
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
314
3 x 3 x 3 = 27 cells
315
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
12/20/2012
316
2009 Wipro Ltd - Confidential
316
Issues with MDDB - Sparsity Example If dimension members of different dimensions Employee Age do not interact , then blank cell is left behind. LAST NAME EMP# AGE
Smith
M O D E L
01 21 12 Sales Volumes 19 31 63 Miini Van 14 6 5 31 4 54 3 5 27 Coupe 5 03 56 4 3 2 Sedan 41 45 Blue Red White 33 COLOR 41 23 19
21
Regan
19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
12/20/2012
317
2009 Wipro Ltd - Confidential
317
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
12/20/2012
318
2009 Wipro Ltd - Confidential
318
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
319
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
320
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
12/20/2012
321
2009 Wipro Ltd - Confidential
321
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
12/20/2012
322
2009 Wipro Ltd - Confidential
322
12/20/2012
323
2009 Wipro Ltd - Confidential
323
1st Qtr
4th Qtr
324
325
12/20/2012
326
2009 Wipro Ltd - Confidential
326
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
327
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
12/20/2012
328
2009 Wipro Ltd - Confidential
328
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
329
2009 Wipro Ltd - Confidential
329
12/20/2012
330
2009 Wipro Ltd - Confidential
330
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
331
2009 Wipro Ltd - Confidential
331
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
12/20/2012
332
2009 Wipro Ltd - Confidential
332
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
12/20/2012
333
2009 Wipro Ltd - Confidential
333
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
12/20/2012
334
2009 Wipro Ltd - Confidential
334
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
12/20/2012
335
2009 Wipro Ltd - Confidential
335
336
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
337
338
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
339
340
341
342
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
343
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
344
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
345
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
346
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
347
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
348
Questions
349
Thank You
350
OLAP
352
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
12/20/2012
353
353
354
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 355 data
355
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
356
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
357
3 x 3 x 3 = 27 cells
358
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
12/20/2012
359
2009 Wipro Ltd - Confidential
359
Issues with MDDB - Sparsity Example If dimension members of different dimensions Employee Age do not interact , then blank cell is left behind. LAST NAME EMP# AGE
Smith
M O D E L
01 21 12 Sales Volumes 19 31 63 Miini Van 14 6 5 31 4 54 3 5 27 Coupe 5 03 56 4 3 2 Sedan 41 45 Blue Red White 33 COLOR 41 23 19
21
Regan
19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
12/20/2012
360
2009 Wipro Ltd - Confidential
360
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
12/20/2012
361
2009 Wipro Ltd - Confidential
361
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
362
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
363
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
12/20/2012
364
2009 Wipro Ltd - Confidential
364
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
12/20/2012
365
2009 Wipro Ltd - Confidential
365
12/20/2012
366
2009 Wipro Ltd - Confidential
366
1st Qtr
4th Qtr
367
368
12/20/2012
369
2009 Wipro Ltd - Confidential
369
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
370
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
12/20/2012
371
2009 Wipro Ltd - Confidential
371
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
372
2009 Wipro Ltd - Confidential
372
12/20/2012
373
2009 Wipro Ltd - Confidential
373
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
374
2009 Wipro Ltd - Confidential
374
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
12/20/2012
375
2009 Wipro Ltd - Confidential
375
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
12/20/2012
376
2009 Wipro Ltd - Confidential
376
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
12/20/2012
377
2009 Wipro Ltd - Confidential
377
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
12/20/2012
378
2009 Wipro Ltd - Confidential
378
379
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
380
381
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
382
383
384
385
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
386
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
387
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
388
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
389
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
390
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
391
Questions
392
Thank You
393
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
394
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
395
Data Modeling
Effective way of using a Data Warehouse
396
Data Modeling Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
398
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
399
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
400
401
The Need For Data Quality Difficulty in decision making Time delays in operation Organizational mistrust Data ownership conflicts Customer attrition Costs associated with
402
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential
403
Limitations
Tons of Custom programs in different environments are difficult to manage Minor alterations demand coding efforts
Limitation
404
2009 Wipro Ltd - Confidential
Limitations
Not all variables can be discovered Some discovered rules might not be pertinent There may be performance problems with large files or with many fields.
407
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
408
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
409
Restructuring of records or fields Removal of Operational-only data Supply of missing field values Data Integrity checks Data Consistency and Range checks, 2009 Wipro Ltd - Confidential
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats. To solve the problem, companies use extract, transform and load (ETL) software.
410
2009 Wipro Ltd - Confidential
411
412
ETL Tools Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment
413
2009 Wipro Ltd - Confidential
Metadata Management
414
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
415
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
416
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
418
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
420
OLAP
422
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
12/20/2012
423
423
424
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 425 data
425
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
426
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
427
3 x 3 x 3 = 27 cells
428
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
12/20/2012
429
2009 Wipro Ltd - Confidential
429
Issues with MDDB - Sparsity Example If dimension members of different dimensions Employee Age do not interact , then blank cell is left behind. LAST NAME EMP# AGE
Smith
M O D E L
01 21 12 Sales Volumes 19 31 63 Miini Van 14 6 5 31 4 54 3 5 27 Coupe 5 03 56 4 3 2 Sedan 41 45 Blue Red White 33 COLOR 41 23 19
21
Regan
19 63 31 27 56 45 41 19
31 41 23 01 14 54 03 12 33
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
12/20/2012
430
2009 Wipro Ltd - Confidential
430
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
12/20/2012
431
2009 Wipro Ltd - Confidential
431
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
432
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
433
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
12/20/2012
434
2009 Wipro Ltd - Confidential
434
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
12/20/2012
435
2009 Wipro Ltd - Confidential
435
12/20/2012
436
2009 Wipro Ltd - Confidential
436
1st Qtr
4th Qtr
437
438
12/20/2012
439
2009 Wipro Ltd - Confidential
439
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
440
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
12/20/2012
441
2009 Wipro Ltd - Confidential
441
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
442
2009 Wipro Ltd - Confidential
442
12/20/2012
443
2009 Wipro Ltd - Confidential
443
Any Client
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
12/20/2012
444
2009 Wipro Ltd - Confidential
444
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
12/20/2012
445
2009 Wipro Ltd - Confidential
445
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
12/20/2012
446
2009 Wipro Ltd - Confidential
446
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
12/20/2012
447
2009 Wipro Ltd - Confidential
447
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
12/20/2012
448
2009 Wipro Ltd - Confidential
448
449
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
450
451
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
452
453
454
455
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
456
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
457
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
458
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
459
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
460
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
461
Questions
462
Thank You
463
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
465
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
466
An Overview
Understanding What is a Data Warehouse
467