You are on page 1of 39

What is a Data Warehouse and

How Do I Test It?


A primer for Testers on Data Warehouses, the ETL process
and Business Intelligence and how to test them

2011 Real-Time Technology Solutions,


Inc.
New York Philadelphia Atlanta
www.rtts.com

RTTS is the leading provider of


software quality
for critical business applications

Fast Facts
Founded:
1996 - consulting firm
Locations:
New York (HQ), Atlanta,
Philly, Phoenix
Geographic region:
Americas, EMEA, APAC
Customer profile:
Fortune 1000
o 350+ customers
o 500+ projects

Strategic Partners:
HP, IBM, MSFT, Oracle,
RTTS Software:
QuerySurge,TOMOS

ALM

The Software Quality

Overview

What is Big Data?

What is a Data Warehouse?


o
o

About the ETL Process


The Data Warehouse marketplace

What is Business Intelligence?


o
o

The architecture
The BI marketplace

Testing the DW Architecture


o
o
o
o

Entry points
The Mapping document
Functional test implementation
Test Tools

Testing BI
o
o

Functional test implementation


Performance Testing

Data Warehouse Test Tool demo


Q&A

What is a Big Data?

What is Big Data?


Big data defined as too much volume,
velocity and variability to work on normal
database architectures.
Size
Defined as 5 petabytes or
more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000
gigabytes
1,000,000 gigabytes = 1,000,000,000
megabytes

The market for big data is $70


billion and growing by 15% a year.
- EMC COO Pat Gelsinger

Big Data Impact


Handles more than 1 million customer transactions every hour.

data imported into databases that contain > 2.5 petabytes of data
the equivalent of 167 times the information contained in all the books in the

USLibrary of Congress.

Others
Facebook handles 40 billion photos from its user base.
Twitter processes 85 million tweets
per day
Google processes 1 Terabyte per hour
eBay processes 80 Terabytes per
day

Big Data Solutions


Requires exceptional technologies to efficiently process
large quantities of data within tolerable elapsed times.
Technologies include:

massively parallel processing (MPP) databases

data warehouses

datamining grids

distributed file systems

distributed databases

cloud computing platforms

the Internet, and

scalable storage system

What is a Data
Warehouse?

What is a Data Warehouse?


Data Warehouse

Typically a relational database that is designed for


query and analysis rather than for transaction
processing

A place where historical data is stored for archival,


analysis and security purposes.

Contains either raw data or formatted data


Legacy

Combines data from multiple sources

Sales
Salaries
Operational data
Human resource data
Inventory data
Web logs
Social networks
Internet text and docs
Other

DB
CRM/ER
P DB
Finance
DB

Data Warehouse Business


Case
Why build a Data Warehouse?

Data stored in operational systems


(OLTP) not easily accessible

OLTP systems are not designed for enduser analysis

The data in OLTP is constantly changing

May lack of historical data

Diverse forms of data stored in different


platforms

Data Warehouse Business


Case
The Data Warehouse Business
Solution

Collects data from different sources (other


databases, files, web services, etc)

Integrates data into logical business areas

Provides direct access to data with powerful


reporting tools (BI)

Data Warehouse about the


data
The Data Warehouse data
Subject-oriented
Integrated
Non-volatile
Time-variant

Data Warehouse the ETL


process
ETL = Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly
(daily/weekly) so that it can serve its purpose of
facilitating business analysis.
Extract - data from one or more OLTP systems and
copied into the warehouse
Transform removing inconsistencies, adding missing
fields, summarizing detailed data and deriving new
fields to store calculated data.
100
0
101
1
101
010
10
010
101
010
11
101
011
101
011
1
110
111
1
1
101 010101 1111
011
010
10

Load map the data and load it into the DW

DAT
A
LOA
D

Data Warehouse the ETL


process
Source
Data

Legacy
DB
CRM/ER
P DB

ETL
Process

Target DW

Extract

Transform

Finance
DB

Load

100
010
110
1
1

010
10

010
101
010
11
101
011
101
011
1
110
111
1
1
101 010101 1111
011
010
10

DAT
A
LOA
D

Data Warehouse the


marketplace
The data warehousing market will see a
compound annual growth rate of 11.5% through
2013 to reach a total of $13.2 billion in revenue.
- Consulting Specialist, The 451 Group

Data Warehouse size

Small data warehouses: < 5 TB


Midsize data warehouses: 5 TB - 20 TB
Large data warehouses: >20 TB
- Analyst firm, Gartner

Leaders in Data Warehouse Data Management


Systems

- Analyst firm Gartners Magic Quadrant for Data Warehouse Database Management Systems

Data Warehouse the


marketplace
Delivery Models
Stand-alone DBMS software
Cloud offerings
Data warehouse appliances
Leading Appliance Makers

Business Intelligence (BI)

Business Intelligence (BI)


B.I. What is it?

Software applications used in


spotting, digging-out, and analyzing
business data

provides easy access to data and


uses it in day to day operations,
integrates data into logical business
areas

provides historical, current and


predictive views of business
operations

made up of several related

Business Intelligence (BI) - Who


uses it?
Wal-Mart uses vast amounts of data
and category analysis to dominate the
industry.
Amazon and Yahoo follow a "test and
learn" approach to business changes.

Hardees, Wendys, and T.G.I.


Fridays use BI to make strategic
decisions.

Business Intelligence (BI) & Data


Marts
Data Mart
A database that has the same characteristics as a
data warehouse, but is usually smaller and is
focused on the data for one division or one
workgroup within an enterprise.
Typically hold aggregated data and some
granular data. It is a subset of the DW and
makes it more efficient for Business
Intelligence reporting.
Source
Data
Legac
y DB
CRM/
ERP
DB
Finance
DB

ETL
Process

ET
L

Target DW

ETL
Process

ET
L

Data
Mart

Business Intelligence (BI)

Source
Data
Legacy
DB
CRM/ERP
DB
Finance
DB

ETL
Process

ET
L

Target DW
ETL
Process

ET
L
Data Mart

B.I. the marketplace


Worldwide business intelligence (BI) platform, analytic
applications and performance management (PM) software
revenue reached $10.5 billion in 2010, a 13.4 percent
increase from 2009 revenue of $9.3 billion
The four large "stack" vendors (SAP, Oracle, IBM and
Microsoft) continue to consolidate the market, owning 59
percent of the market share.

Leaders in BI

- Analyst firm Gartner

- Analyst firm Forrester Researchs Forrester Wave

Testing a Data Warehouse


Architecture

Testing a DW Resources Involved


Resources involved

Business Analysts create requirements

QA Testers develop and execute test plans and


test cases. ***Skill Set required: Very strong
SQL!!!

Architects set up test environments

Developers perform unit tests

DBAs test for performance and stress

Business Users perform functional User


Acceptance Tests
For the purposes of this presentation, we will focus
on a strategy for Testers.

Testing the Data Warehouse


An effective data warehouse testing strategy
focuses on the main structures within the data
warehouse architecture:
1)
2)
3)
4)

The
The
The
The

Sources
ETL layer
data warehouse itself
front-end (BI) data warehouse applications

Testing the Data Warehouse Entry Points


Recommended functional test strategy: Test every
entry point in the system (feeds, databases, internal
messaging, front-end transactions).
The goal: provide rapid localization of data issues between
points
test entry point

test entry point(s)

Source
Data
Legacy
DB
CRM/ER
P DB
Finance
DB

ETL
Process

ET
L

Target DW

test entry point

ETL
Process

ET
L

Data
Mart

B
I

Testing the Data Warehouse Entry Points


test entry points

Source
Data
Legacy
DB

CRM/ER
P DB

Finance
DB

File

File

test entry points

ETL
ETL
ProcessStaging Process
DB
ETL

ETL

ETL

ETL

ETL

ETL

ETL

ETL

ETL

ETL

Possible
architecture
test entry points

Target DW

test entry points

ETL
Process
ETL

ETL
ETL

ETL

Data
Marts

B
I
B
I

Testing the DW Mapping


Document
a.k.a. Source to Target Map
Its the critical element
required to efficiently plan the
ETL process.
Intention:
capture business rules
data flow mapping and
data movement requirements.
Mapping Doc specifies:
Source input definition
Target/output details
Business & data transformation
rules
Absolute data quality
requirements
Optional data quality
requirements.

Testing the DW Mapping


Document
Source
SELECT c.idCustomer "Customer ID", c.lastName
"Customer Last Name", c.firstName "Customer First
Name", o.idOrder "Order Number", p.name "Product
Name", op.quantity "Quantity Ordered",
CASE
WHEN os.idOrderStatus = 5 AND o.refundDate IS
NOT NULL THEN 'Returned'
WHEN (os.idOrderStatus = 3 OR os.idOrderStatus
= 4) AND o.shipDate IS NOT NULL THEN 'Delivered'
ELSE 'Processing'
END "Order Status"
FROM Sales.Orders o, Sales.OrderStatus os,
Sales.OrderProduct op, Sales.Product p,
Sales.Category cat, Sales.Customer c
WHERE o.order_idOrderStatus = os.idorderstatus
AND
op.orderProduct_idOrder = o.idOrder AND
op.orderProduct_idProduct = p.idProduct AND
p.product_idCategory = cat.idCategory AND
cat.name = 'Electronics' AND
o.order_idCustomer = c.idCustomer AND
o.orderDate BETWEEN '01-SEP-10' AND '07-SEP-10'
ORDER BY c.idCustomer, c.lastName, c.firstName,
o.idorder

Target
SELECT u.idUser "Customer ID", u.lastName
"Customer Last Name", u.firstName
"Customer First Name", p.idPurchase
"Purchase Number", i.name "Item Name",
oi.quantity "Quantity Ordered", ps.status
"Purchase Status"
FROM dw.Purchase p, dw.PurchaseStatus ps,
dw.OrderItem oi, dw.Item i, dw.user_ u,
dw.category cat
WHERE p.purchase_idPurchaseStatus =
ps.idPurchaseStatus AND
oi.orderItem_idPurchase = p.idPurchase AND
oi.orderItem_idItem = i.idItem AND
p.purchase_idUser = u.idUser AND
i.item_idCategory = cat.idCategory AND
cat.name = 'Electronics' AND
SUBSTR(p.purchaseDate, 1, 5) BETWEEN '0901' AND '09-07' AND
SUBSTR(p.purchaseDate, -2) = '10'
ORDER BY u.idUser, u.lastname,
u.firstname, p.idpurchase

Testing the DW
Implementation
Implementation of Functional Test
What is going on in the marketplace?
1. Manual Execution
2. Automated execution with standard
test tools
3. Bulk automation with DW Test Tool

Testing the DW Manual Testing


Flow
Tools

Task
s
Review
Review

Mapping
Mapping
Docs
Docs
Write
Write
SQL
SQL in
in
favorite
favorite
editor
editor

Run
Run
TESTs
TESTs

Dump
Dump
results
results to
to
aa file
file

Compare
Compare
results
results
manually
manually or
or
w/compare
w/compare
tool
tool
Report
Report
Defects
Defects
and
and
issues
issues

Timeline

Testing the DW Manual Testing


Flow
Manual ETL Testing Flow Comments

Check points across each leg so that each transformation


is checked.
If a file compare tool is used, care must be taken to ensure
that the result rows for each query are in the same order
(the db is under no obligation to return rows in a specified
order, unless the sql indicates an order).
This process can quick result in 100s or 1,000s of pairs of
queries.
Only a very small sampling can be performed.

Testing the DW Automated


Testing Flow
Functional Automation ETL Testing flow

Functional
Tester

1. Similar to previous - Extract mappings from mapping


document
2. Write pairs of queries that test between any two points in
the architecture.
3. Issue the queries via a Functional Automation tool
4. Have the functional Scripts dump the query result-sets to
files
5. Compare the files, either by writing automation code or by
using a file compare tool.
This process is dependent on the speed of the automation tool;
only a fraction of the data can be covered per ETL per build.

Testing the DW DW Test Tool


Legacy
DB
CRM/ER
P DB
Finance
DB

SQL
(source)

SQLSQL

(source)(target)

SQL
(target)

Testing the DW DW Test Tool

Data Warehouse Test


Automation tool
Validates bulk verification up to
100% of all data
Provides a huge increase in
coverage and verification of
your data
Tremendously decreases your
testing time and costs (i.e. huge
ROI)

Testing the DW Functional Test of


BI
Functional Testing of BI
1. Extract mappings from mapping doc for the data mart
2. Execute reports
3. Verify that data is correct

Verify to the source

Verify field lengths and field level data

Verify logical dependencies of fields

Automation tools can and should be used for


regression purposes.
Functional
Tester

Testing the DW Performance Test


of BI
For Business Intelligence (BI) applications, performance
requirements must be met during batch report execution
and normal user activity.
For BI applications, performance requirements must be
met during batch report execution and normal user
activity.
Since most BI applications are customized to meet the
specific business requirements and data model of the
organization, it is risky to rely on the initial performance
testing done by the software vendor prior to their release.
It is therefore a common practice to test the performance
of BI applications before their initial deployment and
before any major system updates and upgrades.

Performance
Tester

Automating ETL Testing


QuerySurge
DEMONSTRATION

DEMO

Please visit
www.querysurge.co
m
for more information.
Thank you!

You might also like