You are on page 1of 47

Data Warehousing Basics

DATAWAREHOUSE
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Integrated
The data warehouse is a centralized, consolidated database that integrated data derived from the entire organization
Multiple Sources Diverse Sources Diverse Formats

For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

Subject-Oriented
Data is arranged and optimized to provide answer to questions from diverse functional areas
Data is organized and summarized by topic
Sales / Marketing / Finance / Distribution / Etc. For example, "sales" can be a particular subject.

Time-Variant
The Data Warehouse represents the flow of data through time. Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse.

Nonvolatile
Once data is entered it is NEVER removed Represents the companys entire history
Near term history is continually added to it Always growing Must support terabyte databases and multiprocessors

Read-Only database for data analysis and query processing

OLTP(On-line Transaction Processing)


An OLTP system is an application that modifies data(INSERT, UPDATE, DELETE) and has a large number of concurrent users. Highly normalized with many tables(3 NF) These systems are typically used for orderentry purposes, such as for retail sales, credit-card validation, ATM transactions, and so on.

OLAP(On-line Analytical Processing)


OLAP database is aggregated, historical data, stored in multi-dimensional schemas Typically de-normalized with fewer tables.

OLAP
Need for More Intensive Decision Support 4 Main Characteristics
Multidimensional data analysis Advanced Database Support Easy-to-use end-user interfaces

OLTP v/s OLAP


Features
Characteristics
Orientation User Function Data

OLTP
Operational processing
Transaction Clerk,DBA,database professional Day to day operation Current Analysis

OLAP
Informational processing
Knowledge workers Decision support Historical

View
DB design Unit of work Access

Detailed,flat relational
Application oriented Read/write

Summarized, multidimensional
Subject oriented Mostly read

Short ,simple transaction Complex query

OLTP v/s OLAP


Features
Focus
Number of records accessed Number of users

OLTP
Data in
tens thousands

OLAP
Information out
millions hundreds

DB size
Priority Metric

100MB to GB

100 GB to TB

High performance,high High flexibility,endavailability user autonomy Transaction throughput Query througput

Need for Datawarehousing


Better business intelligence for end-users Reduction in time to locate, access, and analyze information Consolidation of disparate information sources Strategic advantage over competitors Faster time-to-market for products and services Replacement of older, less-responsive decision support systems Reduction in demand on IS to generate reports

Why DSS(DATAWAREHOUSE)?

Unavailability of Tools and Techniques for acquisition of data from various sources for answering business questions and making decisions, in earlier days Intensive efforts in data formatting than data analysis Static and inflexible report generation Time-lag in accessing the information at central place
Contd.

How to answer these Business Queries?


What is the sales distribution region wise? How did my revenue improve in the past 5 years?

What are the slow movers in my product line?

Which channel costs me more and pays less?

Which of my Sales Agents are doing better?

Strategic Planning / Budgeting

What is Defaulters Profile?

Currency Risk, Interest Rate Risk, Liquidity Risk

Who are my profitable customers?


Contd.

Why DSS?: Why not OLTP?

DSS queries can adversely impact On-Line Transaction Processing (OLTP) system Constantly changing state of OLTP systems makes replication of result-set difficult Data in OLTP systems are rarely quality assured for DSS analysis OLTP systems may not store data over 90 days making temporal comparisons difficult

Benefits of DATAWAREHOSE
Flexible Information Access High Availability Ease of Use Quality & Completeness of Data Focus on Information Processing

Information Base for Knowledge Discovery

How to Build Datawarehouse?

Identify key business drivers, sponsorship, risks, ROI Survey information needs and identify desired functionality and define functional requirements for initial subject area. Architect long-term, data warehousing architecture

Evaluate and Finalize DW tool & technology


Conduct Proof-of-Concept

How to Build Datawarehouse?


Design target data base schema Build data mapping, extract, transformation, cleansing and aggregation/summarization rules

Build initial data mart, using exact subset of enterprise data warehousing architecture and expand to enterprise architecture over subsequent phases
Maintain and administer data warehouse

Terms and Definitions

Data Dictionary - A collection of Meta Data. Many kinds of products in the datawarehouse arena use a data dictionary, including database management systems, modeling tools, middleware, and query tools. Data Mart - A subset of a data warehouse that focuses on one or more specific subject areas. The data usually is extracted from the data warehouse and further denormalized and indexed to support intense usage by targeted customers. Contd.

Terms and Definitions

Data Mining - Techniques for finding patterns and trends in large data sets. Data Model - The road map to the data in a database. This includes the source of tables and columns, the meanings of the keys, and the relationships between the tables.

Contd.

Terms and Definitions

Data Cleansing -The process of cleaning or removing errors, redundancies and inconsistencies in the data that is being imported into a data mart or data warehouse. It is part of the quality assurance process. Normalization -The process of eliminating duplicate information in a database by creating a separate table that stores the redundant information.
Contd.

Terms and Definitions

ODS - An operational data store is a database designed to integrate data from multiple sources for additional operations on the data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. Normalization -The process of eliminating duplicate information in a database by creating a separate table that stores the redundant information.
Contd.

Terms and Definitions

Data Transformation-The modification of transaction data extracted from one or more data sources before it is loaded into the data mart or warehouse. The modifications may include data cleansing, translation of data into a common format so that is can be aggregated and compared, summarizing the data, etc.

Contd.

DW Components Metadata Layer


Extraction FS1 FS2 S T A G I N G A R E A Cleansing Transformation Aggregation Summarization Data Mart Population

DM1 DM2

. . .
FSn

Transmission
N E T W O R K

ODS

DW

DMn

OLAP ANALYSIS

Legacy System

Knowledge Discovery

Operational Process

Data extraction

Data Cleansing and Transformation


Data Load and refresh Build derived data and views Service queries Administer the warehouse

Extraction Process
( Data Capturing )

Incremental
Business Transactions Feed System Application Data Capturing Process

Data

Control Metadata Extract the incremental data from feed system Store the extracted data into a temporary area

Extraction Process
(Data Transmission )
Feed System Side Network Cloud Staging area

Incremental Data

FTP

Incremental Data

Transmit the extracted data from Feed system to Staging area Periodicity of transmission ( daily / weekly ) depends upon the feed system

Transformation Process
Process Metadata Mapping Detail Transformation Rule Clean Operational Data Transformation Process Control Metadata

Operational Data Store

Transform the cleaned Operational Data into DSS Data Load the DSS data into ODS ODS contains the current DSS data at the lowest level of granularity

Summarization Process

ODS
Summarization Process

DW

Weekly Monthly

Yearly

Control Metadata

Summarize and aggregate ODS data and Populate to the Warehouse Periodicity of Summarization Process depends upon the level of summarization at Warehouse ( weekly, monthly, daily )

DW Options and Architectures

Virtual Data Warehouse Enterprise Data Warehouse Data Marts Distributed Data Marts Multi-tiered warehouse

Virtual Data Warehouse Legacy Client/ Server OLTP Application External


Operational Systems Data

A P I

U S E R S

Enterprise Data Warehouse Legacy


Client/ Server
Select

Metadata Repository

Extract

Transform

DATA WAREHOUSE

OLTP
External

Integrate

A P I

U S E R S

Maintain

Data Preparation Operational Systems Enterprise wide Data

Data Marts Legacy


Client/ Server
Select

Metadata Repository

Extract

Transform

DATA MART

OLTP
External

Integrate

A P I

U S E R S

Maintain

Data Preparation Operational Systems Data

Distributed Data Marts Legacy


Client/ Server
Select

Data Mart

Extract

Transform

Data Mart

OLTP
External

Integrate

A P I

U S E R S

Maintain

Data Mart

Data Preparation Operational Systems Data

Multi-tiered Data Warehouse


Data Mart
Legacy

Select

Client/ Server

Extract

Metadata Repository

Transform
OLTP

Data Mart

Integrate

DATA WAREHOUSE

A P I

U S E R S

External

Maintain

Data Mart

Operational Systems Enterprise wide Data

Multi-tiered Data Warehouse Legacy Client/ Server


Select

Data Mart
Metadata Repository

Extract

Transform Integrate

Data Mart

DATA WAREHOUSE

OLTP
External

A P I

U S E R S

Maintain

Data Mart

Data Preparation Operational Systems Data

Data in a Warehouse
Highly Summarized Data

Lightly Summarized Data

Current Detail Data

Metadata

Older Detail Data

Cont.

Data in a Warehouse
Monthly sales by region for 1991-94

(example)
Monthly Sales by Product for 1991-94

Weekly sales by region for 1991-94

Weekly sales by product/sub-product for 1991-94

Sales Detail for 1991-94


Metadata

Sales Detail for 1985-90

Tools and Technology


Tool Category ETL Tools OLAP Server Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine

OLAP Tools

Data Warehouse Data Mining & Analysis

OLAP Flavours
OLAP

ROLAP MOLAP DOLAP

HOLAP

MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only do able, but they return quickly.

MOLAP
Disadvantages: Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

ROLAP
Disadvantages: Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions

MOLAP vs. ROLAP


Query Performance

MOLAP
Choice for faster response & more complex queries

ROLAP

Complexity Of Analysis

HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

DOLAP
Designed for low-end, single, departmental user. Data is stored in cubes on the desktop. It's like having your own spreadsheet. Since the data is local, end users do not have to worry about performance hits against the server.

You might also like