You are on page 1of 56

Data warehouse

Dr. Retno Kusumaningrum, S.Si.,


M.Kom.

DATA WAREHOUSE
DEFINITION

A Data Warehouse is

An enterprise structured repository of


subject-oriented, time-variant,
historical data used for information
retrieval and decision support. The
data warehouse stores atomic and
summary data.

DATA WAREHOUSE
CHARACTERISTICS

Data Warehouse
Characteristics

Subject Oriented
Data is categorized and stored by business
subject rather than by application

OLTP
Applicati
on
Insurance
Loans
Saving

Data
Warehou
se
Customer
Financial
Information

(Contd)
Focusing on the modeling and analysis
of data for decision makers, not on daily
operations or transaction processing
Provide a simple and concise (brief)
view around particular subject issues by
excluding data that are not useful in
the decision support process

Integrated
Data on a given subject is defined and
stored once
Savings

Current
accounts

Loans

OLTP Applications

Customer

Data Warehouse

(Contd)
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line
transaction records

Data cleaning and data integration


techniques are applied
Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast
covered, etc.
When data is moved to the warehouse, it is
9
converted.

Time Variant
Data is stored as a series of
snapshots, each representing a
period of time

Time
Jan-97
Feb-97
Mar-97

Data
January
February
March

(Contd)
The time horizon for the data warehouse is
significantly longer than that of operational
systems
Operational database: current value data
Data warehouse data: provide information from
a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse


Contains an element of time, explicitly or
implicitly
But the key of operational data may or may not
11
contain time element

Non Volatile
Typically data in the data warehouse is
not updated or deleted
Operational

Warehouse

Load

Insert
Update
Delete

Read

Read

THE DIFFERENCE BETWEEN DATA


WAREHOUSE AND OPERATIONAL
DBMS

Operational DBMS vs. Data


Warehouse
OLTP (on-line transaction
processing)

Major task of traditional


relational DBMS
Day-to-day operations:
purchasing, inventory,
banking,
manufacturing, payroll,
registration,
accounting, etc.

OLAP (on-line analytical


processing)

Major task of data


warehouse system
Data analysis and
decision making

14

In Detail
DISTINCT
FEATURES
User and
system
orientation
Usage
Data contents

OLTP

OLAP

Customer

Market

Data design

Repetitive
Current,
Detailed
ER + application

View

Current, Local

Ad hoc
Historical,
Consolidated
Schema
(ex:star) +
subject
Evolutionary,

(Contd)
DISTINCT
FEATURES
Access pattern

OLTP

OLAP

Update

# Record
Accessed
DB Size
Metric

Tens, Twenty,
etc
100 MB GB
Transaction
throughput

Read-only but
complex queries
Hundreds
100 GB TB
Query
throughput,
response

Why Separate Data


Warehouse?
High performance for both systems
DBMS tuned for OLTP: access
methods, indexing, concurrency control,
recovery
Warehouse tuned for OLAP: complex
OLAP queries, multidimensional view,
read only access of data records for
summarization and aggregation
(consolidation)
17

Different data of data warehouse source :


data consolidation: Decision support requires
consolidation (aggregation, summarization)
of data from heterogeneous sources
data quality: different sources typically use
inconsistent data representations, codes and
formats which have to be reconciled
Note: There are more and more systems which
perform OLAP analysis directly on relational
databases

PROBLEMS OF DATA
WAREHOUSING

Problems of Data Warehousing


Underestimation of resources for data
loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
High demand for resources
Data ownership
Long duration projects
Complexity of integration
9/29/16 rev.

20

What data are stored in Warehouse?


In simple words: Subject(s) per
Dimension
Example:
If our subject/measure : quantity sold
If the dimensions : Item Type, Location
and Period
Data warehouse stores the items sold per
type, per geographical location during the
particular period.

How do we represent this data???

DATA MODEL

Multidimensional Data
Model
A data cube :
From tables and spreadsheets to Data Cube
Allows data to be modeled and viewed in multiple
dimensions.

Dimensions :
Perspective or entities with respect to which an organization
wants to keep records.

Example :
AllElectronics may create a sales data warehouse in order
to keep records of the stores sales with respect to the
dimensions time, item, branch, and location.
Thus, we can keep track of things like monthly sales of items
and the branches and locations at which the items were sold

A data cube
We usually think of cubes as 3-D
geometric structures
In data warehouse
The data cube is n-dimensional

st

How to view sales data with third


dimension
Example :
We would like to view the data according
to time and item, as well as location
(e.g. Chicago, New York, Toronto,
Vancouver

st

How to view sales data with third


dimension
Example :
We would like to view the data according
to time and item, as well as location
(e.g. Chicago, New York, Toronto,
Vancouver

nd

How to view sales data with an


additional fourth dimension?
E.g. Supplier

nd

How to view sales data with an


additional fourth dimension?
E.g. Supplier

Schemas for Multidimensional


Databases
Entitiy-relationship data model
Commonly used in the design of relational databases
A database schema consists of a set of entities and the relationship
between them

A data warehouse
Requires a consice, subject oriented schema
Facilitates on-line data analysis

Data Model :
Multidimensional model
Model can exist in the form of :
Star Schema
Snowflake Schema
Fact Constellation Schema

The well known schemas are:


Star Schema: Single Fact table with n
Dimension tables linked to it.
Snowflake Schema: Single Fact table with nDimension tables organized as a hierarchy.
Fact Constellation Schema: Multiple Facts table
sharing dimension tables.
Each Schema has a Fact table that stores all the
facts about the subject/measure.
Each fact is associated with multiple dimension
keys that are linked to Dimension Tables.

Star Schema

There is a central
large Fact table with
no redundancy
Each tuple in the fact
table has a foreign
key to a dimension
table which
describes the details
of that dimension
What is the problem of the schema and how
to overcome it?
What is the advantage of the schema?

Snowflakes Schema

Fact
Table

What is
the
problem
of the
schema?
Some of the dimension tables are normalized
thus splitting data into additional tables
Thus Snowflake schema is not as popular as
the Star schema

Illustration of Constellations
Fact
Table
Fact Table

Fact
Table

Two or more fact tables share dimension tables.


In the figure above the Sales fact table and
Shipping fact table Share the dimension tables

DATA WAREHOUSE
DESIGN

Physical Design vs Logical


Design
Logical Design

Physical Design

More conceptual and


abstract
You look at the logical
relationships among the
objects
You do not deal with the
physical implementation
details yet.
You deal only with
defining the types of
information that you need

You look at the most


effective way of
storing and retrieving
the objects as well as
handling them from a
transportation and
backup/recovery
perspective

Logical Design
The logical design should result in :
1. a set of entities and attributes
corresponding to fact tables and
dimension tables
2. a model of operational data from
your source into subject-oriented
information in your target data
warehouse schema

Create Logical Design


Using a pen and paper, or
Using a design tool such as
Oracle Warehouse Builder (specifically
designed to support modeling the ETL
process)
Oracle Designer (a general purpose
modeling tool)

Physical Design Process


You convert the data gathered during
the logical design phase into a
description of the physical database
structure
Physical design decisions are mainly
driven by query performance and
database maintenance aspects

Distinguishing between Logical and


Physical Designs
Mapping :
Entities Tables
Relationships
Foreign Key
Constraints
Attributes Columns
Primary Unique
Identifiers Primary
Key Constraints
Unique identifiers
Unique Key
Constraints

PHYSICAL DESIGN

Physical Design Structure


Create some or all of the following structures:
Tablespaces
Tables and Partitioned Tables
require disk
space or only
Views
in the data
dictionary
Integrity Constraints
Dimensions
Structures for Performance Improvement
Indexes and Partitioned Indexes
Materialized Views

Introduction
Physical database design is a fundamental
part of data warehouse design.
The performance of a data warehouse is
largely affected by the physical design of the
underlying databases and the environment
where the databases are running.
To do the physical database design :
it is important to understand the physical system
architecture on which the database will be
operating.

Non-Functional (NF)
Requirements
Beberapa NF yang dijumpai :
data warehouse harus tersedia
(available) 24 jam sehari, 7 hari
seminggu
downtime tidak lebih dari satu jam
dalam sebulan
HIGH AVAILABILITY REQUIREMENTS

Konsekuensi
Database engine tidak dapat
diimplementasikan pada satu (single)
server, tetapi pada failover cluster
a configuration of installations on several
identical servers (nodes)
database instances are running on an
active node, but when the active node is
not available, the database instances
automatically switches over to a
secondary node

Reporting Services perlu di deploy


network load balanced (NLB) cluster,
sedangkan Reporting Services
database diinstall pada failover
cluster
NLB : a configuration of servers where
the incoming traffic is distributed to
Components of SQL Server Reporting
several identical servers

Services (SSRS)
a web service : installed on NLB servers
a database : running on a failover
cluster

Analysis Services (untuk menyimpan


multidimensional databases / cubes)
perlu diinstall pada failover cluster
either in the same cluster as the database
engine or in a separate cluster
Recommendation is installing Analysis Services in
a separate cluster, because we can optimize and
tune the memory and CPU usage separately

allocated disk space on the SAN for Analysis


Service should ideally be separated from the
database server for the same reason

(SQL Server) Integration Services


SSIS
is not a cluster-aware application.
This means we cannot put SSIS in a failover
cluster, which means that if the SSIS server
is unavailable, there is no secondary server
to take over

Network Spesification
Between the SSIS server and the
database server and between the
database server and the OLAP server
try to put in a Gigabit network (capable
of performing at 1Gbps throughput),
rather than the normal 100Mbps
Ethernet.

Reporting Services web


Specification
a scale-out deployment
does not require a high server
specification
Ex : two or three nodes with two CPUs and
2GB or 4GB RAM

Sizing a Reporting Services web farm is


similar to sizing the web servers for a web
site
depends on the amount of traffic served by
the servers

ETL server spesification


It performs intense calculations
It is determined from the complexity
of transformation logic in the ETL
processes
The amount of memory required
depends on the size of the online
lookup in the transformations

OLAP server cluster


spesification
The memory requirement depends on :
the number of large cubes that will be running on Analysis Services,
the number of partitions (the more partitions we have, the more
memory we need), whether there will be OLAP queries running while the
partitions are being processed (if there are, we need more memory),
the requirements of a large file system cache (the larger the cache, the
more memory we need),
the requirements of a large process buffer (the larger this is, the more
memory we need),
the number of large replicas required (a replica is a subset of a
dimension to accommodate dynamic security),
The number of heavy relational database queries and OLAP queries
running simultaneously (lots of parallel queries require more memory).

The number of CPUs affects the aggregate calculations and


cube-processing times.

Criteria to determine database


server spesification
The number and complexity of reports,
applications, and direct queries hitting the
DDS Dimensional Data Store
Whether were taking an ELT or ETL
approach in populating the NDS
Normalized Data Store/ODS Operational
Data Store
Calculation from the stage to the NDS/ODS
and the complexity of firewall rules
The number and size of data stores

How the data stores are physically


designed (indexing, partitions, and so
on)
The number of other databases
hosted in the same server and future
growth

You might also like