Week 1b - Data Warehouse

Data warehouse
Dr. Retno Kusumaningrum, S.Si.,

M.Kom.
DATA WAREHOUSE
DEFINITION
A Data Warehouse is
An enterprise structured repository of

subject-oriented, time-variant,
historical data used for information
retrieval and decision support. The
data warehouse stores atomic and
summary data.
DATA WAREHOUSE
CHARACTERISTICS
Data Warehouse
Characteristics
Subject Oriented
Data is categorized and stored by business
subject rather than by application
OLTP
Applicati
on
Insurance
Loans
Saving
Data
Warehou
se
Customer
Financial
Information
(Contd)
Focusing on the modeling and analysis
of data for decision makers, not on daily
operations or transaction processing
Provide a simple and concise (brief)
view around particular subject issues by
excluding data that are not useful in
the decision support process
Integrated
Data on a given subject is defined and
stored once
Savings
Current
accounts
Loans
OLTP Applications
Customer
Data Warehouse
(Contd)
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line
transaction records
Data cleaning and data integration

techniques are applied
Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast
covered, etc.
When data is moved to the warehouse, it is
9
converted.
Time Variant
Data is stored as a series of
snapshots, each representing a
period of time
Time
Jan-97
Feb-97
Mar-97
Data
January
February
March
(Contd)
The time horizon for the data warehouse is
significantly longer than that of operational
systems
Operational database: current value data
Data warehouse data: provide information from
a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse

Contains an element of time, explicitly or
implicitly
But the key of operational data may or may not
11
contain time element
Non Volatile
Typically data in the data warehouse is
not updated or deleted
Operational
Warehouse
Load
Insert
Update
Delete
Read
Read
THE DIFFERENCE BETWEEN DATA

WAREHOUSE AND OPERATIONAL
DBMS
Operational DBMS vs. Data

Warehouse
OLTP (on-line transaction
processing)
Major task of traditional

relational DBMS
Day-to-day operations:
purchasing, inventory,
banking,
manufacturing, payroll,
registration,
accounting, etc.
OLAP (on-line analytical

processing)
Major task of data

warehouse system
Data analysis and
decision making
14
In Detail
DISTINCT
FEATURES
User and
system
orientation
Usage
Data contents
OLTP
OLAP
Customer
Market
Data design
Repetitive
Current,
Detailed
ER + application
View
Current, Local
Ad hoc
Historical,
Consolidated
Schema
(ex:star) +
subject
Evolutionary,
(Contd)
DISTINCT
FEATURES
Access pattern
OLTP
OLAP
Update
# Record
Accessed
DB Size
Metric
Tens, Twenty,
etc
100 MB GB
Transaction
throughput
Read-only but
complex queries
Hundreds
100 GB TB
Query
throughput,
response
Why Separate Data

Warehouse?
High performance for both systems
DBMS tuned for OLTP: access
methods, indexing, concurrency control,
recovery
Warehouse tuned for OLAP: complex
OLAP queries, multidimensional view,
read only access of data records for
summarization and aggregation
(consolidation)
17
Different data of data warehouse source :

data consolidation: Decision support requires
consolidation (aggregation, summarization)
of data from heterogeneous sources
data quality: different sources typically use
inconsistent data representations, codes and
formats which have to be reconciled
Note: There are more and more systems which
perform OLAP analysis directly on relational
databases
PROBLEMS OF DATA
WAREHOUSING
Problems of Data Warehousing

Underestimation of resources for data
loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
High demand for resources
Data ownership
Long duration projects
Complexity of integration
9/29/16 rev.
20
What data are stored in Warehouse?

In simple words: Subject(s) per
Dimension
Example:
If our subject/measure : quantity sold
If the dimensions : Item Type, Location
and Period
Data warehouse stores the items sold per
type, per geographical location during the
particular period.
How do we represent this data???
DATA MODEL
Multidimensional Data
Model
A data cube :
From tables and spreadsheets to Data Cube
Allows data to be modeled and viewed in multiple
dimensions.
Dimensions :
Perspective or entities with respect to which an organization
wants to keep records.
Example :
AllElectronics may create a sales data warehouse in order
to keep records of the stores sales with respect to the
dimensions time, item, branch, and location.
Thus, we can keep track of things like monthly sales of items
and the branches and locations at which the items were sold
A data cube
We usually think of cubes as 3-D
geometric structures
In data warehouse
The data cube is n-dimensional
st
How to view sales data with third

dimension
Example :
We would like to view the data according
to time and item, as well as location
(e.g. Chicago, New York, Toronto,
Vancouver
st
How to view sales data with third

dimension
Example :
We would like to view the data according
to time and item, as well as location
(e.g. Chicago, New York, Toronto,
Vancouver
nd
How to view sales data with an

additional fourth dimension?
E.g. Supplier
nd
How to view sales data with an

additional fourth dimension?
E.g. Supplier
Schemas for Multidimensional

Databases
Entitiy-relationship data model
Commonly used in the design of relational databases
A database schema consists of a set of entities and the relationship
between them
A data warehouse
Requires a consice, subject oriented schema
Facilitates on-line data analysis
Data Model :
Multidimensional model
Model can exist in the form of :
Star Schema
Snowflake Schema
Fact Constellation Schema
The well known schemas are:

Star Schema: Single Fact table with n
Dimension tables linked to it.
Snowflake Schema: Single Fact table with nDimension tables organized as a hierarchy.
Fact Constellation Schema: Multiple Facts table
sharing dimension tables.
Each Schema has a Fact table that stores all the
facts about the subject/measure.
Each fact is associated with multiple dimension
keys that are linked to Dimension Tables.
Star Schema
There is a central
large Fact table with
no redundancy
Each tuple in the fact
table has a foreign
key to a dimension
table which
describes the details
of that dimension
What is the problem of the schema and how
to overcome it?
What is the advantage of the schema?
Snowflakes Schema
Fact
Table
What is
the
problem
of the
schema?
Some of the dimension tables are normalized
thus splitting data into additional tables
Thus Snowflake schema is not as popular as
the Star schema
Illustration of Constellations
Fact
Table
Fact Table
Fact
Table
Two or more fact tables share dimension tables.

In the figure above the Sales fact table and
Shipping fact table Share the dimension tables
DATA WAREHOUSE
DESIGN
Physical Design vs Logical

Design
Logical Design
Physical Design
More conceptual and

abstract
You look at the logical
relationships among the
objects
You do not deal with the
physical implementation
details yet.
You deal only with
defining the types of
information that you need
You look at the most

effective way of
storing and retrieving
the objects as well as
handling them from a
transportation and
backup/recovery
perspective
Logical Design
The logical design should result in :
1. a set of entities and attributes
corresponding to fact tables and
dimension tables
2. a model of operational data from
your source into subject-oriented
information in your target data
warehouse schema
Create Logical Design

Using a pen and paper, or
Using a design tool such as
Oracle Warehouse Builder (specifically
designed to support modeling the ETL
process)
Oracle Designer (a general purpose
modeling tool)
Physical Design Process

You convert the data gathered during
the logical design phase into a
description of the physical database
structure
Physical design decisions are mainly
driven by query performance and
database maintenance aspects
Distinguishing between Logical and

Physical Designs
Mapping :
Entities Tables
Relationships
Foreign Key
Constraints
Attributes Columns
Primary Unique
Identifiers Primary
Key Constraints
Unique identifiers
Unique Key
Constraints
PHYSICAL DESIGN
Physical Design Structure

Create some or all of the following structures:
Tablespaces
Tables and Partitioned Tables
require disk
space or only
Views
in the data
dictionary
Integrity Constraints
Dimensions
Structures for Performance Improvement
Indexes and Partitioned Indexes
Materialized Views
Introduction
Physical database design is a fundamental
part of data warehouse design.
The performance of a data warehouse is
largely affected by the physical design of the
underlying databases and the environment
where the databases are running.
To do the physical database design :
it is important to understand the physical system
architecture on which the database will be
operating.
Non-Functional (NF)
Requirements
Beberapa NF yang dijumpai :
data warehouse harus tersedia
(available) 24 jam sehari, 7 hari
seminggu
downtime tidak lebih dari satu jam
dalam sebulan
HIGH AVAILABILITY REQUIREMENTS
Konsekuensi
Database engine tidak dapat
diimplementasikan pada satu (single)
server, tetapi pada failover cluster
a configuration of installations on several
identical servers (nodes)
database instances are running on an
active node, but when the active node is
not available, the database instances
automatically switches over to a
secondary node
Reporting Services perlu di deploy

network load balanced (NLB) cluster,
sedangkan Reporting Services
database diinstall pada failover
cluster
NLB : a configuration of servers where
the incoming traffic is distributed to
Components of SQL Server Reporting
several identical servers
Services (SSRS)
a web service : installed on NLB servers
a database : running on a failover
cluster
Analysis Services (untuk menyimpan

multidimensional databases / cubes)
perlu diinstall pada failover cluster
either in the same cluster as the database
engine or in a separate cluster
Recommendation is installing Analysis Services in
a separate cluster, because we can optimize and
tune the memory and CPU usage separately
allocated disk space on the SAN for Analysis

Service should ideally be separated from the
database server for the same reason
(SQL Server) Integration Services

SSIS
is not a cluster-aware application.
This means we cannot put SSIS in a failover
cluster, which means that if the SSIS server
is unavailable, there is no secondary server
to take over
Network Spesification
Between the SSIS server and the
database server and between the
database server and the OLAP server
try to put in a Gigabit network (capable
of performing at 1Gbps throughput),
rather than the normal 100Mbps
Ethernet.
Reporting Services web

Specification
a scale-out deployment
does not require a high server
specification
Ex : two or three nodes with two CPUs and
2GB or 4GB RAM
Sizing a Reporting Services web farm is

similar to sizing the web servers for a web
site
depends on the amount of traffic served by
the servers
ETL server spesification

It performs intense calculations
It is determined from the complexity
of transformation logic in the ETL
processes
The amount of memory required
depends on the size of the online
lookup in the transformations
OLAP server cluster

spesification
The memory requirement depends on :
the number of large cubes that will be running on Analysis Services,
the number of partitions (the more partitions we have, the more
memory we need), whether there will be OLAP queries running while the
partitions are being processed (if there are, we need more memory),
the requirements of a large file system cache (the larger the cache, the
more memory we need),
the requirements of a large process buffer (the larger this is, the more
memory we need),
the number of large replicas required (a replica is a subset of a
dimension to accommodate dynamic security),
The number of heavy relational database queries and OLAP queries
running simultaneously (lots of parallel queries require more memory).
The number of CPUs affects the aggregate calculations and

cube-processing times.
Criteria to determine database

server spesification
The number and complexity of reports,
applications, and direct queries hitting the
DDS Dimensional Data Store
Whether were taking an ELT or ETL
approach in populating the NDS
Normalized Data Store/ODS Operational
Data Store
Calculation from the stage to the NDS/ODS
and the complexity of firewall rules
The number and size of data stores
How the data stores are physically

designed (indexing, partitions, and so
on)
The number of other databases
hosted in the same server and future
growth

Week 1b - Data Warehouse

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 1b - Data Warehouse

Uploaded by

Copyright:

Available Formats

Data warehouse

Dr. Retno Kusumaningrum, S.Si.,

An enterprise structured repository of

Data cleaning and data integration

Every key structure in the data warehouse

THE DIFFERENCE BETWEEN DATA

Operational DBMS vs. Data

Major task of traditional

OLAP (on-line analytical

Major task of data

Why Separate Data

Different data of data warehouse source :

Problems of Data Warehousing

What data are stored in Warehouse?

How do we represent this data???

How to view sales data with third

How to view sales data with third

How to view sales data with an

How to view sales data with an

Schemas for Multidimensional

The well known schemas are:

Two or more fact tables share dimension tables.

Physical Design vs Logical

More conceptual and

You look at the most

Create Logical Design

Physical Design Process

Distinguishing between Logical and

Physical Design Structure

Reporting Services perlu di deploy

Analysis Services (untuk menyimpan

allocated disk space on the SAN for Analysis

(SQL Server) Integration Services

Reporting Services web

Sizing a Reporting Services web farm is

ETL server spesification

OLAP server cluster

The number of CPUs affects the aggregate calculations and

Criteria to determine database

How the data stores are physically

You might also like