Datawarehouse Concepts

What is a Data Warehouse?
A data warehouse is a relational database that facilitates on line analytical processing by

allowing data to be viewed in different dimension to provide business intelligence.
Data Warehousing and Online Analytical Processing
The data warehouses or data marts can be used for sophisticated enterprise intelligence
systems that process queries required to discover trends and analyze critical factors. These
systems are called online analytical processing (OLAP) systems.
OLAP data is organized into multidimensional cubes. The structure of data in multidimensional
cubes gives better performance for OLAP queries than data organized in relational tables
Diff between datawarehouse and OLAP

Datawarehouse is the place where the data is stored for analysis where as OLAP is the process
of analyzing the data, managing aggregations, partitioning information into cubes for in-depth
visualization
OLTP Systems
Systems that are designed to store daily business transactions of an organization are known
as online transaction processing (OLTP) systems. OLTP systems are designed and tuned to
process hundreds or thousands of transactions being entered at the same time.
Why are OLTP database designs not generally a good idea for a Data Warehouse?
Since in OLTP, tables are normalized and hence query response will be slow for end user and
OLTP does not contain years of data and hence cannot be analyzed
Data Mart
A data mart is a focused subset of a data warehouse that deals with a single area (like
different department) of data and is organized for quick analysis. They are well suited for
small and medium business enterprise as well as different departments of large organization.
ODS
Operational data store is a relative database management system which maintains current
data from OLTP systems and used by operational user as they do high performance integrated
processing. For example ODS is used to declare results of different exams
ER model
ER model is a conceptual data model that views the real world as entities and relationships. A
basic component of the model is the Entity-Relationship diagram which is used to visually
represent data objects
Dimensional modeling
Dimensional Modeling is a design concept used by many data warehouse designers to build
their datawarehouse. In this design model all the data is stored in two types of tables - Facts
table and Dimension table. Fact table contains the facts/measurements of the business and
the dimension table contains the dimensions on which the facts are calculated
Diff between ER model and dimensional model
Basic diff is E-R modeling will have logical and physical model. Dimensional model will have
only physical model.
E-R modeling is used for normalizing the OLTP database design.
Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design
Snowflake schema
Snowflake schema is nothing but one Fact table which is connected to a number of dimension
tables, and the dimension tables in turn connected to other dimension tables
Fact and Dimension
Dimension tables holds descriptive data usually dimension/attributes of business.

Facts are nothing but measure or factual data of a business. A fact table typically has two
types of columns: those that contain numeric facts (often called measurements), and those
that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts
that have been aggregated. Fact tables that contain aggregated facts are often called
summary tables. A fact table usually contains facts with the same level of aggregation. Though
most facts are additive, they can also be semi-additive or non-additive. Additive facts can be
aggregated by simple arithmetical addition. A common example of this is sales. Non-additive
facts cannot be added at all. An example of this is averages. Semi-additive facts can be
aggregated along some of the dimensions and not along others. An example of this is
inventory levels, where you cannot tell what a level means simply by looking at it
Conformed dimension
Conformed dimension is a single coherent view of single piece of data through out the
organization. The same dimension is used in all subsequent star schema defined. That means
dimension that carries same meaning across the entire star schema is called conformed
dimension.
Junk Dimension
Sometimes when you create fact and dimension table from operational database, you will find
there are some attributes which does not fit to any of the tables but they can’t be discarded.
In such case there are two options. First is to discard them which may cause loss of
information. Second one is to store them in just dimension where junk dimension tables are
created with junk attributes.
Degenerated dimension
If a table contains the values, which are neither dimensions nor measures is called degenerate
dimensions. Ex: invoice id, empno.
Cubes
Cubes are logical representation of multidimensional data. The edge of the cube contains
dimension members and the body of the cube contains data values.
Data mining
Data mining is a process of extracting hidden trends within a datawarehouse. For example an
insurance datawarehouse can be used to mine data for the most high risk people to insure in a
certain geographical area
The differences between bill inmon and Ralph kimbal approach
For Bill Inmon, There is only one big data warehouse for the entire company and a star
schema for each department or business need.
For Ralph Kimball, all the Star Schema together is the datawarehouse.
That was the theory, in practice:

Bill Inmon:
You have a corporate environment that is normalized, no duplication, no aggregation... a pure
environment that contain the historical information. And based on this corporate environment,
you create a star each time you have a business need.
Pros:
Very clean
Building star schema is very fast once the EDW (Corporate environment) is built.
No need to have a complete detailed star schema, you can concentrate your energy on
aggregate star schema which are more performant.
Cons:
Very costly (at the beginning)
If not done properly, it can take a while before the end user see data and your project might
be killed by user that think they are paying for nothing.
KIMBALL:
There is no corporate environment, you have a need, you create a star for it. you try to reuse
some dimension by making them "Conform dimension"...
Pros: Very fast to develop.

End users can see their data very fast
Cost less to develop
Cons:
You have to keep a detailed star schema in case you need to build new aggregate tables.
Each star takes the data in the operationals source system.
If not done properly, it's easy to get what I call the "Chaos"...
So, from my point of view and my experience... From a theoretical point of view, Inmon is the
best... But, in practice, your users want to take decision quickly... they'll put you a lot of
pressure to get their data... so, in practice, Kimball is more often use for these reasons...
Physical Design
During the logical design phase, you defined a model for your data warehouse consisting of
entities, attributes, and relationships. The entities are linked together using relationships.
Attributes are used to describe the entities. The unique identifier (UID) distinguishes between
one instance of an entity and another.
Figure 3-1 offers you a graphical way of looking at the different ways of thinking about logical
and physical designs.
Figure 3-1 Logical Design Compared with Physical Design
During the physical design process, you translate the expected schemas into actual database
structures. At this time, you have to map:
• Entities to tables
• Relationships to foreign key constraints
• Attributes to columns
• Primary unique identifiers to primary key constraints
• Unique identifiers to unique key constraints
Normalization:-
Normalization is the process of taking data from a problem and reducing it to a set of relations
while ensuring data integrity and elimination data redundancy.
Data integrity: All the data in database are consistent and satisfy all integrity constraints.
Data redundancy: If the data in the database can be found in two different locations or can be
calculated from other data items then the data is said to contain redundancy.
Integrity constraint:-
This is a rule that restricts data values that may be present in database.
Entity integrity: - Each row in a relation must be uniquely identified.

Referential integrity: - This constraint involves foreign key. Every foreign key must contain
either NULL or actual data value of the key in another relation.
First Normal Form:- A relation is in 1st normal form if and only if it does not contain any
repeating attributes or group of attributes.
To remove the repeating groups, one the two things can be done
1. Either flatten the table and extend the primary key OR
2. Decompose the relation- Lead to 1st normal form.
Second Normal Form:- A relation is in 2nd normal form if and only if the relation is in 1NF and
there must be no partial functional dependencies.
To remove partial dependency in relation R (A, B, C, D, E)
Separate out all attribute which solely dependent on A and put them in separate relation.
Separate out all attribute which solely dependent on D and put them in separate relation.
Separate out all attribute which solely dependent on A and D and put them in separate
relation.
Third Normal Form:- A relation is in 3NF if and only if the relation is in 2NF and there must be
no transitive functional dependencies. Transitive functional dependency can only occur when
one non key attribute of a relation is dependent on another non key attribute of that relation.
So we can say a relation in 2NF with zero or one non key attribute is automatically in 3NF.
To remove transitive functional dependency in Relation R (A, B, C) and B -> C
Create two relations, one with transitive dependency in it and another with all remaining
attributes.
i.e R1 (A, B) and R2 (B, C)
BCNF:- A relation is in BCNF if and only if it is in 3NF and every determinant is a candidate
key.
For example
R(a,b,c,d)
a,c -> b,d
a,d -> b
Here, the first determinant suggests that the primary key of R could be changed from a,b to
a,c. If this change was done all of the non-key attributes present in R could still be
determined, and therefore this change is legal. However, the second determinant indicates
that a,d determines b, but a,d could not be the key of R as a,d does not determine all of the
non key attributes of R (it does not determine c). We would say that the first determinate is a
candidate key, but the second determinant is not a candidate key, and thus this relation is not
in BCNF (but is in 3rd normal form).

Datawarehouse Concepts

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datawarehouse Concepts

Uploaded by

Copyright:

Available Formats

What is a Data Warehouse?

A data warehouse is a relational database that facilitates on line analytical processing by

Data Warehousing and Online Analytical Processing

Diff between datawarehouse and OLAP

Diff between ER model and dimensional model

Fact and Dimension

Dimension tables holds descriptive data usually dimension/attributes of business.

The differences between bill inmon and Ralph kimbal approach

That was the theory, in practice:

Pros: Very fast to develop.

Figure 3-1 Logical Design Compared with Physical Design

• Relationships to foreign key constraints

• Primary unique identifiers to primary key constraints

• Unique identifiers to unique key constraints

Entity integrity: - Each row in a relation must be uniquely identified.

To remove partial dependency in relation R (A, B, C, D, E)

To remove transitive functional dependency in Relation R (A, B, C) and B -> C

i.e R1 (A, B) and R2 (B, C)

You might also like