You are on page 1of 13

D

DAAT
TAA
W
WAAR
REEHHO
OUUSSIIN
NGG

TThhee ggooaall iiss ttoo eennaabbllee uusseerrss ttoo mmaakkee iinnffoorrm meedd
ddeecciissiioonnss rraappiiddllyy ssoo tthheeiirr ccoomp a n ies c a n
mpanies can rreessppoonndd
ttoo m a k e ch a n g e a
make change and remain ccoom n d r e m a in mppeettiittiivvee..

DATA WAREHOUSING ARCHITECTURE

1. Starter 2

2. Six Steps to Develop the Architecture 2

3. The Data Warehouse Infrastructure 3

4. Data Warehouse System Infrastructure 3

5. Data Layer components 4

6. Ongoing Maintenance: Warehouse Infrastructure 5

7. What is Data Warehouse Architecture? 7


7.1 Components 7

8. Different possible wrong architectures 10


8.1 “Virtual” Data Warehouse 11
8.2 Data Mart in a box Architecture 12
SUSHIL KULKARNI 2

1. Starter

Architecture of a data warehouse is a very complex and involves many elements. This is
because the architecture of data warehouse consists of many different systems and to
connect as we as process these systems. For construction of any corporate data
warehouse required technical infrastructure that includes operating system, hardware
platform, database management system, and network. The DBMS selection becomes a
little more complicated than a straightforward operational system because of the
unusual challenges of the data warehouse, especially in its capability to support very
complex queries that cannot be predicted in advance. This will be explored in more
detail later on in the chapter.

In this chapter we will answer the following questions:

1. How does the data get into the data warehouse?

2. The warehouse requires ongoing processes to feed it; these processes require their
own infrastructure. What is the infrastructure required for data warehouse?

3. Many times, IT departments overlook the above aspect when they plan for the data
warehouse. You required different layers for storing layers so what is Data layers?

4. How the data can be clean and what are the steps? How will ongoing data loads,
cleansing, and summarizing be accomplished?

5. How will users get information out of the warehouse? The choice of query tool
becomes very important, and depends upon a multiplicity of factors.

2. Six Steps to Develop the Architecture

Following are different steps to develop architecture. These steps are to be performed
according to the order in which they are given:

1. The most important step in developing effective data warehouse architecture is to


enlist the full support/commitment (project sponsor) of an executive of the
company.

2. Next, you must staff an architecture team with strong personnel. It is not
necessarily the technology you choose for your architecture, it is the personnel you
have designing and developing the architecture that makes the project successful.

3. Prototype/benchmark all the technologies you are interested in using. Design and
develop a prototype that can be used to test all of the different technologies that are
being considered.

4. Give the architecture team enough time to build the architecture infrastructure
before development begins. For a large organization, this can be anywhere from six
months to a year or more.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 3

5. Make sure you train the development staff on the use of the architecture before
development begins. Spend time letting the development team get full exposure to
the capabilities and components of the architecture.

6. Provide the architecture team an opportunity to enhance and improve the


architecture as the project moves forward. No matter how much time is spent up
front developing an architecture, it will not be perfect the first time around.

As we examine the architecture of a data warehouse, we will look at it from three views:
the overall data warehouse infrastructure, data layer components, and ongoing
maintenance infrastructure.

3. The Data Warehouse Infrastructure

The data warehouse consists of the following architectural components, which compose
the data warehouse infrastructure:

o System infrastructure: Hardware, software, network, database management


system, and personnel components of the infrastructure.

o Metadata layer: Data about data. This includes, but is not limited to, definitions
and descriptions of data items and business rules.

o Data discovery: The process of understanding the current environment so it can


be integrated into the warehouse.

o Data acquisition: The process of loading data from the various sources.

o Data distribution: The dissemination/replication of data to distributed data marts


for specific segmented groups.

o User analysis: Includes the infrastructure required to support user queries and
analysis.

4. Data Warehouse System Infrastructure

The technical architecture of a data warehouse is an important component. The reason


for this is that the technical architecture is used as the base for building all the other
data warehouse components. This is why the technical architecture is called the
infrastructure.

The infrastructure foundation upon which the data warehouse is built is often called the
platform. It is made up of the following components:

Hardware, including operating system: Should be open, meaning that a variety of


tools are able to run on the platform, and data is able to flow to/from the platform with

sushiltry@yahoo.co.in
SUSHIL KULKARNI 4

a minimal amount of effort required. Most of the hardware of a data warehouse will
consist of a number of large machines. Large machines are 6 to 8 or even 12 CPUs with
a gigabyte(s) of memory and many gigabytes or even a terabyte of disk space.

Network: Should minimize complexity, maximize bandwidth. Should connect (directly)


all components and locations of the corporate enterprise that need access to the data
warehouse.

Software: Of course, the most important software component of a data warehouse is


the Database Management System (DBMS) (seen in the first chapter). However, there
are other important software components as well: the monitoring, administration, and
network management tools used to maintain the database; the software used to support
user access; and data modeling tools used by the development staff to design,
implement, and maintain the data warehouse and third-party utilities.

Personnel: This may seem like an odd component of data warehouse architecture, but
it is the most important (and the most expensive!). There are a number of technology
choices on the market today, and each of these technologies has good features and bad
features. Therefore, choosing the right components is not an exact science. The key
factor in whether or not the technology will work is the skill level of the individuals
designing and developing the architecture. Good components and experienced
architects/developers will make the difference in the end.

5. Data Layer components

Figure A illustrates the overall high-level data architectural components required for the
typical data warehouse effort. For building a data warehouse, we typically build at least
two separate databases: an interim "staging area" and the warehouse itself.

When the data is loaded into the warehouse, if it comes over from the legacy system as
is with no transformation, it is considered Level 0.

Alternatively, some scrubbing can take place on the legacy system (if the load of the
system allows it). Some tools enable you to enter scrub rules and the tool will generate
code, which can then be executed in various ways.

If some initial scrubbing occurs on the legacy side, then the data coming over to the
interim staging area is called Level 1, because it has already been processed once.
Each time data is scrubbed, the data level is incremented.

The data is placed in the interim environment so scrubbing can take place. Often,
primary keys must be resolved. Sometimes one row of data in a warehouse table will be
source by more than one legacy system. The primary key is pieced together in the
interim environment. This must take place first before any other scrubbing can occur;
you must have a proper identifier for each row before proceeding. Often, more than one
iteration of scrubbing occurs in the staging area. At level 2 scrubbing takes place and
all the legacy systems are aggregated at level 3 and final answer is obtained from level
4.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 5

LEVEL 4 : Final Result

LEVEL 3 : Aggregate LEVEL 4 : Aggregate

LEVEL 2 : Scrub LEVEL 3 : Scrub

LEVEL 1 : Create Primary key Integrate LEVEL 2 : Integrate

LEVEL 0 : Extract without scrub LEVEL 1 : Extract and Cleaning

FIGURE A FIGURE B

Figure B shows an alternate way with four data levels. Level 0 is the straight extract,
taken as is from the legacy environment. Level 1 takes place in the interim staging
area; its main purpose is to create a primary key, a single field that will serve as the
unique identifier for each row. Then, Level 2 cleans up miscellaneous data anomalies
such as replacing non-standard project codes for the approved values. Another set of
scrub routines then performs summarization information that will be stored in the data
warehouse. This summarization is Level 3. Some warehouses may have one or more
levels of summarization. Level 4 shows more granular summaries calculated.

Most data warehouses in the real world don't have all of these levels shown. As stated
previously, if scrubbing takes place before the data is shipped to the interim
environment, Level 0 is not even shown. It is never represented in the warehouse or
stored.

6. Ongoing Maintenance: Warehouse Infrastructure

Data warehouses are fed periodically, and repeatable processes must be in place for this
to occur. The following processes are part of this iterative cycle:

o Extract data from source system database

o Export data from source environment to warehouse platform

sushiltry@yahoo.co.in
SUSHIL KULKARNI 6

o Copy data into interim staging area database

o Perform necessary scrubs

o Process errors (such as moving rows with bad data into an error table, flagging
problem rows, etc.)

o Perform summarization/aggregation

o Load data from interim staging area into the warehouse

o Perform backup if required

o Propagate subscribed data to data marts as required

Queries/reports

Data mining

Adhoc User
SQL View

OLAP

DW

sushiltry@yahoo.co.in
SUSHIL KULKARNI 7

This cycle is repeated every time a load is performed. Following figure shows these
essential architectural components required for ongoing maintenance of the data
warehouse.

7. What is Data Warehouse Architecture?

Data Warehouse Architecture is a description of the components and services of the


warehouse, how they fit together and how they will grow. These descriptions should
contain enough information to allow a skilled professional to implement the architecture.

Architecture provides the mechanism to achieve enterprise integration to support


business. It provides an organizing framework that will improve data sharing between
agencies, and in the long run allow for faster development, reuse and consistent data
between warehouse projects. Most importantly, this architecture is an evolutionary
process. The architecture as defined here was initially developed as a place to start. The
first enterprise warehouse projects will be based on this architecture. Increments of
additional agency projects will cause this architecture to evolve. As technology changes
and improves, that too will most likely require us to make adjustments to this
architecture. This incremental development of both the architecture and the warehouse
offers an opportunity to learn and to minimize the impact of mistakes.

7.1 Components

The architecture is made up of a number of interconnected parts called components or


layers and are as follows:

o Operational Database / External Database Layer


o Information Access Layer
o Data Access Layer
o Data Directory (Metadata) Layer
o Process Management Layer
o Application Messaging Layer
o Data Warehouse Layer
o Data Staging Layer

sushiltry@yahoo.co.in
SUSHIL KULKARNI 8

[A] Operational Database / External Database Layer

Operational systems process data to support critical operational needs. In order to do


that, operational databases have been historically created to provide an efficient
processing structure for a relatively small number of well-defined business transactions.
However, because of the limited focus of operational systems, the databases designed
to support operational systems have difficulty accessing the data for other management
or informational purposes. This difficulty in accessing operational data is amplified by the
fact that many operational systems are often 10 to 15 years old. The age of some of
these systems means that the data access technology available to obtain operational
data is itself dated.

Clearly, the goal of data warehousing is to free the information that is locked up in the
operational databases and to mix it with information from other, often external, sources
of data. Increasingly, large organizations are acquiring additional data from outside
databases. This information includes demographic, econometric, competitive and
purchasing trends. The so-called "information superhighway" is providing access to more
data resources every day.

[B] Information Access Layer

The Information Access layer of the Data Warehouse Architecture is the layer that the
end-user deals with directly. In particular, it represents the tools that the end-user
normally uses day to day, e.g., Excel, Lotus 1-2-3, Focus, Access, SAS, etc. This layer
also includes the hardware and software involved in displaying and printing reports,
spreadsheets, graphs and charts for analysis and presentation. Over the past two
decades, the Information Access layer has expanded enormously, especially as end-
users have moved to PCs and PC/LANs.

Today, more and more sophisticated tools exist on the desktop for manipulating,
analyzing and presenting data; however, there are significant problems in making the

sushiltry@yahoo.co.in
SUSHIL KULKARNI 9

raw data contained in operational systems available easily and seamlessly to end-user
tools. One of the keys to this is to find a common data language that can be used
throughout the enterprise.

[C] Data Access Layer

The Data Access Layer of the Data Warehouse Architecture is involved with allowing the
Information Access Layer to talk to the Operational Layer. In the network world today,
the common data language that has emerged is SQL. Originally, SQL was developed by
IBM as a query language, but over the last twenty years has become the de facto
standard for data interchange.

One of the key breakthroughs of the last few years has been the development of a
series of data access "filters" such as EDA/SQL that make it possible for SQL to access
nearly all DBMSs and data file systems, relational or nonrelational. These filters make it
possible for state-of-the-art Information Access tools to access data stored on database
management systems that are twenty years old.

The Data Access Layer not only spans different DBMSs and file systems on the same
hardware, it spans manufacturers and network protocols as well. One of the keys to a
Data Warehousing strategy is to provide end-users with "universal data access".
Universal data access means that, theoretically at least, end-users, regardless of location
or Information Access tool, should be able to access any or all of the data in the
enterprise that is necessary for them to do their job.

The Data Access Layer then is responsible for interfacing between Information Access
tools and Operational Databases. In some cases, this is all that certain end-users need.
However, in general, organizations are developing a much more sophisticated scheme to
support Data Warehousing.

[D] Data Directory (Metadata) Layer

In order to provide for universal data access, it is absolutely necessary to maintain some
form of data directory or repository of meta-data information. Meta-data is the data
about data within the enterprise. Record descriptions in a COBOL program are meta-
data. So are DIMENSION statements in a FORTRAN program, or SQL Create statements.
The information in an ERA diagram is also meta-data.

In order to have a fully functional warehouse, it is necessary to have a variety of meta-


data available, data about the end-user views of data and data about the operational
databases. Ideally, end-users should be able to access data from the data warehouse
(or from the operational databases) without having to know where that data resides or
the form in which it is stored.

[E] Process Management Layer

The Process Management Layer is involved in scheduling the various tasks that must be

sushiltry@yahoo.co.in
SUSHIL KULKARNI 10

accomplished to build and maintain the data warehouse and data directory information.
The Process Management Layer can be thought of as the scheduler or the high-level job
control for the many processes (procedures) that must occur to keep the Data
Warehouse up-to-date.

[F] Application Messaging Layer

The Application Message Layer has to do with transporting information around the
enterprise computing network. Application Messaging is also referred to as
"middleware", but it can involve more that just networking protocols. Application
Messaging for example can be used to isolate applications, operational or informational,
from the exact data format on either end. Application Messaging can also be used to
collect transactions or messages and deliver them to a certain location at a certain time.
Application Messaging in the transport system underlying the Data Warehouse.

[G] Data Warehouse (Physical) Layer

The (core) Data Warehouse is where the actual data used primarily for informational
uses occurs. In some cases, one can think of the Data Warehouse simply as a logical or
virtual view of data. In many instances, the data warehouse may not actually involve
storing data.

In a Physical Data Warehouse, copies, in some cases many copies, of operational and or
external data are actually stored in a form that is easy to access and is highly flexible.
Increasingly, Data Warehouses are stored on client/server platforms, but they are often
stored on main frames as well.

[H] Data Staging Layer

The final component of the Data Warehouse Architecture is Data Staging. Data Staging
is also called copy management or replication management, but in fact, it includes all of
the processes necessary to select, edit, summarize, combine and load data warehouse
and information access data from operational and/or external databases.

Data Staging often involves complex programming, but increasingly data warehousing
tools are being created that help in this process. Data Staging may also involve data
quality analysis programs and filters that identify patterns and data structures within
existing operational data.

8. Different possible wrong architectures

In the previous article you saw different components of data warehouse architecture. To
design architecture the care should be taken so that the architecture is not faulty. In this
article you will see different types of architectures possible which are wrong. Many Data
Warehouse projects fail due to the selection of an architecture that is incapable to meet
business requirements.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 11

A desire to build a Data Warehouse quickly and cheaply often leads to selection of a
wrong architecture. There exist architectures that are generally considered to be wrong:

o “Virtual” Data Warehouse


o “Data Mart in a Box”

8.1 Virtual Data Warehouse

In this architecture there is no Data Warehouse database. The business analysts access
operational databases using simple OLAP front-end tools. This architecture is popular
because it requires minimum investment in additional to hardware and software. You
don’t require extra IT personal as well as there is no extracting, cleaning and loading
burden. The front-end data access and analysis tools simplify access to legacy database
systems on mainframes, and allow multidimensional queries on views and drill-down
operations on operational data. Following figure depicts this architecture:

Following are some of the limitations of “Virtual” data warehouse:

1. As there is no true data warehouse database is built, there is no:

o Historical data,
o Summarized and aggregated data,
o Central meta data repository with enterprise wide definitions of the business data
semantics
o Cleaning and transforming operational data to suit the decision making processes

2. A “virtual” data warehouse can be considered as a really short time temporary


solution for the problem.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 12

8.2 Data Mart in a box Architecture

A packaged product that allows to build a Data Warehouse database from various data
Sources and accessing Data Warehouse database using user friendly data access and
analysis tools. It also builds a local meta data repository with data definitions in business
terms. Following figure depicts this architecture:

Following are some of the advantages and disadvantages of the above architecture:

o The data mart in a box architecture eliminates the interference of OLAP operations
with OLTP
o But it retains some of the old and introduces some new problems:

* This architecture tends to proliferate in an uncontrolled manner leading to


multiple, non integrated, independent, local data marts, purchased from
different vendors

* Lack of support for common business rules, semantics, and data definitions
across business areas (although every data mart maintains its own meta
data repository)

* Population of data marts with “dirty” source data

Following are different dirty data Problem

o Data stored in the legacy databases have high percentage of:


missing, erroneous, or inconsistent data values. The examples of “dirty” data are
multiple attribute values in one field, one attribute value across two or more fields,
different spellings for the same attribute vale, inconsistent names for legal entities,
incorrect use of codes across records.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 13

o Up to 20% of fields can contain such “dirty” data.

To sum up, the benefits of having a data warehouse architecture are as follows:

o Provides an organizing framework - the architecture draws the lines on the map
in terms of what the individual components are, how they fit together, who owns
what parts, and priorities.

o Improved flexibility and maintenance - allows you to quickly add new data
sources, interface standards allow plug and play, and the model and meta data allow
impact analysis and single-point changes.

o Faster development and reuse - warehouse developers are better able to


understand the data warehouse process, data base contents, and business rules
more quickly.

o Management and communications tool - define and communicate direction and


scope to set expectations, identify roles and responsibilities, and communicate
requirements to vendors.

o Coordinate parallel efforts - multiple, relatively independent efforts have a


chance to converge successfully. Also, data marts without architecture become the
stovepipes of tomorrow.

WWWWW

sushiltry@yahoo.co.in

You might also like