Professional Documents
Culture Documents
DAAT
TAA
W
WAAR
REEHHO
OUUSSIIN
NGG
TThhee ggooaall iiss ttoo eennaabbllee uusseerrss ttoo mmaakkee iinnffoorrm meedd
ddeecciissiioonnss rraappiiddllyy ssoo tthheeiirr ccoomp a n ies c a n
mpanies can rreessppoonndd
ttoo m a k e ch a n g e a
make change and remain ccoom n d r e m a in mppeettiittiivvee..
1. Starter 2
1. Starter
Architecture of a data warehouse is a very complex and involves many elements. This is
because the architecture of data warehouse consists of many different systems and to
connect as we as process these systems. For construction of any corporate data
warehouse required technical infrastructure that includes operating system, hardware
platform, database management system, and network. The DBMS selection becomes a
little more complicated than a straightforward operational system because of the
unusual challenges of the data warehouse, especially in its capability to support very
complex queries that cannot be predicted in advance. This will be explored in more
detail later on in the chapter.
2. The warehouse requires ongoing processes to feed it; these processes require their
own infrastructure. What is the infrastructure required for data warehouse?
3. Many times, IT departments overlook the above aspect when they plan for the data
warehouse. You required different layers for storing layers so what is Data layers?
4. How the data can be clean and what are the steps? How will ongoing data loads,
cleansing, and summarizing be accomplished?
5. How will users get information out of the warehouse? The choice of query tool
becomes very important, and depends upon a multiplicity of factors.
Following are different steps to develop architecture. These steps are to be performed
according to the order in which they are given:
2. Next, you must staff an architecture team with strong personnel. It is not
necessarily the technology you choose for your architecture, it is the personnel you
have designing and developing the architecture that makes the project successful.
3. Prototype/benchmark all the technologies you are interested in using. Design and
develop a prototype that can be used to test all of the different technologies that are
being considered.
4. Give the architecture team enough time to build the architecture infrastructure
before development begins. For a large organization, this can be anywhere from six
months to a year or more.
sushiltry@yahoo.co.in
SUSHIL KULKARNI 3
5. Make sure you train the development staff on the use of the architecture before
development begins. Spend time letting the development team get full exposure to
the capabilities and components of the architecture.
As we examine the architecture of a data warehouse, we will look at it from three views:
the overall data warehouse infrastructure, data layer components, and ongoing
maintenance infrastructure.
The data warehouse consists of the following architectural components, which compose
the data warehouse infrastructure:
o Metadata layer: Data about data. This includes, but is not limited to, definitions
and descriptions of data items and business rules.
o Data acquisition: The process of loading data from the various sources.
o User analysis: Includes the infrastructure required to support user queries and
analysis.
The infrastructure foundation upon which the data warehouse is built is often called the
platform. It is made up of the following components:
sushiltry@yahoo.co.in
SUSHIL KULKARNI 4
a minimal amount of effort required. Most of the hardware of a data warehouse will
consist of a number of large machines. Large machines are 6 to 8 or even 12 CPUs with
a gigabyte(s) of memory and many gigabytes or even a terabyte of disk space.
Personnel: This may seem like an odd component of data warehouse architecture, but
it is the most important (and the most expensive!). There are a number of technology
choices on the market today, and each of these technologies has good features and bad
features. Therefore, choosing the right components is not an exact science. The key
factor in whether or not the technology will work is the skill level of the individuals
designing and developing the architecture. Good components and experienced
architects/developers will make the difference in the end.
Figure A illustrates the overall high-level data architectural components required for the
typical data warehouse effort. For building a data warehouse, we typically build at least
two separate databases: an interim "staging area" and the warehouse itself.
When the data is loaded into the warehouse, if it comes over from the legacy system as
is with no transformation, it is considered Level 0.
Alternatively, some scrubbing can take place on the legacy system (if the load of the
system allows it). Some tools enable you to enter scrub rules and the tool will generate
code, which can then be executed in various ways.
If some initial scrubbing occurs on the legacy side, then the data coming over to the
interim staging area is called Level 1, because it has already been processed once.
Each time data is scrubbed, the data level is incremented.
The data is placed in the interim environment so scrubbing can take place. Often,
primary keys must be resolved. Sometimes one row of data in a warehouse table will be
source by more than one legacy system. The primary key is pieced together in the
interim environment. This must take place first before any other scrubbing can occur;
you must have a proper identifier for each row before proceeding. Often, more than one
iteration of scrubbing occurs in the staging area. At level 2 scrubbing takes place and
all the legacy systems are aggregated at level 3 and final answer is obtained from level
4.
sushiltry@yahoo.co.in
SUSHIL KULKARNI 5
FIGURE A FIGURE B
Figure B shows an alternate way with four data levels. Level 0 is the straight extract,
taken as is from the legacy environment. Level 1 takes place in the interim staging
area; its main purpose is to create a primary key, a single field that will serve as the
unique identifier for each row. Then, Level 2 cleans up miscellaneous data anomalies
such as replacing non-standard project codes for the approved values. Another set of
scrub routines then performs summarization information that will be stored in the data
warehouse. This summarization is Level 3. Some warehouses may have one or more
levels of summarization. Level 4 shows more granular summaries calculated.
Most data warehouses in the real world don't have all of these levels shown. As stated
previously, if scrubbing takes place before the data is shipped to the interim
environment, Level 0 is not even shown. It is never represented in the warehouse or
stored.
Data warehouses are fed periodically, and repeatable processes must be in place for this
to occur. The following processes are part of this iterative cycle:
sushiltry@yahoo.co.in
SUSHIL KULKARNI 6
o Process errors (such as moving rows with bad data into an error table, flagging
problem rows, etc.)
o Perform summarization/aggregation
Queries/reports
Data mining
Adhoc User
SQL View
OLAP
DW
sushiltry@yahoo.co.in
SUSHIL KULKARNI 7
This cycle is repeated every time a load is performed. Following figure shows these
essential architectural components required for ongoing maintenance of the data
warehouse.
7.1 Components
sushiltry@yahoo.co.in
SUSHIL KULKARNI 8
Clearly, the goal of data warehousing is to free the information that is locked up in the
operational databases and to mix it with information from other, often external, sources
of data. Increasingly, large organizations are acquiring additional data from outside
databases. This information includes demographic, econometric, competitive and
purchasing trends. The so-called "information superhighway" is providing access to more
data resources every day.
The Information Access layer of the Data Warehouse Architecture is the layer that the
end-user deals with directly. In particular, it represents the tools that the end-user
normally uses day to day, e.g., Excel, Lotus 1-2-3, Focus, Access, SAS, etc. This layer
also includes the hardware and software involved in displaying and printing reports,
spreadsheets, graphs and charts for analysis and presentation. Over the past two
decades, the Information Access layer has expanded enormously, especially as end-
users have moved to PCs and PC/LANs.
Today, more and more sophisticated tools exist on the desktop for manipulating,
analyzing and presenting data; however, there are significant problems in making the
sushiltry@yahoo.co.in
SUSHIL KULKARNI 9
raw data contained in operational systems available easily and seamlessly to end-user
tools. One of the keys to this is to find a common data language that can be used
throughout the enterprise.
The Data Access Layer of the Data Warehouse Architecture is involved with allowing the
Information Access Layer to talk to the Operational Layer. In the network world today,
the common data language that has emerged is SQL. Originally, SQL was developed by
IBM as a query language, but over the last twenty years has become the de facto
standard for data interchange.
One of the key breakthroughs of the last few years has been the development of a
series of data access "filters" such as EDA/SQL that make it possible for SQL to access
nearly all DBMSs and data file systems, relational or nonrelational. These filters make it
possible for state-of-the-art Information Access tools to access data stored on database
management systems that are twenty years old.
The Data Access Layer not only spans different DBMSs and file systems on the same
hardware, it spans manufacturers and network protocols as well. One of the keys to a
Data Warehousing strategy is to provide end-users with "universal data access".
Universal data access means that, theoretically at least, end-users, regardless of location
or Information Access tool, should be able to access any or all of the data in the
enterprise that is necessary for them to do their job.
The Data Access Layer then is responsible for interfacing between Information Access
tools and Operational Databases. In some cases, this is all that certain end-users need.
However, in general, organizations are developing a much more sophisticated scheme to
support Data Warehousing.
In order to provide for universal data access, it is absolutely necessary to maintain some
form of data directory or repository of meta-data information. Meta-data is the data
about data within the enterprise. Record descriptions in a COBOL program are meta-
data. So are DIMENSION statements in a FORTRAN program, or SQL Create statements.
The information in an ERA diagram is also meta-data.
The Process Management Layer is involved in scheduling the various tasks that must be
sushiltry@yahoo.co.in
SUSHIL KULKARNI 10
accomplished to build and maintain the data warehouse and data directory information.
The Process Management Layer can be thought of as the scheduler or the high-level job
control for the many processes (procedures) that must occur to keep the Data
Warehouse up-to-date.
The Application Message Layer has to do with transporting information around the
enterprise computing network. Application Messaging is also referred to as
"middleware", but it can involve more that just networking protocols. Application
Messaging for example can be used to isolate applications, operational or informational,
from the exact data format on either end. Application Messaging can also be used to
collect transactions or messages and deliver them to a certain location at a certain time.
Application Messaging in the transport system underlying the Data Warehouse.
The (core) Data Warehouse is where the actual data used primarily for informational
uses occurs. In some cases, one can think of the Data Warehouse simply as a logical or
virtual view of data. In many instances, the data warehouse may not actually involve
storing data.
In a Physical Data Warehouse, copies, in some cases many copies, of operational and or
external data are actually stored in a form that is easy to access and is highly flexible.
Increasingly, Data Warehouses are stored on client/server platforms, but they are often
stored on main frames as well.
The final component of the Data Warehouse Architecture is Data Staging. Data Staging
is also called copy management or replication management, but in fact, it includes all of
the processes necessary to select, edit, summarize, combine and load data warehouse
and information access data from operational and/or external databases.
Data Staging often involves complex programming, but increasingly data warehousing
tools are being created that help in this process. Data Staging may also involve data
quality analysis programs and filters that identify patterns and data structures within
existing operational data.
In the previous article you saw different components of data warehouse architecture. To
design architecture the care should be taken so that the architecture is not faulty. In this
article you will see different types of architectures possible which are wrong. Many Data
Warehouse projects fail due to the selection of an architecture that is incapable to meet
business requirements.
sushiltry@yahoo.co.in
SUSHIL KULKARNI 11
A desire to build a Data Warehouse quickly and cheaply often leads to selection of a
wrong architecture. There exist architectures that are generally considered to be wrong:
In this architecture there is no Data Warehouse database. The business analysts access
operational databases using simple OLAP front-end tools. This architecture is popular
because it requires minimum investment in additional to hardware and software. You
don’t require extra IT personal as well as there is no extracting, cleaning and loading
burden. The front-end data access and analysis tools simplify access to legacy database
systems on mainframes, and allow multidimensional queries on views and drill-down
operations on operational data. Following figure depicts this architecture:
o Historical data,
o Summarized and aggregated data,
o Central meta data repository with enterprise wide definitions of the business data
semantics
o Cleaning and transforming operational data to suit the decision making processes
sushiltry@yahoo.co.in
SUSHIL KULKARNI 12
A packaged product that allows to build a Data Warehouse database from various data
Sources and accessing Data Warehouse database using user friendly data access and
analysis tools. It also builds a local meta data repository with data definitions in business
terms. Following figure depicts this architecture:
Following are some of the advantages and disadvantages of the above architecture:
o The data mart in a box architecture eliminates the interference of OLAP operations
with OLTP
o But it retains some of the old and introduces some new problems:
* Lack of support for common business rules, semantics, and data definitions
across business areas (although every data mart maintains its own meta
data repository)
sushiltry@yahoo.co.in
SUSHIL KULKARNI 13
To sum up, the benefits of having a data warehouse architecture are as follows:
o Provides an organizing framework - the architecture draws the lines on the map
in terms of what the individual components are, how they fit together, who owns
what parts, and priorities.
o Improved flexibility and maintenance - allows you to quickly add new data
sources, interface standards allow plug and play, and the model and meta data allow
impact analysis and single-point changes.
WWWWW
sushiltry@yahoo.co.in