Populating A DW With SS2K

Populating a Data Warehouse with SQL
Server 2000
Date: Nov 9, 2001 By Baya Pavliashvili. Article is provided courtesy of Sams.
In this article, Baya Pavliashvili provides a tutorial for populating data warehouses.
The article includes a number of tips for choosing the storage model, indexing fact
and dimension tables, and much more.
Building and maintaining data warehouses (DW) is one of the hottest topics in today's
IT industry. Companies that have been collecting transactional data for decades have
also been thinking about using such data effectively. Many businesses have been
utilizing standard reports that have a "set-in-stone" structure. Others have their
programmers run ad-hoc reports when needed. However, such reports no longer can
provide strategic advantage to companies. What they need instead is a cost-effective
set of tools to let them slice and dice their data along appropriate dimensions to get
them the particular information that they need each time. Data warehouses happen to
be such tools.
Even though data warehousing has been around for more than 20 years, the total cost
of owning a DW has been unaffordable for small to midsize companies. The
introduction of data warehousing support with SQL Server versions 7.0 and 2000 has
made data warehouses much more affordable.
Many complex steps are involved in building and maintaining a successful data
warehouse. First, you have to identify all the reporting needs of the organization. Next
you need to figure where the data will come from and how to integrate data originating
from various sources. Then you need to build a dimensional data model that will
support the reporting needs and populate this data model with data.
This article talks about populating the data warehouses. Although most principles
covered in this article apply to any database management system utilized for building a
data warehouse, I will cover specific implementation details only for Microsoft SQL
Server 2000.
50,000-Foot View
All data warehouses consist of measures and dimensions. The measures are individual
facts to be represented on reports, whereas the dimensions are how the facts need to
be broken down. For instance, a data warehouse for a grocery chain might have
dimensions for customers, suppliers, stores, managers, and measures of costs and
revenues.
When the dimensions and measures have been identified and the dimensional model
has been built, it's time to move on to the physical implementation. This phase consists
of several steps:
1. Populating the fact and dimension tables with data
2. Building appropriate indexes
3. Creating Analysis Services cubes
4. Building aggregations
5. Creating programs to automate the process of updating your data warehouse
I've put step 5 at the end for a reason. Populating the tables and building cubes
manually the first time gives you an idea of what it takes to refresh data in the fact and
dimension tables, how long it takes to build aggregations, and which dimensions need
Page 1 of 5 Articles
5/19/2005 http://www.samspublishing.com/articles/printerfriendly.asp?p=24019
to be continuously updated. Therefore, physical implementation is more likely to be
successful with a bottom-up approach: First perform your steps manually, and then
automate as many of the tasks as possible through your code.
Let's consider each step in detail.
Populating the Fact and Dimension Tables
Depending on the nature of your data warehouse, you might have to populate your fact
table(s) hourly, several times a day, nightly, once a week, once a month, or perhaps
even once a year. It all depends on how volatile your data is. For instance, in the
manufacturing industry, if your managers need to monitor a number of defects in
products coming off the assembly line, your fact table might have to be refreshed
hourly. On the other hand, if your marketing managers are comparing sales during a
particular time period in the store with the same period last year, a monthly refresh of
the fact table will be sufficient. Indeed, it usually wouldn't make much sense to compare
sales on Tuesday with those on Saturday of the same week.
How often you refresh your fact table depends on your business needs, so be sure to
check with your users. If the organization already has some reports, that will give you a
clue to the frequency that managers need to examine their data. Keep in mind, though,
that one of the reasons you're building a data warehouse is because the existing
reports are not sufficient, so don't rely solely on what's already thereask what would
make the managers' jobs easier and more productive.
Populating the dimension tables is much trickier than populating the fact tables. Some
dimensions are relatively small and static. After you have created these, you almost
never have to worry about them again. For example, consider a department dimension:
Sometimes this dimension is referred to as organizational unit. Granted, a department
name might change every year or so, or a new department might be added. But, in
general, every organization will have Sales, Marketing, Finance, Operation, and a
handful of other departments. If the S department is renamed to Marketing, all you have
to do is change one record in the dimension table, and you're done. The exception to
this rule is when managers want to see data under the Sales heading for the duration of
time when the department was called Sales and then see everything else under the
heading of Marketing. If that's the case, you'll have to add a column to the dimension
table that gives you the date range during which the department had a particular name.
In addition, you might want to have a separate key for Marketing and Sales members of
the department dimension.
With other dimensions, you don't have such a luxury. For instance, consider the
customer dimension. If you're building a data warehouse for retail stores chain, you
might have thousands or even millions of customers. You'll have to update this
dimension every time you need to rebuild the fact table; in addition, you'll have to
update the dimension table before populating the fact table. If you don't have a
particular customer in a dimension, then your fact table cannot have a key pointing to
that customer. This means that customer Gary Jones will have to be assigned a key of
12345 before you can write a record to the fact table representing Gary's purchase.
When you're working with frequently modified dimensions, you have to warrant the
capability to rebuild the dimension before you rebuild the fact table. Therefore, you
might want to put the whole rebuilding routine in a transaction. But I'm jumping a bit
ahead of the game herewe'll talk about rebuilding routines later.
You can make an important conclusion from the previous couple of paragraphs: Each
dimension can change in different ways. You can have additional dimension members,
or some of the dimension member values can change over time. The former change is
relatively easy to handle: Just add the new members to the dimension, and you're
done. The latter change, on the other hand, can be handled in multiple ways. The
concept of changing dimension member values is sometimes referred to as "slowly
changing dimensions." We already discussed one way of dealing with changing
dimension member values: adding a column to the dimension table that tells you the
date when the member value changed and then adding a row to the table to assign a
new key to the new dimension member. This is a tough way to resolve the problem
because each member might change many times; for example, a female customer
might have a maiden name, a married name, a divorced name, a name from the
second marriage, and so on. Each time Ms. Jones changes her name, you need to add
a new row and a pointer to the original record for Ms. Jones so that you know it is the
same person.
An easier solution is to overwrite the existing value with the new value; if Ms. Jones
decides to be Ms. Walters, just change her name and don't worry about any previous
names used by Ms. Walters. Any purchases that Ms. Walters made while using
different names will still appear on reports as Ms. Walters's. In many environments this
would be acceptable; however, for certain projects (for instance, government-related
work) you'll have to know exactly what the person's name was at the time of the
transaction. Yet another way to handle slowly changing dimension values is to store
aliases. We would record that Mr. Jones is the same person as Ms. Walters and Ms.
Ravichandar, but you won't change the "main" customer name, so the managers will
always know whose purchasing behavior they're examining on each report.
NOTE
Slowly changing dimensions are one of the most difficult data warehousing topics to
learn and master. This introductory article just barely scratches the surface.
Indexing Database Tables
When you're done importing your data to the fact and dimension tables, it's time to think
about indexing your database tables to improve performance. How long will it take to
build the cubes, and how fast will the queries be executed against the fact and
dimension tables? Ideally, most of the business questions should be answered by
querying the cube using multidimensional expressions (MDX) and ActiveX Data
Objects Multi-Dimensional (ADOMD). However, if any of your users are SQLsavvy,
they'll opt for writing queries directly against the fact and dimension tables. Still, your
main concern should be to optimize the cube-rebuilding tasks.
The fact table should have a clustered index defined on the combination of all
dimension keys. Sometimes such a combination might qualify as a primary key for the
fact table, but that won't always be the case. I'll discuss why and when a primary key
would not be a combination of foreign keys in a following article. For now, just realize
that because you'll be joining the fact table to the dimension tables, the clustered index
should be built on all foreign key columns (keep in mind that you can have only one
clustered index per table). In addition, it helps to define a nonclustered index on each
individual foreign key in the fact table. Individual dimension tables, on the other hand,
are not likely to be queried frequently. Therefore, indexing dimension tables is relatively
straightforward: You can place a clustered index on a primary key or another column
that might be queried frequently. A nonclustered index on a primary key of a dimension
table will do just fine.
By now, you might be asking how you are supposed to load the data and maintain the
indexes that I recommended. If so, give yourself a credityou know that indexes can
be a blessing in disguise when it comes to INSERT, UPDATE, and DELETE statements.
The issue here is that SQL Server has to maintain the index keys as data modifications
occur. Therefore, INSERT, UDPATE, and DELETE statements will generally run more
slowly on tables with indexes. Before you rush off to test this theory, let me warn you:
SQL Server 2000 is an awesome product that manages indexes much more efficiently
than any of its predecessors. Therefore, if you have 10,000 rows in your fact table,
indexes will not slow your performance drastically. But when you get up to several
million rows, tiny index maintenance taxes will add up quickly.
Some experts suggest populating the fact tables through BULK copy operations. SQL
Server offers a command-line utility affectionately referred to as BCP (for the Bulk Copy
Program), as well as a Transact-SQL BULK INSERT command to handle such tasks.
I'll save the discussion of BCP and BULK INSERT for another article. Personally, I do
not recommend writing data out to the file system and then copying it back to SQL
Server format (which is what you'd have to do if your transactional data is already in
SQL Server). In my experience a better approach has been to do the following:
1. Drop the indexes on the fact table.
2. Insert the new records into the fact table.
3. Truncate the transaction log when you're finished populating the fact table.
4. Rebuild indexes on the fact table.
I know that this sounds like a lot of work, but if your performance matters, this
investment of effort will pay substantial dividends.
This discussion means that you have to drop and re-create indexes on the fact table
each time you populate it with new data. The indexes on the dimension tables are
usually not so troublesome, unless you have thousands of new customers to insert into
the Customer dimension each time you populate your data warehouse. Most of the
time, rebuilding indexes on the dimension tables once for every 5 to 7 populations of
the fact table will provide acceptable performance.
Building the Cube
Now that you've populated your data warehouse, it is time to build the cube. Building a
cube with Microsoft Analysis Services is easy. All you have to do is provide the name of
the fact table and dimension tables and then specify your measures. I'll provide an
example of building a cube (with screenshots) in a later article.
Defining the Storage Model
Next you need to define the storage model and build aggregationsprecalculated
summary values that will appear on the reports. In Analysis Services 8.0 (which is part
of SQL Server 2000), the tasks of defining and building aggregations are referred to as
designing storage and processing the cube. There are multiple ways of storing the data
with Analysis ServicesMultidimensional OLAP (MOLAP), Relational OLAP (ROLAP),
and Hybrid OLAP (HOLAP). MOLAP stores data on the file system in multidimensional
format; this storage type is the most efficient as far as the amount of storage and data
retrieval speed is concerned. The downside is that MOLAP does not seem to handle
huge amount of data too well. If your data warehouse is less than 100GB, no need to
worrychoose MOLAP.
ROLAP, on the other hand, stores all aggregations and data in the relational format. It
creates additional tables in SQL Server for each defined aggregation. ROLAP requires
a lot of storage space much more than the database itselfso be careful. HOLAP is
supposed to combine the best of the two approaches (MOLAP and ROLAP); however,
in my experience, HOLAP is not nearly as efficient as MOLAP.
When you've chosen the storage option, Analysis Services lets you define the
performance optimization level, as opposed to the storage space required. Of course,
you should try to optimize the performance as much as you can, but 80% of
performance gain will usually give you appropriate number of aggregations for the
majority of your queries. You also need to watch the storage space usage, especially if
you're using ROLAP or HOLAP. MOLAP seems to store aggregations in a very
compact manner.
Processing the cube is completely automated: If Analysis Services finds all dimension
members and appropriate keys in the fact table, it will rebuild the cube without any help
from you. If the fact table contains the keys not found in the dimension tables, the cube-
processing task will fail. That's why it's important to refresh your dimensions (in
Microsoft Analysis Services) before rebuilding the cube.
Automating the Tasks
After the cube is processed, you can examine the results of your work (finally!) in the
Analysis Manager. Now the manual labor is over, and it's time to automate the tasks.
You'll need a DTS package that consists of several tasks:
1. Drop indexes on the fact table and dimension tables, if appropriate.
2. Refresh dimensions with the new dimension members (or new member values).
3. Populate the fact table with new data (data that isn't already in the fact table).
4. Rebuild indexes on the fact table (and dimension tables, if appropriate).
5. Process cube dimensions.
6. Process the cube itself.
Not too hard, is it? Well, not at the first glance, anyway. But it does take some patience
and diligence to get all of these steps right. In addition, you'll want to put your entire
cube population and processing routine in a transaction because you don't want one
step to fail and the rest of them keep going (you'll mess up your existing cube). I'll
discuss transactions in another article.
In this article, I introduced you to the tasks of populating the fact and dimension tables,
creating and maintaining appropriate indexes to optimize your cube processing
performance, designing storage for your cubes, and building aggregations. As I
mentioned in the beginning, there are many more complex topics to be discussed with
data warehousing. I'll try to cover these in upcoming articles.
2004 Pearson Education, Inc. InformIT. All rights reserved.
800 East 96th Street Indianapolis, Indiana 46240

Populating A DW With SS2K

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Populating A DW With SS2K

Uploaded by

Copyright:

Available Formats

Populating a Data Warehouse with SQL

You might also like