You are on page 1of 28

Data Warehousing: A Perspective

by
Hemant Kirpekar 10/18/201
Data Warehousing: A Perspective
by Hemant Kirpekar
Introduction
!he "ee# $or proper un#erstan#ing o$ Data Warehousing%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%2
!he Key &ssues%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%3
!he De$inition o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%3
!he 'i$ecyc(e o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%4
!he )oa(s o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
Why Data Warehousing is different from OLTP.................................................6
E! "ode#ing $s Dimension Ta%#es....................................................................&
T'o (amp#e Data Warehouse Designs
Designing a Pro#uct*+riente# Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%10
Designing a ,ustomer*+riente# Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%14
"echanics of the Design
&ntervie-ing .n#*/sers an# D0As%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%19
Assemb(ing the team%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%19
,hoosing Har#-are/1o$t-are p(at$orms%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%20
Han#(ing Aggregates%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%20
1erver*1i#e activities%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%21
,(ient*1i#e activities%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%22
)onc#usions.........................................................................................................*+
A )hec,#ist for an Idea# Data Warehouse.........................................................*-
1
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Introduction
The need for proper understanding of Data Warehousing
!he $o((o-ing is an e2tract $rom 3Kno-(e#ge Asset 4anagement an# ,orporate 4emory3 a White Paper
to be pub(ishe# on the WWW possib(y via the Hispacom site in the thir# -eek o$ August 1556%%%%%%
Data Warehousing may -e(( (everage the rising ti#e techno(ogies that everyone -i(( -ant or nee#7
ho-ever the current tren# in Data Warehousing marketing (eaves a (ot to be #esire#%
&n many organi8ations there sti(( e2ists an enormous #ivi#e that separates &n$ormation !echno(ogy an# a
managers nee# $or Kno-(e#ge an# &n$ormation% &t is common currency that there is a -ho(e host o$
avai(ab(e too(s an# techni9ues $or (ocating7 scrubbing7 sorting7 storing7 structuring7 #ocumenting7
processing an# presenting in$ormation% /n$ortunate(y7 too(s are tangib(e an# business in$ormation an#
kno-(e#ge are not7 so they ten# to get con$use#%
1o -hy #o -e sti(( have this con$usion: ;irst consi#er ho- certain companies market Data Warehousing%
!here are companies that se(( #atabase techno(ogies7 other companies that se(( the p(at$orms <ostensib(y
consisting o$ an 4PP or 14P architecture=7 some se(( technica( ,onsu(tancy services7 others meta*#ata
too(s an# services7 $ina((y there are the business ,onsu(tancy services an# the systems integrators * each
an# everyone -ith their o-n particu(ar $ocus on the critica( $actors in the success o$ Data Warehousing
pro>ects%
&n the main7 most ?D041 ven#ors seem to see Data Warehouse pro>ects as a cha((enge to provi#e greater
per$ormance7 greater capacity an# greater #ivergence% With this e2cuse7 most ?D041 pro#ucts carry
$unctiona(ity that make them about as tru(y 3open3 as a /"&@A, 50/A07 i%e% "o stan#ar#s $or @ie-
Partitioning7 0it 4appe# &n#e2ing7 Histograms7 +b>ect Partitioning7 1B' 9uery #ecomposition or 1B'
eva(uation strategies etc% !his ho-ever is not rea((y the important issue7 the rea( issue is that some ven#ors
se(( Data Warehousing as i$ it >ust provi#e# a big #umping groun# $or massive amounts o$ #ata -ith -hich
users are a((o-e# to #o anything they (ike7 -hi(st at the same time $reeing up +perationa( 1ystems $rom
the nee# to support en#*user in$ormationa( re9uirements%
1ome har#-are ven#ors have a simi(ar approach7 i%e% a Data Warehouse p(at$orm must inherent(y have a
(ot o$ #isks7 a (ot o$ memory an# a (ot o$ ,P/s% Ho-ever7 one o$ the most success$u( Data Warehouse
pro>ects have -orke# on use# ,+4PAB har#-are7 -hich provi#es an e2ce((ent cost/bene$it ratio%
1ome !echnica( ,onsu(tancy 1ervices provi#ers ten# to #-e(( on the per$ormance aspects o$ Data
Warehousing% !hey see Data Warehousing as a technica( cha((enge7 rather than a business opportunity7 but
the biggest per$ormance payo$$s -i(( be brought about -hen there is a $u(( un#erstan#ing o$ ho- the user
-ishes to use the in$ormation%
2
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
The Key Issues
+rgani8ations are s-imming in #ata% Ho-ever7 most -i(( have to create ne- #ata -ith improve# 9ua(ity7
to meet strategic business p(anning re9uirements%
1o:
Ho- shou(# &1 p(an $or the mass o$ en# user in$ormation #eman#:
What ven#ors an# too(s -i(( emerge to he(p &1 bui(# an# maintain a #ata -arehouse architecture:
What strategies can users #ep(oy to #eve(op a success$u( #ata -arehouse architecture :
What techno(ogy breakthroughs -i(( occur to empo-er kno-(e#ge -orkers an# re#uce operationa(
#ata access re9uirements:
!hese are some o$ the key 9uestions out(ine# by the )artner )roup in their 155C report on Data
Warehousing%
& -i(( try to ans-er some o$ these 9uestions in this report%
The Definition a Data Warehouse
A Data Warehouse is a:
% sub>ect*oriente#
% integrate#
%time*variant
% non*vo(ati(e
co((ection o$ #ata in support o$ management #ecisions%
<W%H% &nmon7 in 30ui(#ing a Data Warehouse7 Wi(ey 1556=
!he #ata -arehouse is oriente# to the ma>or subject areas o$ the corporation that have been #e$ine# in the
#ata mo#e(% .2amp(es o$ sub>ect areas are: customer7 pro#uct7 activity7 po(icy7 c(aim7 account% !he ma>or
sub>ect areas en# up being physica((y imp(emente# as a series o$ re(ate# tab(es in the #ata -arehouse%
Personal Note: Could these be objects? No one to my knowledge has explored this possibility as yet.
!he secon# sa(ient characteristic o$ the #ata -arehouse is that it is integrated. !his is the most important
aspect o$ a #ata -arehouse% !he #i$$erent #esign #ecisions that the app(ication #esigners have ma#e over
the years sho- up in a thousan# #i$$erent -ays% )enera((y7 there is no app(ication consistency in
enco#ing7 naming conventions7 physica( attributes7 measurements o$ attributes7 key structure an# physica(
characteristics o$ the #ata% .ach app(ication has been most (ike(y been #esigne# in#epen#ent(y% As #ata is
entere# into the #ata -arehouse7 inconsistencies o$ the app(ication (eve( are un#one%
!he thir# sa(ient characteristic o$ the #ata -arehouse is that it is time-variant. A C to 10 year time hori8on
o$ #ata is norma( $or the #ata -arehouse% Data Warehouse #ata is a sophisticate# series o$ snapshots taken
at one moment in time an# the key structure a(-ays contains some time e(ement%
!he (ast important characteristic o$ the #ata -arehouse is that it is nonvolatile. /n(ike operationa( #ata
-arehouse #ata is (oa#e# en masse an# is then accesse#% /p#ate o$ the #ata #oes not occur in the #ata
-arehouse environment%
3
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
The lifecycle of the Data Warehouse
Data $(o-s into the #ata -arehouse $rom the operationa( environment% /sua((y a signi$icant amount o$
trans$ormation o$ #ata occurs at the passage $rom the operationa( (eve( to the #ata -arehouse (eve(%
+nce the #ata ages7 it passes $rom current #etai( to o(#er #etai(% As the #ata is summari8e#7 it passes $rom
current #etai( to (ight(y summari8e# #ata an# then onto summari8e# #ata%
At some point in time #ata is purge# $rom the -arehouse% !here are severa( -ays in -hich this can be
ma#e to happen:
% Data is a##e# to a ro((ing summary $i(e -here the #etai( is (ost%
% Data is trans$erre# to a bu(k me#ium $rom a high*per$ormance me#ium such as DA1D%
% Data is trans$erre# $rom one (eve( o$ the architecture to another%
% Data is actua((y purge# $rom the system at the D0As re9uest%
!he $o((o-ing #iagram is $rom 30ui(#ing a Data Warehouse3 2n# .#7 by W%H% &nmon7 Wi(ey D56
high(y summari8e#
(ight(y
summari8e#
<#ata mart=
month(y sa(es by pro#uct (ine <E81 * E52=
-k(y sa(es by
subpro#uct (ine
<E8 * E52=
sa(es #etai( <1550 * 1551=
sa(es #etai( <E8 * E85= o(# #etai(
operationa(
trans$ormation
current
#etai(
m
e
t
a
#
a
t
a
Structure of a Data Warehouse
4
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
The Goals of a Data Warehouse
Accor#ing to ?a(ph Kimba(( <$oun#er o$ ?e# 0rick 1ystems * A high(y success$u( Data Warehouse D041
startup=7 the goa(s o$ a Data Warehouse are:
.. The data 'arehouse provides access to corporate or organi/ationa# data.
Access means severa( things% 4anagers an# ana(ysts must be ab(e to connect to the #ata -arehouse
$rom their persona( computers an# this connection must be imme#iate7 on #eman#7 an# -ith high
per$ormance% !he tiniest 9ueries must run in (ess than a secon#% !he too(s avai(ab(e must be easy to
use i%e% use$u( reports can be run -ith a one button c(ick an# can be change# an# rerun -ith t-o
button c(icks%
*. The data in the 'arehouse is consistent.
,onsistency means that -hen t-o peop(e re9uest sa(es $igures $or the 1outheast ?egion $or Fanuary
they get the same number% ,onsistency means that -hen they ask $or the #e$inition o$ the 3sa(es3 #ata
e(ement7 they get a use$u( ans-er that (ets them kno- -hat they are $etching% ,onsistency a(so means
that i$ yester#ayDs #ata has not been comp(ete(y (oa#e#7 the ana(yst is -arne# that the #ata (oa# is not
comp(ete an# -i(( not be comp(ete ti(( tomorro-%
+. The data in the 'arehouse can %e com%ined %y every possi%#e measure of the
%usiness 0i.e. slice 1 dice)
!his imp(ies a very #i$$erent organi8ation $rom the E/R organization o$ typica((y +perationa( Data%
!his imp(ies ro- hea#ers an# constraints7 i%e% #imensions in a dimensional data model
-. The data 'arehouse is not 2ust data3 %ut is a#so a set of too#s to query, analyze, and to
present information.
!he 3back room3 components7 name(y the har#-are7 the re(ationa( #atabase so$t-are an# the #ata
itse($ are on(y about 60G o$ -hat is nee#e# $or a success$u( #ata -arehouse imp(ementation% !he
remaining 0G is the set o$ $ront*en# too(s that 9uery7 ana(y8e an# present the #ata% !he 3sho- me
-hat is important3 re9uirement nee#s a(( o$ these components%
4. The data 'arehouse is 'here used data is published.
Data is not simp(y accumu(ate# at a centra( point an# (et (oose% &t is assemb(e# $rom a variety o$
in$ormation sources in the organi8ation7 c(eane# up7 9ua(ity assure#7 an# then re(ease# on(y i$ it is $it
$or use% A #ata 9ua(ity manager is critica( $or a #ata -arehouse an# p(ay a ro(e simi(ar to that o$ a
maga8ine e#itor or a book pub(isher% He/she is responsib(e $or the content an# 9ua(ity o$ the
pub(ication an# is i#enti$ie# -ith the #e(iverab(e%
6. The 5ua#ity of the data in the data 'arehouse is the driver of business reengineering.
!he best #ata in any company is the recor# o$ ho- much money someone e(se o-es the company%
Data 9ua(ity goes #o-nhi(( $rom there% !he #ata -arehouse cannot $i2 poor 9ua(ity #ata but the
inabi(ity o$ a #ata -arehouse to be e$$ective -ith poor 9ua(ity #ata is the best #river $or business
reengineering e$$orts in an organi8ation%
5
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Why Data Warehousing is different from OLTP
+n*(ine transaction processing is pro$oun#(y #i$$erent $rom #ata -arehousing% !he users are #i$$erent7 the
#ata content is #i$$erent7 the #ata structures are #i$$erent7 the har#-are is #i$$erent7 the so$t-are is
#i$$erent7 the a#ministration is #i$$erent7 the management o$ the systems is #i$$erent7 an# the #ai(y rhythms
are #i$$erent% !he #esign techni9ues an# #esign instincts appropriate $or transaction processing are
inappropriate an# even #estructive $or in$ormation -arehousing%
OLTP Transactional Properties
&n +'!P a transaction is #e$ine# by its A,&D properties%
A Transaction is a user6defined se5uence of instructions that maintains consistency
across a persistent set of va#ues. It is a se5uence of operations that is atomic
'ith respect to recovery.
!o remain va(i#7 a transaction must maintain itHs A,&D properties
!tomicit" is a con#ition that states that $or a transaction to be va(i# the e$$ects o$ a(( its instructions must
be en$orce# or none at a((%
#onsistenc" is a property o$ the persistent #ata is an# must be preserve# by the e2ecution o$ a comp(ete
transaction%
$solation is a property that states that the e$$ect o$ running transactions concurrent(y must be that o$
seria(i8abi(ity% i%e% as i$ each o$ the transactions -ere run in iso(ation%
Dura%ilit" is the abi(ity o$ a transaction to preserve its e$$ects i$ it has committe#7 in the presence o$
me#ia an# system $ai(ures%
A serious #ata -arehouse -i(( o$ten process on(y one transaction per #ay7 but this transaction -i(( contain
thousan#s or even mi((ions o$ recor#s% !his kin# o$ transaction has a specia( name in #ata -arehousing% &t
is ca((e# a &roduction data load
&n a #ata -arehouse7 consistenc" is measure# glo%all" We #o not care about an in#ivi#ua( transaction7
but -e care enormous(y that the current (oa# o$ ne- #ata is a $u(( an# consistent set o$ #ata% What -e care
about is the consistent state o$ the system -e starte# -ith be$ore the pro#uction #ata (oa#7 an# the
consistent state o$ the system -e en#e# up -ith a$ter a success$u( pro#uction #ata (oa#% !he most practica(
$re9uency o$ this pro#uction #ata (oa# is once per #ay7 usua((y in the ear(y hours o$ the morning% 1o7
instea# o$ a microscopic perspective7 -e have a 9ua(ity assurance managerDs >u#gment o$ #ata consistency%
+'!P systems are #riven by per$ormance an# re(iabi(ity concerns% /sers o$ a #ata -arehouse a(most never
#ea( -ith one account at a time7 usua((y re9uiring hun#re#s or thousan#s o$ recor#s to be searche# an#
compresse# into a sma(( ans-er set% /sers o$ a #ata -arehouse change the kin#s o$ 9uestions they ask
constant(y% A(though7 the temp(ates o$ their re9uests may be simi(ar7 the impact o$ these 9ueries -i(( vary
-i(#(y on the #atabase system% 1ma(( sing(e tab(e 9ueries7 ca((e# %ro'ses( nee# to be instantaneous
-hereas (arge mu(titab(e 9ueries7 ca((e# )oin *ueries( are e2pecte# to run $or secon#s or minutes%
Re&orting is the primary activity in a #ata -arehouse% /sers consume in$ormation in human*si8e# chunks
o$ one or t-o pages% 0(inking numbers on a page can be c(icke# on to ans-er -hy 9uestions% "egatives
be(o- are b(inking numbers%
+
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Example of a Data Warehouse Report
,roduct Region Sales -ro'th in Sales as #hange in #hange in
.his /onth Sales 0s1 of Sales as Sales as
2ast /onth #ategor" 1 of #at 1 of #at 3.D
2ast /t 0s
2ast 3r 3.D
4ramis #entral 110 121 311 31 51
4ramis Eastern 159 67318 291 67118 31
4ramis Western 55 51 441 11 51
.otal 4ramis 344 +1 331 11 51
Widget #entral ++ 21 191 21 101
Widget Eastern 102 41 121 51 131
Widget Western 391 67918 91 67118 91
.otal Widget 205 11 131 41 111
-rand .otal 551 41 201 21 91
!he t'in:ling nature o$ +'!P #atabases <constant up#ates o$ ne- va(ues=7 is the $irst kin# o$ tempora(
inconsistency that -e avoi# in #ata -arehouses%
!he secon# kin# o$ tempora( inconsistency in an +'!P #atabase is the (ack o$ e2p(icit support $or
correct(y representing prior history% A(though it is possib(e to keep history in an +'!P system7 it is a ma>or
bur#en on that system to correct(y #epict o(# history% We have a (ong series o$ transactions that
incrementa((y a(ter history an# it is c(ose to impossib(e to 9uick(y reconstruct the snapshot o$ a business at
a speci$ie# point in time%
We make a #ata -arehouse a speci$ic time series We move snapshots o$ the +'!P systems over to the
#ata -arehouse as a series o$ #ata (ayers7 (ike geo(ogic (ayers% 0y bringing static sna&shots to the
-arehouse on(y on a regu(ar basis7 -e so(ve both o$ the time representation prob(ems -e ha# on the +'!P
system% "o up#ates #uring the #ay * so no t'in:ling 0y storing snapshots7 -e represent prior points in
time correct(y% !his a((o-s us to ask comparative 9ueries easi(y% !he snapshot is ca((e# the &roduction
data e;tract( an# -e migrate this e2tract to the #ata -arehouse system at regu(ar time interva(s% !his
process gives rise to the t-o phases o$ the #ata -arehouse: loading and *uer"ing
5
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
E/R Modeling Vs Dimension Tables
Entit"/Relationshi& mo#e(ing seeks to #rive a(( the re#un#ancy out o$ the #ata% &$ there is no re#un#ancy
in the #ata7 then a transaction that changes any #ata on(y nee#s to touch the #atabase in one p(ace% !his is
the secret behin# the phenomena( improvement in transaction processing spee# since the ear(y 80s% ./?
mo#e(ing -orks by #ivi#ing the #ata into many #iscreet entities7 each o$ -hich becomes a tab(e in the
+'!P #atabase% A simp(e ./? #iagram (ooks (ike the map o$ a (arge metropo(itan area -here the entities
are the cities an# the re(ationships are the connecting $ree-ays% !his #iagram is very symmetric 4or
*ueries that s&an man" records or man" ta%les( E/R diagrams are too com&le; for users to
understand and too com&le; for soft'are to na<igate%
S=( E/R /=DE2S #!>>=. ?E @SED !S .AE ?!S$S 4=R E>.ER,R$SE D!.!
W!REA=@SES
&n #ata -arehousing7 80G o$ the 9ueries are sing(e*tab(e bro-ses7 an# 20G are mu(titab(e >oins% !his
a((o-s $or a tremen#ous(y simp(e #ata structure% !his structure is the dimensional model or the star )oin
schema
!his name is chosen because the ./? #iagram (ooks (ike a star -ith one (arge centra( tab(e ca((e# the fact
ta%le an# a set o$ sma((er atten#ant tab(es ca((e# dimensional ta%les7 #isp(aye# in a ra#ia( pattern aroun#
the $act tab(e% !his structure is very asymmetric% !he $act tab(e in the schema is the on(y one that
participates in mu(tip(e >oins -ith the #imension tab(es% !he #imension tab(es a(( have a sing(e >oin to this
centra( $act tab(e%
.ime Dimension
timeIkey
#ayIo$I-eek
month
9uarter
year
ho(i#ayI$(ag
Sales 4act
timeIkey
pro#uctIkey
storeIkey
#o((arsIso(#
unitsIso(#
#o((arsIcost
,roduct Dimension
Store Dimension
pro#uctIkey
#escription
bran#
category
storeIkey
storeIname
a##ress
$(oorIp(anItype
! t"&ical dimensional model
!he above is an e2amp(e o$ a star schema $or a typica( grocery store chain% !he 1a(es ;act tab(e contains
daily item totals o$ a(( the pro#ucts so(#% !his is ca((e# the grain o$ the $act tab(e% .ach recor# in the $act
tab(e represents the tota( sa(es o$ a speci$ic pro#uct in a market on a #ay% Any other combination generates
a #i$$erent recor# in the $act tab(e% .he fact ta%le of a t"&ical grocer" retailer 'ith 500 stores( each
carr"ing 50(000 &roducts on the shel<es and measuring a dail" item mo<ement o<er 2 "ears could
a&&roach 1 ?illion ro's Ao'e<er( using a high6&erformance ser<er and an industrial6strength d%ms
'e can store and *uer" such a large fact ta%le 'ith good &erformance
9
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
!he fact ta%le is -here the numerica( measurements o$ the business are store#% !hese measurements are
taken at the intersection o$ a(( the #imensions% !he best an# most use$u( $acts are continuousl" <alued
and additi<e &$ there is no pro#uct activity on a given #ay7 in a market7 -e (eave the recor# out o$ the
#atabase% ;act tab(es there$ore are a(-ays s&arse ;act tab(es can a(so contain semiadditi<e $acts -hich
can be a##e# on(y on some o$ the #imensions an# nonadditi<e $acts -hich cannot be a##e# at a((% !he
on(y interesting characteristic about nona##itive $acts in tab(e -ith bi((ions o$ recor#s is to get a count%
!he dimension ta%les are -here the te2tua( #escriptions o$ the #imensions o$ the business are store#%
Here the best attributes are te;tual( discrete an# use# as the source o$ constraints an# ro' headers in the
userDs ans-er set%
!ypica( attributes $or a pro#uct -ou(# inc(u#e a short #escription <10 to 1C characters=7 a (ong #escription
<A0 to 60 characters=7 the bran# name7 the category name7 the packaging type7 an# the si8e% +ccasiona((y7
it may be possib(e to mo#e( an attribute either as a $act or as a #imension% &n such a case it is the
#esignerDs choice%
! :e" role for dimension ta%le attri%utes is to ser<e as the source of constraints in a *uer" or to ser<e
as ro' headers in the userBs ans'er set
e%g%
?rand Dollar Sales @nit Sales
A2on J80 26A
;ramis 10 C05
Wi#get 21A
Kapper 5C A5
A stan#ar# 1B' Buery e2amp(e $or #ata -arehousing cou(# be:
select p.brand sum!".dollars# sum!".units# LMMM se(ect (ist
"rom sales"act " product p time t LMMM $rom c(auses -ith a(iases $7 p7 t
where ".timekey $ t.timekey LMMM >oin constraint
and ".productkey $ p.productkey LMMM >oin constraint
and t.%uarter $ &' ( '))*& LMMM app(ication constraint
groupby p.brand LMMM group by c(ause
orderby p.brand LMMM or#er by c(ause
@irtua((y every 9uery (ike this one contains ro- hea#ers an# aggregated facts in the se(ect (ist% !he ro-
hea#ers are not summe#7 the aggregate# $acts are%
!he $rom c(ause (ist the tab(es invo(ve# in the >oin%
!he >oin constraints >oin on the &rimar" :e" $rom the #imension tab(e an# the foreign :e" in the $act
tab(e% Referential integrit" is e2treme(y important in #ata -arehousing an# is en$orce# by the #ata base
management system%
!his $act tab(e key is a com&osite :e" consisting o$ concatenate# $oreign keys%
&n +'!P app(ications >oins are usua((y among arti$icia((y generate# numeric keys that have (itt(e
a#ministrative signi$icance e(se-here in the company% &n #ata -arehousing one >ob $unction maintains the
master pro#uct $i(e an# overseas the generation o$ ne- pro#uct keys an# another >ob $unction makes sure
that every sa(es recor# contains va(i# pro#uct keys% !hese >oins are there$ore ca((e# /$S )oins
9
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
!&&lication constraints app(y to in#ivi#ua( #imension tab(es% ?ro'sing the #imension tab(es7 the user
speci$ies app(ication constraints% &t rare(y makes sense to app(y an app(ication constraint simu(taneous(y
across t-o #imensions7 thereby (inking the t-o #imensions% !he #imensions are (inke# on(y through the
$act tab(e% &t is possib(e to #irect(y app(y an app(ication constraint to a $act in the $act tab(e% !his can be
thought o$ as a filter on the recor#s that -ou(# other-ise be retrieve# by the rest o$ the 9uery%
!he grou& %" clause summari8es recor#s in the ro- hea#ers% !he order %" clause #etermines the sort
or#er o$ the ans-er set -hen it is presente# to the user%
;rom a per$ormance vie-point then7 the 1B' 9uery shou(# be eva(uate# as $o((o-s:
;irst7 the app(ication constraints are eva(uate# #imension by #imension% .ach #imension thus pro#uces a
set o$ can#i#ate keys% !he can#i#ate keys are then assemb(e# $rom each #imension into tria( composite
keys to be searche# $or in the $act tab(e% A(( the 3hits3 in the $act tab(e are then groupe# an# summe#
accor#ing to the speci$ications in the se(ect (ist an# group by c(ause%
Attri%utes !o#e in Data Warehousing
Attributes are the #rivers o$ the Data Warehouse% !he user begins by p(acing app(ication constraints on the
#imensions through the process o$ bro-sing the #imension tab(es one at a time% !he bro-se 9ueries are
a(-ays on sing(e*#imension tab(es an# are usua((y $ast acting an# (ight-eight% 0ro-sing is to a((o- the
user to assemb(e the correct constraints on each #imension% !he user (aunches severa( 9ueries in this
phase% !he user a(so #rags ro- hea#ers $rom the #imension tab(es an# a##itive $acts $rom the $act tab(e to
the ans-er staging area < the report=% !he user then (aunches a mu(titab(e >oin% ;ina((y7 the #bms groups
an# summari8es mi((ions o$ (o-*(eve( recor#s $rom the $act tab(e into the sma(( ans-er set an# returns the
ans-er to the user%
To !am"le Data Warehouse Designs
Designing a ,roduct6=riented Data Warehouse
.ime Dimension
,romotion Dimension
Sales 4act
,roduct Dimension
Store Dimension
.he -rocer" Store Schema
timeIkey
#ayIo$I-eek
DayInoIinI4onth
other time #imension attri
promotionIkey
promotionIname
priceIre#uctionItype
other promotion attr
pro#uctIkey
1K/Ino
1K/I#esc
other pro#uct attr
storeIkey
storeIname
storeInumber
storeIa##r
other store attr
timeIkey
pro#uctIkey
storeIkey
promotionIkey
#o((arIsa(es
unitsIsa(es
#o((arIcost
customerIcount
10
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
7ac,ground
!he above schema is $or a grocery chain -ith C00 (arge grocery stores sprea# over a three*state area% .ach
store has a $u(( comp(ement o$ #epartments inc(u#ing grocery7 $ro8en $oo#s7 #airy7 meat7 pro#uce7 bakery7
$(ora(7 har# goo#s7 (i9uor an# #rugs% .ach store has about 607000 in#ivi#ua( pro#ucts on its she(ves% !he
in#ivi#ua( pro#ucts are ca((e# +tock ,eeping -nits or SC@s About 07000 o$ the 1K/s come $rom
outsi#e manu$acturers an# have bar co#es imprinte# on the pro#uct package% !hese bar co#es ca((e#
-niversal Product Codes or @,#s are at the same grain as in#ivi#ua( 1K/s% !he remaining 207000 1K/s
come $rom #epartments (ike meat7 pro#uce7 bakery or $(ora( #epartments an# #o not have nationa((y
recogni8e# /P, co#es%
4anagement is concerne# -ith the (ogistics o$ or#ering7 stocking the she(ves an# se((ing the pro#ucts as
-e(( as ma2imi8ing the pro$it at each store% !he most signi$icant management #ecision has to #o -ith
pricing an# promotions% Promotions inc(u#e tem&orar" &rice reductions( a#s in ne-spapers7 #isp(ays in
the grocery store inc(u#ing she($ #isp(ays an# en# ais(e #isp(ays an# coupons%
Identifying the Processes to "ode#
!he $irst step in the #esign is to #eci#e -hat business processes to mo#e(7 by combining an un#erstan#ing
o$ the business -ith an un#erstan#ing o$ -hat #ata is avai(ab(e% !he secon# step is to #eci#e on the grain
o$ the $act tab(e in each business process%
A #ata -arehouse a(-ays #eman#s #ata e2presse# at the (o-est possib(e grain o$ each #imension7 not $or
the 9ueries to see in#ivi#ua( (o-*(eve( recor#s7 but $or the 9ueries to be ab(e to cut through the #atabase in
very precise -ays% !he best grain $or the grocery store #ata -arehouse is #ai(y item movement or 1K/ by
store by promotion by #ay%
Dimension Ta%#e "ode#ing
A care$u( grain statement #etermines the primary #imensiona((y o$ the $act tab(e% &t is then possib(e to a##
a##itiona( #imensions to the basic grain o$ the $act tab(e7 -here these a##itiona( #imensions natura((y take
on on(y a sing(e va(ue un#er each combination o$ the primary #imensions% &$ it is recogni8e# that an
a##itiona( #esire# #imension vio(ates the grain by causing a##itiona( recor#s to be generate#7 then the
grain statement must be revise# to accommo#ate this a##itiona( #imension% !he grain o$ the grocery store
tab(e a((o-s the primary #imensions o$ time7 pro#uct an# store to $a(( out imme#iate(y%
4ost #ata -arehouses nee# an e2p(icit time dimension ta%le even though the primary time key may be an
1B' #ate*va(ue# ob>ect% !he e2p(icit time #imension tab(e is nee#e# to #escribe $isca( perio#s7 seasons7
ho(i#ays7 -eeken#s an# other ca(en#ar ca(cu(ations that are #i$$icu(t to get $rom the 1B' #ate machinery%
!ime is usua((y the $irst #imension in the un#er(ying sort or#er in the #atabase because -hen it is the $irst
in the sort or#er7 the successive (oa#ing o$ time interva(s o$ #ata -i(( (oa# #ata into virgin territory on the
#isk%
!he &roduct dimension is one o$ the t-o or three primary #imensions in near(y every #ata -arehouse%
!his type o$ #imension has a great many attributes7 in genera( can go above C0 attributes%
!he other t-o #imensions are an arti$act o$ the grocery store e2amp(e%
! note of cautionD
11
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
,roduct Dimension
pro#uctIkey
1K/I#esc
1K/Inumber
packageIsi8eIkey
packageItype
#ietItype
-eight
-eightIunitIo$I
Imeasure
storageItypeIkey
unitsIperIretai(I
case
etc%%
packageIsi8eIkey
packageIsi8e
bran#Ikey
categoryIkey
category
#epartmentIkey
subcategoryIkey
subcategory
categoryIkey
bran#Ikey
bran#
subcategoryI
key
#epartmentIkey
#epartment
storageItypeIkey
storageItype
she($I(i$eItypeIkey
she($I(i$eI
typeIkey
she($I(i$eI
type
! sno'fla:ed &roduct dimension
0ro-sing is the act o$ navigating aroun# in a #imension7 either to gain an intuitive un#erstan#ing o$ ho-
the various attributes corre(ate -ith each other or to bui(# a constraint on the #imension as a -ho(e% &$ a
(arge pro#uct #imension tab(e is sp(it apart into a sno-$(ake7 an# robust bro-sing is attempte# among
-i#e(y separate# attributes7 possib(y (ying a(ong various tree structures7 it is inevitab(e that bro-sing
per$ormance -i(( be compromise#%
12
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
8act Ta%#e "ode#ing
!he sales fact ta%le recor#s on(y the 1K/s actua((y so(#% "o recor# is kept o$ the 1K/s that #i# not se((%
<1ome app(ications re9uire these recor#s as -e((% !he $act tab(es are then terme# 3$act(ess3 $act recor#s=%
!he customer count7 because it is a##itive across three o$ the #imensions7 but not the $ourth7 is ca((e#
semiadditive. Any ana(ysis using the customer count must be restricte# to a sing(e pro#uct key to be va(i#%
!he app(ication must group (ine items together an# $in# those groups -here the #esire# pro#ucts coe2ist%
!his can be #one -ith the ,+/"! D&1!&",! operator in 1B'%
A #i$$erent so(ution is to store bran#7 subcategory7 category7 #epartment an# a(( merchan#ise customer
counts in e2p(icit(y store# aggregates% !his is an important techni9ue in #ata -arehousing that & -i(( not
cover in this report%
;ina((y7 #ri((ing #o-n in a #ata -arehouse is nothing more than a##ing ro- hea#ers $rom the #imension
tab(es% Dri((ing up is subtracting ro- hea#ers% An e2p(icit hierarchy is not nee#e# to support #ri((ing
#o-n%
Data%ase (i/ing for the 9rocery )hain
!he $act tab(e is over-he(ming(y (arge% !he #imensiona( tab(es are geometrica((y sma((er% 1o a(( rea(istic
estimates o$ the #isk space nee#e# $or the -arehouse can ignore the #imension tab(es%
!he $act tab(e in a #imensiona( schema shou(# be high(y norma(i8e# -hereas e$$orts to norma(i8e any o$
the #imensiona( tab(es are a -aste o$ time% &$ -e norma(i8e them by e2tracting repeating #ata e(ements
into separate 3outrigger3 tab(es7 -e make bro-sing an# pick (ist generation #i$$icu(t or impossib(e%
!ime #imension: 2 years N A6C #ays M JA0 #ays
1tore #imension: A00 stores7 reporting sa(es each #ay
Pro#uct #imension: A07000 pro#ucts in each store7 o$ -hich A7000 se(( each #ay in a given store
Promotion #imension: a so(# item appears in on(y one promotion con#ition in a store on a #ay%
"umber o$ base $act recor#s M JA0 N A00 N A000 N 1 M .*/ million records
"umber o$ key $ie(#s M O "umber o$ $act $ie(#s M O !ota( $ie(#s M 8
0ase $act tab(e si8e M 6CJ mi((ion N 8 $ie(#s N bytes M 21 )0
13
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
To !am"le Data Warehouse Designs
Designing a #ustomer6=riented Data Warehouse
& -i(( out(ine an insurance app(ication as an e2amp(e o$ a customer*oriente# #ata -arehouse%
&n this e2amp(e the insurance company is a PA bi((ion property an# casua(ty insurer $or automobi(es7 home
$ire protection7 an# persona( (iabi(ity% !here are t-o main pro#uction #ata sources: a(( transactions re(ating
to the $ormu(ation o$ po(icies7 an# a(( transactions invo(ve# in processing c(aims% !he insurance company
-ants to ana(y8e both the -ritten po(icies an# c(aims% &t -ants to see -hich coverages are most pro$itab(e
an# -hich are the (east% &t -ants to measure pro$its over time by covere# item type <i%e% kin#s o$ cars an#
kin#s o$ houses=7 state7 county7 #emographic pro$i(e7 un#er-riter7 sa(es broker an# sa(es region7 an#
events% 0oth revenues an# costs nee# to be i#enti$ie# an# tracke#% !he company -ants to un#erstan# -hat
happens #uring the (i$e o$ a po(icy7 especia((y -hen a c(aim is processe#%
14
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
.he follo'ing four schemas outline the star schema for the insurance a&&licationD
#laims .ransaction
Schema
#ateIkey
#ayIo$I-eek
$isca(Iperio#
emp(oyeeIkey
name
emp(oyeeItype
#epartment
covere#IitemIkey
covere#IitemI#esc
covere#IitemItype
c(aimantIname
c(aimantIkey
c(aimantIa##ress
c(aimantItype
thir#IpartyIkey
thir#IpartyIname
thir#IpartyIa##r
thor#IpartyItype
insure#IpartyIkey
name
a##ress
type
#emographic attributes
coverageIkey
coverageI#esc
marketIsegment
(ineIo$Ibusiness
annua(IstatementI(ine
automobile0attributes ...
po(icyIkey
riskIgra#e
c(aimIkey
c(aimI#esc
c(aimItype
automobile0attributes ...
transactionIkey
transactionI#escription
reason
automobile0attributes
...
transactionI#ate
e$$ectiveI#ate
insure#IpartyIkey
emp(oyeeIkey
coverageIkey
covere#IitemIkey
po(icyIkey
c(aimantIkey
c(aimIkey
thir#IpartyIkey
transactionIkey
amount
,olic" .ransaction Schema
#ateIkey
#ayIo$ -eek
$isca(Iperio#
emp(oyeeIkey
name
emp(oyeeItype
#epartment
covere#IitemIkey
covere#IitemI#escription
covere#IitemItype
automobile0attributes
transactionIkey
transactionI#scription
reason
transactionI#ate
e$$ectiveI#ate
insure#IpartyIkey
emp(oyeeIkey
coverageIkey
covere#IitemIkey
po(icyIkey
transactionIkey
amount
insure#IpartyIkey
name
a##ress
type
#emographicIattributes%%%
coverageIkey
coverageI#escription
marketIsegment
(ineIo$Ibusiness
annua(IstatementI(ine
automobi(eIattributes
po(icyIkey
riskIgra#e
%%%
15
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
,olic" Sna&shot Schema
#ateIkey
$isca(Iperio#
agentIkey
agentIname
agentI(ocation
agentItype
covere#IitemIkey
covere#IitemI#escription
covere#IitemItype
automobile0attributes ...
statusIkey
statusI#escription
insure#IpartyIkey
name
a##ress
type
#emographic attributes
coverageIkey
coverageI#esc
marketIsegment
(ineIo$Ibusiness
annua(IstatementI(ine
automobile0attributes ...
po(icyIkey
riskIgra#e
snapshotI#ate
e$$ectiveI#ate
insure#IpartyIkey
agentIkey
coverageIkey
covere#IitemIkey
po(icyIkey
statusIkey
-rittenIpermission
earne#Ipremium
primaryI(imit
primaryI#e#uctib(e
numberItransactions
automobile0"acts ...
#laims Sna&shot
Schema
#ateIkey
#ayIo$I-eek
$isca(Iperio#
covere#IitemIkey
covere#IitemI#esc
covere#IitemItype
automobile0attributes ...
agentIkey
agentIname
agentItype
agentI(ocation
c(aimIkey
c(aimI#esc
c(aimItype
automobile0attributes ...
insure#IpartyIkey
name
a##ress
type
#emographic attributes
coverageIkey
coverageI#esc
marketIsegment
(ineIo$Ibusiness
annua(IstatementI(ine
automobile0attributes ...
po(icyIkey
riskIgra#e
statusIkey
1tatusI#escription
transactionI#ate
e$$ectiveI#ate
insure#IpartyIkey
agentIkey
emp(oyeeIkey
coverageIkey
covere#IitemIkey
po(icyIkey
c(aimIkey
statusIkey
reservetIamount
pai#IthisImonth
receive#IthisImonth
numberItransactions
automobile "acts ...
1+
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
An appropriate #esign $or a property an# casua(ty insurance #ata -arehouse is a short va(ue chain
consisting o$ po(icy creation an# c(aims processing7 -here these t-o ma>or processes are represente# both
by transaction $act tab(es an# month(y snapshot $act tab(es%
!his #ata -arehouse -i(( nee# to represent a number o$ heterogeneous coverage types -ith appropriate
combinations o$ core an# custom #imension tab(es an# $act tab(es%
!he (arge insure# party an# covere# item #imensions -i(( nee# to be #ecompose# into one or more
mini#imensions in or#er to provi#e reasonab(e bro-sing per$ormance an# in or#er to accurate(y track
these s(o-(y changing #imensions%
Data%ase (i/ing for the Insurance App#ication
Policy Transaction #act Table !i$ing
"umber o$ po(icies: 270007000
"umber o$ covere# item coverages <(ine items= per po(icy: 10
"umber o$ po(icy transactions <not c(aim transactions= per year per po(icy: 12
"umber o$ years: A
+ther #imensions: 1 $or each po(icy (ine item transaction
"umber o$ base $act recor#s: 270007000 N 10 N 12 N A M /12 million records
"umber o$ key $ie(#s: 8O "umber o$ $act $ie(#s M 1O !ota( $ie(#s M 5
0ase $act tab(e si8e M J20 mi((ion N 5 $ie(#s N bytes M 26 )0
%laim Transaction #act Table !i$ing
"umber o$ po(icies: 270007000
"umber o$ covere# item coverages <(ine items= per po(icy: 10
Qear(y percentage o$ a(( covere# item coverages -ith a c(aim: CG
"umber o$ c(aim transactions per actua( c(aim: C0
"umber o$ years: A
+ther #imensions: 1 $or each po(icy (ine item transaction
"umber o$ base $act recor#s: 270007000 N 10 N 0%0C N C0 N A M '*2 million records
"umber o$ key $ie(#s: 11O "umber o$ $act $ie(#s M 1O !ota( $ie(#s M 12
0ase $act tab(e si8e M 1C0 mi((ion N 12 $ie(#s N bytes M J%2 )0
15
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Policy !na"shot #act Table !i$ing
"umber o$ po(icies: 270007000
"umber o$ covere# item coverages <(ine items= per po(icy: 10
"umber o$ years: A MR A6 months
+ther #imensions: 1 $or each po(icy (ine item transaction
"umber o$ base $act recor#s: 270007000 N 10 N A6 M /12 million records
"umber o$ key $ie(#s: 8O "umber o$ $act $ie(#s M CO !ota( $ie(#s M 1A
0ase $act tab(e si8e M J20 mi((ion N 1A $ie(#s N bytes M AJ )0
!ota( custom po(icy snapshot $act tab(es assuming an average o$ C custom $acts: C2 )0
%laim !na"shot #act Table !i$ing
"umber o$ po(icies: 270007000
"umber o$ covere# item coverages <(ine items= per po(icy: 10
Qear(y percentage o$ a(( covere# item coverages -ith a c(aim: CG
Average (ength o$ time that a c(aim is open: 12 months
"umber o$ years: A
+ther #imensions: 1 $or each po(icy (ine item transaction
"umber o$ base $act recor#s: 270007000 N 10 N 0%0C N A N 12 M 3. million records
"umber o$ key $ie(#s: 11O "umber o$ $act $ie(#s M O !ota( $ie(#s M 1C
0ase $act tab(e si8e M A6 mi((ion N 1C $ie(#s N bytes M 2%2 )0
!ota( custom po(icy snapshot $act tab(es assuming an average o$ C custom $acts: 2%5 )0
19
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Mechanics of the Design
!here are nine #ecision points that nee# to be reso(ve# $or a comp(ete #ata -arehouse #esign:
1% !he processes7 an# hence the i#entity o$ the $act tab(es
2% !he grain o$ each $act tab(e
A% !he #imensions o$ each $act tab(e
% !he $acts7 inc(u#ing preca(cu(ate# $acts%
C% !he #imension attributes -ith comp(ete #escriptions an# proper termino(ogy
6% Ho- to track s(o-(y changing #imensions
J% !he aggregations7 heterogeneous #imensions7 mini#imensions7 9uery mo#e(s an# other physica( storage
#ecisions
8% !he historica( #uration o$ the #atabase
5% !he urgency -ith -hich the #ata is e2tracte# an# (oa#e# into the #ata -arehouse
$nter<ie'ing End6@sers and D?!s
&ntervie-ing the en# users is the most important $irst step in #esigning a #ata -arehouse% !he intervie-s
rea((y accomp(ish t-o purposes% ;irst7 the intervie-s give the #esigners the insight into the nee#s an#
e2pectations o$ the user community% !he secon# purpose is to a((o- the #esigners to raise the (eve( o$
a-areness o$ the $orthcoming #ata -arehouse -ith the en# users7 an# to a#>ust an# correct some o$ the
usersD e2pectations%
!he D0Aa are o$ten the primary e2perts on the (egacy systems that may be use# as the sources $or the #ata
-arehouse% !hese intervie-s serve as a rea(ity check on some o$ the themes that come up in the en# user
intervie-s%
!ssem%ling the team
!he entire #ata -arehouse team shou(# be assemb(e# $or t-o to three #ays to go through the nine #ecision
points% !he atten#ees shou(# be a(( the peop(e -ho have an ongoing responsibi(ity $or the #ata -arehouse7
inc(u#ing D0As7 system a#ministrators7 e2tract programmers7 app(ication #eve(opers7 an# support
personne(% .n# users shou(# not atten# the #esign sessions%
&n the #esign sessions7 the $act tab(es are i#enti$ie# an# their grains chosen% "e2t the #imension tab(es are
i#enti$ie# by name an# their grains chosen% ./? #iagrams are not use# to i#enti$y the $act tab(es or their
grains% !hey simp(y $ami(iari8e the sta$$ -ith the comp(e2ities o$ the #ata%
19
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
#hoosing the Aard'are/Soft'are &latforms
!hese choices boi( #o-n to t-o primary concerns:
1% Does the propose# system actua((y -ork :
2% &s this a ven#or re(ationship that -e -ant to have $or a (ong time :
Buestion the ven#or -hether:
1% ,an the system 9uery7 store7 (oa#7 in#e27 an# a(ter a bi((ion*ro- $act tab(e -ith a #o8en #imensions :
2% ,an the system rapi#(y bro-se a 1007000 ro- #imension tab(e :
0enchmark the system to simu(ate $act an# #imension tab(e (oa#ing%
,on#uct a 9uery test $or:
1% Average bro-se 9uery response time
2% Average bro-se 9uery #e(ay compare# -ith un(oa#e# system
A% ?atio bet-een (ongest an# shortest bro-se 9uery time
% Average >oin 9uery response time
C% Average >oin 9uery #e(ay compare# -ith un(oa#e# system
6% ?ation bet-een (ongest an# shortest >oin 9uery time <gives a sense o$ the stabi(ity o$ the optimi8er=
J% !ota( number o$ 9uery suites processe# per hour
Aandling !ggregates
An aggregate is a $act tab(e recor# representing a summari8ation o$ base*(eve( $act tab(e recor#s% An
aggregate $act tab(e recor# is a(-ays associate# -ith one or more aggregate #imension tab(e recor#s% Any
#imension attribute that remains unchange# in the aggregate #imension tab(e can be use# more e$$icient(y
in the aggregate schema than in the base*(eve( schema because it is guarantee# to make sense at the
aggregate (eve(%
1evera( #i$$erent precompute# aggregates -i(( acce(erate summari8ation 9ueries% !he e$$ect on
per$ormance -i(( be huge% !here -i(( be a ten to thousan#*$o(# improvement in runtime by having the
right aggregates avai(ab(e%
D0As shou(# spen# time -atching -hat the users are #oing an# #eci#ing -hether to bui(# more
aggregates% !he creation o$ aggregates re9uires a signi$icant a#ministrative e$$ort% Whereas the operationa(
pro#uction system -i(( provi#e a $rame-ork $or a#ministering base*(eve( recor# keys7 the #ata -arehouse
team must create an# maintain aggregate keys%
An aggregate navigator is very use$u( to intercept the en# userDs 1B' 9uery an# trans$orm it so as to use
the best avai(ab(e aggregate% &t is thus an essentia( component o$ the #ata -arehouse because it insu(ates
an# user app(ications $rom the changing port$o(io o$ aggregations7 an# a((o-s the D0A to #ynamica((y
a#>ust the aggregations -ithout having to ro(( over the app(ication base%
;ina((y7 aggregations provi#e a home $or p(anning #ata% Aggregations bui(t $rom the base (ayer up-ar#7
coinci#e -ith the p(anning process in p(ace that creates p(ans an# $orecasts at these very same (eve(s%
20
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Ser<er6Side acti<ities
&n summary7 the 3back3 room or server $unctions can be (iste# as $o((o-s%
0ui(# an# use the pro#uction #ata e2tract system%
Per$orm #ai(y #ata 9ua(ity assurance%
4onitor an# tune the per$ormance o$ the #ata -arehouse system%
Per$orm backup an# recovery on the #ata -arehouse%
,ommunicate -ith the user community%
1teps can be out(ine# in the #ai(y pro#uction e2tract7 as $o((o-s:
1% Primary e2traction <rea# the (egacy $ormat=
2% &#enti$y the change# recor#s
A% )enera(i8e keys $or changing #imensions%
% !rans$orm e2tract into (oa# recor# images%
C% 4igrate $rom the (egacy system to the Data Warehouse system
6% 1ort an# bui(# aggregates%
J% )enera(i8e keys $or aggregates%
8% Per$orm (oa#ing
5% Process e2ceptions
10% Bua(ity assurance
11% Pub(ish
A##itiona( notes:
Data e2tract too(s are e2pensive% &t #oes not make sense to buy them unti( the e2tract an# trans$ormation
re9uirements are -e(( un#erstoo#%
4aintenance o$ comparison copies o$ pro#uction $i(es is a signi$icant app(ication bur#en that is a uni9ue
responsibi(ity o$ the #ata -arehouse team%
!o contro( s(o-(y changing #imensions7 the #ata -arehouse team must create an a#ministrative process
$or issuing ne- #imension keys each time a trackab(e change occurs% !he t-o a(ternatives $or
a#ministering keys are: #erive# keys an# se9uentia((y assigne# integer keys%
/etadata 6 4eta#ata is a (oose term $or any $orm o$ au2i(iary #ata that is maintaine# by an app(ication%
4eta#ata is a(so kept by the aggregate navigator an# by $ront*en# 9uery too(s% !he #ata -arehouse team
shou(# care$u((y #ocument a(( $orms o$ meta#ata% &#ea((y7 $ront*en# too(s shou(# provi#e $or too(s $or
meta#ata a#ministration%
4ost o$ the e2traction steps shou(# be han#(e# on the (egacy system% !his -i(( a((o- $or the biggest
re#uction in #ata vo(umes%
21
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
A bu(k #ata (oa#er shou(# a((o- $or:
!he para((e(i8ation o$ the bu(k #ata (oa# across a number o$ processors in either 14P or 4PP
environments%
1e(ective(y turning o$$ an# then on the master in#e2 pre an# post bu(k (oa#s
&nsert an# up#ate mo#es se(ectab(e by the D0A
?e$erentia( integrity han#(ing options
&t is a goo# i#ea7 as mentione# ear(ier7 to think o$ the (oa# process as one transaction% &$ the (oa# is
corrupte#7 a ro((back an# (oa# in the ne2t (oa# -in#o- shou(# be trie#%
#lient6Side acti<ities
!he c(ient $unctions can be summari8e# as $o((o-s:
0ui(# reusab(e app(ication temp(ates
Design usab(e graphica( user inter$aces
!rain users on both the app(ications an# the #ata
Keep the net-ork running e$$icient(y
A##itiona( notes:
.ase o$ use shou(# be a primary criteria $or an en# user app(ication too(%
!he #ata -arehouse shou(# consist o$ a (ibrary o$ temp(ate app(ications that run imme#iate(y on the userDs
#esktop% !hese app(ications shou(# have a (imite# set o$ user*se(ectab(e a(ternatives $or setting ne-
constraints an# $or picking ne- measures% !hese temp(ate app(ications are precanne#7 parameteri8e#
reports%
!he 9uery too(s shou(# per$orm comparisons $(e2ib(y an# imme#iate(y% A sing(e ro- o$ an ans-er set
shou(# sho- comparisons over mu(tip(e time perio#s o$ #i$$ering grains * month7 9uarter7 yt#7 etc% An# a
comparison over other #imensions * share o$ a pro#uct to a category7 an# compoun# comparisons across
t-o or more #imensions * share change this yr @s (ast yr% !hese comparison a(ternatives shou(# be
avai(ab(e in the $orm o$ a pu(( #o-n menu% 1B' shou(# never be sho-n%
Presentation shou(# be treate# as a separate activity $rom 9uerying an# comparing an# too(s that a((o-
ans-er sets to be trans$erre# easi(y into mu(tip(e presentation environments7 shou(# be chosen
A report*-riting 9uery too( shou(# communicate the conte2t o$ the report instant(y7 inc(u#ing the i#entities
o$ the attributes an# the $acts as -e(( as any constraints p(ace# by the user% &$ a user -ishes to e#it a
co(umn7 they shou(# be ab(e to #o it #irect(y% ?e9uerying a$ter an e#it shou(# at the most $etch the #ata
nee#e# to rebui(# the e#ite# co(umn%
A(( 9uery too(s must have an instant 1!+P comman#% !he too( shou(# not engage the c(ient machine
-hi(e -aiting on #ata $rom the server%
22
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
%onclusions
!he #ata -arehousing market is moving 9uick(y as a(( ma>or D041 an# too( ven#ors try to satis$y &1
nee#s% !he in#ustry nee#s to be #riven by the users as oppose# to by the so$t-are/har#-are ven#ors as has
been the case upto no-%
1o$t-are is the key% A(though there have been severa( a#vances in har#-are7 such as para((e( processing7
the main impact -i(( sti(( be $e(t through so$t-are%
Here are a $e- so$t-are issues:
+ptimi8ation o$ the e2ecution o$ star >oin 9ueries
&n#e2ing o$ #imension tab(es $or bro-sing an# constraining7 especia((y mu(ti*mi((ion*ro- #imension
tab(es
&n#e2ing o$ composite keys o$ $act tab(es
1ynta2 e2tensions $or 1B' to han#(e aggregations an# comparisons
1upport $or (o-*(eve( #ata compression
1upport $or para((e( processing
Database Design too(s $or star schemas
.2tract7 a#ministration an# BA too(s $or star schemas
.n# user 9uery too(s
23
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
& %hec'list for an Ideal Data Warehouse
!he $o((o-ing check(ist is $rom ?a(ph Kimba((Ds * A Data Warehouse !oo(kit7 Wi(ey D56
Pre(iminary comp(ete (ist o$ a$$ecte# user groups prior to intervie-s
Pre(iminary comp(ete (ist o$ (egacy #ata sources prior to intervie-s
Data -arehouse imp(ementation team i#enti$ie#
Data -arehouse manager i#enti$ie#
&ntervie- (ea#er i#enti$ie#
.2tract programming manager i#enti$ie#
.n# user groups to be intervie-e# i#enti$ie#
Data -arehouse kicko$$ meeting -ith a(( a$$ecte# en# user groups
.n# user intervie-s
4arketing intervie-s
;inance intervie-s
'ogistics intervie-s
;ie(# management intervie-s
1enior management intervie-s
1i2*inch stack o$ e2isting management reports representing a(( intervie-e# groups
'egacy system D0A intervie-s
,opy books obtaine# $or can#i#ate (egacy systems
Data #ictionary e2p(aining meaning o$ each can#i#ate tab(e an# $ie(#
High*(eve( #escription o$ -hich tab(es an# $ie(#s are popu(ate# -ith 9ua(ity #ata
&ntervie- $in#ings report #istribute#
Prioriti8e# in$ormation nee#s as e2presse# by en# user community
Data au#it per$orme# sho-ing -hat #ata is avai(ab(e to support in$ormation nee#s
Data-arehousing #esign meeting
4a>or processes i#enti$ie# an# $act tab(es (ai# out
)rain $or each $act tab(e chosen
,hoice o$ transaction grain @s time perio# accumu(ating snapshot grain
Dimensions $or each $act tab(e i#enti$ie#
;acts $or each $act tab(e -ith (egacy source $ie(#s i#enti$ie#
24
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Dimension attributes -ith (egacy source $ie(#s i#enti$ie#
,ore an# custom heterogeneous pro#uct tab(es i#enti$ie#
1(o-(y changing #imension attributes i#enti$ie#
Demographic mini#imensions i#enti$ie#
&nitia( aggregate# #imensions i#enti$ie#
Duration o$ each $act tab(e <nee# to e2tract o(# #ata up$ront= i#enti$ie#
/rgency o$ each $act tab(e <e%g% nee# to e2tract on a #ai(y basis= i#enti$ie#
&mp(ementation staging <$irst process to be imp(emente#%%%=
0(ock #iagram $or pro#uction #ata e2tract <as each ma>or process is imp(emente#=
1ystem $or rea#ing (egacy #ata
1ystem $or i#enti$ying changing recor#s
1ystem $or han#(ing s(o-(y changing #imensions
1ystem $or preparing (oa# recor# images
4igration system <main$rame to D041 server machine=
1ystem $or creating aggregates
1ystem $or (oa#ing #ata7 han#(ing e2ceptions7 guaranteeing re$erentia( integrity
1ystem $or #ata 9ua(ity assurance check
1ystem $or #ata snapshot backup an# recovery
1ystem $or pub(ishing7 noti$ying users o$ #ai(y #ata status
D041 server har#-are
@en#or sa(es an# support team 9ua(i$ie#
@en#or re$erence sites contacte# an# 9ua(i$ie# as to re(evance
@en#or on*site test <i$ no 9ua(i$ie#7 re(evant re$erences avai(ab(e=
@en#or #emonstrates abi(ity to support system startup7 backup7 #ebugging
+pen systems an# para((e( sca(abi(ity goa(s met
,ontractua( terms approve#
D041 so$t-are
@en#or sa(es an# support team 9ua(i$ie#
@en#or team has imp(emente# a simi(ar #ata -arehouse
@en#or team agrees -ith #imensiona( approach
25
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
@en#or team #emonstrates competence in prototype test
Abi(ity to (oa#7 in#e2 an# 9ua(ity assure #ata vo(ume #emonstrate#
Abi(ity to bro-se (arge #imension tab(es #emonstrate#
Abi(ity to 9uery $ami(y o$ $act tab(es $rom 20 P,s un#er (oa# #emonstrate#
1uperior per$ormance an# optimi8er stabi(ity #emonstrate# $or star >oin 9ueries
1uperior (arge #imension tab(e bro-sing #emonstrate#
.2ten#e# 1B' synta2 $or specia( #ata -arehouse $unctions
Abi(ity to imme#iate(y an# grace$u((y stop a 9uery $rom en# user P,
.2tract too(s
1peci$ic nee# $or $eatures o$ e2tract too( i#enti$ie# $rom e2tract system b(ock #iagram
A(ternative o$ -riting home*gro-n e2tract system re>ecte#
?e$erence sites supp(ie# by ven#or 9ua(i$ie# $or re(evance
Aggregate navigator
+pen system approach o$ navigator veri$ie# <serves a(( 1B' net-ork c(ients=
4eta#ata tab(e a#ministration un#erstoo# an# compare# -ith other navigators
/ser 9uery statistics7 aggregate recommen#ations7 (ink to aggregate creation too(
1ubsecon# bro-sing per$ormance -ith the navigator #emonstrate# $or tiny bro-ses
;ront en# too( $or #e(ivering parameteri8e# reports
1ave# reports that can be mai(e# $rom user to user an# run
1ave# constraint #e$initions that can be reuse# <pub(ic an# private=
1ave# behaviora( group #e$initions that can be reuse# <pub(ic an# private=
Dimension tab(e bro-ser -ith cross attribute subsetting
.2isting report can be opene# an# run -ith one button c(ick
4u(tip(e ans-er sets can be automatica((y assemb(e# in too( -ith outer >oin
Direct support $or sing(e an# mu(ti #imension comparisons
Direct support $or mu(tip(e comparisons -ith #i$$erent aggregations
Direct support $or average time perio# ca(cu(ations <e%g% average #ai(y ba(ance=
1!+P B/.?Q comman#
.2tensib(e inter$ace to H.'P a((o-ing -arehouse #ata tab(es to be #escribe# to user
1imp(e #ri((*#o-n comman# supporting mu(tip(e hierarchies an# nonhierarchies
2+
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
Dri(( across that a((o-s mu(tip(e $act tab(es to appear in same report
,orrect(y ca(cu(ate# break ro-s
?e#*)reen e2ception high(ighting -ith inter$ace to #ri(( #o-n
Abi(ity to use net-ork aggregate navigator -ith every atomic 9uery issue# by too(
1e9uentia( operations on the ans-er set such as numbering top "7 an# ro((ing
Abi(ity to e2ten# 9uery synta2 $or D041 specia( $unctions
Abi(ity to #e$ine very (arge behaviora( groups o$ customers or pro#ucts
Abi(ity to graph #ata or han# o$$ #ata to thir#*party graphics package
Abi(ity to pivot #ata or to han# o$$ #ata to thir#*party pivot package
Abi(ity to support +'. hot (inks -ith other +'. a-are app(ications
Abi(ity to p(ace ans-er set in c(ipboar# or !N! $i(e in 'otus or .2ce( $ormats
Abi(ity to print hori8onta( an# vertica( ti(e# report
0atch operation
)raphica( user inter$ace user #eve(opment $aci(ities
Abi(ity to bui(# a startup screen $or the en# user
Abi(ity to #e$ine pu(( #o-n menu items
Abi(ity to #e$ine buttons $or running reports an# invoking the bro-ser
,onsu(tants
,onsu(tant team 9ua(i$ie#
,onsu(tant team has imp(emente# a simi(ar #ata -arehouse
,onsu(tant team agrees -ith the #imensiona( approach
,onsu(tant team #emonstrates competence in prototype test
25
Data Warehousing: A Perspective
by
Hemant Kirpekar
10/18/201
(ibliogra"hy
1% 0u(i#ing a Data Warehouse7 1econ# .#ition7 by W%H% &nmon7 Wi(ey7 1556
2% !he Data Warehouse !oo(kit7 by Dr% ?a(ph Kimba((7 Wi(ey7 1556
A% 1trategic Database !echno(ogy: 4anagement $or the year 20007 by A(an 1imon7 4organ Kau$mann7
155C
% App(ie# Decision 1upport7 by 4ichae( W% Davis7 Prentice Ha((7 1588
C% Data Warehousing: Passing ;ancy or 1trategic &mperative7 -hite paper by the )artner )roup7 155C
6% Kno-(e#ge Asset 4anagement an# ,orporate 4emory7 -hite paper by the Hispacom )roup7 to be
pub(ishe# in Aug
1556
The End
29

You might also like