Professional Documents
Culture Documents
Lecture1Motivation:Whydatamining?
Lecture2Whatisdatamining?
g
Lecture3DataMining:Onwhatkindof
data?
Lecture4Dataminingfunctionality
Lecture5Classificationofdatamining
systems
L
Lecture6Majorissuesindatamining
6 M j i i d i i
1
Unit1
Unit 1DatawarehouseandOLAP
Data warehouse and OLAP
L t
Lecture7
7 Wh t i d t
Whatisadatawarehouse?
h ?
Lecture8 Amultidimensionaldatamodel
Lecture9 Datawarehousearchitecture
Lecture10&11 Datawarehouseimplementation
Lecture12 Fromdatawarehousingtodatamining
g g
2
Lecture 1
Lecture1
Motivation: Why data mining?
Motivation:Whydatamining?
3
EvolutionofDatabaseTechnology
1960sandearlier:
1960 d li
DataCollectionandDatabaseCreation
Primitivefileprocessing
4
EvolutionofDatabaseTechnology
1970s early1980s:
DataBaseManagementSystems
D B M S
Hieraticalandnetworkdatabasesystems
RelationaldatabaseSystems
Querylanguages:SQL
Transactions,concurrencycontrolandrecovery.
On
Online
linetransactionprocessing(OLTP)
transaction processing (OLTP)
5
EvolutionofDatabaseTechnology
Mid
Mid 1980s
1980s present:
present:
Advanceddatamodels
Extendedrelational,objectrelational
Extended relational object relational
AdvancedapplicationorientedDBMS
spatial,scientific,engineering,temporal,multimedia,
ti l i tifi i i t l lti di
active,streamandsensor,knowledgebased
6
EvolutionofDatabaseTechnology
Late1980spresent
p
AdvancedDataAnalysis
DatawarehouseandOLAP
Dataminingandknowledgediscovery
i i dk l d di
Advanceddataminingappliations
Dataminingandsocity
1990spresent:
XMLbaseddatabasesystems
Integrationwithinformationretrieval
Dataandinformationintegreation
7
EvolutionofDatabaseTechnology
Present
Present future:
future:
Newgenerationofintegrateddataand
information system
informationsystem.
8
Lecture2
What Is Data Mining?
WhatIsDataMining?
9
WhatIsDataMining?
Data
Dataminingreferstoextractingormining
mining refers to extracting or mining
knowledgefromlargeamountsofdata.
Miningofgoldfromrocksorsand
Mining of gold from rocks or sand
Knowledgeminingfromdata,knowledge
extraction,data/patternanalysis,data
i d / l i d
archeology,anddatadreding.
KnowledgeDiscoveryfromdata,orKDD
10
DataMining:AKDDProcess
Pattern Evaluation
Datamining:thecoreof
knowledgediscovery
process
process. Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
11
Steps of a KDD Process
StepsofaKDDProcess
1.
1 Datacleaning
Data cleaning
2. Dataintegration
3
3. Dataselection
l i
4. Datatransformation
5. Datamining
6
6. Pattern evaluation
Patternevaluation
7. Knowledgepresentaion
12
StepsofaKDDProcess
p
Learningtheapplicationdomain:
relevantpriorknowledgeandgoalsof
l i k l d d l f
application
Creatingatargetdataset:dataselection
Creating a target data set: data selection
Datacleaningandpreprocessing
Datareductionandtransformation:
Data reduction and transformation:
Findusefulfeatures,dimensionality/variable
reduction,invariantrepresentation.
13
Steps of a KDD Process
StepsofaKDDProcess
Choosingfunctionsofdatamining
Choosing functions of data mining
summarization,classification,regression,association,
clustering.
Choosingtheminingalgorithms
Datamining:searchforpatternsofinterest
Patternevaluationandknowledgepresentation
visualization,transformation,removingredundant
patterns,etc.
Useofdiscoveredknowledge
14
ArchitectureofaTypicalData
Mi i S t
MiningSystem
G hi l user interface
Graphical i f
Pattern evaluation
Data
a a mining
g engine
g
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Databases Warehouse
15
Data Mining and Business Intelligence
DataMiningandBusinessIntelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
16
Lecture3
DataMining:OnWhatKindofData?
17
DataMining:OnWhatKindofData?
Relationaldatabases
Datawarehouses
Transactionaldatabases
18
Data Mining: On What Kind of Data?
DataMining:OnWhatKindofData?
AdvancedDBandinformationrepositories
Advanced DB and information repositories
Objectorientedandobjectrelationaldatabases
Spatialdatabases
Spatial databases
Timeseriesdataandtemporaldata
Textdatabasesandmultimediadatabases
T td t b d lti di d t b
Heterogeneousandlegacydatabases
WWW
19
Lecture4
Lecture 4
DataMiningFunctionalities
20
DataMiningFunctionalities
g
Conceptdescription:Characterizationand
p p
discrimination
Datacanbeassociatedwithclassesorconcepts
p
Ex.AllElectronicsstoreclassesofitemsforsaleinclude
computerandprinters.
Descriptionofclassorconceptcalledclass/concept
description.
Datacharacterization
Datadiscrimination
21
Data Mining Functionalities
DataMiningFunctionalities
Mining
MiningFrequentPatterns,Associations,and
Frequent Patterns Associations and
Correlations
Frequentpatters
Frequent patters patternsoccursfrequently
patterns occurs frequently
Itemsets,subsequencesandsubstructures
Frequentitemset
Sequentialpatterns
Structuredpatterns
22
Data Mining Functionalities
DataMiningFunctionalities
AssociationAnalysis
Association Analysis
Multidimensionalvs.singledimensional
association
association
age(X,20..29)^income(X,20..29K)=>buys(X,
PC) [
PC)[support=2%,confidence=60%]
t 2% fid 60%]
contains(T,computer)=>contains(x,
f
software)[support=1%,confidence=75%]
) [ f ]
23
DataMiningFunctionalities
ClassificationandPrediction
Findingmodels(functions)thatdescribeand
Finding models (functions) that describe and
distinguishdataclassesorconceptsforpredictthe
class whose label is unknown
classwhoselabelisunknown
E.g.,classifycountriesbasedonclimate,orclassify
cars based on gas mileage
carsbasedongasmileage
Models:decisiontree,classificationrules(ifthen),
neuralnetwork
l t k
Prediction:Predictsomeunknown ormissing
numericalvalues
24
Data Mining Functionalities
DataMiningFunctionalities
Clusteranalysis
Cluster analysis
Analyzeclasslabeleddataobjects,clustering
analyze data objects without consulting a known
analyzedataobjectswithoutconsultingaknown
classlabel.
Clusteringbasedontheprinciple:maximizingthe
Cl t i b d th i i l i i i th
intraclasssimilarityandminimizingtheinterclass
similarity
25
DataMiningFunctionalities
g
Outlieranalysis
Outlier:adataobjectthatdoesnotcomplywiththegeneralbehavior
Outlier: a data object that does not comply with the general behavior
ofthemodelofthedata
Itcanbeconsideredasnoiseorexceptionbutisquiteusefulinfraud
detection,rareeventsanalysis
Trendandevolutionanalysis
y
Trendanddeviation:regressionanalysis
Sequentialpatternmining,periodicityanalysis
Sequential pattern mining periodicity analysis
Similaritybasedanalysis
26
Lecture5
Lecture 5
DataMining:ClassificationSchemes
27
DataMining:ConfluenceofMultiple
Disciplines
Database
Statistics
Technology
Information
Science Data Mining MachineLearning
Visualization Other
Disciplines
28
Data Mining: Classification Schemes
DataMining:ClassificationSchemes
Generalfunctionalityy
Descriptivedatamining
Predictivedatamining
Predictive data mining
Dataminingvariouscriteria's:
Kindsofdatabasestobemined
Kindsofknowledgetobediscovered
Kindsoftechniquesutilized
Kindsofapplicationsadapted
pp p
29
DataMining:ClassificationSchemes
Databasestobemined
Relational,transactional,objectoriented,object
, , j , j
relational,active,spatial,timeseries,text,multimedia,
heterogeneous,legacy,WWW,etc.
Knowledgetobemined
Knowledge to be mined
Characterization,discrimination,association,
classification,clustering,trend,deviationandoutlier
analysis etc
analysis,etc.
Multiple/integratedfunctionsandminingatmultiple
levels
analysis,Webmining,Webloganalysis,etc.
30
Data Mining: Classification Schemes
DataMining:ClassificationSchemes
Techniques
Techniquesutilized
utilized
Databaseoriented,datawarehouse(OLAP),
machine learning statistics visualization
machinelearning,statistics,visualization,
neuralnetwork,etc.
Applicationsadapted
A li i d d
Retail,telecommunication,banking,fraud
analysis,DNAmining,stockmarket
31
Lecture6
Lecture 6
MajorIssuesinDataMining
32
MajorIssuesinDataMining
Miningmethodologyanduserinteractionissues
Miningdifferentkindsofknowledgeindatabases
Interactiveminingofknowledgeatmultiplelevelsof
abstraction
Incorporationofbackgroundknowledge
Dataminingquerylanguagesandadhocdatamining
Expressionandvisualizationofdataminingresults
Handlingnoiseandincompletedata
Patternevaluation:theinterestingnessproblem
33
Major Issues in Data Mining
MajorIssuesinDataMining
Performanceissues
Performance issues
Efficiencyandscalabilityofdataminingalgorithms
Effi i d l bilit f d t i i l ith
Parallel,distributedandincrementalmining
methods
h d
34
MajorIssuesinDataMining
Issuesrelatingtothediversityofdatatypes
g y yp
Handlingrelationalandcomplextypesofdata
Handling relational and complex types of data
Mininginformationfromheterogeneousdatabases
Minin information from hetero eneo s databases
andglobalinformationsystems(WWW)
35
Lecture7
Wh t i D t W h
WhatisDataWarehouse?
?
36
WhatisDataWarehouse?
Definedinmanydifferentways
Adecisionsupportdatabasethatismaintainedseparately
from the organizations operational database
fromtheorganizationsoperationaldatabase
Supportinformationprocessingbyprovidingasolid
platformofconsolidated,historicaldataforanalysis.
Adatawarehouseisasubjectoriented, integrated,time
variant and nonvolatile collectionofdatainsupportof
variant,andnonvolatile collection of data in support of
managementsdecisionmakingprocess.W.H.Inmon
Datawarehousing:
h i
Theprocessofconstructingandusingdatawarehouses
37
D t W h
DataWarehouseSubjectOriented
S bj t O i t d
Organizedaroundmajorsubjects,suchascustomer,product,
Organized around major subjects such as customer product
sales.
Focusingonthemodelingandanalysisofdatafordecision
Focusing on the modeling and analysis of data for decision
makers,notondailyoperationsortransactionprocessing.
Provideasimpleandconciseviewaroundparticularsubject
Provide a simple and concise view around particular subject
issuesbyexcludingdatathatarenotusefulinthedecision
support process
supportprocess.
38
DataWarehouseIntegrated
Constructedbyintegratingmultiple,heterogeneous
datasources
relationaldatabases,flatfiles,onlinetransactionrecords
Datacleaninganddataintegrationtechniquesare
g g q
applied.
Ensureconsistencyinnamingconventions,encoding
structures,attributemeasures,etc.amongdifferentdata
sources
E.g.,Hotelprice:currency,tax,breakfastcovered,etc.
g, p y, , ,
Whendataismovedtothewarehouse,itisconverted.
39
Data Warehouse Time Variant
DataWarehouseTimeVariant
Thetimehorizonforthedatawarehouseis
significantlylongerthanthatofoperationalsystems.
Operationaldatabase:currentvaluedata.
Datawarehousedata:provideinformationfromahistorical
perspective(e.g.,past510years)
Everykeystructureinthedatawarehouse
k i h d h
Containsanelementoftime,explicitlyorimplicitly
Butthekeyofoperationaldatamayormaynotcontain
timeelement.
40
Data Warehouse Non Volatile
DataWarehouseNonVolatile
Aphysicallyseparatestoreofdatatransformedfrom
theoperationalenvironment.
Operationalupdateofdatadoesnotoccurinthe
datawarehouseenvironment.
Doesnotrequiretransactionprocessing,recovery,and
concurrencycontrolmechanisms
y
Requiresonlytwooperationsindataaccessing:
initialloadingofdata
t a oad g of data aandaccessofdata.
d access of data
41
Data Warehouse vs Operational DBMS
DataWarehousevs.OperationalDBMS
Distinctfeatures(OLTPvs.OLAP):
Userandsystemorientation:customervs.market
U d t i t ti t k t
Datacontents:current,detailedvs.historical,consolidated
Databasedesign:ER+applicationvs.star+subject
Database design: ER + application vs star + subject
View:current,localvs.evolutionary,integrated
Accesspatterns:updatevs.readonlybutcomplexqueries
Access patterns: update vs read only but complex queries
42
Data Warehouse vs. Operational DBMS
DataWarehousevs.OperationalDBMS
OLTP(onlinetransactionprocessing)
MajortaskoftraditionalrelationalDBMS
Daytodayoperations:purchasing,inventory,banking,
manufacturing,payroll,registration,accounting,etc.
OLAP(onlineanalyticalprocessing)
Majortaskofdatawarehousesystem
Dataanalysisanddecisionmaking
43
OLTP vs OLAP
OLTPvs.OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design
g application-oriented
pp subject-oriented
j
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad hoc
ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex queryy
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
44
Why Separate Data Warehouse?
WhySeparateDataWarehouse?
Highperformanceforbothsystems
DBMS tunedforOLTP:accessmethods,indexing,
concurrencycontrol,recovery
WarehousetunedforOLAP:complexOLAP
W h t d f OLAP l OLAP
queries,multidimensionalview,consolidation.
45
Why Separate Data Warehouse?
WhySeparateDataWarehouse?
Differentfunctionsanddifferentdata:
Different functions and different data:
missingdata:Decisionsupportrequireshistorical
datawhichoperationalDBsdonottypically
maintain
dataconsolidation:DSrequiresconsolidation
(aggregation summari ation) of data from
(aggregation,summarization)ofdatafrom
heterogeneoussources
dataquality:differentsourcestypicallyuse
data quality: different sources typically use
inconsistentdatarepresentations,codesand
formatswhichhavetobereconciled
46
L
Lecture8
8
Amultidimensionaldatamodel
47
Cube:ALatticeofCuboids
all
0-D(apex) cuboid
time item
time,item time location
time,location item location
item,location location supplier
location,supplier
2-D cuboids
time,supplier item,supplier
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
48
Conceptual Modeling of Data Warehouses
ConceptualModelingofDataWarehouses
Modelingdatawarehouses:dimensions&measures
Starschema:Afacttableinthemiddleconnectedtoasetof
di
dimensiontables
i bl
Snowflakeschema:Arefinementofstarschemawhere
some dimensional hierarchy is normalized into a set of
somedimensionalhierarchyisnormalizedintoasetof
smallerdimensiontables,formingashapesimilarto
snowflake
Factconstellations:Multiplefacttablessharedimension
tables,viewedasacollectionofstars,thereforecalled
galaxyschemaorfactconstellation
49
Example of Star Schema
ExampleofStarSchema
time
time_key
time key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time key
time_key type
year supplier_type
item_key
branch key
branch_key
branch location
location_key
branch_key location_key
bbranch
a c _name
a e units sold
units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
50
Example of Snowflake Schema
ExampleofSnowflakeSchema
time
time_key
time key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time key
time_key t
type
year item_key supplier_key
branch_key
y
branch location
location_key
location_key
branch_key
units_sold street
bbranch
a c _name
a e
city_key
it k
branch_type city
dollars_sold
city_key
avg_sales cityy
Measures province_or_street
country
51
ExampleofFactConstellation
p
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_keyy type
yp item_key
year supplier_type shipper_key
item_key
branch_key from_location
CubeDefinition(FactTable)
Cube Definition (Fact Table)
definecube<cube_name>[<dimension_list>]:
<measure_list>
DimensionDefinition(DimensionTable)
definedimension<dimension_name>as
(
(<attribute_or_subdimension_list>)
ib bdi i li )
SpecialCase(SharedDimensionTables)
Fi
Firsttimeascubedefinition
t ti b d fi iti
definedimension<dimension_name>as
<dimension_name_first_time>incube
<cube_name_first_time>
53
DefiningaStarSchemainDMQL
definecubesales_star[time,item,branch,location]:
[ , , , ]
dollars_sold=sum(sales_in_dollars),avg_sales=
avg(sales_in_dollars),units_sold=count(*)
definedimensiontimeas(time_key,day,day_of_week,month,
quarter,year)
define dimension item as (item key item name brand type
definedimensionitemas(item_key,item_name,brand,type,
supplier_type)
definedimensionbranchas(branch_key,branch_name,
branch_type)
definedimensionlocationas(location_key,street,city,
province or state country)
province_or_state,country)
54
DefiningaSnowflakeSchemainDMQL
definecubesales_snowflake[time,item,branch,location]:
dollars_sold=sum(sales_in_dollars),avg_sales=
d ll ld ( l i d ll ) l
avg(sales_in_dollars),units_sold=count(*)
definedimensiontimeas(time_key,day,day_of_week,
define dimension time as (time key day day of week
month,quarter,year)
definedimensionitemas(item_key,item_name,brand,
define dimension item as (item key item name brand
type,supplier(supplier_key,supplier_type))
55
Defining a Snowflake Schema in DMQL
DefiningaSnowflakeSchemainDMQL
definedimensionbranchas(branch_key,
define dimension branch as (branch key
branch_name,branch_type)
definedimensionlocationas(location_key,
(
street,city(city_key,province_or_state,
country))
56
DefiningaFactConstellationinDMQL
definecubesales[time,item,branch,location]:
dollars_sold=sum(sales_in_dollars),avg_sales=
avg(sales_in_dollars),units_sold=count(*)
( ) (*)
definedimensiontimeas(time_key,day,day_of_week,month,
q
quarter,year)
,y )
definedimensionitemas(item_key,item_name,brand,type,
supplier_type)
define dimension branch as (branch key branch name branch type)
definedimensionbranchas(branch_key,branch_name,branch_type)
definedimensionlocationas(location_key,street,city,
province_or_state,country)
57
Defining a Fact Constellation in DMQL
DefiningaFactConstellationinDMQL
definecubeshipping[time,item,shipper,from_location,
to_location]:
dollar_cost=sum(cost_in_dollars),unit_shipped=
count( )
count(*)
definedimensiontimeastimeincubesales
definedimensionitemasitemincubesales
definedimensionshipperas(shipper_key,shipper_name,
locationaslocationincubesales,shipper_type)
definedimensionfrom
de e d e s o o _locationaslocationincubesales
ocat o as ocat o cube sa es
definedimensionto_locationaslocationincubesales
58
Measures:ThreeCategories
distributive:iftheresultderivedbyapplyingthe
function to n aggregate values is the same as that
functiontonaggregatevaluesisthesameasthat
derivedbyapplyingthefunctiononallthedata
without partitioning
withoutpartitioning.
E.g.,count(),sum(),min(),max().
algebraic:
algebraic:ifitcanbecomputedbyanalgebraic
if it can be computed by an algebraic
functionwithM arguments(where M isabounded
integer) each of which is obtained by applying a
integer),eachofwhichisobtainedbyapplyinga
distributiveaggregatefunction.
E.g.,avg(),min_N(),standard_deviation().
E g avg() min N() standard deviation()
59
Measures: Three Categories
Measures:ThreeCategories
holistic:
holistic:ifthereisnoconstantboundonthe
if there is no constant bound on the
storagesizeneededtodescribeasub
aggregate.
aggregate
E.g.,median(),mode(),rank().
60
AConceptHierarchy:Dimension(location)
all all
61
M ltidi
MultidimensionalData
i lD t
Sales
Salesvolumeasafunctionofproduct,
volume as a function of product,
month,andregion Dimensions: Product, Location, Time
Hierarchical summarization paths
Office Day
Month
62
A Sample Data Cube
ASampleDataCube
Date Total annual sales
1Qtr 2Qtr 3Qtr 4Qt
4Qtr sum of TV in U
U.S.A.
SA
TV
PC U.S.A
VCR
Country
y
sum
Canada
C
Mexico
sum
63
Cuboids Corresponding to the Cube
CuboidsCorrespondingtotheCube
all
0-D(apex) cuboid
product
d date country
1-D cuboids
pproduct,date
, pproduct,country
, y date,, country
y
2-D cuboids
3 D(b ) cuboid
3-D(base) b id
product, date, country
64
OLAPOperations
p
Rollup(drill
Roll up (drillup):
up):summarizedata
summarize data
by climbing up hierarchy or by
dimension reduction
Drilldown(rolldown):reverseofrollup
from
f higher
hi h level
l l summary to
t lower
l
level summary or detailed data, or
introducing new dimensions
Sliceanddice:
project and select
65
OLAP Operations
OLAPOperations
Pivot
Pivot(rotate):
(rotate):
reorient the cube, visualization, 3D to
series
se es of
o 2D planes.
p a es
Otheroperations
drill across: involving (across) more
than one fact table
drill through: through the bottom level
of the cube to its back-end relational
tables (using SQL)
66
Lecture9
Datawarehousearchitecture
67
StepsfortheDesignandConstructionof
DataWarehouse
h
Thedesignofadatawarehouse:abusiness
analysis framework
analysisframework
Theprocessofdatawarehousedesign
Athreetierdatawarehousearchitecture
68
DesignofaDataWarehouse:ABusinessAnalysis
Framework
Fourviewsregardingthedesignofadatawarehouse
F i di th d i f d t h
Topdownview
allowsselectionoftherelevantinformation
necessaryforthedatawarehouse
69
DesignofaDataWarehouse:ABusinessAnalysis
F
Frameworkk
Datawarehouseview
Data warehouse view
consistsoffacttablesanddimensiontables
Datasourceview
exposes
exposestheinformationbeingcaptured,stored,and
the information being captured stored and
managedbyoperationalsystems
Businessqueryview
seestheperspectives
sees the perspectives
70
Data Warehouse Design Process
DataWarehouseDesignProcess
Topdown,
Top down,bottom
bottomup
upapproachesoracombination
approaches or a combination
ofboth
Topdown:Startswithoveralldesignandplanning
(mature)
Bottomup:Startswithexperimentsandprototypes(rapid)
Fromsoftwareengineeringpointofview
Waterfall:structuredandsystematicanalysisateachstep
before proceeding to the next
beforeproceedingtothenext
Spiral:rapidgenerationofincreasinglyfunctionalsystems,
shortturnaroundtime,quickturnaround
71
Data Warehouse Design Process
DataWarehouseDesignProcess
Typicaldatawarehousedesignprocess
Typical data warehouse design process
Chooseabusinessprocesstomodel,e.g.,orders,
invoices etc
invoices,etc.
Choosethegrain (atomiclevelofdata)ofthe
business process
businessprocess
Choosethedimensionsthatwillapplytoeachfact
tablerecord
Choosethemeasurethatwillpopulateeachfact
tablerecord
72
Multi--Tiered Architecture
Multi
Monitor
& OLAP Server
other Metadata
Integrator
sources
Analysis
A l i
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining
Data Marts
74
DataWarehouseBackEndToolsandUtilities
Dataextraction:
getdatafrommultiple,heterogeneous,andexternal
sources
Datacleaning:
detecterrorsinthedataandrectifythemwhenpossible
Datatransformation:
convertdatafromlegacyorhostformattowarehouse
format
Load:
sort,summarize,consolidate,computeviews,check
integrity, and build indices and partitions
integrity,andbuildindicesandpartitions
Refresh
propagatetheupdatesfromthedatasourcestothe
warehouse
75
ThreeDataWarehouseModels
Enterprisewarehouse
collectsalloftheinformationaboutsubjectsspanningtheentire
j p g
organization
DataMart
asubsetofcorporatewidedatathatisofvaluetoaspecificgroups
b f id d h i f l ifi
ofusers.Itsscopeisconfinedtospecific,selectedgroups,suchas
marketingdatamart
Independentvs.dependent(directlyfromwarehouse)data
mart
Virtualwarehouse
Vi t l h
Asetofviewsoveroperationaldatabases
Onlysomeofthepossiblesummaryviewsmaybematerialized
y p y y
76
DataWarehouseDevelopment:A
Recommended Approach
RecommendedApproach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
E t i
Data Data
Data
Mart Mart
Warehouse
78
Types of OLAP Servers
TypesofOLAPServers
HybridOLAP(HOLAP)
Hybrid OLAP (HOLAP)
Userflexibility,e.g.,lowlevel:relational,high
level: array
level:array
SpecializedSQLservers
specializedsupportforSQLqueriesover
specialized support for SQL queries over
star/snowflakeschemas
79
Lecture10
Lecture 10&11
& 11
80
EfficientDataCubeComputation
Datacubecanbeviewedasalatticeofcuboids
Thebottommostcuboidisthebasecuboid
Th b tt t b id i th b b id
Thetopmostcuboid(apex)containsonlyonecell
HowmanycuboidsinanndimensionalcubewithLlevels?
How many cuboids in an n dimensional cube with L levels?
n
T of( Ldata
Materialization i 1) cube
Materializationofdatacube
i 1
Materializeevery(cuboid)(fullmaterialization),none(no
materialization) or some (partial materialization)
materialization),orsome(partialmaterialization)
Selectionofwhichcuboidstomaterialize
Basedonsize,sharing,accessfrequency,etc.
, g, q y,
81
CubeOperation
Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
Transform it into a SQLlike language (with a new operator cube
by introduced by Gray et al.
by, al 96)
96)
SELECT item, city, year, SUM (amount)
()
FROM SALES
CUBE BY item, city, year
(city) (item) (year)
Need compute the following GroupBys
(date, product,
(date product customer),
customer)
(date,product),(date, customer), (product, customer),
(date), (product), (customer) (city, item) (city, year) (item, year)
()
(city, item, year)
82
CubeComputation:ROLAPBasedMethod
Efficientcubecomputationmethods
p
ROLAPbasedcubingalgorithms(Agarwaletal96)
Arraybasedcubingalgorithm(Zhaoetal97)
Bottomupcomputationmethod(Bayer&Ramarkrishnan
Bottom up computation method (Bayer & Ramarkrishnan99)
99)
ROLAPbasedcubingalgorithms
SSorting,hashing,andgroupingoperationsareappliedtothe
ti h hi d i ti li d t th
dimensionattributesinordertoreorderandclusterrelatedtuples
Groupingisperformedonsomesubaggregatesasapartial
groupingstep
Aggregatesmaybecomputedfrompreviouslycomputed
aggregates,ratherthanfromthebasefacttable
t th th f th b f t t bl
83
MultiwayArrayAggregationfor
Cube Computation
CubeComputation
Partitionarraysintochunks(asmallsubcubewhichfitsin
memory).
Compressedsparsearrayaddressing:(chunk_id,offset)
Computeaggregatesinmultiwaybyvisitingcubecellsinthe
orderwhichminimizesthe#oftimestovisiteachcell,and
reduces memory access and storage cost
reducesmemoryaccessandstoragecost.
84
MultiwayArrayAggregationfor
C b C
CubeComputation
t ti
C c3 61
c2 45
62 63 64
46 47 48
c11 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
85
MultiWayArrayAggregationforCube
Computation
Computation
Method:theplanesshouldbesortedand
computed according to their size in ascending
computedaccordingtotheirsizeinascending
order.
Idea:keepthesmallestplaneinthemain
Idea: keep the smallest plane in the main
memory,fetchandcomputeonlyonechunkata
timeforthelargestplane
Limitationofthemethod:computingwell
onlyforasmallnumberofdimensions
Iftherearealargenumberofdimensions,
bottomupcomputationandicebergcube
computation methods can be explored
computationmethodscanbeexplored
86
IndexingOLAPData:BitmapIndex
Indexonaparticularcolumn
Eachvalueinthecolumnhasabitvector:bitopisfast
The length of the bit vector: # of records in the base table
Thelengthofthebitvector:#ofrecordsinthebasetable
The ithbitissetifthe ithrowofthebasetablehasthe
valuefortheindexedcolumn
notsuitableforhighcardinalitydomains
87
IndexingOLAPData:JoinIndices
Joinindex:JI(Rid,Sid)whereR(Rid,)
S(Sid,)
Traditionalindicesmapthevaluestoalistof
p
recordids
ItmaterializesrelationaljoininJIfileandspeeds
uprelationaljoin
p j arathercostlyoperation
y p
Indatawarehouses,joinindexrelatesthe
valuesofthedimensions ofastartschema
to rows inthefacttable.
torows in the fact table.
E.g.facttable:Salesandtwodimensionscity and
product
Ajoinindexoncity
A join index on city maintainsforeachdistinct
maintains for each distinct
cityalistofRIDsofthetuplesrecordingthe
Salesinthecity
Joinindicescanspanmultipledimensions
88
EfficientProcessingOLAPQueries
Determinewhichoperationsshouldbeperformedon
the available cuboids:
theavailablecuboids:
transformdrill,roll,etc.intocorrespondingSQLand/orOLAP
operations,e.g,dice=selection+projection
i di l i j i
Determinetowhichmaterializedcuboid(s)therelevant
operationsshouldbeapplied.
Exploringindexingstructuresandcompressedvs.
E l i i d i t t d d
densearraystructuresinMOLAP
89
Lecture12
mining
i i
90
DataWarehouseUsage
Threekindsofdatawarehouseapplications
Informationprocessing
Information processing
supportsquerying,basicstatisticalanalysis,andreportingusing
crosstabs,tables,chartsandgraphs
Analyticalprocessing
l l
multidimensionalanalysisofdatawarehousedata
pp p , , g, p g
supportsbasicOLAPoperations,slicedice,drilling,pivoting
Datamining
knowledgediscoveryfromhiddenpatterns
supportsassociations,constructinganalyticalmodels,
performingclassificationandprediction,andpresentingthe
miningresultsusingvisualizationtools.
Differencesamongthethreetasks
91
FromOnLineAnalyticalProcessingtoOnLineAnalytical
Mining (OLAM)
Mining(OLAM)
Whyonlineanalyticalmining?
Why online analytical mining?
Highqualityofdataindatawarehouses
DWcontainsintegrated,consistent,cleaneddata
Availableinformationprocessingstructuresurroundingdata
l bl f d d
warehouses
ODBC,OLEDB,Webaccessing,servicefacilities,reportingand
O
OLAPtools l
OLAPbasedexploratorydataanalysis
miningwithdrilling,dicing,pivoting,etc.
Onlineselectionofdataminingfunctions
integrationandswappingofmultipleminingfunctions,
algorithms,andtasks.
g ,
ArchitectureofOLAM
92
AnOLAMArchitecture
Mi i query
Mining Mi i result
Mining l L
Layer4
4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine
g Engine
g OLAP/OLAM
Layer2
MDDB
MDDB
Meta Data