You are on page 1of 35

The VLDB Journal (2008) 17:621655

DOI 10.1007/s00778-006-0022-1
REGULAR PAPER
Hierarchical clustering for OLAP: the CUBE File approach
Nikos Karayannidis Timos Sellis
Received: 6 September 2005 / Accepted: 13 April 2006 / Published online: 7 September 2006
Springer-Verlag 2006
Abstract This paper deals with the problem of phys-
ical clustering of multidimensional data that are orga-
nized in hierarchies on disk in a hierarchy-preserving
manner. This is called hierarchical clustering. A typi-
cal case, where hierarchical clustering is necessary for
reducing I/Os during query evaluation, is the most
detailed data of an OLAP cube. The presence of hierar-
chies in the multidimensional space results in an enor-
mous search space for this problem. We propose a
representation of the data space that results in a chunk-
tree representation of the cube. The model is adaptive
to the cubes extensive sparseness and provides efcient
access tosubsets of data basedonhierarchy value combi-
nations. Based on this representation of the search space
we formulate the problem as a chunk-to-bucket alloca-
tion problem, which is a packing problem as opposed to
the linear ordering approach followed in the literature.
We propose a metric to evaluate the quality of hier-
archical clustering achieved (i.e., evaluate the solutions
to the problem) and formulate the problem as an opti-
mization problem. We prove its NP-Hardness and pro-
vide an effective solution based on a linear time greedy
algorithm. The solution of this problemleads to the con-
struction of the CUBEFile data structure. We analyze in
depth all steps of the construction and provide solutions
Communicated by P-L. Lions.
N. Karayannidis (B) T. Sellis
Institute of Communication and Computer Systems
and School of Electrical and Computer Engineering,
National Technical University of Athens,
Zographou 15773, Athens, Greece
e-mail: nikos@dblab.ece.ntua.gr
T. Sellis
e-mail: timos@dblab.ece.ntua.gr
for interesting sub-problems arising, such as the forma-
tion of bucket-regions, the storage of large data chunks
and the caching of the upper nodes (root directory) in
main memory.
Finally, we provide an extensive experimental evalu-
ation of the CUBE Files adaptability to the data space
sparseness as well as to an increasing number of data
points. The main result is that the CUBE File is highly
adaptive to even the most sparse data spaces and for
realistic cases of data point cardinalities provides hier-
archical clustering of high quality and signicant space
savings.
Keywords Hierarchical clustering OLAP CUBE
File Data cube Physical data clustering
1 Introduction
Efcient processing of ad hoc OLAP queries is a very
difcult task considering, on the one hand the native
complexity of typical OLAP queries, which potentially
combine huge amounts of data, andonthe other, the fact
that no a priori knowledge for queries exists and thus
no pre-computation of results or other query-specic
tuning can be exploited. The only way to evaluate these
queries is to access directly the most detailed data in an
efcient way. It is exactly this need to access detailed
data based on hierarchy criteria that calls for the hierar-
chical clustering of data. This paper discusses the phys-
ical clustering of OLAP cube data points on disk in
a hierarchy-preserving manner, where hierarchies are
dened along dimensions (hierarchical clustering).
622 N. Karayannidis, T. Sellis
The problem addressed is set out as follows: we are
given a large fact table (FT) containing only grain-level
(most detailed) data. We assume that this is part of the
star schema ina dimensional data warehouse. Therefore,
data points (i.e., tuples in the FT) are organized by a set
of N dimensions. We further assume that each dimen-
sion is organized in a hierarchy. Typically the data dis-
tribution is extremely skewed. In particular, the OLAP
cube is extremely sparse and data tend to appear in
arbitrary clusters along some dimension. These clus-
ters correspond to specic combinations of the hierar-
chy values for which there exist actual data (e.g., sales
for a specic product category in a specic geographic
region for a specic period of time). The problem is
on the one hand to store the fact table data in a hier-
archy-preserving manner so as to reduce I/Os during
the evaluation of ad hoc queries containing restrictions
and/or groupings on the dimension hierarchies and, on
the other, to enable navigation in the multilevel-multi-
dimensional data space by providing direct access (i.e.,
indexing) to subsets of data via hierarchical restrictions.
The later implies that index nodes must also be hier-
archically clustered if we are aiming at a reduced
I/O cost.
Some of the most interesting proposals [20, 21, 36] in
the literature for cube data structures deal with the com-
putation and storage of the data cube operator [9]. These
methods omit a signicant aspect in OLAP, which is
that usually dimensions are not at but are organized
in hierarchies of different aggregation levels (e.g., store,
city, area, country is such a hierarchy for a Location
dimension). The most popular approach for organizing
the most detailed data of a cube is the so-called star
schema. In this case the cube data are stored in a rela-
tional table, called the fact table. Furthermore, various
indexing schemes have been developed [3, 15, 25, 26], in
order tospeed upthe evaluation of the joinof the central
(and usually very large) fact table with the surrounding
dimension tables (also known as a star-join). However,
even when elaborate indexes are used, due to the arbi-
trary ordering of the fact table tuples, there might be
as many I/Os as there are tuples resulting from the fact
table.
We propose the CUBEFile data structure as an effec-
tive solution to the hierarchical clustering problem set
above. The CUBE File multidimensional data structure
[18] clusters data into buckets (i.e., disk pages) with
respect to the dimension hierarchies aiming at the hier-
archical clustering of the data. Buckets may include both
intermediate (index) nodes (directory chunks), as well
as leaf (data) nodes (data chunks). The primary goal of
a CUBE File is to cluster in the same bucket a family
of data (i.e., data corresponding to all hierarchy value
combinations for all dimensions) so as to reduce the
bucket accesses during query evaluation.
Experimental results in [18] have shown that the
CUBE File outperforms the UB-tree/MHC [22], which
is another effective method for hierarchically clustering
the cube, resulting in 79 times less I/Os on average for
all workloads tested. This simply means that the CUBE
File achieves a higher degree of hierarchical clustering
of the data. More interestingly, in [15] it was shown that
the UB-tree/MHC technique outperformed the tradi-
tional bitmap index based star-join by a factor of 20
40, which simply proves that hierarchical clustering is
the most determinant factor for a le organization for
OLAP cube data, in order to reduce I/O cost.
To tackle this problem we rst model the cube data
space as a hierarchy of chunks. This modelcalled the
chunk-tree representation of a cubecopes effectively
with the vast data sparseness by truncating empty areas.
Moreover, it provides a multiple resolution view of the
data space where one can zoom-in or zoom-out to spe-
cic areas navigating along the dimension hierarchies.
The CUBE File is built by allocating the nodes of the
chunk-tree into buckets in a hierarchy-preserving man-
ner. This way we depart from the common approach for
solving the hierarchical clustering problem, which is to
nd a total ordering of the data points (linear cluster-
ing) and cope with it as a packing problem, namely a
chunk-to-bucket packing problem.
In order to solve the chunk-to-bucket packing prob-
lem, we need to be able to evaluate the hierarchical
clustering achieved (i.e., evaluate the solutions to this
problem). Thus, inspired by the chunk-tree represen-
tation of the cube, we dene a hierarchical clustering
quality metric, called the hierarchical clustering factor.
We use this metric to evaluate the quality of the chunk-
to-bucket allocation. Moreover, we exploit it in order
to formulate the CUBE File construction problem as
an optimization problem, which we call the chunk-to-
bucket allocation problem. We formally dene this prob-
lem and prove that it is NP-Hard. Then, we propose a
heuristic algorithm as a solution that requires a single
pass over the input fact table and linear time in the num-
ber of chunks.
In the course of solving this problem several inter-
esting sub-problems arise. We dene the sub-problem
of chunk-region formation, which deals with the clus-
tering of chunk-trees hanging from the same parent
node in order to increase further the overall hierarchi-
cal clustering. We propose two algorithms as a solution,
one of which is driven by workload patterns. Next, we
deal with the sub-problem of storing large data chunks
(i.e., chunks that do not t in a single bucket), as well
as with the sub-problem of storing the so-called root
Hierarchical clustering for OLAP 623
directory of the CUBE File (i.e., the upper nodes of the
data structure).
Finally, we study the CUBE Files effective adapta-
tion to several cube data spaces by presenting a set of
experimental measurements that we have conducted.
All in all, the contributions of this paper are outlined
as follows:
We provide an analytic solution to the problem of
hierarchical clustering of an OLAP cube. The solu-
tion leads to the construction of the CUBE File data
structure.
We model the multilevel-multidimensional data
space of the cube as a chunk-tree. This representa-
tion of the data space adapts perfectly to the
extensive data sparseness and provides a multi-res-
olution view of the data with respect to the hierar-
chies. Moreover, if viewed as an index, it provides
direct access to cube data via hierarchical restric-
tions, which results in signicant speedups of typical
ad hoc OLAP queries.
We transform the hierarchical clustering problem
from a linear clustering problem into a chunk-to-
bucket allocation (i.e., packing) problem, which we
formally dene and prove that it is NP-Hard.
We introduce a hierarchical clustering quality met-
ric for evaluating the hierarchical clustering achieved
(i.e., evaluating the solution to the problem in ques-
tion). We provide an efcient solution to this prob-
lem as well as to all sub-problems that stem from
it, such as the storage of large data chunks or the
formation of bucket-regions.
We provide an experimental evaluation which leads
to the following basic results:
o The CUBEFile adapts perfectly to even the most
extremely sparse data spaces yielding signicant
space savings. Furthermore, the hierarchical clus-
tering achieved by the CUBEFile is almost unaf-
fected by the extensive cube sparseness.
o The CUBE File is scalable for any realistic
number of input data points. In addition, the
hierarchical clustering achieved remains of high
quality, when the number of input data points
increases.
o The root directory can be cached in main mem-
ory providing a single I/O cost for the evaluation
of point queries.
The rest of this paper is organized as follows. Section 2
discusses related work and positions the CUBE File
in the space of cube storage structures. Section 3 pro-
poses the chunk-tree representation of the cube as an
effective representation of the search space. Section 4
introduces a quality metric for the evaluation of
hierarchical clustering. Section 5 formally denes the
problem of hierarchical clustering, proves its NP-Hard-
ness and then delves into the nuts and bolts of building
the CUBE File. Section 6 presents our extensive experi-
mental evaluation and Sect. 7 recapitulates and empha-
sizes on main conclusions drawn.
2 Related work
2.1 The linear clustering problem for multidimensional
data
The linear clustering problemfor multidimensional data
is dened as the problem of nding a linear order-
ing of records indexed on multiple attributes, to be
stored in consecutive disk blocks, such that the I/O cost
for the evaluation of queries is minimized. The cluster-
ing of multidimensional data has been studied in terms
of nding a mapping of the multidimensional space
to a one-dimensional space. This approach has been
explored mainly in two directions: (a) in order to exploit
traditional one-dimensional indexing techniques to a
multidimensional index spacetypical example is the
UB-tree [2], which exploits a z-ordering of multidimen-
sional data [27], so that these can be stored into a one-
dimensional B-tree index [1]and (b) for ordering
buckets containing records that have been indexed on
multiple attributes, to minimize the disk access effort.
For example, a grid le [23] exploits a multidimensional
grid in order to provide a mapping between grid cells
and disk blocks. One could nd a linear ordering of
these cells, and therefore an ordering of the underlying
buckets, such as the evaluation of a query to entail more
sequential bucket reads than random bucket accesses.
To this end, space-lling curves (see [33] for a survey)
have been used extensively. For example, Jagadish [13]
provides a linear clustering method based on the Hilbert
curve that outperforms previously proposed mappings.
Note, however, that all linear clustering methods are
inferior to a simple scan in high dimensional spaces. This
is due to the notorious dimensionality curse [41], which
states that clustering in such spaces becomes meaning-
less due to lack of useful distance metrics.
In the presence of dimension hierarchies the multidi-
mensional clustering problem becomes combinatorially
explosive. Jagadish et al. [14] try to solve the problem of
nding an optimal linear clustering of records of a fact
table on disk, given a specic workload in the form of a
probability distribution over query classes. The authors
propose a subclass of clustering methods called lattice
paths, which are paths on the lattice dened by the
624 N. Karayannidis, T. Sellis
hierarchy level combinations of the dimensions. The
HPP chunk-to-bucket allocation problem (in Sect. 3.2
we provide a formal denition of HPP restrictions and
queries) is a different problemfor the following reasons:
1. It tries to nd an optimal way (in terms of reduced
I/O cost during query evaluation) to pack the data
into buckets, rather than order the data linearly. The
problem of nding an optimal linear ordering of
the buckets, for a specic workload, so as to reduce
random bucket reads, is an orthogonal problem and
therefore, the methods proposed in [14] could be
used additionally.
2. Apart from the data, it also deals with the inter-
mediate node entries (i.e., directory chunk entries),
which provide clustering at a whole-index level and
not only at the index-leaf level. Inother words, index
data are also clustered along with the real data.
As, we know that there is no linear clustering of
records that will permit all queries over a multidimen-
sional space to be answered efciently [14], we strongly
advocate that linear clustering of buckets (inter-bucket
clustering) must be exploited in conjunction with an
efcient allocation of records into buckets (intra-bucket
clustering).
Furthermore, in[22], apath-basedencodingof dimen-
sion data, similar to our encoding scheme, is exploited
in order to achieve linear clustering of multidimensional
data with hierarchies, through a z-ordering [27]. The
authors use the UB-tree [2] as an index on top of the
linearly clustered records. This technique has the advan-
tage of transforming typical star-join [25] queries to
multidimensional range queries, which are computed
more efciently due to the underlying multidimensional
index.
However, this technique suffers from the inherent
deciencies of the z space-lling curve, which is not the
best space-lling curve according to [7, 13]. On the other
hand, it is very easy to compute and thus straightforward
to implement the technique even for high dimensional-
ities. A typical example of such deciency is that in the
z-curve there is a dispersion of certain data points, which
are close in the multidimensional space but not close in
the linear order and the opposite, i.e., distant data points
are clustered in the linear space. The latter results also
in an inefcient evaluation of multiple disjoint query
regions, due to the repetitive retrieval of the same pages
for many queries. Finally, the benets of z-based linear
clustering starts to disappear quite soon as dimensional-
ity increases, practically even when dimensionality gets
over the number of 45 dimensions.
2.2 Grid le based multidimensional access methods
The CUBE File organization was initially inspired by
the grid le organization [23], which can be viewed as
the multidimensional counterpart of extendible hashing
[6]. The grid le superimposes a d-dimensional orthog-
onal grid on the multidimensional space. Given that the
grid is not necessarily regular, the resulting cells may be
of different shapes and sizes. A grid directory associates
one or more of these cells with data buckets, which are
stored in one disk page each. Each cell is associated with
one bucket, but a bucket may contain several adjacent
cells, therefore bucket regions may be formed.
To ensure that data items are always found with no
more than two disk accesses for exact match queries,
the grid itself is kept in main memory represented by
d one-dimensional arrays called scales. The grid le is
intended for dynamic insert/delete operations, therefore
it supports operations for splitting and merging direc-
tory cells. A well-known problem of the grid le is that
it suffers froma superlinear growth of the directory even
for data that are uniformly distributed [31]. One basic
reasonfor this is that splitting is not a local operationand
thus canleadtosuperlinear directory growth. Moreover,
depending on the implementation of the grid directory
merging may require a complete directory scan [12].
Hinrichs [12] attempts to overcome the shortcomings
of the grid le by introducing a 2-level grid directory.
In this scheme, the grid directory is now stored on disk
and a scaled-down version of it (called root directory)
is kept in main memory to ensure the two-disk access
principle still holds. Furthermore, he discusses efcient
implementations of the split, merge and neighborhood
operations. In a similar manner, Whang and krishna-
murthy [43] extends the idea of a 2-level directory to a
multilevel directory, introducing the multilevel grid le,
achieving a linear directory growth in the number of
records. There exist more grid le based organizations.
A comprehensive survey of these and multidimensional
access methods in general can be found in [8].
An obvious distinction of the CUBE File organiza-
tion from the above multidimensional access methods
is that it has been designed to fulll completely differ-
ent requirements, namely those of an OLAP environ-
ment and not of a transaction-oriented one. A CUBE
File is designed for an initial bulk loading and then a
read-only operation mode, in contrast, to the dynamic
insert/delete/update workload of a grid le. Moreover,
a CUBE File aims at speeding up queries on multidi-
mensional data with hierarchies and exploits hierarchi-
cal clustering to this end. Furthermore, as the dimension
domain in OLAP is known a priori the directory does
not have to grow dynamically. In addition, changes to
Hierarchical clustering for OLAP 625
the directory are rare, as dimension data do not change
very often (compared to the rate of change for the cube
data), and also deletions are seldom, therefore split and
merge operations are not needed somuch. Nevertheless,
more important is to adapt well to the native sparseness
of a cube data space and to efciently support incremen-
tal updating, so as to minimize the updating windowand
cube query-down time, which are critical factors in busi-
ness intelligence applications nowadays.
2.3 Taxonomy of cube primary organizations
The set of reported methods in the literature for primary
organizations for the storage of cubes is quite conned.
We believe that this is basically due to two reasons:
rst of all the generally held view is that a cube is
a set of pre-computed aggregated results and thus the
main focus has been to devise efcient ways to compute
these results [11], as well as to choose which ones to
compute for a specic workload (view selection/main-
tenance problem[10, 32, 37]). Kotidis and Roussopoulos
[19] proposed a storage organization based on packed
R-trees for storing these aggregated results. We believe
that this is a one-sided view of the problem as it dis-
regards the fact that very often, especially for ad hoc
queries, there will be a need for drilling down to the
most detailed data in order to compute a result from
scratch. Ad hoc queries represent the essence of OLAP,
and in contrast to report queries, are not known a pri-
ori and thus cannot really benet from pre-computa-
tion. The only way to process them efciently is to
enable fast retrieval of the base data. This calls for
an effective primary storage organization for the most
detailed data (grain level) of the cube. This argument
is of course based on the fact that a full pre-compu-
tation of all possible aggregates is prohibitive due to
the consequent size explosion, especially for sparse
cubes [24].
The second reason that makes people reluctant to
work on new primary organizations for cubes is their
adherence to relational systems. Although this seems
justied, one could pinpoint that a relational table (e.g.,
a fact table of a star schema [4]) is a logical entity and thus
should be separated from the physical method chosen
for implementing it. Therefore, one can use apart from
a paged record le, also a B

-tree or even a multidimen-


sional data structure as a primary organization for a fact
table. In fact, there are not many commercial RDBMS
([39] is one that we know of) that exploit a multidimen-
sional data structure as a primary organization for fact
tables. All in all, the integration of a new data struc-
ture in a full-blown commercial system is a strenuous
task with high cost and high risk and thus usually the
proposed solutions are reluctant to depart from the
existing technology (see also [30] for a detailed descrip-
tion of the issues in this integration).
Figure 1 positions the CUBE File organization in the
space of primary organizations proposed for storing a
cube (i.e., only the base data and not aggregates). The
columns of this table describe the alternative data struc-
tures that have beenproposedas a primary organization,
while the rows classify the proposed methods accord-
ing to the achieved data clustering. At the top-left cell
lies the conventional star schema [4], where a paged
record le is used for storing the fact table. This orga-
nization guarantees no particular ordering among the
stored data and thus additional secondary indexes are
built around it in order to support efcient access to
the data.
Padmanabhan et al. [28] assume a typical relation
(i.e., a paged record le) as the primary organization
of a cube (i.e., fact table). However, unique combina-
tions of dimension values are used in order to form
blocks of records, which correspond to consecutive disk
pages. These blocks can be considered as chunks. The
database administrator must choose only one hierar-
chy level from each dimension to participate in the
clustering scheme. In this sense, the method provides
multidimensional clustering and not hierarchical (mul-
tidimensional) clustering.
In [35] a chunk-based method for storing large
multidimensional arrays is proposed. No hierarchies are
assumed on the dimensions and data are clustered
according to the most frequent range queries of a partic-
ular workload. In [5] the benets of hierarchical cluster-
ing in speeding-up queries was observed as a side effect
of using a chunk-based le organization over a relation
(i.e., a paged le of records) for query caching, with
chunk as the caching unit. Hierarchical clustering was
achieved through appropriate hierarchical encoding
of the dimension data.
Markl et al. [22], also impose a hierarchical encoding
on the dimension data and assign a path-based surro-
gate key on each dimension tuple that was called the
compound surrogate key. They exploit the UB-tree mul-
tidimensional index [2] as the primary organization of
the cube. Hierarchical clustering is achieved by taking
the z-order [27] of the cube data points by interleav-
ing the bits of the corresponding compound surrogates.
Deshpande et al. [5], Markl et al. [22] and the CUBE
File [18], all exploit hierarchical clustering of the cube
data and the last two use multidimensional structures
as the primary organization. This has among others the
signicant benet of transforming a star-join [25] into
a multidimensional range query that is evaluated very
efciently over these data structures.
626 N. Karayannidis, T. Sellis
Fig. 1 The space of proposed
primary organizations for
cube storage
Multidimensional
data structure
Primary
Organization
Clustering
Achieved
Relation MD-Array
UB-tree
GRID
FILE-
based
No Clustering Star Schema
Chunk-
based
[28]
[35]
Clustering
Other
Chunk-
based
[5] [18]
Hierarchical
Clustering z-order
based
[22]
3 Modeling the data space as a chunk-tree
Clearly our goal is to dene a multidimensional le
organization that natively supports hierarchies. There
is indeed a plethora of data structures for multidimen-
sional data [8], but to the best of our knowledge, none of
these explicitly supports hierarchies. Hierarchies com-
plicate things, basically because, in their presence, the
data space explodes
1
. Moreover, as we are primarily
aiming at speeding up queries including restrictions on
the hierarchies, we need a data structure that can ef-
ciently lead us to the corresponding data subset based
on these restrictions. A key observation at this point is
that all restrictions on the hierarchies intuitively dene
a subcube or a cube-slice.
To this end, we exploit the intuitive representation of
a cube as a multidimensional array and apply a chunk-
ing method in order to create subcubes, i.e., the so-called
chunks. Our method of chunking is based on the dimen-
sion hierarchies structure and thus we call it hierar-
chical chunking. In the following sections we present
a dimension-data encoding scheme that assigns hierar-
chy-enabled unique identiers to each data point in a
dimension. Then, we present our hierarchical chunking
method. Finally, we propose a tree structure for repre-
senting the hierarchy of the resultant chunks and thus
modeling the cube data space.
3.1 Dimension encoding and hierarchical chunking
In order to apply hierarchical chunking, we rst assign
a surrogate key to each dimension hierarchy value. This
key uniquely identies each value within the hierarchy.
1
Assuming N dimension hierarchies modeled as K-level m-way
trees, the number of possible value combinations is K-times expo-
nential in the number of dimensions, i.e., O(m
KN
).
Continent
Country
Region
City
LOCATION
Grain level ---
North
(0)
South (0.0.1)
(1)
North
(2)
South
(3)
Greece (0.0)
(0)
U.K.
(1)
Europe (0)
(0)
Salonica
(0)
Athens
(1) (2)
Rhodes Glasgow
(3)
London
(4)
Cardiff
(5)
(0.0.1.2)
Fig. 2 Example of hierarchical surrogate keys assigned to an
example hierarchy
More specically, we order the values in each hierar-
chy level so that sibling values occupy consecutive posi-
tions and perform a mapping to the domain of positive
integers. The resulting values are depicted in Fig. 2 for
an example of a dimension hierarchy. The simple inte-
gers appearing under each value in each level are called
order-codes. In order to identify a value in the hierarchy,
we form the path of order-codes from the root-value to
the value in question. This path is called a hierarchical
surrogate key, or simply h-surrogate. For example the h-
surrogate for the value Rhodes is 0.0.1.2. H-surrogates
convey hierarchical (i.e., semantic) information for each
cube data point, which can be greatly exploited for the
efcient processing of star-queries [15, 29, 40].
The basic incentive behind hierarchical chunking is
to partition the data space by forming a hierarchy of
chunks that is based on the dimensions hierarchies.
This has the benecial effect of pruning all empty areas.
Remember that in a cube data space empty areas are
typically dened on specic combinations of hierarchy
values (e.g., as we did not sell the X product Category
on Region Y for T periods of time, an empty region is
formed). Moreover, it provides us with a multi-resolu-
tion view of the data space where one can zoom-in and
zoom-out navigating along the dimension hierarchies.
We model the cube as a large multidimensional array,
which consists only of the most detailed data. Initially, we
Hierarchical clustering for OLAP 627
partition the cube in to a very fewchunks corresponding
to the most aggregated levels of the dimensions
hierarchies. Then we recursively partition each chunk as
we drill-down to the hierarchies of all dimensions in par-
allel. We dene a measure in order to distinguish each
recursion step, the chunking depth D. We will illustrate
hierarchical chunking with an example. The dimensions
of our example cube are depicted in Fig. 3 and corre-
spond to a two-dimensional cube hosting sales data for
a ctitious company. The two dimensions are namely
LOCATION and PRODUCT. In the gure we can see
the members for each level of these dimensions (each
appearing with its member-code).
In order to apply our method, we need to have hierar-
chies of equal length. For this reason, we insert pseudo-
levels P into the shorter hierarchies until they reach
the length of the longest one. This padding is done
after the level that is just above the grain level. In our
example, the PRODUCT dimension has only three lev-
els and needs one pseudo-level in order to reach the
length of the LOCATION dimension. This is depicted
next, where we have also noted the order-code range at
each level:
LOCATION:[0-2].[0-4].[0-10].[0-18]
PRODUCT:[0-1].[0-2].P.[0-5]
The result of hierarchical chunking on our example
cube is depicted in Fig. 4a. Chunking begins at chunking
depth D = 0 and proceeds in a top-down fashion. To
dene a chunk, we dene discrete ranges of grain-level
(i.e., most-detailed) values on each dimension, denoted
in the gure as [a..b], where a and b are grain-level
order-codes. Each such range is dened as the set of
values with the same parent (value) in the correspond-
ing parent level. These parent levels form the set of
pivot levels PVT, which guides the chunking process
at each step. Therefore initially, PVT = {LOCATION:
Continent, PRODUCT: Category}. For example, if we
take value 0 of pivot level Continent of the LOCA-
TIONdimension, then the corresponding range at the
grain level is Cities [0..5].
The denition of such a range for each dimension
denes a chunk. For example, the chunk dened from
the 0, 0 values of the pivot levels Continent and Cat-
egory, respectively, consists of the following grain data
(LOCATION:0.[0-1].[0-3].[0-5], PRODUCT:0.[0-1]. P.[0-3]).
The [] notationdenotes a range of members. This chunk
appears shaded in Fig. 4a at D = 0. Ultimately at D = 0
we have a chunk for each possible combination between
the members of the pivot levels, that is a total of
[0-1][0-2] = 6 chunks in this example. Thus the total
number of chunks created at each depth D equals the
product of the cardinalities of the pivot levels.
Next we proceed at D = 1, with PVT={LOCATION:
Country, PRODUCT: Type} and recursively chunk each
chunk of depth D = 0. This time we dene ranges
within the previously dened ranges. For example, on
the range corresponding to Continent value 0 that we
created before, we dene discrete ranges correspond-
ing to each country of this continent (i.e., to each value
of the Country level, which has parent 0). In Fig. 4a,
at D = 1, shaded boxes correspond to all the chunks
resulting from the chunking of the chunk mentioned in
the previous paragraph.
Similarly, we proceed the chunking by descending in
parallel all dimension hierarchies and at each depth D
we create new chunks within the existing ones. The pro-
cedure ends when the next levels to include as pivot lev-
els are the grain levels. Then we do not need to perform
any further chunking, because the chunks that would be
produced fromsuch a chunking would be the cells of the
cube themselves. In this case, we have reached the max-
imum chunking depth D
MAX
. In our example, chunking
stops at D = 2 and the maximum depth is D = 3. Notice
the shaded chunks in Fig. 4a depicting chunks belonging
in the same chunk hierarchy.
The rationale for inserting the pseudo-levels above
the grain level lies in that we wish to apply chunking
the soonest possible and for all possible dimensions. As
the chunking proceeds in a top-to-bottom fashion, this
eager chunking has the advantage of reducing very
early the chunk size and also provides faster access to
the underlying data, because it increases the fan-out
of the intermediate nodes. If at a particular depth one
(or more) pivot level is a pseudo-level, then this level
does not take part in the chunking. (In our example this
occurs at D = 2 for the PRODUCT dimension.) This
means that we do not dene any new ranges within the
previously dened range for the specic dimension(s)
but instead we keep the old one with no further chunk-
ing. Therefore, as pseudo-levels restrict chunking in the
dimensions that are applied, we must insert them to
the lowest possible level. Consequently, as there is no
chunking below the grain level (a data cell cannot be
further partitioned), the pseudo-level insertion occurs
just above the grain level.
3.2 The chunk-tree representation
We use the intermediate depth chunks as directory
chunks that will guide us to the D
MAX
depth chunks
containing the data and thus called data chunks. This
leads to a chunk-tree representation of the hierarchi-
cally chunked cube and hence the cube data space. It
628 N. Karayannidis, T. Sellis
Category Type Item
Books
0
Literature
0.0
Murderess, A. Papadiamantis
0.0.0
Karamazof brothers F.
Dostoiewsky
0.0.1
Philosophy
0.1
Zarathustra, F. W. Nietzsche
0.1.2
Symposium, Plato
0.1.3
Music
1
Classical
1.2
The Vivaldi Album Special
Edition
1.2.4
Mozart: The Magic Flute
1.2.5
Continent Country Region City
Europe
0
Greece
0.0
Greece -North
0.0.0
Salonica
0.0.0.0
Greece- South
0.0.1
Athens
0.0.1.1
Rhodes
0.0.1.2
U.K.
0.1
U.K.- North
0.1.2
Glasgow
0.1.2.3
U.K.- South
0.1.3
London
0.1.3.4
Cardiff
0.1.3.5
North America
1
USA
1.2
USA- East
1.2.4
New York
1.2.4.6
Boston
1.2.4.7
USA - West
1.2.5
Los Angeles
1.2.5.8
San Francisco
1.2.5.9
USA- North
1.2.6
Seattle
1.2.6.10
Asia
2
Japan
2.3
Kiusiu
2.3.7
Nagasaki
2.3.7.11
Hondo
2.3.8
Tokyo
2.3.8.12
Yokohama
2.3.8.13
Kioto
2.3.8.14
India
2.4
India - East
2.4.9
Calcutta
2.4.9.15
New Delhi
2.4.9.16
India - West
2.4.10
Surat
2.4.10.17
Bombay
2.4.10.18
PRODUCT
Category
Type
Item
LOCATION
Continent
Country
Region
City
PRODUCT
LOCATION
Fig. 3 Dimensions of our example cube along with two hierarchy instantiations
is depicted in Fig. 4b for our example cube. In Fig. 4b,
we have expanded the chunk-sub-tree corresponding to
the family of chunks that has been shaded in Fig. 4a.
Pseudo-levels are marked with P and the correspond-
ing directory chunks have reduced dimensionality (i.e.,
one dimensional in this case). We interleave the h-sur-
rogates of the pivot level values that dene a chunk and
form a chunk-id. This is a unique identier for a chunk
within a CUBE File. Moreover, this identier includes
the whole path in the chunk hierarchy of a chunk. In
Fig. 4b, we note the corresponding chunk-id above each
chunk. The root chunk does not have a chunk-id because
it represents the whole cube and chunk-ids essentially
denote sub-cubes. The part of a chunk-id that is con-
tained between consecutive dots and corresponds to a
specic depth D is called D-domain.
The chunk-tree representation can be regarded as a
method to model the multilevel-multidimensional data
space of an OLAP cube. We discuss next the major ben-
ets from this modeling:
Direct access to cube data through hierarchical restric-
tions One of the main advantages of the chunk-tree
representationof a cube is that it explicitly supports hier-
archies. This means that any cube data subset dened
through restrictions on the dimension hierarchies can
be accessed directly. This is achieved by simply accessing
the qualifying cells at each depth and following the inter-
mediate chunk pointers to the appropriate data. Note
that the vast majority of OLAPqueries containanequal-
ity restriction on a number of hierarchical attributes and
more commonly on hierarchical attributes that form a
complete path in the hierarchy. This is reasonable as
the core of analysis is conducted along the hierarchies.
We call this kind of restrictions hierarchical prex path
(HPP) restrictions and provide the corresponding de-
nition next:
Denition 1 (Hierarchical Prex Path Restriction) We
dene a hierarchical prex path restriction (HPP restric-
tion) on a hierarchy H of a dimension D, to be a set of
equality restrictions linked by conjunctions on Hs levels
that forma path in H, which always includes the topmost
(most aggregated) level of H.
For example, if we consider the dimension LOCA-
TIONof our example cube and a DATEdimension with
a 3-level hierarchy (Year/Month/Day), then the query
show me sales for country A (in continent C) in region
B for each month of 1999 contains two whole-path
restrictions, one for the dimension LOCATION and
one for DATE: (a) LOCATION.continent = C AND
Hierarchical clustering for OLAP 629
[0][1..2]
[2..3]
[0..1]
[0..2] [3..5]
[4..5]
[0..3]
[6..10] [0..5]
[0..5]
[0..18]
Cube
LOCATION
PRODUCT
[11..18]
D = 0
[4..5]
[6..10] [11 - 14] [15-18]
D = 1
[0..1]
[2..3]
[4..5]
[3] [4..5] [6..7] [8..9] [10] [11][12..14][15..16][17..18]
D = 2
(Category, Continent)
(PRODUCT, LOCATION)
(Type, Country)
( - , Region)
1
3
Grain level
(Data Chunks)
Root Chunk
P P
0 1 2 3
D = 0
D = 1
LOCATION
PRODUCT
0 1 2
0
1
0
0|0.0|0 0|0.1|0
D = 2
0
0|0.0|0.0|P
0
1
1 2
0|0.0|0.1|P
0
1
0|0.1|0.2|P
|
P
0
1
4 5
0|0.1|0.3|P
0
1
0 1
0|0
P P
0 1 2 3
0|0.0|1 0|0.1|1
3 0
0|0.0|1.0|P
2
3
1 2
2
3
0|0.1|1.2|P
2
3
4 5
0|0.1|1.3|P
|
P
2
3
D = 3 (Max Depth)
0|0.0|1.1|P
(Category, Continent)
(Type, Country)
( - , Region)
(Item, City)
(a) (b)
Fig. 4 a The cube from our running example hierarchically chunked. b The whole sub-tree up to the data chunks under chunk 0|0
LOCATION.country = A AND LOCATION.region =
B, and (b) DATE.year = 1999.
Consequently, we can now dene the class of HPP
queries:
Denition 2 (Hierarchical Prex Path Query) We call
a query Q on a cube C a hierarchical prex path query
(HPP query), if and only if all the restrictions imposed
by Q on the dimensions of C are HPP restrictions, which
are linked together by conjunctions.
Adaptation to cubes native sparseness The cube data
space is extremely sparse [34]. In other words, the ratio
of the number of real data points to the product of the
dimension grain-level cardinalities is a very small num-
ber. Values for this ratio in the range of 10
12
10
5
are more than typical (especially for cubes with more
than three dimensions). It is therefore imperative that
a primary organization for the cube adapts well to this
sparseness, allocating space conservatively. Ideally, the
allocated space must be comparable to the size of the
existing data points. The chunk-tree representation
adapts perfectly to the cube data space. The reason
is that the empty regions of a cube are not arbitrarily
formed. On the contrary, specic combinations of
dimension hierarchy values form them. For instance,
in our running example, if no music products are sold
in Greece, then a large empty region is formed. Con-
sequently, the empty regions in the cube data space
translate naturally to one or more empty chunk sub-
trees inthe chunk-tree representation. Therefore, empty
sub-trees can be discarded altogether and the space
allocation corresponds to real data points and
only.
Multi-resolution viewof the data space The chunk-tree
represents the whole cube data space (however, with
most of the empty areas pruned). Similarly, each sub-
tree represents a sub, space. Moreover, at a specic
chunking depth we view all the data points organized
in hierarchical families (i.e., chunk-trees) according
to the combinations of hierarchy values for the corre-
sponding hierarchy levels. By descending to a higher
depth node we view the data of the corresponding
subspace organized in hierarchical families of a more
detailed level and so on. This multi-resolution feature
will be exploited later in order to achieve a better hierar-
chical clustering of the data by promoting the storage of
lower depth chunk-trees in a bucket than that of higher
depth ones.
630 N. Karayannidis, T. Sellis
Storage efciency Achunk is physically represented by
a multidimensional array. This enables an offset-based
access, rather than a search-based one, which speed-ups
the cell access mechanism considerably. Moreover, it
gives us the opportunity to exploit chunk-ids in a very
effective way. A chunk-id essentially consists of inter-
leavedcoordinate values. Therefore, we canuse a chunk-
id in order to calculate the appropriate offset of a cell
in a chunk but we do not have to store the chunk-id
along with each cell. Indeed, a search-based mechanism
(like the one used by conventional B-tree indexes or
the UB-tree [2]) requires that the dimension values (or
the corresponding h-surrogates), which formthe search-
key, must also be stored within each cell (i.e., tuple) of
the cube. In the CUBE File only the measure values of
the cube are stored in each cell. Hence notable space
savings are achieved. In addition, further compression
of chunks can be easily achieved, without affecting the
offset-based accessing (see [17] for the details).
Parallel processing enabling Chunk-trees (at various
depths) can be exploited naturally for the logical frag-
mentation of the cube data, in order to enable the par-
allel processing of queries, as well as the construction
and maintenance (i.e., bulk loading and batch updating)
of the CUBE File. Chunk-trees are essentially disjoint
fragments of the data that carry all the hierarchy seman-
tics of the data. This makes the CUBE File data struc-
ture an excellent candidate for advanced fragmentation
methods ([38]) used in parallel data warehouse DBMSs.
Efcient maintenance operations Any data structure
aimed to accommodate data warehouse data must be
efcient in typical data warehousing maintenance oper-
ations. The logical data partitioning provided by the
chunk-treerepresentationenables fast bulkloading(rol-
lin of data), data purging (rollout of data, i.e., bulk
deletions from the cube), as well as the incremental
updating of the cube (i.e., when the input data with
the latest changes arrive from the data sources, only
local reorganizations are required and not a complete
CUBE File rebuild). The key idea is that new data to
be inserted in the CUBE le correspond to a set of
chunk-trees that need to be hanged at various depths
of the structure. The insertion of each such chunk-tree
requires only a local reorganization without affecting
the rest of the structure. In addition, as noted previously,
these chunk-tree insertions can be performed in parallel
as long as they correspond to disjoint subspaces of the
cube. Finally, it is very easy to rollout the oldest months
data and rollin the current months (we call this data
purging), as these data correspond to separate chunk-
trees and only a minimum reorganization is required.
The interested reader can nd more information regard-
ing other aspects of the CUBE File not covered in this
paper (e.g., the updating and maintenance operations),
as well as information for a prototype implementation
of a CUBE File based DBMS in [16].
4 Evaluating the quality of hierarchical clustering
Any physical organization of data must determine how
the latter are distributed in disk pages. A CUBE File
physically organizes its data by allocating the chunks of
the chunk-tree into a set of buckets, which is the I/O
transfer unit counterpart in our case. First, let us try to
understand what are the objectives of such an alloca-
tion. As already stated the primary goal is to achieve
a high degree of hierarchical clustering. This statement,
althoughclear, couldstill be interpretedinseveral differ-
ent ways. What are the elements that can guarantee that
a specic hierarchical clustering scheme is good? We
attempt to list some next:
1. Efcient evaluation of queries containing restric-
tions on the dimension hierarchies
2. Minimization of the size of the data
3. High space utilization
The most important goal of hierarchical clustering is
to improve response time of queries containing hier-
archical restrictions. Therefore, the rst element calls
for a minimal I/O cost (i.e., bucket reads) for the eval-
uation of such restrictions. The second element deals
with the ability to minimize the size of the data to be
stored (e.g., by adapting to the extensive sparseness of
the cube data spacei.e., not storing null dataas well
as storing only the minimum necessary data, e.g., in an
offset-based access structure we do not need to store the
dimension values along with the facts). Of course, the
storage overhead must also be minimized in terms of the
number of allocated buckets. Naturally, the best way to
keep this number low is to utilize the available space as
much as possible. Therefore the third element implies
that the allocation must adapt well to the data distri-
bution, e.g., more buckets must be allocated to more
densely populated areas and fewer buckets for more
sparse ones. Also, buckets must be lledalmost tocapac-
ity (i.e., imposing a high bucket occupancy threshold).
Both the last two elements guarantee an overall mini-
mum storage cost.
In the following, we propose a metric for evaluat-
ing the hierarchical clustering quality of an allocation of
chunks into buckets. Then in the next section we use this
metric to formally dene the chunk-to-bucket allocation
problem as an optimization problem.
Hierarchical clustering for OLAP 631
4.1 The hierarchical clustering factor
We advocate that hierarchical clustering is the most
important goal for a le organization for OLAP cubes.
However, the space of possible combinations of dimen-
sion hierarchy values is huge (doubly exponentialsee
Footnote 1). To this end, we exploit the chunk-tree rep-
resentation, resulting from the hierarchical chunking of
a cube, and deal with the problemof hierarchical cluster-
ing, as a problem of allocating chunks of the chunk-tree
into disk buckets. Thus, we are not searching for a linear
clustering (i.e., for a total ordering of the chunked-cube
cells), but rather we are interested in the packing of
chunks into buckets according to the criteria of good
hierarchical clustering posed above.
The intuitive explanation for the utilization of the
chunk-tree for achieving hierarchical clustering lies in
the fact that the chunk-tree is built based solely on the
hierarchies structure and content and not on some stor-
age criteria (e.g., eachnode corresponding toa diskpage,
etc.); as a result, it embodies all possible combinations of
hierarchical values. For example, the sub-tree hanging
from the root-chunk in Fig. 4b at the leaf level con-
tains all the sales gures corresponding to the continent
Europe (order-code 0) and to the product category
Books (order-code 0) and any possible combinations
of the children members of the two. Therefore, each
sub-tree in the chunk-tree corresponds to a hierarchi-
cal family of values and thus reduces the search space
signicantly. In the following we will regard as a stor-
age unit the bucket. In this section, we dene a metric
for evaluating the degree of hierarchical clustering of
different storage schemes in a quantitative way.
Clearly, a hierarchical clustering strategy that respects
the quality element of efcient evaluation of queries
with HPP restrictions that we have posed above must
ensure that the access of the sub-trees hanging under
a specic chunk must be done with a minimal num-
ber of bucket reads. Intuitively, one can say that if we
could store whole sub-trees in each bucket (instead of
single chunks), then this would result in a better hier-
archical clustering, as all the restrictions on the specic
sub-tree, as well as on any of its descendant sub-trees,
would be evaluated with a single bucket I/O. For exam-
ple, if we store the sub-tree hanging fromthe root-chunk
in Fig. 4b into a single bucket, we can answer all queries
containing hierarchical restrictions on the combination
Books and Europe and on any children-values of
these two with just a single disk I/O.
Therefore, each sub-tree in this chunk-tree corre-
sponds to a hierarchical family of values. Moreover,
the smaller the chunking depth of this sub-tree the more
the value combinations it embodies. Intuitively, we can
say that the hierarchical clustering achieved could be
assessed by the degree of storing low-depth whole chunk
sub-trees into each storage unit. Next, we exploit this
intuitive criterion to dene the hierarchical clustering
degree of a bucket (HCD
B
). We begin with a number of
auxiliary denitions:
Denition 3 (Bucket-Region) Assume a hierarchically
chunked cube represented by a chunk-tree CT of a max-
imum chunking depth D
MAX
. A group of chunk-trees of
the same depth having a common parent node, which are
stored in the same bucket, comprises a bucket-region
Denition 4 (Region contribution of a tree stored in a
bucketc
r
) Assume a hierarchically chunked cube rep-
resented by a chunk-tree CT of a maximum chunking
depth D
MAX
. We dene as the region contribution c
r
of
a tree t of depth d that is stored in a bucket B to be the
total number of trees in the bucket region that this tree
belongs to divided by the total number of trees of the
same depth in the total chunk-tree CT in general. This is
then multiplied by a bucket region proximity factor r
P
,
which expresses the proximity of the trees of a bucket
region in the multidimensional space:
c
r

treeNum(d, B)
treeNum(d, CT)
r
P
,
where treeNum(d, B)is the total number of sub-trees in
B of depth d, treeNum(d, CT) the total number of sub-
trees in CT of depth d and r
P
the bucket region proximity
(0 < r
P
1).
The region contribution of a tree stored in a bucket
essentially denotes the percentage of trees at a specic
depth that a bucket region covers. Therefore, the greater
this percentage, the greater the hierarchical clustering
achieved by the corresponding bucket, as more com-
binations of the hierarchy members will be clustered in
the same bucket. To keep this contribution high we need
large bucket-regions of low depth trees, because in low
depths the total number of CT sub-trees is small. Notice
alsothat the regioncontributionincludes a bucket region
proximity factor r
P
, which expresses the spatial proxim-
ity of the trees of a bucket regioninthe multidimensional
space. The larger this factor becomes the closer the trees
of a bucket-region are and thus the larger their individ-
ual region contributions are. We will see in more detail
the effects of this factor and its denition (Denition
10) in a following subsection, where we will discuss the
formation of the bucket regions.
Denition 5 (Depth contribution of a tree stored in a
bucketc
d
) Assume a hierarchically chunked cube rep-
resented by a chunk-tree CT of a maximum chunking
632 N. Karayannidis, T. Sellis
depth D
MAX
. We dene as the depth contribution c
d
of a
tree t of depth d that is stored in a bucket B to be the ratio
of d to D
MAX
:
c
d

d
D
MAX
.
The depth contribution of a tree stored in a bucket
expresses the proportion between the depth of the tree
and the maximum chunking depth. The less this ratio
becomes (i.e., the lower is the depth of the tree), the
greater the hierarchical clustering achieved by the corre-
sponding bucket becomes. Intuitively, the depth contri-
bution expresses the percentage of the number of nodes
in the path from the root-chunk to the bucket in ques-
tion and thus the less it is the less is the I/Ocost to access
this bucket. Alternatively, we could substitute the depth
value fromthe nominator of the depth contribution with
the number of buckets in the path from the root-chunk
to the bucket in question (with the latter included).
Next, we provide the denition for the hierarchical
clustering degree of a bucket:
Denition 6 (Hierarchical clustering degree of a
bucketHCD
B
) Assume a hierarchically chunked cube
represented by a chunk-tree CT of a maximum chunk-
ing depth D
MAX
. For a bucket B containing 4 whole
sub-trees {t
1
, t
2
. . . t
T
} of chunking depths {d
1
, d
2
. . . d
T
},
respectively, where none of these sub-trees is a sub-tree of
another, we dene as the Hierarchical Clustering Degree
HCD
B
of bucket B to be the ratio of the sum of the
region contribution of each tree t
i
(1 i T) included
in B to the sum of the depth contribution of each tree
t
i
(1 i T), multiplied by the bucket occupancy O
B
,
where 0 O
B
1 :
HCD
B

T
i=1
c
i
r

T
i=1
c
i
d
O
B
=
Tc
r
Tc
d
O
B
=
c
r
c
d
O
B
, (1)
where c
i
r
is the region contribution of tree t
i
and c
i
d
is the
depth contribution of tree t
i
(1 i T). (Note that as
bucket regions have been dened as consisting of equi-
depth trees, then all trees of a bucket have the same region
contribution as well as depth contribution.)
In this denition, we have assumed that the chunking
depth d
i
of a chunk-tree t
i
is equal to the chunking depth
of the root-chunk of this tree. Of course we assume that
a normalization of the depth values has taken place, so
that the depth of the chunk-tree CT is to be 1 instead of
0, in order to avoid having zero depths in the denomina-
tor of (1). Furthermore, data chunks are considered as
chunk-trees with a depth equal to the maximum chunk-
ing depth of the cube. Note that directory chunks stored
in a bucket, not as part of a sub-tree but isolated, have
a zero region contribution; therefore, buckets that con-
tain only such directory chunks have a zero degree of
hierarchical clustering.
From (1), we can see that the more sub-trees, instead
of single chunks, are included in a bucket the greater the
hierarchical clustering degree of the bucket becomes,
because more HPP restrictions can be evaluated solely
with this bucket. Also the highest these trees are (i.e.,
the smaller their chunking depth is) the greater the hier-
archical degree of the bucket becomes, as more combi-
nations of hierarchical attributes are covered by this
bucket. Moreover, the more trees of the same depth and
hanging under the same parent node, we have stored in
a bucket, the greater the hierarchical clustering degree
of the bucket, as we include more combinations of the
same path in the hierarchy.
All in all, the HCD
B
metric favors the following stor-
age choices for a bucket:
Whole trees instead of single chunks or other data
partitions
Smaller depth trees instead of greater depth ones
Tree regions instead of single trees
Regions with a few low-depth trees instead of ones
with more trees of greater depth
Regions with trees of the same depth that are close
in the multidimensional space instead of dispersed
trees
Buckets with a high occupancy
We prove the following theorem regarding the maxi-
mum value of the hierarchical clustering degree of a
bucket:
Theorem 1 (Theoremof maximumhierarchical cluster-
ing degree of a bucket) Assume a hierarchically chun-
ked cube represented by a chunk-tree CT of a maximum
chunking depth D
MAX
, which has been allocated to a set
of buckets. Then, for any such bucket B holds that
HCD
B
D
MAX
.
Proof From the denition of the region contribution of
a tree appearing in Denition 4, we can easily deduce
that
c
i
r
1. (I)
This means that the following holds:
T

i=1
c
i
r
T. (II)
In (II) T stands for the number of trees stored in B.
Similarly, from the denition of the depth contribution
Hierarchical clustering for OLAP 633
of a tree appearing in Denition 5, we can easily deduce
that:
c
i
d

1
D
MAX
, (III)
as, the smallest possible depth value is 1. This means that
the following holds:
T

i=1
c
i
d

T
D
MAX
. (IV)
From (II), (IV), (1) and assuming that B is lled to its
capacity (i.e., O
B
equals 1) the theorem is proved.
It is easy to see that the maximum degree of hierar-
chical clustering of a bucket B is achieved only in the
ideal case, where we store the chunk-tree CT that rep-
resents the whole cube in B and CT ts exactly in B.
2
.
In this case, all our primary goals for a good hierarchical
clustering, posed in the beginning of this chapter, such as
the efcient evaluation of HPP queries, the low storage
cost and the high space utilization are achieved. This is
because all possible HPP restrictions can be evaluated
with a single bucket read (one I/O operation) and the
achieved space utilization is maximal (full bucket) with
a minimal storage cost (just one bucket). Moreover, it
is now clear that the hierarchical clustering degree of a
bucket signies to what extent the chunk-tree represent-
ing the cube has been packed into the specic bucket
and this is measured in terms of the chunking depth of
the tree.
By trying to create buckets with a high HCD
B
we
can guarantee that our allocation respects these ele-
ments of good hierarchical clustering. Furthermore, it
is now straightforward to dene a metric for evaluating
the overall hierarchical clustering achieved by a chunk-
to-bucket allocation strategy:
Denition 7 (Hierarchical clustering factor of a physical
organization for a cubef
HC
) For a physical organiza-
tion that stores the data of a cube into a set of N
B
buckets,
we dene as the hierarchical clustering factor f
HC
, the
percent of hierarchical clustering achieved by this storage
organization, as this results fromthe hierarchical cluster-
ing degree of each individual bucket divided by the total
number of buckets and we write:
f
HC

N
B
1
HCD
B
N
B
D
MAX
. (2)
2
Indeed, a bucket with HCD
B
= D
MAX
would mean that the
depth contribution of each tree in this bucket should be equal to
1/D
MAX
(according to the inequality (III)); however, this is only
possible for the whole chunk-tree CT, as this only has a depth
equal to 1.
Note that N
B
is the total number of buckets usedinorder
to store the cube; however, only the buckets that contain
at least one whole chunk-tree have a non-zero HCD
B
value. Therefore, allocations that spend more buckets
for storing sub-trees have a higher hierarchical cluster-
ing factor than others, which favor, e.g., single directory
chunk allocations. From (2), it is clear that even if we
have two different allocations of a cube that result in
the same total HCD
B
of individual buckets, the one that
occupies the smaller number of buckets will have the
greater f
HC
, rewarding this way the allocations that use
the available space more conservatively.
Another way of viewing the f
HC
is as the average
HCD
B
for all the buckets divided by the maximum
chunking depth. It is now clear that it expresses the
percentage of the extent by which the chunk-tree rep-
resenting the whole cube has been packed into the
set of the N
B
buckets and thus 0 f
HC
1. It follows
directly from Theorem 1 that this factor is maximized
(i.e., equals 1), if and only if we store the whole cube
(i.e., the chunk-tree CT) into a single bucket, which cor-
responds to a perfect hierarchical clustering for a cube.
In the next section we exploit the hierarchical clus-
tering factor f
HC
, in order to dene the chunk-to-bucket
allocationproblemas anoptimizationproblem. Further-
more, we exploit the hierarchical clustering degree of a
bucket HCD
B
in a greedy strategy that we propose for
solving this problem, as an evaluation criterion, in order
to decide how close we are to an optimal solution.
5 Building the CUBE File
In this section we formally dene the chunk-to-bucket
allocation problem as an optimization problem. We
prove that it is NP-Hard and provide a heuristic algo-
rithm as a solution. In the course of solving this problem
several interesting sub-problems arise. We tackle each
one in a separate subsection.
5.1 The HPP chunk-to-bucket allocation problem
The chunk-to-bucket allocation problem is dened as
follows:
Denition 8 (The HPP chunk-to-bucket allocation
problem) For a cube C, represented by a chunk-tree CT
with a maximum chunking depth of D
MAX
, nd an allo-
cation of the chunks of CT into a set of xed-size buckets
that corresponds to a maximum hierarchical clustering
factor f
HC
We assume the following: The storage cost of any
chunk-tree t equals cost(t), the number of sub-trees per
634 N. Karayannidis, T. Sellis
depthdinCT equals treeNum(d) andthe size of a bucket
equals S
B
. Finally, we are given a bucket of special size
S
ROOT
consisting of consecutive simple buckets, called
the root-bucket B
R
, where S
ROOT
= S
B
, with 1.
Essentially, B
R
represents the set of buckets that contain
no whole sub-trees and thus have a zero HCD
B
.
The solution S for this problem consists of a set of K
buckets, S = {B
1
, B
2
. . . B
K
}, so that each bucket con-
tains at least one sub-tree of CT and a root-bucket B
R
that contains all the rest of CT (part with no whole sub-
trees). S must result in a maximum value for the f
HC
factor for the given bucket size S
B
. As the HCD
B
val-
ues of the buckets of the root-bucket B
R
equal to zero
(recall that they contain no whole sub-trees), following
from (2), f
HC
can be expressed as
f
HC
=

K
1
HCD
B
(K +)D
MAX
. (3)
From (3), it is clear that the more buckets we allocate
for the root-bucket (i.e., the greater becomes) the
less the degree of hierarchical clustering achieved by
our allocation. Alternatively, if we consider caching the
whole root-bucket in main memory (see the following
discussion), then we could assume that does not affect
hierarchical clustering (as it does not introduce more
bucket I/Os from the root-chunk to a simple bucket)
and could be zeroed.
In Fig. 5, we depict four different chunk-to-bucket
allocations for the same chunk-tree. The maximum
chunking depth is D
MAX
= 5, although in the gure we
can see the nodes up to depth D = 3 (i.e., the triangles
correspond to sub-trees of three levels). The numbers
inside each node represent the storage cost for the cor-
responding sub-tree, e.g., the whole chunk-tree has a
cost of 65 units. Assume a bucket size of S
B
= 30 units.
Below each gure we depict the calculated f
HC
and
beside we note the percentage with respect to the best
f
HC
that can be achieved for this bucket size (i.e., f
HC
/
f
Hcmax
100%). The chunk-to-bucket allocation that
yields the maximum f
HC
can be identied easily by
exhaustive search in this simple case. Observe, how the
f
HC
deteriorates gradually, as we move from Fig. 5a to d.
In Fig. 5a we have failed to create any bucket-regions
at depth D = 2. Thus each bucket stores a single sub-
tree of depth 3. Note also that the occupancy of most
buckets is quite low. In Fig. 5b the hierarchical clustering
improves as some bucket-regions have been formed
buckets B1, B3 and B4 store two sub-trees of depth 3. In
Fig. 5c the total number of buckets decreases by one as a
large bucket-region of four sub-trees has been formed in
bucket B3. Finally, in Fig. 5d we have managed to store
in bucket B3 a higher level (i.e., lower depth) sub-tree
(i.e., a sub-tree of depth 2). This increases even more the
hierarchical clustering achieved, compared to the previ-
ous case (Fig. 5c), because the root node is included in
the same bucket as the four sub-trees. In addition, the
bucket occupancy of B3 is increased.
It is clear nowfromthis simple example, that the hier-
archical clustering factor f
HC
rewards the allocations
that achieve to store lower-depth sub-trees in buckets,
that store regions of sub-trees instead of single sub-trees
and that create highly occupied buckets. The individual
calculations of this example can be seen in Fig. 6.
All in all, it is obvious that we have now the optimi-
zation problem of nding a chunk-to-bucket allocation
such that f
HC
is maximized. This problem is NP-Hard,
which results from the following theorem.
Theorem 2 (Complexity of the HPP chunk-to-bucket
allocation problem) The HPP chunk-to-bucket alloca-
tion problem is NP-Hard.
Proof Assume a typical bin packing problem[42] where
we are given Nitems with weights w
i
, i = 1,,N, respec-
tively, and a bin size Bsuch as w
i
Bfor all i = 1, . . . , N.
The problem is to nd a packing of the items in the few-
est possible bins. Assume that we create N chunks of
depth d and dimensionality D, so as chunk c
1
has a
storage cost of w
1
and chunk c
2
has a storage cost w
2
and so on. Also assume that N 1 of these chunks are
under the same parent chunk (e.g., the Nth chunk). This
way we have created a two-level chunk-tree where the
root lies at depth d = 0 and the leaves at depth d =
1. Also assume that a bin and a bucket are equivalent
terms. Now we have reduced in polynomial time the bin
packing problem to an HPP chunk-to-bucket allocation
problem, which is to nd an allocation of the chunks
into buckets of B size such that the achieved hierarchi-
cal clustering factor f
HC
is maximized.
As all the chunk-trees (i.e., single chunks in our case)
are of the same depth, the depth contribution c
i
d
(1 i
N), dened in (1), is the same for all chunk-trees. There-
fore, in order to maximize the degree of the hierarchical
clustering HCD
B
for each individual bucket (and thus
also increase the hierarchical clustering factor f
HC
), we
have to maximize the region contribution c
i
r
(1 i N)
of each chunk-tree (1). This occurs when we pack into
each bucket as many trees as possible on the one hand
anddue to the region proximity factor r
P
when the
trees of each region are as close as possible in the mul-
tidimensional space, on the other. Finally, according to
the f
HC
denition, the number of buckets used must
be the smallest possible. If we assume that the chunk
dimensions have no inherent ordering then there is no
notion of spatial proximity within the trees of the same
region and the region proximity factor equals 1 for all
Hierarchical clustering for OLAP 635
65
40
22
10
20
5
5
2
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
B
4
5
B5
B
6
B7
(a) f
HC
=0.01(14%)
65
40
22
10
20
5
5
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
2
5
B
4
(b) f
HC
=0.03(42%)
65
40
22
10
20
5
5
5
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
2
(c) f
HC
=0.05(69%)
65
40
22
10
20
5
5
5
2
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
(d) f
HC
=0.07(100%)
Fig. 5 The hierarchical clustering factor f
HC
of the same chunk-tree for four different chunk-to-bucket allocations
possible regions (see also related discussion in the fol-
lowing subsection).
In this case the only factor that can maximize the
HCD
B
of each bucket and consequently the overall f
HC
is to minimize empty space within each bucket [i.e., max-
imize bucket occupancy in (1)] and use as fewbuckets as
possible by packing the largest number of trees in each
bucket. These are exactly the goals of the original bin
packing problem and thus a solution to the bin packing
problem is also a solution to the HPP chunk-to-bucket
allocation problem and vice versa.
As the bin packing can be reduced in polynomial time
to the HPP chunk-to-bucket, then any problem in NP
can be reduced in polynomial time to the HPP chunk-
to-bucket. Furthermore, in the general case (where we
have chunk-trees of variant depths and dimension have
inherent orderings) it is not easy to nd a polynomial
time verier for a solution to the HPP chunk-to-bucket
problem, as the maximum f
HC
that can be achieved is
not known (as it is in the bin packing problem where
the minimum number of bins can be computed with a
simple division of the total weight of items by the size of
a bin). Thus the problem is NP-Hard.
We proceed next by providing a greedy algorithm
based on heuristics for solving the HPPchunk-to-bucket
allocation problem in linear time. The algorithm utilizes
the hierarchical clustering degree of a bucket as a cri-
terion in order to evaluate at each step how close we
are to an optimal solution. In particular, it traverses the
chunk-tree in a top-down depth-rst manner, adopting
the greedy approach that if at each step we create a
636 N. Karayannidis, T. Sellis
C
h
u
n
k
-
t
o
-
b
u
c
k
e
t

A
l
l
o
c
a
t
i
o
n

B
u
c
k
e
t

R
e
g
i
o
n

C
o
n
t
r
i
b
u
t
i
o
n

c
r
D
e
p
t
h

C
o
n
t
r
i
b
u
t
i
o
n

c
d
B
u
c
k
e
t

O
c
c
u
p
a
n
c
y


O
B
HCD
B
B
u
c
k
e
t

S
i
z
e
S
B
T
o
t
a
l

N
o

o
f

B
u
c
k
e
t
s

K

N
o

o
f

b
u
c
k
e
t
R
o
o
t

B
u
c
k
e
t


b
M
a
x
i
m
u
m


C
h
u
n
k
i
n
g

D
e
p
t
h

D
M
A
X
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04
Fig(d)
B3 0,50 0,4 0,73 0,92
3 1 0,07 100%
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04 Fig( c )
B3 0,57 0,6 0,50 0,48
3 1 0,05 69%
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04
B3 0,29 0,6 0,33 0,16
Fig(b)
B4 0,29 0,6 0,17 0,08
4 1 0,03 42%
B1 0,14 0,6 0,33 0,08
B2 0,14 0,6 0,67 0,16
B3 0,14 0,6 0,17 0,04
B4 0,14 0,6 0,17 0,04
B5 0,14 0,6 0,17 0,04
B6 0,14 0,6 0,10 0,02
Fig(a)
B7 0,14 0,6 0,07 0,02
30
7 1
5
0,01 14%
f
HC
f
HC
/f
HCmax
(%)
Fig. 6 The individual calculations of the example in Fig. 5
bucket with a maximum value of HCD
B
, then overall
the acquired hierarchical clustering factor will be maxi-
mal. Intuitively, by trying to pack the available buckets
with low-depth trees (i.e., the tallest trees) rst (thus the
top-to-bottom traversal) we can ensure that we have
not missed the chance to create the best HCD
B
buckets
possible.
In Fig. 7, we present the GreedyPutChunksIntoBuc-
kets algorithm, which receives as input the root R of a
chunk-tree CT and the xed size S
B
of a bucket. The
output of this algorithm is a set of buckets containing
at least one whole chunk-tree, a directory chunk entry
pointing at the root chunk R and the root-bucket B
R
.
In each step the algorithmtries greedily to make an
allocation decision that will maximize the HCD
B
of the
current bucket. For example, in lines 27 of Fig. 7, the
algorithm tries to store the whole input tree in a single
bucket thus aiming at a maximum degree of hierarchi-
cal clustering for the corresponding bucket. If this fails,
then it allocates the root Rto the root-bucket and tries to
achieve a maximum HCD
B
by allocating the sub-trees
at the next depth, i.e., the children of R (lines 926).
This essentially is achieved by including all direct
children sub-trees with size less than (or equal to) the
size of a bucket (S
B
) into a list of candidate trees for
inclusion into bucket regions (buckRegion) (lines 14
16). Then the routine formBucketRegions is called
upon this list and tries to include the corresponding
trees in a minimum set of buckets, by forming bucket
regions to be stored in each bucket, so that each one
achieves the maximum possible HCD
B
(lines 1922).
We will come back to this routine and discuss how it
solves this problem in the next sub-section. Finally, for
the children sub-trees of root R with size cost greater
than the size of a bucket, we recursively try to solve
the corresponding HPP chunk-to-bucket allocation sub-
problem for each one of them (lines 2326). This of
course corresponds to a depth-rst traversal of the input
chunk-tree.
Very important is also the fact that no space is
allocated for empty sub-trees (lines 1113); only a spe-
cial entry is inserted in the parent node to denote a
NULL sub-tree. Therefore, the allocation performed
by the greedy algorithm adapts perfectly to the data
Hierarchical clustering for OLAP 637




0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
GreedyPutChunksIntoBuckets(R,S
B
)
//Input: Root R of a chunk-tree CT, bucket size S
B
//Output: Updated R, list of allocated buckets BuckList, root
// bucket B
R
, directory entry dirEnt pointing at R
{
List buckRegion // Bucket-region Candidates list
IF (cost(CT) < S
B
){
Allocate new bucket B
n

Store CT in B
n
dirEnt = addressOf(R)
RETURN
}
//R will be stored in the root-bucket B
R
IF (R is a directory chunk) {
FOR EACH child sub-tree CT
C
of R {
IF (CT
C
is empty){
Mark with empty tag corresponding Rs entry
}
IF (cost(CT
C
) S
B
){
//Insert CT
c
into list for bucket-region candidates
buckRegion.push(CT
C
)
}
}
IF(buckRegion != empty){
// Formulate the bucket-regions
formBucketRegions(buckRegion, BuckList, R)
}
WHILE (there is a child CT : cost(CT ) > S ){
C C B
GreedyPutChunkIntoBuckets(root(CT ),S
C
Update corresponding R entry for CT
B
)
C
}
Store R in the root-bucket B
R
dirEnt = addressOf(R)
}
ELSE { //R is a data chunk and cost(R) > B
Artificially chunk R, create 2-level chunk-tree CT
A

GreedyPutChunkIntoBuckets(root(CT
A
),S
B
)
//storage of R will be taken cared of by previous call
dirEnt = addressOf(root(CT
A
))
}
RETURN
}
Fig. 7 A greedy algorithm for the HPP chunk-to-bucket allocation problem
distribution, coping effectively with the native sparse-
ness of the cube.
The recursive calls might lead us eventually all the
way down to a data chunk (at depth D
MAX
). Indeed, if
the GreedyPutChunksIntoBuckets is called upon a
root R, which is a data chunk, then this means that we
have come upon a data chunk with size greater than the
bucket size. This is called a large data chunk and a more
detailed discussion on how to handle them will follow
in a later sub-section. For now it is enough to say that
in order to resolve the problem of storing such a chunk
we extend the chunking further (with a technique called
articial chunking) in order to transform the large data
chunk into a 2-level chunk-tree. Then, we solve the HPP
chunk-to-bucket sub-problemfor this sub-tree (lines 30
35). The termination of the algorithm is guaranteed by
the fact that each recursive call deals with a sub-problem
of a smaller in size chunk-tree than the parent problem.
Thus, the size of the input chunk-tree is continuously
reduced.
65
40
22
10
20
5
5
5
2
3
D = 1
D = 2
D
MAX
= 5
Fig. 8 A chunk-tree to be allocated to buckets by the greedy
algorithm
Assuming an input le consisting of the cubes data
points along with their corresponding chunk-ids (or
equivalently the corresponding h-surrogate key per
dimension) we need a single pass over this le to create
638 N. Karayannidis, T. Sellis
65
40
22
10
20
5
5
5
2
3
D = 1
D
MAX
= 5
D = 2
S
B
= 30
B
1
B
2
B
3
Fig. 9 The chunk-to-bucket allocation for S
B
= 30
the chunk-tree representation of the cube. Then the
above greedy algorithm requires only linear time in the
number of input chunks (i.e., the chunks of the chunk-
tree) to perform the allocation of chunks to buckets, as
each node is visited exactly once and at the worst case
all nodes are visited.
Assume the chunk-tree of D
MAX
= 5 of Fig. 8. The
numbers inside each node represent the storage cost for
the corresponding sub-tree, e.g., the whole chunk-tree
has a cost of 65 units. For a bucket size S
B
= 30 units the
greedy algorithm yields a hierarchical clustering factor
f
HC
= 0.72. The corresponding allocation is depicted in
Fig. 9.
The solution comprises three buckets B
1
, B
2
and B
3
,
depicted as rectangles in the gure. The bucket with
the highest clustering degree (HCD
B
) is B
3
, because it
includes the lowest depth tree. The chunks not included
in a rectangle will be stored in the root-bucket. In this
case, the root-bucket consists of only a single bucket (i.e.,
= 1 and K = 3, see (3)), as this sufces for storing the
corresponding two chunks.
5.2 Bucket-region formation
We have seen that in each step of the greedy algorithm
for solving the HPPchunk-to-bucket allocation problem
(corresponding to an input chunk-tree with a root node
at a specic chunking depth), we try to store all the sib-
ling trees hanging fromthis root toa set of buckets, form-
ing this way groups of trees to be stored in each bucket
that we call bucket regions. The formation of bucket
regions is essentially a special case of the HPP chunk-
to-bucket allocation problem and can be described as
follows:
Denition 9 (Thebucket regionformationproblem) We
are given a set of N chunk-trees T
1
, T
2
, . . . T
N
, of the same
chunking depth d. Each tree T
i
(1 i N) has a size:
cost(T
i
) S
B
, where S
B
is the bucket size. The prob-
lem is to store these trees into a set of buckets, so that
the hierarchical clustering factor f
HC
of this allocation is
maximized.
As all the trees are of the same depth, the depth con-
tribution c
i
d
(1 i N), dened in (1), is the same for all
trees. Therefore, in order to maximize the degree of the
hierarchical clustering HCD
B
for each individual bucket
(and thus also increase the hierarchical clustering fac-
tor f
HC
), we have to maximize the region contribution
c
i
r
(1 i N) of each tree (1). This occurs when we
create bucket regions with as many trees as possible on
the one hand anddue to the region proximity factor
r
P
when the trees of each region are as close as possi-
ble in the multidimensional space, on the other. Finally,
according to the f
HC
denition, the number of buckets
used must be the smallest possible.
Summarizing, in the bucket region formation prob-
lem we seek a set of buckets to store the input trees, in
order to fulll the following three criteria:
1. The bucket regions (i.e., each bucket) contain as
many trees as possible.
2. The total number of buckets is minimum.
3. The trees of a region are as close in the multidimen-
sional space as possible.
One could observe that if we focused only on the rst
two criteria, then the bucket region formation problem
would be transformed to a typical bin-packing problem,
which is a well-known NP-complete problem [42]. So
intuitively the bucket region formation problem can be
viewed as a bin-packing problem, where items packed in
the same bin must be neighbors in the multidimensional
space.
The space proximity of the trees of a region is mean-
ingful only whenwe have dimensiondomains withinher-
ent orderings. Typical example is the TIME dimension.
For example, we might have trees corresponding to the
months of the same year (which guarantees hierarchi-
cal proximity) but we would also like the consecutive
months to be in the same region (space proximity). This
is because these dimensions are the best candidates for
expressing range predicates (e.g., months fromFEB99 to
AUG99). Otherwise, when there is not such an inherent
ordering, e.g., a chunk might point to trees correspond-
ing to products of the same category along the PROD-
UCT dimension, space proximity is not important and
therefore all regions with the same number of trees are
of equal value. In this case the corresponding predicates
are typically set inclusion predicates (e.g., products IN
Hierarchical clustering for OLAP 639
Fig. 10 The region proximity
for two bucket regions:
r
P
1 > r
P
2
Months in Year 1999
T
y
p
e
s

i
n

C
a
t
e
g
o
r
y

B
o
o
k
s

0 1 2 3 4 5 6 7 8 9 10 11
Literature
Philosophy
Computers
Science
Fiction
R
1
R
2
Fig. 11 A row-wise traversal of the input trees
{Literature, Philosophy, Science}) and not range
predicates, so hierarchical proximity alone sufces to
ensure a low I/O cost. To measure the space proximity
of the trees in a bucket region we use the region prox-
imity r
P
, which we dene as follows:
Denition 10 (Region proximity r
P
) We dene the
region proximity r
P
of a bucket region Rdened in a mul-
tidimensional space S, where all dimensions of S have an
inherent ordering, as the relative distance of the average
Euclidian distance between all trees of the region R from
the longest distance in S:
r
P

|dist
AVG
dist
MAX
|
dist
MAX
.
In the case where no dimension of the cube has an
inherent ordering, then we assume that the average dis-
tance for any region is zero and thus the region prox-
imity r
P
equals one. For example, in Fig. 10 we depict
two different bucket regions R
1
and R
2
. The surround-
ing chunk represents the sub-cube corresponding to the
months of a specic year and the types of a specic prod-
uct category and denes a Euclidian space S. Each point
in this gure corresponds to a root of a chunk-tree. As,
only the TIME dimension, among the two, includes an
inherent ordering of its values, the data space, as long as
the region proximity is concerned, is specied by TIME
only (one-dimensional metric space). The largest dis-
tance in S equals 11 and is the distance between the
leftmost and the rightmost trees. The average distance
for region R
1
equals 2 while for region R
2
equals 5. By a
simple substitution of the corresponding values in De-
nition 10, we nd that the region proximity for R
1
equals
0.8, while for R
2
equals 0.5. This is because the trees of
the latter are more dispersed along the time dimension.
Therefore region R
1
exhibits a better space proximity
than R
2
.
In order to tackle the region formation problem we
propose an algorithm called FormBuck Regions. This
algorithm is a variation of an approximation algorithm
called best-t [42] for solving the bin-packing problem.
Best-t is a greedy algorithm that does not nd always
the optimal solution; however, it runs in P-time (also can
be implemented to run in N logN, N being the number
of trees in the input), and provides solutions that are far
from the optimal solution within a certain bound. Actu-
ally, the best-t solution in the worst case is never more
than roughly 1.7 times worse than the optimal solution
[42]. Moreover, our algorithm exploits a space- lling
curve [33] in order to visit the trees in a space-proximity
preserving way. We describe it next:
FormBuckRegions Traversetheinput set of trees along
a space-lling curve SFC on the data space dened by
the parent chunk. Each time you process a tree, insert it
in the bucket that will yield the maximum HCD
B
value,
among the allocated buckets, after the insertion. On a
tie, choose one randomly. If no bucket can accommo-
date the current tree, then allocate a new bucket and
insert the tree in it.
Note that there is no linearization of multidimen-
sional data points that preserves space proximity 100%
[8, 13]. In the case where no dimension has an inherent
ordering the space-lling curve might be a simple row-
wise traversal (Fig. 11). In this gure, we also depict the
corresponding bucket regions that are formed.
640 N. Karayannidis, T. Sellis
We believe that a formation of bucket-regions that
will provide an efcient clustering of chunk-trees must
be based on some query patterns. In the following we
show an example of such a query-pattern driven forma-
tion of bucket-regions.
A hierarchy level of a dimension can basically take
part in an OLAP query in two ways: (a) as a means of
restriction (e.g., year = 2000) or (b) as a grouping attri-
bute (e.g., show me sales grouped by month). In the
former, we ask for values on a hyper-plane of the cube
perpendicular to the Time dimension at the restriction
point, while in the latter we ask for values on hyper-
planes that are parallel to the Time dimension. In other
words, if we know for a dimension level that it is going
to be used by the queries more often as a restriction
attribute, then we should try to create regions perpen-
dicular to this dimension. Similarly, if we know that a
level is going to be used more often as a grouping attri-
bute, then we should opt for regions that are parallel to
this dimension. Unfortunately, things are not so simple,
because if, for example, we have two restriction levels
from two different dimensions, then the requirement
for vertical regions to the corresponding dimensions is
contradictory.
In Fig. 12, we depict a bucket-region formation that
is driven by the table appearing in the gure. In this
table we note for each dimension-level corresponding
to a chunking depth, from our example cube in Fig. 3,
whether it should be characterized as a restriction level
or as a grouping level. For instance, a user might know
that 80% of the queries referencing level continent
will apply a restriction on it and only 20% will use it
as a grouping attribute, thus this level will be character-
ized as a restriction level. Furthermore, in the column
labeled importance order, we order the different lev-
els of the same depth according to their importance in
the expected query load. For example, we might know
that the category level will appear much more often
in queries than the continent level and so on.
In Fig. 12, we also depict a representative chunk for
each chunking depth (of course for the topmost levels
there is only one chunk, the root chunk), inorder toshow
the formation of the regions according to the table. The
algorithm in Fig. 13 describes how we can produce the
bucket-regions for all depths, when we have as input a
table similar to the one appearing in Fig. 12.
In Fig. 12, for the chunks corresponding to the levels
country, typeand city, item, we also depict the col-
umn-major traversal method corresponding to the sec-
ond part of the algorithm. Note also that the term fully
sized region means a region that has a size greater than
the bucket occupancy threshold, i.e., it utilizes well the
available bucket space. Finally, whenever we are at a
depth where a pseudo-level exists for a dimension, e.g.,
D = 2 for our example, no regions are created for the
pseudo level of course. Also, note that bucket-region
formation for chunks at the maximum chunking depth
(as is the chunk in depth 3 in Fig. 12) is only required
in the case where the chunking is extended beyond the
data-chunk level. This is the case of large data chunks
which is the topic of the next sub-section.
5.3 Storing large data chunks
In this sub-section, we will discuss the case where the
GreedyPutChunksIntoBuckets algorithm(Fig. 7) is called
with input as chunk-tree that corresponds to a single
data chunk. This, as we have already explained, would
be the result of a number of recursive calls tothe Greedy-
PutChunksIntoBuckets algorithm that led us to descend
the chunk hierarchy and to end up at a leaf node. Typ-
ically, this leaf node is large enough so as not to t in
a single bucket, otherwise the recursive call upon this
node would not have occurred in the rst place (Fig. 7).
The main idea for tackling this problem is to further
continue the chunking process, although we have fully
used the existing dimension hierarchies, by imposing a
normal grid. We call this chunking articial chunking
in contrast to the hierarchical chunking presented in
the previous section. This process transforms the ini-
tial large data chunk into a 2-level chunk-tree of size
less than or equal to the original data chunk. Then, we
solve the HPP chunk-to-bucket allocation sub-problem
for this chunk-tree and therefore we once again call the
GreedyPutChunksIntoBuckets routine upon this tree.
In Fig. 14, we depict an example of such a large data
chunk. It consists of twodimensions AandB. We assume
that the maximumchunking depthis D
MAX
= K. There-
fore, K will be the depth of this chunk. Parallel to the
dimensions, we depict the order-codes of the dimension
values of this chunk that correspond to the most detailed
level of each dimension. Also, we denote their parent
value on each dimension, i.e., the pivot level values that
created this chunk. Notice that the sufx of the chunk-id
of this chunk consists of the concatenated order-codes
of the two pivot level values, i.e., 5|14.
In order to extend the chunking further, we need to
insert a new level between the most detailed members
of each dimension and their parent. However, this level
must be inserted locally, only for this specic chunk
and not for all the grain-level values of a dimension.
We want to avoid inserting another pseudo-level in the
whole level hierarchy of the dimension, because this
would trigger the enlargement of all dimension hierar-
chies and would result in a lot of useless chunks. There-
fore, it is essential that this new level remains local. To
Hierarchical clustering for OLAP 641
Fig. 12 Bucket-region
formation based on query
patterns
LOCATION
P
R
O
D
U
C
T
Restriction Group By
continent
category
country
type
region
pseudo - -
Importance
Order
2
1
1
2
1
-
city
item
2
1
D = 0
continent, category
D = 1
country, type
D = 2
region
D = 3
city, item
this end, we introduce the notion of the local depth d of
a chunk to characterize the articial chunking, similar to
the global chunking depth D(introduced in the previous
section) characterizing the hierarchical chunking.
Denition 11 (Local depth d) The local depth d, where
d 1, of a chunk Ch denotes the chunking depth of Ch
pertaining to articial chunking. A local depth d = 1
denotes that no articial chunking has been imposed on
Ch. A value of d = 0 corresponds to the root of a chunk-
tree by articial chunking andis always a directory chunk.
The value of d increases by one for each articial chunk-
ing level.
Note that the global chunking depth D, while
descending levels created by articial chunking, remains
constant and equal to the maximum global chunking
depth of the cube (in general, to the current global depth
value); only the local depth increases.
Let us assume a bucket size S
B
that can accommodate
a maximumof M
r
directory chunkentries, or a maximum
of M
e
data chunk entries. In order to chunk a large data
chunk Ch of N dimensions by articial chunking, we
dene a grid on it, consisting of m
i
g
(1 i N) number
of members per dimension, such as

N
i=1
m
i
g
M
r
. This
grid will correspond to a new directory chunk, pointing
at the new chunks created from the articial chunk-
ing of the original large data chunk Ch and due to the
aforementioned constraint it is guaranteed that it will
t in a bucket. If we assume a normal grid, then for all
i : 1 i N, it holds m
i
g
=
_
N

M
r
_
.
In particular, if n
i
(1 i N) corresponds to the
number of members of the original chunk Ch along the
dimension i, then a new level consisting of m
i
g
members
will be inserted as a parent level. In other words, a
number of c
i
children (out of the n
i
) will be assigned
to each of the m
i
g
members, where c
i

_
n
i
/m
i
g
_
, as
long as c
i
1. If 0 < c
i
< 1, then the corresponding new
level will act as a pseudo-level, i.e., no chunking will take
place along this dimension. If all new levels correspond
642 N. Karayannidis, T. Sellis



0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
QueryDrivenFormBucketRegions
//Input: query pattern table
//Result: bucket-regions formed at all chunking depths
{
FOR EACH (global chunking depth value D) {
Pick first level in the importance order

LOOP {
Try to create as many fully-sized regions that will favor
(i.e., being perpendicular or parallel to the level
according to its characterization as a restrictive or
grouping attribute respectively) this level as it is
possible.
IF (there are more levels in the importance order AND there
are more ungrouped chunk-trees to visit){
pick next level from the order
}
ELSE {
exit from loop
}
}
IF (there are still ungrouped chunk-trees) {
Traverse the chunk in a row/column-major style with the first
level in the importance order being the fastest (slowest)
running attribute, if it is characterized as a grouping
(restriction) attribute; then the second level in
the importance order being the second fastest (slowest)
running attribute, if it is characterized as a grouping
(restriction) attribute and so on for all levels in the
order, and try to pack in the same bucket as much trees as
it is possible, until there are no more trees to visit.
}}}
Fig. 13 A bucket-region formation algorithm that is driven by query patterns
9 10 11 12 13 14 15 16

2
9
3
0


3
1

3
2

3
3

3
4
5
1
4
. . .
DIMENSION A
.

.

.
D
I
M
E
N
S
I
O
N

B
Chunk ID = ... .5|14
D = K (max depth)
Fig. 14 Example of a large data chunk
to pseudo-levels, i.e., n
i
< m
i
g
for all i : 1 i N, then
we take m
i
g
= maximum(n
i
).
We will describe the above process with an example.
Let us assume a bucket that can accommodate a maxi-
mum of M
r
= 10 directory chunk entries or a maximum
of M
e
= 5 data chunk entries. In this case the data chunk
of Fig. 14 is a large data chunk, as it cannot be stored in
a single bucket. Therefore, we dene a grid with m
1
g
, m
2
g
number of members along dimensions A and B, respec-
tively. If the grid is normal then m
1
g
= m
2
g
=
_

10
_
= 3.
Thus, we create a directory chunk, which consists of
3 3 = 9 cells (i.e., directory chunk entries); this is
depicted in Fig. 15.
In Fig. 15, we can also see the new values of each
dimension and the corresponding parentchild relation-
ships between the original values and the newly inserted
ones. In this case, each new value will have at most
c
1
8/3 = 3 children for dimension A and c
2

6/3 = 2 children for dimension B. The created direc-
tory chunk will have a global depth D = K and a local
depth d = 0. Around it, we depict all the data chunks
(partitions of the original data chunk) that correspond
to each directory entry. Each such data chunk will have a
global depth D = K and a local depth d = 1. The chunk-
ids of the new data chunks include one more domain as
a sufx, corresponding to the new chunking depth that
they belong. Notice that fromthe articial chunking pro-
cess new empty chunks might arise. For example see the
rightmost chunk in the top of Fig. 15. As no space will
be allocated for such empty chunks, it is obvious that
articial chunking might lead to a minimization of the
Hierarchical clustering for OLAP 643
Fig. 15 The example large
data chunk articially
chunked
0 1 2
2
9

3
0

3
1


3
2
3
3



3
4
5
1
4
. . .
DIMENSION A
.

.

.
D
I
M
E
N
S
I
O
N

B
Chunk ID = ... .5|14
D = K (max depth)
9 10 11 12 13 14 15 16
d = 0
0

1

2
9 10 11
2
9
3
0
12 13 14
2
9
3
0
15 16
2
9
3
0
15 16
3
1

3
2
12 13 14
3
1
3
2
9 10 11
3
1

3
2
3
3


3
4
15 16
12 13 14
3
3
3
4
9 10 11
3
3


3
4
... .5|14.0|0 d = 1 ... .5|14.1|0 d = 1 ... .5|14.2|0 d = 1
... .5|14.2|1 d = 1
... .5|14.1|1 d = 1
... .5|14.2|2 d = 1
... .5|14.1|2 d = 1
... .5|14.0|2 d = 1
... .5|14.0|1 d = 1
X
size of the original data chunk, especially for sparse data
chunks. This important characteristic is stated with the
following theorem, which shows that in the worst case
the extra size overhead of the resultant 2-level tree will
be equal to the size of a single bucket. However, as cubes
are sparse, chunks will also be sparse and therefore prac-
tically the size of the tree will always be smaller than that
of the original chunk.
Theorem 3 (Size upper bound for an articially chun-
ked large data chunk) For any large data chunk Ch of
size S
Ch
it holds that the two-level chunk-tree CT result-
ing fromthe application of the articial chunking process
on Ch will have a size S
CT
such that
S
CT
S
Ch
+S
B
, (4)
where S
B
is the bucket size.
Proof Assume a large data chunk Ch which is 100%
full. Then from the application of articial chunking no
empty chunks will be produced. Moreover, from the
denition of chunking we know that if we connect these
chunks back together we will get Ch. Consequently, the
total size of these chunks is equal to S
Ch
. Now, the root
chunk of the new tree CT will have (by denition) at
most M
r
entries, so as to t in a single bucket. Therefore
the extra size overhead caused by the root is at most S
B
.
From this we infer that S
CT
S
Ch
+S
B
. Naturally if this
holds for the largest possible Ch it will certainly hold for
all other possible Chs that are not 100% full and thus
may result in empty chunks after the articial chunking.

As soon as we create the 2-level chunk-tree, we have


to solve the corresponding HPP chunk-to-bucket allo-
cation sub-problem for this tree, i.e., we recursively call
the GreedyPutChunksIntoBucket algorithm, with input
as the root node of the new tree. The algorithm will then
try to store the whole chunk-tree in a bucket (which is
possible because as explained above articial chunking
reduces the size of the original chunk for sparse data
chunks) or create the appropriate bucket-regions and
store the root node in the root-bucket (see Fig. 7). Also
it will mark the empty directory entries. In Fig. 15, we
can see the formed region assuming that the maximum
number of data entries in a bucket is M
e
= 5.
Finally, if there still exists a large data chunk that can-
not t by itself in a whole bucket, then we repeat the
whole procedure and thus create some new data chunks
at local depth d = 2. This procedure may continue until
we nally store all parts of the original large data chunk.
5.4 Storage of the root-directory
Inthe previous subsections we formally dened the HPP
chunk-to-bucket allocation problem. From this deni-
tion we have seen that the root-bucket B
R
essentially
represented the entire set of buckets that had a zero
degree of hierarchical clustering HCD
B
and, therefore,
had no contribution to the hierarchical clustering
achievedby a specic chunk-to-bucket allocation. More-
over, due tothe factor in(3) ( was denedas the num-
ber of xed-size buckets in B
R
), it is clear that the larger
the root-bucket becomes the worse the hierarchical
644 N. Karayannidis, T. Sellis
Root Chunk
X
X X
X X
X
X X X
X
X
X
X
maximum D
X X
X
X
X
D = 0, d = -1
D = 1, d = -1
D = 2, d = -1
D = 3, d = -1
D = 4, d = 0
D = 4, d = 1
X
X
Fig. 16 An example of a root directory
clustering achieved is. In this subsection, we will present
a method for improving the hierarchical clustering con-
tribution of the root-bucket by reducing the factor,
with the use of a main memory cache area, and also by
increasing the HCD
B
of the buckets in B
R
.
In Fig. 16, we depict an example of a set of directory
nodes that will be stored in the root-bucket. These are
all directory chunks and are rooted all the way up to
the root chunk of the whole cube. These chunks are of
different global depths Dand local depths d and forman
unbalanced chunk-tree that we call the root directory.
Denition 12 (The root directory R
D
): The root direc-
tory R
D
of a hierarchically chunked cube C, represented
by a chunk-tree CT, is an unbalanced chunk-tree with the
following properties:
1. The root of R
D
is the root node of CT.
2. For the set S
R
of the nodes of R
D
holds: S
R
S
CT
,
where S
CT
is the set of the nodes of CT.
3. All the nodes of R
D
are directory chunks.
4. The leaves of the root directory contain entries that
point to chunks stored in a different bucket than their
own.
5. R
D
is an empty tree iff the root node of CT is stored
in the same bucket with its children nodes.
In Fig. 16, the empty cells correspond to sub-trees that
have been allocated to some bucket, either on their own
or with other sub-trees (i.e., forming a bucket-region).
We have omitted these links from the gure in order
to avoid cluttering the picture. Also note the symbol
X for cells pointing to an empty sub-tree. Beneath
the dotted line we can see directory chunks that have
resulted from the articial chunking process described
in the previous subsection.
The basic idea of the method that we will describe
next is based on the simple observation that if we impose
hierarchical clustering to the root directory, as if it was
a chunk-tree on its own, the evaluation of HPP que-
ries would be improved, because all the HPP queries
need at some point to access a node of the root direc-
tory. Moreover, as the root directory always contains the
root chunk of the whole chunk-tree as well as certain
higher level (i.e., lower depth) directory chunk nodes,
we could assume that these nodes are permanently res-
ident in main memory during a query session on a cube.
The latter is of course a common practice for all index
structures in databases.
The algorithm that we propose for the storage of the
root directory is called StoreRootDir. It assumes that
directory entries inthe root directory pointing toalready
allocated sub-trees (empty cells in Fig. 16) are treated
as pointers to empty trees, in the sense that their stor-
age cost is not taken into account for the storage of
the root directory. The algorithm receives as an input
the root directory R
D
, a cache area of size S
M
and a
root-bucket B
R
of size S
ROOT
= S
B
, where 1 and
= S
M
/S
B
(therefore S
ROOT

= S
M
) and produces a
list of allocated buckets for the root directory; the details
of the algorithm are shown in Fig. 17.
We begin from the root and visit in a breadth-rst
manner all nodes of R
D
(lines 15). Each node we visit,
we store it in the root-bucket B
R
, until we nd a node
that can no longer be accommodated. Then, for each
of the remaining unallocated chunk sub-trees of R
D
we solve the corresponding HPP chunk-to-bucket sub-
problem (lines 1013). For the storage of the new root
directories that might result from these sub-problems,
we use again the Store RootDir algorithm but with a
zero cache area size this time (lines 1518).
From the above description we can see that the pro-
posed algorithm uses the root-bucket only for storing
the higher-level nodes that will be loaded in the cache.
Therefore, the I/Ooverhead due to the root-bucket dur-
ing the evaluation of an HPP query is zeroed. Further-
more, the chunk-to-bucket allocation solution of a cube
is now augmented with an extra set of buckets result-
ing from the solutions to the new sub-problems from
within StoreRootDir. The hierarchical clustering degree
Hierarchical clustering for OLAP 645
Fig. 17 Recursive algorithm
for storing the root directory
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
StoreRootDir(R,S
M
,S
B
)
//Input: Root R of root directory, cache area size S
M
,
// bucket size S
B
, root-bucket B
R
//Output: list of allocated buckets BuckList for
// the root directory, root-bucket B
R
,
{
current node is R
WHILE(B
R
can accommodate current node) {
store current node in root-bucket B
R
current node = next node in breadth-first way
}
IF (we have stored all nodes of root directory){
//whole root directory in cache (i.e.,in root-bucket)
RETURN B
R
}
FOR each unallocated sub-tree ST {
// solve subproblem, update BuckList
GreedyPutChunksIntoBuckets(root(ST), S
B
)
IF (the root directory of ST is not empty){
//make a recursive call with 0 cache
StoreRootDir(root(ST),0,S
B
)
}
}
RETURN (B
R
, BuckList)
}
HCD
B
of these buckets is calculated based on the input
chunk-tree of the specic sub-problem and not on the
chunk-tree representing the whole cube. In the case
where the former is an unbalanced tree, the maximum
chunking depth D
MAX
is calculated from the longest
path from the root to a leaf.
Notice that for each such sub-problem a new root
directory might arise. (In fact the only chance for an
empty root directory is the case where the whole chunk
sub-tree, upon which GreedyPutChunksInto Buckets is
called, ts in a single bucket.) Therefore, we solve each
of these sub-problems by recursively using StoreRoot-
Dir, but this time with no available cache area. This will
make the StoreRootDir torecursively invoke the Greedy
PutChunksIntoBuckets algorithm, until all chunks of a
sub-tree are allocated to a bucket. Recall from the pre-
vious sub-sections that the termination of the Greedy-
PutChunksIntoBuckets algorithm is guaranteed by the
fact that each recursive call deals with a sub-problem
of a smaller in size chunk-tree than the parent prob-
lem. Thus, the size of the input chunk-tree continuously
reduces. Consequently, this also guarantees the termi-
nation of StoreRootDir.
Note that the root directory is a very small fragment
of the overall cube data space. Thus, it is realistic to
assume that in most cases we can store the whole root
directory in the root-bucket and load it entirely in the
cache during querying. In this case, we can evaluate any
point HPP query with a single bucket I/O.
In the following we provide an upper bound for the
size of the root directory. In order to compute this upper
bound, we use the full chunk-tree resulting from the
hierarchical chunking of a cube. A guaranteed upper
bound for the size of the root directory could be the size
of all possible directory chunks of this tree. However, the
root directory of the CUBEFile is a signicantly smaller
version of the whole directory tree for the following rea-
sons: (a) it does not contain all directory chunk nodes,
only the ones that were not stored in a bucket along with
their descendants, (b) space is not allocated for empty
sub-trees and (c) chunks are stored in a compressed
form, not wasting space for empty entries.
Lemma 1 For any cube C consisting of N dimensions,
where each dimension has a hierarchy represented by a
complete K-level m-way tree, the size of the root directory
in terms of the number of directory entries is O(m
N(K2)
).
Proof As the root directory is always (by its denition)
smaller in size from the tree containing all the possi-
ble directory chunks (called directory tree), then we can
write: the size of the root directory is O(size of direc-
tory tree). The size of the directory tree can be very
easily computed by the following series, which adds the
number of all possible directory entries:
Size of directory tree = 1 +m
N
+m
2N
+ +m
(K2)N
= O(m
N(K2)
).

Next we provide a theorem that proves an upper


bound for the ratio between the size of the root direc-
tory and that of the full most-detailed data space of a
cube.
Theorem 4 (Upper bound of the size ratio between the
root directory and the cubes data space) For any cube C
646 N. Karayannidis, T. Sellis
Fig. 18 Resulting allocation
of the running example cube
for a bucket size S
B
= 30 and
a cache area equal to a single
bucket
65
40
22
10
20
5
5
5
2
3
D = 1
D
MAX
= 5
D = 2
S
B
= 30
B
1
B
2
B
3
Cache Area S
M
= 30
consisting of N dimensions, where each dimension has a
hierarchy represented by a complete K-level m-way tree,
the ratio of the root directory size to the full size of Cs
detailed data space (i.e., the Cartesian product of the car-
dinalities of the most detailed levels for all dimensions) is
O(1/m
N
).
Proof From the above lemma we have that the size
of the root directory is O(m
N(K2)
). Similarly we can
prove that the size of the Cs most detailed data space is
O(m
N(K1)
). Therefore, the ratio
root directory size
cube most detailed data space size
= O
_
m
N(K2)
m
N(K1)
_
= O
_
1
m
N
_
.

Theorem 4 proves that as dimensionality increases


the ratio of the root directory size to the full cube size at
the most detailed level exponentially decreases. There-
fore, as N increases, the root directory size becomes very
fastly negligible compared to the cubes data space.
If we go back to the allocated cube in Fig. 9, and
assume a cache area of size equal to a single bucket,
then the StoreRootDir algorithm will store the whole
root directory in the root-bucket. In other words, the
root directory can be fully accommodated in the cache
area and therefore from (3), for K = 3 and = 0 (as
the root-bucket will be loaded into memory the factor
is zeroed) we get an improved hierarchical clustering
degree f
HC
= 9.6%. The new allocation is depicted in
Fig. 18. Notice that any point query can be answered
now with a single bucket I/O.
If for the cube of our running example we assume a
bucket size of S
B
= 10, then the chunk-to-bucket allo-
cation resulting fromthe GreedyPutChunksIntoBuckets
and the subsequent call to StoreRootDir is depicted in
Fig. 19. In this case, we have once more assumed a cache
area equal to a single bucket. In the gure, we can see
the upper nodes allocated to the cache area (i.e., stored
in the root-bucket), in a breadth-rst way. The buckets
B
1
to B
5
have resulted from the initial call to Greedy-
PutChunksInto Buckets. Buckets B
6
and B
7
store the
rest of the nodes of the root directory that could not be
accommodated in the cache area and are a result of the
call to the StoreRootDir algorithm. Finally, in Fig. 20, we
present the corresponding allocation for a zero cache
area.
This concludes the presentation of the data structures
and algorithms used to construct the CUBE File. We
move next to present detailed experimental evaluation
results.
6 Experimental evaluation
We have conducted an extensive set of experiments
over our CUBE File implementation. The large set of
experiments covers both the structural and the query
evaluation aspects of the data structure. In addition, we
wanted to compare the CUBE File with the UB-tree/
MHC (which to our knowledge is the only multidi-
mensional structure that achieves hierarchical clustering
with the use of h-surrogates), both in terms of struc-
tural behavior and query evaluation time. The latter
comparison yielded 79 times less I/Os on average, in
Hierarchical clustering for OLAP 647
Fig. 19 Resulting allocation
of the running example cube
for a bucket size S
B
= 10 and
a cache area equal to a single
bucket
65
40
22
10
8
5
5
5
2
3
D = 1
D = 2
D
MAX
= 5
D = 3
20
9
D = 4
B
1
S
B
= 10
B
2
B
3
B
4
B
5
Cache Area S
M
= 10
B
6
B
7
Fig. 20 Resulting allocation
of the running example cube
for a bucket size S
B
= 10 and
a zero cache area
65
40
22
10
8
5
5
5
2
3
D = 1
D = 2
D
MAX
= 5
D = 3
20
9
D = 4
B
1
S
B
= 10
B
2
B
3
B
4
B
5
Cache Area S
M
= 0
B
6
B
7
B
8
favor of the CUBE File for all workloads tested and the
former one showed a 23 times lower storage cost for
almost all data spaces, again in favor of the CUBE File,
hence providing evidence that the CUBE File achieves
a higher degree of hierarchical clustering of the data.
These results appear in [18]. Note that the same com-
parison but between the UB-tree/MHC and a bitmap
index based star schema has shown a query evaluation
speedup of 2040 times on average (depending on the
use or not of the pre-grouping transformation optimiza-
tion [40]) (see [15] for more details).
The query performance measurements in [18] were
based on HPP queries (see Denition 2) that resulted in
a single or multiple disjoint query boxes (i.e., hyper-rect-
angles) at the grain level of the cube data space. Both
hot and cold cache query evaluations were examined. In
CUBE File parlance, this translates to a cached or not
cached root-bucket, respectively.
648 N. Karayannidis, T. Sellis
Fig. 21 Dimension hierarchy
conguration for the
experimental data sets
Dimension D
1
D
2
D
3
D
4
D
5
D
6
D
7
D
8
D
9
#Levels 4 5 7 3 9 2 10 8 6
Grain Level
Cardinality
2000 3125 6912 500 8748 36 7776 6561 4096
Our query-load consisted of various query classes
with respect to the cube selectivity (i.e., how many data
points were returned in the result set). The CUBE File
performed multiple times less I/Os than the UB-tree,
in all query classes, for both hot and cold cache experi-
ments, exhibiting a superior hierarchical clustering. For
large selectivity queries (i.e., many data in the result set),
where the hierarchical restrictions were posed on higher
hierarchy levels, the CUBEFile needed 3 times less I/Os
than the UB-tree. Interestingly, for small selectivity que-
ries, where the restrictions were posed on more detailed
hierarchy levels, the difference in I/Os increased impres-
sively (in favor of the CUBE File) reaching a factor
larger than 10 for all relevant query classes and up to 37
in certain query classes.
Note that the most decisive factor for any HPP query
to run fast (i.e., with few I/Os) is to achieve hierarchical
clustering at all levels of the dimension hierarchies. This
is more obvious in small selectivity queries, where one
has to achieve hierarchical clustering even at the most
detailed levels of the hierarchy. For queries with small
cube selectivities the UB-tree performance was worse
and the hierarchical clustering effect reduced. This is
due to the way data are clustered into z-regions (i.e., disk
pages) along the z-curve [2]. In contrast, the hierarchi-
cal chunking applied in the CUBE File creates groups
of data (i.e., chunks) that belong in the same hierar-
chical family even for the most detailed levels. This, in
combination with the chunk-to-bucket allocation, which
guarantees that hierarchical families will be physically
stored together, results in better hierarchical cluster-
ing of the cube even for the most detailed levels of the
hierarchies.
In this paper, we want to present further experimen-
tal results that show the adaptation of the CUBE File
structure to data spaces of varying characteristics such
as cube sparseness and number of total data points (i.e.,
scalability tests).
We have used synthetic data sets that were produced
with an OLAP data generator that we have developed.
Our aim was to create data sets with a realistic number
of dimensions and hierarchy levels. In Fig. 21, we present
the hierarchy conguration for each dimension used in
the experimental data sets. The shortest hierarchy con-
sists of 2 levels, while the longest consists of 10 levels. We
tried each data set to consist of a good mixture of hier-
archy lengths. To evaluate the adaptation to sparse data
spaces, we created cubes that were very sparse. There-
fore, the number of input tuples was kept froma small to
a moderate level. To simulate the cube data distribution,
for each cube we created 10 hyper-rectangular regions
as data point containers. These regions are dened ran-
domly at the most detailed level of the cube and not by
combination of hierarchy values (although this would
be more realistic), in order not to favor the CUBE File
particularly due to the hierarchical chunking. We then
lled each region with data points uniformly spread and
tried to maintain the same number of data points in each
region.
We have distinguished our experiments in two sets
depending on the characteristic for which we wanted to
analyze CUBEFiles behavior: (a) data space sparseness
(SPARSE) and (b) input data point scalability
(SCALE). Figure 22 shows the data set conguration
for each series of experiments.
6.1 Adaptability to data space sparseness
We increase the dimensionality of the cube while main-
taining the number of data points constant (= 100 K
tuples); this way we essentially increase the cube sparse-
ness. The cube sparseness is measured as the ratio of the
actual cube data points tothe product of the cardinalities
of the dimension grain levels.
The primary hypotheses that we aimed to prove
experimentally were the following:
1. The CUBE File adapts perfectly to the extensive
sparseness of the data space and thus its size does
not increase as the cube sparseness increases.
2. Hierarchical clustering achieved by the CUBE File
is almost unaffected by the extensive cube sparse-
ness.
3. The root-bucket size remains low compared to the
CUBE File size and thus it is feasible to be cached
in main memory for realistic cases.
Additionally, we have drawn other interesting conclu-
sions regarding the structures behavior as sparseness
increases.
In Fig. 23, we observe the data space size exploding
exponentially as the number of dimensions increases.
We can see that the data space size is many orders of
magnitude larger than the CUBE File size.
In addition, the CUBE File size is smaller than the
input le, containing the input tuples (i.e., fact values
accompanied by their corresponding chunk-id, or equiv-
alently h-surrogates) to be loaded into the CUBE File.
Hierarchical clustering for OLAP 649
SPARSE SCALE
#Dimensions Varying 5
#Tuples 100,000 Varying
#Facts 1 1
Maximum chunking depth Depends on longest hierarchy 8
Bucket size (bytes) 8K 8K
Bucket occupancy threshold 80% 80%
Fig. 22 Data set conguration for the four series of experiments
SIZE vs #Dims
1
10000
1E+08
1E+12
1E+16
1E+20
1E+24
1E+28
3 4 5 6 7 8 9
#dimensions
s
i
z
e

(
M
B
s
)
CB File Sz
DATA SPACE sz
INPUT FILE Sz
Fig. 23 CUBE File size (in logarithmic scale) for increasing
dimensionality
This is depicted more clearly in Fig. 25. Now, we can
more clearly see that the total CUBE File size is smaller
than that of the input data le, although the former
maintains a whole tree structure of intermediate direc-
tory nodes, essentially because the CUBE File does not
allocate space for empty sub-trees and does not store
the coordinates along the measure values.
In the graph, we can see that the CUBE File size
exceeds the input data le only after dimensionality goes
over the number of eight dimensions. The real cause in
this case is the cube sparseness, which is magnied by
the dimensionality increase. In our case, for nine dimen-
sions and with 100,000 input data points, the sparseness
has reached a value of 7.08 10
26
, which is an extreme
case.
This clearly shows that the CUBE File:
1. Adapts to the large sparseness of the cube allocat-
ing space comparable to the actual number of data
points and not to all possible cells.
2. Achieves a compression of the input data as it does
not store the data point coordinates (i.e., the h-sur-
rogate keys/chunk-ids) but only the measure values
The last point is depicted more clearly in Fig. 24,
where we present the compression achieved by the
CUBE File organization as the cube sparseness
increases. This compression is calculated as the ratio
of the CUBE File size to the data space size (or the
input le size), which is then subtracted from 1. With
respect to the data space size, the compression is always
COMPRESSION
0%
20%
40%
60%
80%
100%
120%
2,3E-06 4,4E-09 5,6E-13 1,5E-14 1,9E-18 3,0E-22 7,1E-26
CUBE SPARSENESS
%
C
O
M
P
R
E
S
S
I
O
N

A
C
H
I
E
V
E
D
DATA SPACE COMPRESSION INPUT FILE COMPRESSION
Fig. 24 Compression achieved by the CUBE File as the cube
sparseness increases
100% for all depicted sparseness values. This is reason-
able as the CUBE File size is always many orders of
magnitude smaller than the data space size. In addition,
with respect to the input le, the compression remains
high (above 50%) even for cubes with sparseness values
down to 10
20
. This shows that for all practical cases of
cubes the compression achieved is signicant.
It is noteworthy to point that for the measurements
presented in this report the CUBE File implementa-
tion does not impose any compression to the interme-
diate nodes (i.e., the directory chunks). Only the data
chunks are compressed by means of a bitmap represent-
ing the cell offsets (called compression bitmap), which
however is storeduncompressedalso. This was a deliber-
ate choice to evaluate the compression achieved merely
by the pruning ability of our chunk-to-bucket alloca-
tion scheme, according to which no space is allocated for
empty chunk-trees. Finally, another factor that reduces
the achieved compression is that in our current imple-
mentationfor eachchunkwe alsostore its chunk-id. This
is due to a defensive design choice made on the early
stages of the implementation but it is not necessary for
accessing the chunks, as chunk-ids are not used for asso-
ciative search, when accessing the CUBE File. There-
fore, regarding the compression achieved the following
could improve the compression ratio even further:
1. Compression of directory chunks
2. Removal of chunk-ids from chunks
3. Compression of bitmaps (e.g., with run-length
encoding)
In addition, in Fig. 25 we depict the root-bucket size
and the chunk-tree size. The root-bucket grows in a sim-
ilar way to the CUBE File; however, its size is always
one or two orders of magnitude smaller. We will return
shortly to the root-bucket. The chunk-tree denotes the
chunk-tree representation of the cube, i.e., it is the sum
of the sizes of all the chunks comprising the chunk-
tree. Interestingly, we observe that as dimensionality
increases (i.e., cube sparseness increases) the size of
650 N. Karayannidis, T. Sellis
Fig. 25 Several sizes for increasing cube sparseness (via increase
in dimensionality)
the chunk-tree exceeds that of the CUBE File. This
seems rather strange, as one would expect the CUBE
File size, which includes the storage overhead of the
buckets, to be greater. The explanation lies in the
existence of large data chunks. The chunk-tree repre-
sentation may include large data chunks, which in the
chunk-to-bucket allocation process will be articially
chunked. However, in sparse data spaces, these large
data chunks are also very sparse and most of their size
cost is due to the compression bitmap. When such a
sparse chunk is articially chunked, then its size is
signicantly reduced due to the pruning ability of the
allocation algorithm. Therefore, in sparse cube data
spaces, articial chunking provides substantial compres-
sion as a side effect. Figure. 26 also veries the exis-
tence of many large data chunks in highly sparse data
spaces.
In Fig. 26, we depict the chunk distribution as dimen-
sionality increases. Note that the number of chunks
depicted is the number of real chunks that will be
eventually storedinbuckets andnot the number of pos-
sible chunks deriving from the hierarchical chunking
process. One interesting result that can be drawn from
this graph is that the increase of dimensionality does
not necessarily mean an increase in the total number
of chunks. In fact, we observe this metric decreasing as
dimensionality increases and reaching a minimumpoint,
when dimensionality becomes 7. One would expect the
opposite as the number of chunks at each depth, gener-
ated by hierarchical chunking, equals the product of the
dimension cardinalities at the corresponding levels. The
explanation here lies again in the pruning ability of our
method. This shows that although the number of pos-
sible chunks increases, the number of real chunks
might decrease for certain data distributions and hier-
archy congurations. Again, this provides evidence that
the CUBE File adapts well to the sparseness of the data
space.
Another interesting result is that very soon (from
dimensionality 5 and above) the total number of direc-
tory chunks exceeds the total number of data chunks.
This leads us to the conclusion that a compression of the
directory chunks (which as we have mentioned above,
has not been implemented in our current version) is
indeed meaningful, and might provide a signicant
compression.
Finally, we observe an increase in the number of large
data chunks. This is an implementation effect and not a
data structure characteristic. As we have already noted,
the current chunk implementation leaves the compres-
sion bitmap uncompressed. As the space becomes
sparser, these large data chunks are essentially almost
empty data chunks, with a very large compression bit-
map, which is almost lled with 0 s. Of course, a more
efcient storage of this bitmap (even with a simple run
length encoding scheme) would eliminate this effect
and these data chunks would not appear as large.
The existence of many large data chunks in high di-
mensionalities explains also the fact that the number
of the root-directory chunks (i.e., the chunks that will
be stored mainly in the root-bucket but also in simple
buckets if the latter overows) exceeds the total num-
ber of directory chunks. This is because the total num-
ber of directory chunks appearing in the graph does
not include the directory chunks arising from the arti-
cial chunking of large data chunks, which were not
created initially from the hierarchical chunking process
but dynamically during the chunk-to-bucket allocation
phase.
InFig. 27, we depict the relative size of the root bucket
with respect to the total CUBE File size. In particular,
we can see the ratio of the root-bucket size to the total
CUBE File size for continuously increasing values of
the cube sparseness. It is clear that for even extremely
sparse cubes with sparseness values down to 10
18
, the
total root-bucket size remains less than the 20% of the
total CUBE File size. For all realistic cases this ratio is
below 5%. Once more, the remarks mentioned above
regarding the compression hold for this case too, i.e., in
our experiments no compression has been imposed to
the root-bucket chunks, other than the pruning of empty
regions.
Finally, we have measured the achieved hierarchi-
cal clustering for increasing cube sparseness. In Fig. 28,
we depict f
HC
values that have been normalized to the
range of [0,1]. We can observe in this gure the f
HC
val-
ues varying from one end of the curve to the other only
about 70%, while the cube sparseness varies 20 orders
of magnitude. Thus, the hierarchical clustering factor is
essentially not affected by the cube sparseness increase
and the CUBE File manages to maintain a high quality
of hierarchical clustering even for extremely sparse data
spaces.
Hierarchical clustering for OLAP 651
Fig. 26 The distribution of
chunks for increasing cube
sparseness (via increase in
dimensionality)
CHUNK DISTRIBUTION vs DIMENSIONALITY
0
1000
2000
3000
4000
5000
6000
7000
3 4 5 6 7 8 9
#Dimension
#
C
h
u
n
k
s
TOTAL #CHUNKS NO OF DIR CHUNKS NO OF DATA CHUNKS
NO OF LARGE DATA CHUNKS NO OF ROOT-DIR CHUNKS
ROOT BUCKET RATIO vs SPARSENESS
0%
10%
20%
30%
40%
50%
60%
70%
2,3E-06 4,4E-09 5,6E-13 1,5E-14 1,9E-18 3,0E-22 7,1E-26
SPARSENESS
R
O
O
T

B
K
T

S
I
Z
E
/
C
B

F
I
L
E

S
I
Z
E
Fig. 27 Relative growth of the size of the root-bucket as the cube
sparseness becomes greater
We recapitulate the main conclusions drawn regard-
ing the CUBEFiles behavior in conditions of increasing
sparseness:
1. It adapts to the large sparseness of the cube allocat-
ing space comparable to the actual number of data
points and not to all possible cells.
2. Moreover, it achieves more than 50% compression
of the input data for all realistic cases.
3. In sparse cube data spaces, articial chunking pro-
vides substantial compression as a side effect due to
the existence of many large data chunks.
4. The increase of dimensionality does not necessar-
ily mean an increase in the total number of chunks
for the CUBE File. The possible chunks indeed
increase but the CUBE File stores only those that
are non-empty.
5. Compression of directory chunks in data spaces of
large dimensionality is likely toyieldsignicant stor-
age savings.
6. The root-bucket size remains less than the 20% of
the total CUBE File size for even extremely sparse
cubes. For more realistic cases of sparseness the size
is below 5%. Thus, caching the root-bucket (or at
least a signicant part of it) in main memory is
indeed feasible.
7. The hierarchical clustering factor is essentially not
affected by the cube sparseness increase and the
CUBE File manages to maintain a high quality of
hierarchical clustering even for extremely sparse
data spaces.
6.2 Scalability
This series of experiments aimed at evaluating the sca-
lability of the CUBE File. To this end we increased the
number of input tuples, while maintaining a xed set
of ve dimensions (D1 D5 in Fig. 21). However, we
have kept the maximumnumber of tuples to a moderate
level (1 million rows), to maintain the large sparseness
of the cube, which is more realistic. The primary hypoth-
eses that we aimed to prove with this set of experiments
were the following:
1. The CUBE File is scalable (its size remains lower
than that of the input le when the number of input
data points increase).
2. Hierarchical clustering achieved remains of high
quality, when the number of input data points
increase.
3. The root-bucket size remains low compared to the
CUBE File size and thus it is feasible to be cached
in main memory for realistic cases.
The rst and the third hypotheses can be conrmed
directly fromFig. 29. In this gure, we can see the CUBE
File size remaining smaller than the input le for all
data sets. We can also see the difference between the
CUBE File size and that of the root-bucket becom-
ing larger. Thus, as tuple cardinality increases, the root
bucket becomes a continually smaller fraction of the
CUBEFile. Finally, we can see the chunk-tree size being
652 N. Karayannidis, T. Sellis
Fig. 28 The hierarchical
clustering factor f
HC
as cube
sparseness increases
HCFactor vs Sparseness
0,00
0,20
0,40
0,60
0,80
1,00
1,20
2,3E-06 4,4E-09 5,6E-13 1,5E-14 1,9E-18 3,0E-22 7,1E-26
Sparseness
N
o
r
m
a
l
i
z
e
d

H
C
F
a
c
t
o
r
Fig. 29 CUBE File size as the number of cube data points
increases
very close to the CUBE File size, which demonstrates
the high space utilization achieved by the CUBE File.
More interestingly, in Fig. 30, we depict the compres-
sion achieved by the CUBE File as the number of cube
data points increases. With respect to the data space size
the compression is constantly 100%. With respect to the
input data le the compression becomes high (around
70%) very soon and maintains this high compression
rate for all tuple cardinalities. In fact, it seems the com-
pression reaches a maximum value and then remains
almost constant; thus both sizes increase with the same
rate. This is clear evidence that the CUBE File uti-
lizes space efciently. It saves a signicant portion of
storage from discarding the dimension foreign keys of
each tuple (i.e., the chunk-ids or h-surrogates) and then
retains this size difference by increasing proportionally
to the number of input tuples.
Fig. 31 depicts the decrease of the ratio of the root-
bucket size to the CUBEFile size as the number of input
tuples increases. It shows that for realistic tuple cardinal-
ities the root-bucket size becomes negligible compared
to the CUBE File size. Therefore, for realistic cube sizes
(>1,000 Ktuples) the root-bucket size is below5%of the
CUBE le size and it could be cached in main memory.
Finally, we observe a super-linear decrease of the ratio
Compression vs #Tuples
-120%
-90%
-60%
-30%
0%
30%
60%
90%
120%
1
.
0
0
3
4
.
9
5
0
1
0
.
3
3
0
5
1
.
2
4
6
9
9
.
2
0
3
5
2
5
.
7
7
2
1
.
1
4
2
.
1
3
0
#Tuples
%
C
O
M
P
R
E
S
S
I
O
N

A
C
H
I
E
V
E
D
DATA SPACE COMPRESSION INPUT FILE COMPRESSION
Fig. 30 The compression achieved by the CUBEFile as the num-
ber of cube data points increases
ROOT BUCKET RATIO vs #tuples
0%
10%
20%
30%
40%
50%
60%
70%
1
.
0
0
3
4
.
9
5
0
1
0
.
3
3
0
5
1
.
2
4
6
9
9
.
2
0
3
5
2
5
.
7
7
2
1
.
1
4
2
.
1
3
0
#tuples
R
O
O
T

B
K
T

S
Z

/

C
B

F
I
L
E

S
Z
Fig. 31 The ratio of the root-bucket size to the CUBE File size
for increasing tuple cardinality
in the number of input tuples, which further conrms
our previous statement.
In Fig. 32, we depict the distribution of buckets with
different contents inthenumber of input tuples. Observe
that as the space becomes gradually denser and more
data points ll-up the empty regions, more chunk-sub-
trees are created and thus the number of bucket-region
buckets increases rapidly. This is a very welcomed result
as, the more bucket-regions are formed the better the
hierarchical clustering of the chunk-to-bucket allocation
becomes.
This last point is further exhibited in Fig. 33. We
depict the normalized values of the hierarchical clus-
tering factor for each data set. We can clearly see that
Hierarchical clustering for OLAP 653
BUCKET DISTRIBUTION vs #TUPLES
0
1000
2000
3000
4000
5000
1
.
0
0
3
4
.
9
5
0
1
0
.
3
3
0
5
1
.
2
4
6
9
9
.
2
0
3
5
2
5
.
7
7
2
1
.
1
4
2
.
1
3
0
#Dimensions
#
B
u
c
k
e
t
s
NO OF SINGLE-TREE BUCKETS NO OF BCKT-REGION BUCKETS
NO OF SINGLE CHUNK BUCKETS NO OF ROOT-BCKT BUCKETS
Fig. 32 The distribution of buckets as tuple cardinality increases
HCFactor vs #Tuples
0,00
0,20
0,40
0,60
0,80
1,00
1,20
1
.
0
0
3
4
.
9
5
0
1
0
.
3
3
0
5
1
.
2
4
6
9
9
.
2
0
3
5
2
5
.
7
7
2
1
.
1
4
2
.
1
3
0
#Tuples
N
o
r
m
a
l
i
z
e
d

H
C
F
a
c
t
o
r
Fig. 33 The hierarchical clustering factor f
HC
as tuple cardinality
increases
the hierarchical clustering quality remains high for all
data sets. In particular, the experiments show that the
hierarchical clustering remains approximately 0.7 (i.e.,
70% of the best value achieved) even when the tuple
cardinality was increased by three orders of magnitude.
This essentially proves the second hypothesis that we
posed in the beginning of this subsection.
7 Summary and conclusions
In this paper, we tried to solve the problem of devising
physical clustering schemes for multidimensional data
that are organized in hierarchies. A typical case of such
data is the OLAP cube. The problem of clustering on
disk the most detailed data of a cube so as to reduce the
I/Os during the evaluation of hierarchy-selective queries
is difcult due to the enormous search space of possi-
ble solutions. Instead of following the typical approach
of nding a linear ordering of the data points, we intro-
duced a representation of the search space (i.e., a model)
that is based on a hierarchical chunking method that
results in a chunk-tree representation of the cube. Then
we coped with the problem as a packing problem, in
particular, packing of chunks into buckets.
The chunk-tree representation is a very effective
model of the cube data space, because it prunes all
empty areas (i.e., chunk-trees) and adapts perfectly to
the usual extensive sparseness of the cube. Moreover, by
traversing the chunk-tree nodes we can very efciently
access subsets of the data space that are based on hier-
archy value combinations. This makes the chunk-tree an
excellent index for queries with hierarchical restrictions.
In order to be able to evaluate the solutions to the
proposed problem we dened a quality metric, namely
the hierarchical clustering factor f
HC
of a cube. Further-
more, we formally dened the problem as an optimiza-
tion problem and proved that is NP-Hard by reducing it
to the bin packing problem. We proposed as a solution
an effective greedy algorithm that requires a single pass
over the input fact table and linear time in the number of
chunks. Moreover we have analyzed and provided solu-
tions for a number of sub-problems such as the forma-
tion of bucket-regions, the storage of large data chunks
and the storage of the root directory. The whole solu-
tion leads to the construction of the CUBE File data
structure.
We presented an extensive set of experiments analyz-
ing the structural behavior of the CUBE File in terms
of increasing sparseness and data point scalability. Our
experimental results have conrmed our principal
hypotheses that:
1. The CUBE File adapts perfectly to even the most
extremely sparse data spaces yielding signicant
space savings. Furthermore, the hierarchical cluster-
ing achieved by the CUBE File is almost unaffected
by the extensive cube sparseness.
2. The CUBE File is scalable (its size remained con-
stantly about 70% smaller than that of the input
tuple-based le, for all input data point cardinali-
ties tested). In addition, the hierarchical clustering
achieved remains of high quality, when the number
of input data points increases.
3. Theroot-bucket sizeremainedlow(below5%) com-
pared to the total CUBE File size for all realistic
cases of sparseness and data point cardinality and
thus caching it in main memory is a feasible pro-
posal. This results in a single I/O evaluation of point
queries but reduces I/Os dramatically for all types
of hierarchy-selective queries [18].
All in all, the CUBEFile is an effective data structure
for physically organizing and indexing the most detailed
data of an OLAP cube. One area that such a struc-
ture could be used successfully is as an alternative to
654 N. Karayannidis, T. Sellis
bitmap-index based processing of star-join queries. To
this end an efcient processing framework has been pro-
posed in [15]. However, it can be used as an effective
index for any data that are accessed through multidi-
mensional queries with hierarchical restrictions.
An interesting enhancement to the CUBEFile would
be to incorporate more workload-specic knowledge in
its chunk-to-bucket allocation algorithm. For example
the allocation of more frequently accessed sub-trees
in the same bucket should be rewarded with a higher
HCD
B
value, etc. We are also investigating the use of the
hierarchical clustering factor for making decisions dur-
ing the construction of other common storage organiza-
tions (e.g., partitioned heap les, B-trees, etc.) in order
to achieve hierarchical clustering of the data. The inter-
ested reader can nd more information regarding other
aspects of the CUBE File not covered in this paper (e.g.,
the updating and maintenance operations), as well as
information for a prototype implementation of a CUBE
File based DBMS in [16].
Acknowledgements We would like to thank our colleagues Yan-
nis Kouvaras and Yannis Roussos from the Knowledge and Data-
base Systems Laboratory at the N.T.U. Athens for their fruitful
comments and their support in the implementation of the CUBE
File and the completion of the experimental evaluation. We also
like to thank Aris Tsois for his detailed reviewing and comment-
ing on the rst draft. This work has been partially funded by the
European Unions Information Society Technologies Programme
(IST) under project EDITH (IST-1999-20722).
References
1. Bayer, R., McCreight, E.: Organization and maintenance of
large ordered Indexes. Acta Inf. 1, 173189 (1972)
2. Bayer, R.: The universal B-tree for multi-dimensional index-
ing: general concepts. In: WWCA 1997
3. Chan, C.Y., Ioannidis, Y.: Bitmap index design and evaluation.
In: SIGMOD 1998
4. Chaudhuri, S., Dayal, U.: An overview of data warehousing
and OLAP technology. SIGMOD Rec. 26(1), 6574 (1997)
5. Deshpande, P.M., Ramasamy, K., Shukla, A., Naughton,
J.: Caching multidimensional queries using chunks. In: Pro-
ceedings of, ACM SIGMOD International Conference on
Management of Data, pp. 259270, 1998
6. Fagin, R., Nievergelt, J., Pippenger, N., Raymond, H.: Strong:
extendible hashing a fast access method for dynamic les.
TODS 4(3), 315344 (1979)
7. Faloutsos, C., Rong, Y.: DOT: ASpatial Access Method Using
Fractals. In: ICDE 1991, pp. 152159
8. Gaede, V., Gnther, O.: Multidimensional access methods.
ACM Comput. Surv. 30(2), 170231 (1998)
9. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: a
relational aggregation operator generalizing group-by, cross-
tab, and subtotal. In: ICDE 1996
10. Gupta, A., Mumick, I.S.: Maintenance of materialized views:
problems, techniques, and applications. Data Eng. Bull. 18(2),
318 (1995)
11. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implement-
ing data cubes efciently. In: Proceedings of. ACM SIGMOD
International Conference on Management of Data, pp. 205
227, 1996
12. Hinrichs, K.: Implementation of the grid le: design concepts
and experience. BIT 25(4), 569592 (1985)
13. Jagadish, H.V.: Linear clustering of objects with multiple attri-
butes. In: SIGMOD Conference, pp. 332342, 1990
14. Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D.: Snakes
and sandwiches: optimal clustering strategies for a data ware-
house. In:SIGMOD Conference, pp. 3748, 1999
15. Karayannidis, N. et al.: Processing star-queries on hierarchi-
cally-clustered fact-tables. In: VLDB 2002
16. Karayannidis, N.: Storage structures, query processing and
implementation of on-line analytical processing systems.
Ph.D. Thesis, National Technical University of Athens,
2003. Available at: http://www.dblab.ece.ntua.gr/ni kos/the-
sis/PhD_thesis_en.pdf
17. Karayannidis, N., Sellis, T.: SISYPHUS: the implementation
of a chunk-based storage manager for OLAPdata cubes. Data
Knowl. Eng. 45(2), 155188 (2003)
18. Karayannidis, N., Sellis, T., Kouvaras, Y.: CUBE File: a le
structure for hierarchically clustered OLAP cubes. In: 9th
International Conference on Extending Database Technol-
ogy, Heraklion, Crete, Greece, 1418 March 2004, EDBT, pp.
621638, 2004
19. Kotidis, Y., Roussopoulos, N.: An alternative storage organi-
zation for ROLAP aggregate views based on cubetrees. In:
Proceedings. ACM SIGMOD International Conference. on
Management of Data, pp. 249258, 1998
20. Lakshmanan, L.V.S., Pei, J., Han, J.: Quotient cube: how to
summarize the semantics of a data cube. In: VLDB 2002
21. Lakshmanan, L.V.S., Pei, J., Zhao, Y.: QC-Trees: an efcient
summary structure for semantic oLAP. In: SIGMOD 2003
22. Markl, V., Ramsak, F., Bayern, R.: Improving OLAP
performance by multidimensional hierarchical clustering.
In:IDEAS 1999
23. Nievergelt, J., Hinterberger, H., Sevcik, K.C.: The grid le:
an adaptable, symmetric multikey le structure. TODS 9(1),
3871 (1984)
24. OLAP Report: Database explosion. Available at: http://www.
olapreport.com/DatabaseExplosion.htm, 1999
25. ONeil, P.E., Graefe, G.: Multi-table joins through bitmapped
join indices. SIGMOD Rec. 24(3), 811 (1995)
26. ONeil, P.E., Quass, D.: Improved query performance with
variant indexes. In: SIGMOD 1997
27. Orenstein, J.A., Merrett, T.H.: A class of data structures for
associative searching. In: PODS, pp. 181190, 1984
28. Padmanabhan, S., Bhattacharjee, B., Malkemus, T., Cranston,
L., Huras, M.: Multi-dimensional clustering: a newdata layout
scheme in DB2. In: SIGMOD Conference, pp. 637641, 2003
29. Pieringer, R. et al. (2003) Combining hierarchy encoding
and pre-grouping: intelligent grouping in star join processing.
In:ICDE 2003
30. Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K.,
Bayer, R.: Integrating the UB-tree into a database system
kernel. In: VLDB, pp. 263272, 2000
31. Rgnier, M.: Analysis of grid le algorithms. BIT 25(2), 335
357 (1985)
32. Roussopoulos, N.: Materialized views and data warehouses.
SIGMOD Rec. 27(1), 2126 (1998)
33. Sagan, H.: Space-Filling Curves.Springer, Berlin Heidelberg
New york (1994)
34. Sarawagi, S.: Indexing OLAP data. Data Eng. Bull. 20(1),
3643 (1997)
Hierarchical clustering for OLAP 655
35. Sarawagi, S., Stonebraker, M.: Efcient organization of large
multidimensional arrays. In: Proceedings. of the 11th Interna-
tional. Conference on Data Engineerings, pp. 326336, 1994
36. Sismanis, Y., Deligiannakis, A., Roussopoulos, N., Kotidis, Y.:
Dwarf: shrinking the PetaCube. In: SIGMOD 2002
37. Srivastava, D., Dar, S., Jagadish, H.V., Levy, A.Y.: Answering
queries with aggregation using views. In: VLDB Conference,
pp. 318329, 1996
38. Sthr, T., Mrtens, H., Rahm, E.: Multi-dimensional database
allocation for parallel data Warehouses. In:VLDB, pp. 273
284, 2000
39. The TransBase HyperCube

relational database system:


available at http://www.transaction.de, 2005
40. Tsois, A., Sellis, T.: The generalized pre-grouping transforma-
tion: aggregate-query optimization in the presence of depen-
dencies. In: VLDB (2003)
41. Weber, R., Schek, H.-.J., Blott, S.: A quantitative analysis
and performance study for similarity-search methods in high-
dimensional spaces. In: VLDB, pp. 194205, 1998
42. Weiss, M.A.: Data Structures and Algorithm Analysis,
pp. 351359. Benjamin/Cummings Publishing, Redwood city
(1995)
43. Whang, K.-Y., Krishnamurthy, R.: The multilevel grid le
a dynamic hierarchical multidimensional le structure. In:
DASFAA, pp. 449459, 1991

You might also like