Professional Documents
Culture Documents
DOI 10.1007/s00778-006-0022-1
REGULAR PAPER
Hierarchical clustering for OLAP: the CUBE File approach
Nikos Karayannidis Timos Sellis
Received: 6 September 2005 / Accepted: 13 April 2006 / Published online: 7 September 2006
Springer-Verlag 2006
Abstract This paper deals with the problem of phys-
ical clustering of multidimensional data that are orga-
nized in hierarchies on disk in a hierarchy-preserving
manner. This is called hierarchical clustering. A typi-
cal case, where hierarchical clustering is necessary for
reducing I/Os during query evaluation, is the most
detailed data of an OLAP cube. The presence of hierar-
chies in the multidimensional space results in an enor-
mous search space for this problem. We propose a
representation of the data space that results in a chunk-
tree representation of the cube. The model is adaptive
to the cubes extensive sparseness and provides efcient
access tosubsets of data basedonhierarchy value combi-
nations. Based on this representation of the search space
we formulate the problem as a chunk-to-bucket alloca-
tion problem, which is a packing problem as opposed to
the linear ordering approach followed in the literature.
We propose a metric to evaluate the quality of hier-
archical clustering achieved (i.e., evaluate the solutions
to the problem) and formulate the problem as an opti-
mization problem. We prove its NP-Hardness and pro-
vide an effective solution based on a linear time greedy
algorithm. The solution of this problemleads to the con-
struction of the CUBEFile data structure. We analyze in
depth all steps of the construction and provide solutions
Communicated by P-L. Lions.
N. Karayannidis (B) T. Sellis
Institute of Communication and Computer Systems
and School of Electrical and Computer Engineering,
National Technical University of Athens,
Zographou 15773, Athens, Greece
e-mail: nikos@dblab.ece.ntua.gr
T. Sellis
e-mail: timos@dblab.ece.ntua.gr
for interesting sub-problems arising, such as the forma-
tion of bucket-regions, the storage of large data chunks
and the caching of the upper nodes (root directory) in
main memory.
Finally, we provide an extensive experimental evalu-
ation of the CUBE Files adaptability to the data space
sparseness as well as to an increasing number of data
points. The main result is that the CUBE File is highly
adaptive to even the most sparse data spaces and for
realistic cases of data point cardinalities provides hier-
archical clustering of high quality and signicant space
savings.
Keywords Hierarchical clustering OLAP CUBE
File Data cube Physical data clustering
1 Introduction
Efcient processing of ad hoc OLAP queries is a very
difcult task considering, on the one hand the native
complexity of typical OLAP queries, which potentially
combine huge amounts of data, andonthe other, the fact
that no a priori knowledge for queries exists and thus
no pre-computation of results or other query-specic
tuning can be exploited. The only way to evaluate these
queries is to access directly the most detailed data in an
efcient way. It is exactly this need to access detailed
data based on hierarchy criteria that calls for the hierar-
chical clustering of data. This paper discusses the phys-
ical clustering of OLAP cube data points on disk in
a hierarchy-preserving manner, where hierarchies are
dened along dimensions (hierarchical clustering).
622 N. Karayannidis, T. Sellis
The problem addressed is set out as follows: we are
given a large fact table (FT) containing only grain-level
(most detailed) data. We assume that this is part of the
star schema ina dimensional data warehouse. Therefore,
data points (i.e., tuples in the FT) are organized by a set
of N dimensions. We further assume that each dimen-
sion is organized in a hierarchy. Typically the data dis-
tribution is extremely skewed. In particular, the OLAP
cube is extremely sparse and data tend to appear in
arbitrary clusters along some dimension. These clus-
ters correspond to specic combinations of the hierar-
chy values for which there exist actual data (e.g., sales
for a specic product category in a specic geographic
region for a specic period of time). The problem is
on the one hand to store the fact table data in a hier-
archy-preserving manner so as to reduce I/Os during
the evaluation of ad hoc queries containing restrictions
and/or groupings on the dimension hierarchies and, on
the other, to enable navigation in the multilevel-multi-
dimensional data space by providing direct access (i.e.,
indexing) to subsets of data via hierarchical restrictions.
The later implies that index nodes must also be hier-
archically clustered if we are aiming at a reduced
I/O cost.
Some of the most interesting proposals [20, 21, 36] in
the literature for cube data structures deal with the com-
putation and storage of the data cube operator [9]. These
methods omit a signicant aspect in OLAP, which is
that usually dimensions are not at but are organized
in hierarchies of different aggregation levels (e.g., store,
city, area, country is such a hierarchy for a Location
dimension). The most popular approach for organizing
the most detailed data of a cube is the so-called star
schema. In this case the cube data are stored in a rela-
tional table, called the fact table. Furthermore, various
indexing schemes have been developed [3, 15, 25, 26], in
order tospeed upthe evaluation of the joinof the central
(and usually very large) fact table with the surrounding
dimension tables (also known as a star-join). However,
even when elaborate indexes are used, due to the arbi-
trary ordering of the fact table tuples, there might be
as many I/Os as there are tuples resulting from the fact
table.
We propose the CUBEFile data structure as an effec-
tive solution to the hierarchical clustering problem set
above. The CUBE File multidimensional data structure
[18] clusters data into buckets (i.e., disk pages) with
respect to the dimension hierarchies aiming at the hier-
archical clustering of the data. Buckets may include both
intermediate (index) nodes (directory chunks), as well
as leaf (data) nodes (data chunks). The primary goal of
a CUBE File is to cluster in the same bucket a family
of data (i.e., data corresponding to all hierarchy value
combinations for all dimensions) so as to reduce the
bucket accesses during query evaluation.
Experimental results in [18] have shown that the
CUBE File outperforms the UB-tree/MHC [22], which
is another effective method for hierarchically clustering
the cube, resulting in 79 times less I/Os on average for
all workloads tested. This simply means that the CUBE
File achieves a higher degree of hierarchical clustering
of the data. More interestingly, in [15] it was shown that
the UB-tree/MHC technique outperformed the tradi-
tional bitmap index based star-join by a factor of 20
40, which simply proves that hierarchical clustering is
the most determinant factor for a le organization for
OLAP cube data, in order to reduce I/O cost.
To tackle this problem we rst model the cube data
space as a hierarchy of chunks. This modelcalled the
chunk-tree representation of a cubecopes effectively
with the vast data sparseness by truncating empty areas.
Moreover, it provides a multiple resolution view of the
data space where one can zoom-in or zoom-out to spe-
cic areas navigating along the dimension hierarchies.
The CUBE File is built by allocating the nodes of the
chunk-tree into buckets in a hierarchy-preserving man-
ner. This way we depart from the common approach for
solving the hierarchical clustering problem, which is to
nd a total ordering of the data points (linear cluster-
ing) and cope with it as a packing problem, namely a
chunk-to-bucket packing problem.
In order to solve the chunk-to-bucket packing prob-
lem, we need to be able to evaluate the hierarchical
clustering achieved (i.e., evaluate the solutions to this
problem). Thus, inspired by the chunk-tree represen-
tation of the cube, we dene a hierarchical clustering
quality metric, called the hierarchical clustering factor.
We use this metric to evaluate the quality of the chunk-
to-bucket allocation. Moreover, we exploit it in order
to formulate the CUBE File construction problem as
an optimization problem, which we call the chunk-to-
bucket allocation problem. We formally dene this prob-
lem and prove that it is NP-Hard. Then, we propose a
heuristic algorithm as a solution that requires a single
pass over the input fact table and linear time in the num-
ber of chunks.
In the course of solving this problem several inter-
esting sub-problems arise. We dene the sub-problem
of chunk-region formation, which deals with the clus-
tering of chunk-trees hanging from the same parent
node in order to increase further the overall hierarchi-
cal clustering. We propose two algorithms as a solution,
one of which is driven by workload patterns. Next, we
deal with the sub-problem of storing large data chunks
(i.e., chunks that do not t in a single bucket), as well
as with the sub-problem of storing the so-called root
Hierarchical clustering for OLAP 623
directory of the CUBE File (i.e., the upper nodes of the
data structure).
Finally, we study the CUBE Files effective adapta-
tion to several cube data spaces by presenting a set of
experimental measurements that we have conducted.
All in all, the contributions of this paper are outlined
as follows:
We provide an analytic solution to the problem of
hierarchical clustering of an OLAP cube. The solu-
tion leads to the construction of the CUBE File data
structure.
We model the multilevel-multidimensional data
space of the cube as a chunk-tree. This representa-
tion of the data space adapts perfectly to the
extensive data sparseness and provides a multi-res-
olution view of the data with respect to the hierar-
chies. Moreover, if viewed as an index, it provides
direct access to cube data via hierarchical restric-
tions, which results in signicant speedups of typical
ad hoc OLAP queries.
We transform the hierarchical clustering problem
from a linear clustering problem into a chunk-to-
bucket allocation (i.e., packing) problem, which we
formally dene and prove that it is NP-Hard.
We introduce a hierarchical clustering quality met-
ric for evaluating the hierarchical clustering achieved
(i.e., evaluating the solution to the problem in ques-
tion). We provide an efcient solution to this prob-
lem as well as to all sub-problems that stem from
it, such as the storage of large data chunks or the
formation of bucket-regions.
We provide an experimental evaluation which leads
to the following basic results:
o The CUBEFile adapts perfectly to even the most
extremely sparse data spaces yielding signicant
space savings. Furthermore, the hierarchical clus-
tering achieved by the CUBEFile is almost unaf-
fected by the extensive cube sparseness.
o The CUBE File is scalable for any realistic
number of input data points. In addition, the
hierarchical clustering achieved remains of high
quality, when the number of input data points
increases.
o The root directory can be cached in main mem-
ory providing a single I/O cost for the evaluation
of point queries.
The rest of this paper is organized as follows. Section 2
discusses related work and positions the CUBE File
in the space of cube storage structures. Section 3 pro-
poses the chunk-tree representation of the cube as an
effective representation of the search space. Section 4
introduces a quality metric for the evaluation of
hierarchical clustering. Section 5 formally denes the
problem of hierarchical clustering, proves its NP-Hard-
ness and then delves into the nuts and bolts of building
the CUBE File. Section 6 presents our extensive experi-
mental evaluation and Sect. 7 recapitulates and empha-
sizes on main conclusions drawn.
2 Related work
2.1 The linear clustering problem for multidimensional
data
The linear clustering problemfor multidimensional data
is dened as the problem of nding a linear order-
ing of records indexed on multiple attributes, to be
stored in consecutive disk blocks, such that the I/O cost
for the evaluation of queries is minimized. The cluster-
ing of multidimensional data has been studied in terms
of nding a mapping of the multidimensional space
to a one-dimensional space. This approach has been
explored mainly in two directions: (a) in order to exploit
traditional one-dimensional indexing techniques to a
multidimensional index spacetypical example is the
UB-tree [2], which exploits a z-ordering of multidimen-
sional data [27], so that these can be stored into a one-
dimensional B-tree index [1]and (b) for ordering
buckets containing records that have been indexed on
multiple attributes, to minimize the disk access effort.
For example, a grid le [23] exploits a multidimensional
grid in order to provide a mapping between grid cells
and disk blocks. One could nd a linear ordering of
these cells, and therefore an ordering of the underlying
buckets, such as the evaluation of a query to entail more
sequential bucket reads than random bucket accesses.
To this end, space-lling curves (see [33] for a survey)
have been used extensively. For example, Jagadish [13]
provides a linear clustering method based on the Hilbert
curve that outperforms previously proposed mappings.
Note, however, that all linear clustering methods are
inferior to a simple scan in high dimensional spaces. This
is due to the notorious dimensionality curse [41], which
states that clustering in such spaces becomes meaning-
less due to lack of useful distance metrics.
In the presence of dimension hierarchies the multidi-
mensional clustering problem becomes combinatorially
explosive. Jagadish et al. [14] try to solve the problem of
nding an optimal linear clustering of records of a fact
table on disk, given a specic workload in the form of a
probability distribution over query classes. The authors
propose a subclass of clustering methods called lattice
paths, which are paths on the lattice dened by the
624 N. Karayannidis, T. Sellis
hierarchy level combinations of the dimensions. The
HPP chunk-to-bucket allocation problem (in Sect. 3.2
we provide a formal denition of HPP restrictions and
queries) is a different problemfor the following reasons:
1. It tries to nd an optimal way (in terms of reduced
I/O cost during query evaluation) to pack the data
into buckets, rather than order the data linearly. The
problem of nding an optimal linear ordering of
the buckets, for a specic workload, so as to reduce
random bucket reads, is an orthogonal problem and
therefore, the methods proposed in [14] could be
used additionally.
2. Apart from the data, it also deals with the inter-
mediate node entries (i.e., directory chunk entries),
which provide clustering at a whole-index level and
not only at the index-leaf level. Inother words, index
data are also clustered along with the real data.
As, we know that there is no linear clustering of
records that will permit all queries over a multidimen-
sional space to be answered efciently [14], we strongly
advocate that linear clustering of buckets (inter-bucket
clustering) must be exploited in conjunction with an
efcient allocation of records into buckets (intra-bucket
clustering).
Furthermore, in[22], apath-basedencodingof dimen-
sion data, similar to our encoding scheme, is exploited
in order to achieve linear clustering of multidimensional
data with hierarchies, through a z-ordering [27]. The
authors use the UB-tree [2] as an index on top of the
linearly clustered records. This technique has the advan-
tage of transforming typical star-join [25] queries to
multidimensional range queries, which are computed
more efciently due to the underlying multidimensional
index.
However, this technique suffers from the inherent
deciencies of the z space-lling curve, which is not the
best space-lling curve according to [7, 13]. On the other
hand, it is very easy to compute and thus straightforward
to implement the technique even for high dimensional-
ities. A typical example of such deciency is that in the
z-curve there is a dispersion of certain data points, which
are close in the multidimensional space but not close in
the linear order and the opposite, i.e., distant data points
are clustered in the linear space. The latter results also
in an inefcient evaluation of multiple disjoint query
regions, due to the repetitive retrieval of the same pages
for many queries. Finally, the benets of z-based linear
clustering starts to disappear quite soon as dimensional-
ity increases, practically even when dimensionality gets
over the number of 45 dimensions.
2.2 Grid le based multidimensional access methods
The CUBE File organization was initially inspired by
the grid le organization [23], which can be viewed as
the multidimensional counterpart of extendible hashing
[6]. The grid le superimposes a d-dimensional orthog-
onal grid on the multidimensional space. Given that the
grid is not necessarily regular, the resulting cells may be
of different shapes and sizes. A grid directory associates
one or more of these cells with data buckets, which are
stored in one disk page each. Each cell is associated with
one bucket, but a bucket may contain several adjacent
cells, therefore bucket regions may be formed.
To ensure that data items are always found with no
more than two disk accesses for exact match queries,
the grid itself is kept in main memory represented by
d one-dimensional arrays called scales. The grid le is
intended for dynamic insert/delete operations, therefore
it supports operations for splitting and merging direc-
tory cells. A well-known problem of the grid le is that
it suffers froma superlinear growth of the directory even
for data that are uniformly distributed [31]. One basic
reasonfor this is that splitting is not a local operationand
thus canleadtosuperlinear directory growth. Moreover,
depending on the implementation of the grid directory
merging may require a complete directory scan [12].
Hinrichs [12] attempts to overcome the shortcomings
of the grid le by introducing a 2-level grid directory.
In this scheme, the grid directory is now stored on disk
and a scaled-down version of it (called root directory)
is kept in main memory to ensure the two-disk access
principle still holds. Furthermore, he discusses efcient
implementations of the split, merge and neighborhood
operations. In a similar manner, Whang and krishna-
murthy [43] extends the idea of a 2-level directory to a
multilevel directory, introducing the multilevel grid le,
achieving a linear directory growth in the number of
records. There exist more grid le based organizations.
A comprehensive survey of these and multidimensional
access methods in general can be found in [8].
An obvious distinction of the CUBE File organiza-
tion from the above multidimensional access methods
is that it has been designed to fulll completely differ-
ent requirements, namely those of an OLAP environ-
ment and not of a transaction-oriented one. A CUBE
File is designed for an initial bulk loading and then a
read-only operation mode, in contrast, to the dynamic
insert/delete/update workload of a grid le. Moreover,
a CUBE File aims at speeding up queries on multidi-
mensional data with hierarchies and exploits hierarchi-
cal clustering to this end. Furthermore, as the dimension
domain in OLAP is known a priori the directory does
not have to grow dynamically. In addition, changes to
Hierarchical clustering for OLAP 625
the directory are rare, as dimension data do not change
very often (compared to the rate of change for the cube
data), and also deletions are seldom, therefore split and
merge operations are not needed somuch. Nevertheless,
more important is to adapt well to the native sparseness
of a cube data space and to efciently support incremen-
tal updating, so as to minimize the updating windowand
cube query-down time, which are critical factors in busi-
ness intelligence applications nowadays.
2.3 Taxonomy of cube primary organizations
The set of reported methods in the literature for primary
organizations for the storage of cubes is quite conned.
We believe that this is basically due to two reasons:
rst of all the generally held view is that a cube is
a set of pre-computed aggregated results and thus the
main focus has been to devise efcient ways to compute
these results [11], as well as to choose which ones to
compute for a specic workload (view selection/main-
tenance problem[10, 32, 37]). Kotidis and Roussopoulos
[19] proposed a storage organization based on packed
R-trees for storing these aggregated results. We believe
that this is a one-sided view of the problem as it dis-
regards the fact that very often, especially for ad hoc
queries, there will be a need for drilling down to the
most detailed data in order to compute a result from
scratch. Ad hoc queries represent the essence of OLAP,
and in contrast to report queries, are not known a pri-
ori and thus cannot really benet from pre-computa-
tion. The only way to process them efciently is to
enable fast retrieval of the base data. This calls for
an effective primary storage organization for the most
detailed data (grain level) of the cube. This argument
is of course based on the fact that a full pre-compu-
tation of all possible aggregates is prohibitive due to
the consequent size explosion, especially for sparse
cubes [24].
The second reason that makes people reluctant to
work on new primary organizations for cubes is their
adherence to relational systems. Although this seems
justied, one could pinpoint that a relational table (e.g.,
a fact table of a star schema [4]) is a logical entity and thus
should be separated from the physical method chosen
for implementing it. Therefore, one can use apart from
a paged record le, also a B
T
i=1
c
i
r
T
i=1
c
i
d
O
B
=
Tc
r
Tc
d
O
B
=
c
r
c
d
O
B
, (1)
where c
i
r
is the region contribution of tree t
i
and c
i
d
is the
depth contribution of tree t
i
(1 i T). (Note that as
bucket regions have been dened as consisting of equi-
depth trees, then all trees of a bucket have the same region
contribution as well as depth contribution.)
In this denition, we have assumed that the chunking
depth d
i
of a chunk-tree t
i
is equal to the chunking depth
of the root-chunk of this tree. Of course we assume that
a normalization of the depth values has taken place, so
that the depth of the chunk-tree CT is to be 1 instead of
0, in order to avoid having zero depths in the denomina-
tor of (1). Furthermore, data chunks are considered as
chunk-trees with a depth equal to the maximum chunk-
ing depth of the cube. Note that directory chunks stored
in a bucket, not as part of a sub-tree but isolated, have
a zero region contribution; therefore, buckets that con-
tain only such directory chunks have a zero degree of
hierarchical clustering.
From (1), we can see that the more sub-trees, instead
of single chunks, are included in a bucket the greater the
hierarchical clustering degree of the bucket becomes,
because more HPP restrictions can be evaluated solely
with this bucket. Also the highest these trees are (i.e.,
the smaller their chunking depth is) the greater the hier-
archical degree of the bucket becomes, as more combi-
nations of hierarchical attributes are covered by this
bucket. Moreover, the more trees of the same depth and
hanging under the same parent node, we have stored in
a bucket, the greater the hierarchical clustering degree
of the bucket, as we include more combinations of the
same path in the hierarchy.
All in all, the HCD
B
metric favors the following stor-
age choices for a bucket:
Whole trees instead of single chunks or other data
partitions
Smaller depth trees instead of greater depth ones
Tree regions instead of single trees
Regions with a few low-depth trees instead of ones
with more trees of greater depth
Regions with trees of the same depth that are close
in the multidimensional space instead of dispersed
trees
Buckets with a high occupancy
We prove the following theorem regarding the maxi-
mum value of the hierarchical clustering degree of a
bucket:
Theorem 1 (Theoremof maximumhierarchical cluster-
ing degree of a bucket) Assume a hierarchically chun-
ked cube represented by a chunk-tree CT of a maximum
chunking depth D
MAX
, which has been allocated to a set
of buckets. Then, for any such bucket B holds that
HCD
B
D
MAX
.
Proof From the denition of the region contribution of
a tree appearing in Denition 4, we can easily deduce
that
c
i
r
1. (I)
This means that the following holds:
T
i=1
c
i
r
T. (II)
In (II) T stands for the number of trees stored in B.
Similarly, from the denition of the depth contribution
Hierarchical clustering for OLAP 633
of a tree appearing in Denition 5, we can easily deduce
that:
c
i
d
1
D
MAX
, (III)
as, the smallest possible depth value is 1. This means that
the following holds:
T
i=1
c
i
d
T
D
MAX
. (IV)
From (II), (IV), (1) and assuming that B is lled to its
capacity (i.e., O
B
equals 1) the theorem is proved.
It is easy to see that the maximum degree of hierar-
chical clustering of a bucket B is achieved only in the
ideal case, where we store the chunk-tree CT that rep-
resents the whole cube in B and CT ts exactly in B.
2
.
In this case, all our primary goals for a good hierarchical
clustering, posed in the beginning of this chapter, such as
the efcient evaluation of HPP queries, the low storage
cost and the high space utilization are achieved. This is
because all possible HPP restrictions can be evaluated
with a single bucket read (one I/O operation) and the
achieved space utilization is maximal (full bucket) with
a minimal storage cost (just one bucket). Moreover, it
is now clear that the hierarchical clustering degree of a
bucket signies to what extent the chunk-tree represent-
ing the cube has been packed into the specic bucket
and this is measured in terms of the chunking depth of
the tree.
By trying to create buckets with a high HCD
B
we
can guarantee that our allocation respects these ele-
ments of good hierarchical clustering. Furthermore, it
is now straightforward to dene a metric for evaluating
the overall hierarchical clustering achieved by a chunk-
to-bucket allocation strategy:
Denition 7 (Hierarchical clustering factor of a physical
organization for a cubef
HC
) For a physical organiza-
tion that stores the data of a cube into a set of N
B
buckets,
we dene as the hierarchical clustering factor f
HC
, the
percent of hierarchical clustering achieved by this storage
organization, as this results fromthe hierarchical cluster-
ing degree of each individual bucket divided by the total
number of buckets and we write:
f
HC
N
B
1
HCD
B
N
B
D
MAX
. (2)
2
Indeed, a bucket with HCD
B
= D
MAX
would mean that the
depth contribution of each tree in this bucket should be equal to
1/D
MAX
(according to the inequality (III)); however, this is only
possible for the whole chunk-tree CT, as this only has a depth
equal to 1.
Note that N
B
is the total number of buckets usedinorder
to store the cube; however, only the buckets that contain
at least one whole chunk-tree have a non-zero HCD
B
value. Therefore, allocations that spend more buckets
for storing sub-trees have a higher hierarchical cluster-
ing factor than others, which favor, e.g., single directory
chunk allocations. From (2), it is clear that even if we
have two different allocations of a cube that result in
the same total HCD
B
of individual buckets, the one that
occupies the smaller number of buckets will have the
greater f
HC
, rewarding this way the allocations that use
the available space more conservatively.
Another way of viewing the f
HC
is as the average
HCD
B
for all the buckets divided by the maximum
chunking depth. It is now clear that it expresses the
percentage of the extent by which the chunk-tree rep-
resenting the whole cube has been packed into the
set of the N
B
buckets and thus 0 f
HC
1. It follows
directly from Theorem 1 that this factor is maximized
(i.e., equals 1), if and only if we store the whole cube
(i.e., the chunk-tree CT) into a single bucket, which cor-
responds to a perfect hierarchical clustering for a cube.
In the next section we exploit the hierarchical clus-
tering factor f
HC
, in order to dene the chunk-to-bucket
allocationproblemas anoptimizationproblem. Further-
more, we exploit the hierarchical clustering degree of a
bucket HCD
B
in a greedy strategy that we propose for
solving this problem, as an evaluation criterion, in order
to decide how close we are to an optimal solution.
5 Building the CUBE File
In this section we formally dene the chunk-to-bucket
allocation problem as an optimization problem. We
prove that it is NP-Hard and provide a heuristic algo-
rithm as a solution. In the course of solving this problem
several interesting sub-problems arise. We tackle each
one in a separate subsection.
5.1 The HPP chunk-to-bucket allocation problem
The chunk-to-bucket allocation problem is dened as
follows:
Denition 8 (The HPP chunk-to-bucket allocation
problem) For a cube C, represented by a chunk-tree CT
with a maximum chunking depth of D
MAX
, nd an allo-
cation of the chunks of CT into a set of xed-size buckets
that corresponds to a maximum hierarchical clustering
factor f
HC
We assume the following: The storage cost of any
chunk-tree t equals cost(t), the number of sub-trees per
634 N. Karayannidis, T. Sellis
depthdinCT equals treeNum(d) andthe size of a bucket
equals S
B
. Finally, we are given a bucket of special size
S
ROOT
consisting of consecutive simple buckets, called
the root-bucket B
R
, where S
ROOT
= S
B
, with 1.
Essentially, B
R
represents the set of buckets that contain
no whole sub-trees and thus have a zero HCD
B
.
The solution S for this problem consists of a set of K
buckets, S = {B
1
, B
2
. . . B
K
}, so that each bucket con-
tains at least one sub-tree of CT and a root-bucket B
R
that contains all the rest of CT (part with no whole sub-
trees). S must result in a maximum value for the f
HC
factor for the given bucket size S
B
. As the HCD
B
val-
ues of the buckets of the root-bucket B
R
equal to zero
(recall that they contain no whole sub-trees), following
from (2), f
HC
can be expressed as
f
HC
=
K
1
HCD
B
(K +)D
MAX
. (3)
From (3), it is clear that the more buckets we allocate
for the root-bucket (i.e., the greater becomes) the
less the degree of hierarchical clustering achieved by
our allocation. Alternatively, if we consider caching the
whole root-bucket in main memory (see the following
discussion), then we could assume that does not affect
hierarchical clustering (as it does not introduce more
bucket I/Os from the root-chunk to a simple bucket)
and could be zeroed.
In Fig. 5, we depict four different chunk-to-bucket
allocations for the same chunk-tree. The maximum
chunking depth is D
MAX
= 5, although in the gure we
can see the nodes up to depth D = 3 (i.e., the triangles
correspond to sub-trees of three levels). The numbers
inside each node represent the storage cost for the cor-
responding sub-tree, e.g., the whole chunk-tree has a
cost of 65 units. Assume a bucket size of S
B
= 30 units.
Below each gure we depict the calculated f
HC
and
beside we note the percentage with respect to the best
f
HC
that can be achieved for this bucket size (i.e., f
HC
/
f
Hcmax
100%). The chunk-to-bucket allocation that
yields the maximum f
HC
can be identied easily by
exhaustive search in this simple case. Observe, how the
f
HC
deteriorates gradually, as we move from Fig. 5a to d.
In Fig. 5a we have failed to create any bucket-regions
at depth D = 2. Thus each bucket stores a single sub-
tree of depth 3. Note also that the occupancy of most
buckets is quite low. In Fig. 5b the hierarchical clustering
improves as some bucket-regions have been formed
buckets B1, B3 and B4 store two sub-trees of depth 3. In
Fig. 5c the total number of buckets decreases by one as a
large bucket-region of four sub-trees has been formed in
bucket B3. Finally, in Fig. 5d we have managed to store
in bucket B3 a higher level (i.e., lower depth) sub-tree
(i.e., a sub-tree of depth 2). This increases even more the
hierarchical clustering achieved, compared to the previ-
ous case (Fig. 5c), because the root node is included in
the same bucket as the four sub-trees. In addition, the
bucket occupancy of B3 is increased.
It is clear nowfromthis simple example, that the hier-
archical clustering factor f
HC
rewards the allocations
that achieve to store lower-depth sub-trees in buckets,
that store regions of sub-trees instead of single sub-trees
and that create highly occupied buckets. The individual
calculations of this example can be seen in Fig. 6.
All in all, it is obvious that we have now the optimi-
zation problem of nding a chunk-to-bucket allocation
such that f
HC
is maximized. This problem is NP-Hard,
which results from the following theorem.
Theorem 2 (Complexity of the HPP chunk-to-bucket
allocation problem) The HPP chunk-to-bucket alloca-
tion problem is NP-Hard.
Proof Assume a typical bin packing problem[42] where
we are given Nitems with weights w
i
, i = 1,,N, respec-
tively, and a bin size Bsuch as w
i
Bfor all i = 1, . . . , N.
The problem is to nd a packing of the items in the few-
est possible bins. Assume that we create N chunks of
depth d and dimensionality D, so as chunk c
1
has a
storage cost of w
1
and chunk c
2
has a storage cost w
2
and so on. Also assume that N 1 of these chunks are
under the same parent chunk (e.g., the Nth chunk). This
way we have created a two-level chunk-tree where the
root lies at depth d = 0 and the leaves at depth d =
1. Also assume that a bin and a bucket are equivalent
terms. Now we have reduced in polynomial time the bin
packing problem to an HPP chunk-to-bucket allocation
problem, which is to nd an allocation of the chunks
into buckets of B size such that the achieved hierarchi-
cal clustering factor f
HC
is maximized.
As all the chunk-trees (i.e., single chunks in our case)
are of the same depth, the depth contribution c
i
d
(1 i
N), dened in (1), is the same for all chunk-trees. There-
fore, in order to maximize the degree of the hierarchical
clustering HCD
B
for each individual bucket (and thus
also increase the hierarchical clustering factor f
HC
), we
have to maximize the region contribution c
i
r
(1 i N)
of each chunk-tree (1). This occurs when we pack into
each bucket as many trees as possible on the one hand
anddue to the region proximity factor r
P
when the
trees of each region are as close as possible in the mul-
tidimensional space, on the other. Finally, according to
the f
HC
denition, the number of buckets used must
be the smallest possible. If we assume that the chunk
dimensions have no inherent ordering then there is no
notion of spatial proximity within the trees of the same
region and the region proximity factor equals 1 for all
Hierarchical clustering for OLAP 635
65
40
22
10
20
5
5
2
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
B
4
5
B5
B
6
B7
(a) f
HC
=0.01(14%)
65
40
22
10
20
5
5
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
2
5
B
4
(b) f
HC
=0.03(42%)
65
40
22
10
20
5
5
5
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
2
(c) f
HC
=0.05(69%)
65
40
22
10
20
5
5
5
2
3
D = 1
DMAX = 5
D = 2
SB = 30
B1
B2
B3
(d) f
HC
=0.07(100%)
Fig. 5 The hierarchical clustering factor f
HC
of the same chunk-tree for four different chunk-to-bucket allocations
possible regions (see also related discussion in the fol-
lowing subsection).
In this case the only factor that can maximize the
HCD
B
of each bucket and consequently the overall f
HC
is to minimize empty space within each bucket [i.e., max-
imize bucket occupancy in (1)] and use as fewbuckets as
possible by packing the largest number of trees in each
bucket. These are exactly the goals of the original bin
packing problem and thus a solution to the bin packing
problem is also a solution to the HPP chunk-to-bucket
allocation problem and vice versa.
As the bin packing can be reduced in polynomial time
to the HPP chunk-to-bucket, then any problem in NP
can be reduced in polynomial time to the HPP chunk-
to-bucket. Furthermore, in the general case (where we
have chunk-trees of variant depths and dimension have
inherent orderings) it is not easy to nd a polynomial
time verier for a solution to the HPP chunk-to-bucket
problem, as the maximum f
HC
that can be achieved is
not known (as it is in the bin packing problem where
the minimum number of bins can be computed with a
simple division of the total weight of items by the size of
a bin). Thus the problem is NP-Hard.
We proceed next by providing a greedy algorithm
based on heuristics for solving the HPPchunk-to-bucket
allocation problem in linear time. The algorithm utilizes
the hierarchical clustering degree of a bucket as a cri-
terion in order to evaluate at each step how close we
are to an optimal solution. In particular, it traverses the
chunk-tree in a top-down depth-rst manner, adopting
the greedy approach that if at each step we create a
636 N. Karayannidis, T. Sellis
C
h
u
n
k
-
t
o
-
b
u
c
k
e
t
A
l
l
o
c
a
t
i
o
n
B
u
c
k
e
t
R
e
g
i
o
n
C
o
n
t
r
i
b
u
t
i
o
n
c
r
D
e
p
t
h
C
o
n
t
r
i
b
u
t
i
o
n
c
d
B
u
c
k
e
t
O
c
c
u
p
a
n
c
y
O
B
HCD
B
B
u
c
k
e
t
S
i
z
e
S
B
T
o
t
a
l
N
o
o
f
B
u
c
k
e
t
s
K
N
o
o
f
b
u
c
k
e
t
R
o
o
t
B
u
c
k
e
t
b
M
a
x
i
m
u
m
C
h
u
n
k
i
n
g
D
e
p
t
h
D
M
A
X
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04
Fig(d)
B3 0,50 0,4 0,73 0,92
3 1 0,07 100%
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04 Fig( c )
B3 0,57 0,6 0,50 0,48
3 1 0,05 69%
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04
B3 0,29 0,6 0,33 0,16
Fig(b)
B4 0,29 0,6 0,17 0,08
4 1 0,03 42%
B1 0,14 0,6 0,33 0,08
B2 0,14 0,6 0,67 0,16
B3 0,14 0,6 0,17 0,04
B4 0,14 0,6 0,17 0,04
B5 0,14 0,6 0,17 0,04
B6 0,14 0,6 0,10 0,02
Fig(a)
B7 0,14 0,6 0,07 0,02
30
7 1
5
0,01 14%
f
HC
f
HC
/f
HCmax
(%)
Fig. 6 The individual calculations of the example in Fig. 5
bucket with a maximum value of HCD
B
, then overall
the acquired hierarchical clustering factor will be maxi-
mal. Intuitively, by trying to pack the available buckets
with low-depth trees (i.e., the tallest trees) rst (thus the
top-to-bottom traversal) we can ensure that we have
not missed the chance to create the best HCD
B
buckets
possible.
In Fig. 7, we present the GreedyPutChunksIntoBuc-
kets algorithm, which receives as input the root R of a
chunk-tree CT and the xed size S
B
of a bucket. The
output of this algorithm is a set of buckets containing
at least one whole chunk-tree, a directory chunk entry
pointing at the root chunk R and the root-bucket B
R
.
In each step the algorithmtries greedily to make an
allocation decision that will maximize the HCD
B
of the
current bucket. For example, in lines 27 of Fig. 7, the
algorithm tries to store the whole input tree in a single
bucket thus aiming at a maximum degree of hierarchi-
cal clustering for the corresponding bucket. If this fails,
then it allocates the root Rto the root-bucket and tries to
achieve a maximum HCD
B
by allocating the sub-trees
at the next depth, i.e., the children of R (lines 926).
This essentially is achieved by including all direct
children sub-trees with size less than (or equal to) the
size of a bucket (S
B
) into a list of candidate trees for
inclusion into bucket regions (buckRegion) (lines 14
16). Then the routine formBucketRegions is called
upon this list and tries to include the corresponding
trees in a minimum set of buckets, by forming bucket
regions to be stored in each bucket, so that each one
achieves the maximum possible HCD
B
(lines 1922).
We will come back to this routine and discuss how it
solves this problem in the next sub-section. Finally, for
the children sub-trees of root R with size cost greater
than the size of a bucket, we recursively try to solve
the corresponding HPP chunk-to-bucket allocation sub-
problem for each one of them (lines 2326). This of
course corresponds to a depth-rst traversal of the input
chunk-tree.
Very important is also the fact that no space is
allocated for empty sub-trees (lines 1113); only a spe-
cial entry is inserted in the parent node to denote a
NULL sub-tree. Therefore, the allocation performed
by the greedy algorithm adapts perfectly to the data
Hierarchical clustering for OLAP 637
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
GreedyPutChunksIntoBuckets(R,S
B
)
//Input: Root R of a chunk-tree CT, bucket size S
B
//Output: Updated R, list of allocated buckets BuckList, root
// bucket B
R
, directory entry dirEnt pointing at R
{
List buckRegion // Bucket-region Candidates list
IF (cost(CT) < S
B
){
Allocate new bucket B
n
Store CT in B
n
dirEnt = addressOf(R)
RETURN
}
//R will be stored in the root-bucket B
R
IF (R is a directory chunk) {
FOR EACH child sub-tree CT
C
of R {
IF (CT
C
is empty){
Mark with empty tag corresponding Rs entry
}
IF (cost(CT
C
) S
B
){
//Insert CT
c
into list for bucket-region candidates
buckRegion.push(CT
C
)
}
}
IF(buckRegion != empty){
// Formulate the bucket-regions
formBucketRegions(buckRegion, BuckList, R)
}
WHILE (there is a child CT : cost(CT ) > S ){
C C B
GreedyPutChunkIntoBuckets(root(CT ),S
C
Update corresponding R entry for CT
B
)
C
}
Store R in the root-bucket B
R
dirEnt = addressOf(R)
}
ELSE { //R is a data chunk and cost(R) > B
Artificially chunk R, create 2-level chunk-tree CT
A
GreedyPutChunkIntoBuckets(root(CT
A
),S
B
)
//storage of R will be taken cared of by previous call
dirEnt = addressOf(root(CT
A
))
}
RETURN
}
Fig. 7 A greedy algorithm for the HPP chunk-to-bucket allocation problem
distribution, coping effectively with the native sparse-
ness of the cube.
The recursive calls might lead us eventually all the
way down to a data chunk (at depth D
MAX
). Indeed, if
the GreedyPutChunksIntoBuckets is called upon a
root R, which is a data chunk, then this means that we
have come upon a data chunk with size greater than the
bucket size. This is called a large data chunk and a more
detailed discussion on how to handle them will follow
in a later sub-section. For now it is enough to say that
in order to resolve the problem of storing such a chunk
we extend the chunking further (with a technique called
articial chunking) in order to transform the large data
chunk into a 2-level chunk-tree. Then, we solve the HPP
chunk-to-bucket sub-problemfor this sub-tree (lines 30
35). The termination of the algorithm is guaranteed by
the fact that each recursive call deals with a sub-problem
of a smaller in size chunk-tree than the parent problem.
Thus, the size of the input chunk-tree is continuously
reduced.
65
40
22
10
20
5
5
5
2
3
D = 1
D = 2
D
MAX
= 5
Fig. 8 A chunk-tree to be allocated to buckets by the greedy
algorithm
Assuming an input le consisting of the cubes data
points along with their corresponding chunk-ids (or
equivalently the corresponding h-surrogate key per
dimension) we need a single pass over this le to create
638 N. Karayannidis, T. Sellis
65
40
22
10
20
5
5
5
2
3
D = 1
D
MAX
= 5
D = 2
S
B
= 30
B
1
B
2
B
3
Fig. 9 The chunk-to-bucket allocation for S
B
= 30
the chunk-tree representation of the cube. Then the
above greedy algorithm requires only linear time in the
number of input chunks (i.e., the chunks of the chunk-
tree) to perform the allocation of chunks to buckets, as
each node is visited exactly once and at the worst case
all nodes are visited.
Assume the chunk-tree of D
MAX
= 5 of Fig. 8. The
numbers inside each node represent the storage cost for
the corresponding sub-tree, e.g., the whole chunk-tree
has a cost of 65 units. For a bucket size S
B
= 30 units the
greedy algorithm yields a hierarchical clustering factor
f
HC
= 0.72. The corresponding allocation is depicted in
Fig. 9.
The solution comprises three buckets B
1
, B
2
and B
3
,
depicted as rectangles in the gure. The bucket with
the highest clustering degree (HCD
B
) is B
3
, because it
includes the lowest depth tree. The chunks not included
in a rectangle will be stored in the root-bucket. In this
case, the root-bucket consists of only a single bucket (i.e.,
= 1 and K = 3, see (3)), as this sufces for storing the
corresponding two chunks.
5.2 Bucket-region formation
We have seen that in each step of the greedy algorithm
for solving the HPPchunk-to-bucket allocation problem
(corresponding to an input chunk-tree with a root node
at a specic chunking depth), we try to store all the sib-
ling trees hanging fromthis root toa set of buckets, form-
ing this way groups of trees to be stored in each bucket
that we call bucket regions. The formation of bucket
regions is essentially a special case of the HPP chunk-
to-bucket allocation problem and can be described as
follows:
Denition 9 (Thebucket regionformationproblem) We
are given a set of N chunk-trees T
1
, T
2
, . . . T
N
, of the same
chunking depth d. Each tree T
i
(1 i N) has a size:
cost(T
i
) S
B
, where S
B
is the bucket size. The prob-
lem is to store these trees into a set of buckets, so that
the hierarchical clustering factor f
HC
of this allocation is
maximized.
As all the trees are of the same depth, the depth con-
tribution c
i
d
(1 i N), dened in (1), is the same for all
trees. Therefore, in order to maximize the degree of the
hierarchical clustering HCD
B
for each individual bucket
(and thus also increase the hierarchical clustering fac-
tor f
HC
), we have to maximize the region contribution
c
i
r
(1 i N) of each tree (1). This occurs when we
create bucket regions with as many trees as possible on
the one hand anddue to the region proximity factor
r
P
when the trees of each region are as close as possi-
ble in the multidimensional space, on the other. Finally,
according to the f
HC
denition, the number of buckets
used must be the smallest possible.
Summarizing, in the bucket region formation prob-
lem we seek a set of buckets to store the input trees, in
order to fulll the following three criteria:
1. The bucket regions (i.e., each bucket) contain as
many trees as possible.
2. The total number of buckets is minimum.
3. The trees of a region are as close in the multidimen-
sional space as possible.
One could observe that if we focused only on the rst
two criteria, then the bucket region formation problem
would be transformed to a typical bin-packing problem,
which is a well-known NP-complete problem [42]. So
intuitively the bucket region formation problem can be
viewed as a bin-packing problem, where items packed in
the same bin must be neighbors in the multidimensional
space.
The space proximity of the trees of a region is mean-
ingful only whenwe have dimensiondomains withinher-
ent orderings. Typical example is the TIME dimension.
For example, we might have trees corresponding to the
months of the same year (which guarantees hierarchi-
cal proximity) but we would also like the consecutive
months to be in the same region (space proximity). This
is because these dimensions are the best candidates for
expressing range predicates (e.g., months fromFEB99 to
AUG99). Otherwise, when there is not such an inherent
ordering, e.g., a chunk might point to trees correspond-
ing to products of the same category along the PROD-
UCT dimension, space proximity is not important and
therefore all regions with the same number of trees are
of equal value. In this case the corresponding predicates
are typically set inclusion predicates (e.g., products IN
Hierarchical clustering for OLAP 639
Fig. 10 The region proximity
for two bucket regions:
r
P
1 > r
P
2
Months in Year 1999
T
y
p
e
s
i
n
C
a
t
e
g
o
r
y
B
o
o
k
s
0 1 2 3 4 5 6 7 8 9 10 11
Literature
Philosophy
Computers
Science
Fiction
R
1
R
2
Fig. 11 A row-wise traversal of the input trees
{Literature, Philosophy, Science}) and not range
predicates, so hierarchical proximity alone sufces to
ensure a low I/O cost. To measure the space proximity
of the trees in a bucket region we use the region prox-
imity r
P
, which we dene as follows:
Denition 10 (Region proximity r
P
) We dene the
region proximity r
P
of a bucket region Rdened in a mul-
tidimensional space S, where all dimensions of S have an
inherent ordering, as the relative distance of the average
Euclidian distance between all trees of the region R from
the longest distance in S:
r
P
|dist
AVG
dist
MAX
|
dist
MAX
.
In the case where no dimension of the cube has an
inherent ordering, then we assume that the average dis-
tance for any region is zero and thus the region prox-
imity r
P
equals one. For example, in Fig. 10 we depict
two different bucket regions R
1
and R
2
. The surround-
ing chunk represents the sub-cube corresponding to the
months of a specic year and the types of a specic prod-
uct category and denes a Euclidian space S. Each point
in this gure corresponds to a root of a chunk-tree. As,
only the TIME dimension, among the two, includes an
inherent ordering of its values, the data space, as long as
the region proximity is concerned, is specied by TIME
only (one-dimensional metric space). The largest dis-
tance in S equals 11 and is the distance between the
leftmost and the rightmost trees. The average distance
for region R
1
equals 2 while for region R
2
equals 5. By a
simple substitution of the corresponding values in De-
nition 10, we nd that the region proximity for R
1
equals
0.8, while for R
2
equals 0.5. This is because the trees of
the latter are more dispersed along the time dimension.
Therefore region R
1
exhibits a better space proximity
than R
2
.
In order to tackle the region formation problem we
propose an algorithm called FormBuck Regions. This
algorithm is a variation of an approximation algorithm
called best-t [42] for solving the bin-packing problem.
Best-t is a greedy algorithm that does not nd always
the optimal solution; however, it runs in P-time (also can
be implemented to run in N logN, N being the number
of trees in the input), and provides solutions that are far
from the optimal solution within a certain bound. Actu-
ally, the best-t solution in the worst case is never more
than roughly 1.7 times worse than the optimal solution
[42]. Moreover, our algorithm exploits a space- lling
curve [33] in order to visit the trees in a space-proximity
preserving way. We describe it next:
FormBuckRegions Traversetheinput set of trees along
a space-lling curve SFC on the data space dened by
the parent chunk. Each time you process a tree, insert it
in the bucket that will yield the maximum HCD
B
value,
among the allocated buckets, after the insertion. On a
tie, choose one randomly. If no bucket can accommo-
date the current tree, then allocate a new bucket and
insert the tree in it.
Note that there is no linearization of multidimen-
sional data points that preserves space proximity 100%
[8, 13]. In the case where no dimension has an inherent
ordering the space-lling curve might be a simple row-
wise traversal (Fig. 11). In this gure, we also depict the
corresponding bucket regions that are formed.
640 N. Karayannidis, T. Sellis
We believe that a formation of bucket-regions that
will provide an efcient clustering of chunk-trees must
be based on some query patterns. In the following we
show an example of such a query-pattern driven forma-
tion of bucket-regions.
A hierarchy level of a dimension can basically take
part in an OLAP query in two ways: (a) as a means of
restriction (e.g., year = 2000) or (b) as a grouping attri-
bute (e.g., show me sales grouped by month). In the
former, we ask for values on a hyper-plane of the cube
perpendicular to the Time dimension at the restriction
point, while in the latter we ask for values on hyper-
planes that are parallel to the Time dimension. In other
words, if we know for a dimension level that it is going
to be used by the queries more often as a restriction
attribute, then we should try to create regions perpen-
dicular to this dimension. Similarly, if we know that a
level is going to be used more often as a grouping attri-
bute, then we should opt for regions that are parallel to
this dimension. Unfortunately, things are not so simple,
because if, for example, we have two restriction levels
from two different dimensions, then the requirement
for vertical regions to the corresponding dimensions is
contradictory.
In Fig. 12, we depict a bucket-region formation that
is driven by the table appearing in the gure. In this
table we note for each dimension-level corresponding
to a chunking depth, from our example cube in Fig. 3,
whether it should be characterized as a restriction level
or as a grouping level. For instance, a user might know
that 80% of the queries referencing level continent
will apply a restriction on it and only 20% will use it
as a grouping attribute, thus this level will be character-
ized as a restriction level. Furthermore, in the column
labeled importance order, we order the different lev-
els of the same depth according to their importance in
the expected query load. For example, we might know
that the category level will appear much more often
in queries than the continent level and so on.
In Fig. 12, we also depict a representative chunk for
each chunking depth (of course for the topmost levels
there is only one chunk, the root chunk), inorder toshow
the formation of the regions according to the table. The
algorithm in Fig. 13 describes how we can produce the
bucket-regions for all depths, when we have as input a
table similar to the one appearing in Fig. 12.
In Fig. 12, for the chunks corresponding to the levels
country, typeand city, item, we also depict the col-
umn-major traversal method corresponding to the sec-
ond part of the algorithm. Note also that the term fully
sized region means a region that has a size greater than
the bucket occupancy threshold, i.e., it utilizes well the
available bucket space. Finally, whenever we are at a
depth where a pseudo-level exists for a dimension, e.g.,
D = 2 for our example, no regions are created for the
pseudo level of course. Also, note that bucket-region
formation for chunks at the maximum chunking depth
(as is the chunk in depth 3 in Fig. 12) is only required
in the case where the chunking is extended beyond the
data-chunk level. This is the case of large data chunks
which is the topic of the next sub-section.
5.3 Storing large data chunks
In this sub-section, we will discuss the case where the
GreedyPutChunksIntoBuckets algorithm(Fig. 7) is called
with input as chunk-tree that corresponds to a single
data chunk. This, as we have already explained, would
be the result of a number of recursive calls tothe Greedy-
PutChunksIntoBuckets algorithm that led us to descend
the chunk hierarchy and to end up at a leaf node. Typ-
ically, this leaf node is large enough so as not to t in
a single bucket, otherwise the recursive call upon this
node would not have occurred in the rst place (Fig. 7).
The main idea for tackling this problem is to further
continue the chunking process, although we have fully
used the existing dimension hierarchies, by imposing a
normal grid. We call this chunking articial chunking
in contrast to the hierarchical chunking presented in
the previous section. This process transforms the ini-
tial large data chunk into a 2-level chunk-tree of size
less than or equal to the original data chunk. Then, we
solve the HPP chunk-to-bucket allocation sub-problem
for this chunk-tree and therefore we once again call the
GreedyPutChunksIntoBuckets routine upon this tree.
In Fig. 14, we depict an example of such a large data
chunk. It consists of twodimensions AandB. We assume
that the maximumchunking depthis D
MAX
= K. There-
fore, K will be the depth of this chunk. Parallel to the
dimensions, we depict the order-codes of the dimension
values of this chunk that correspond to the most detailed
level of each dimension. Also, we denote their parent
value on each dimension, i.e., the pivot level values that
created this chunk. Notice that the sufx of the chunk-id
of this chunk consists of the concatenated order-codes
of the two pivot level values, i.e., 5|14.
In order to extend the chunking further, we need to
insert a new level between the most detailed members
of each dimension and their parent. However, this level
must be inserted locally, only for this specic chunk
and not for all the grain-level values of a dimension.
We want to avoid inserting another pseudo-level in the
whole level hierarchy of the dimension, because this
would trigger the enlargement of all dimension hierar-
chies and would result in a lot of useless chunks. There-
fore, it is essential that this new level remains local. To
Hierarchical clustering for OLAP 641
Fig. 12 Bucket-region
formation based on query
patterns
LOCATION
P
R
O
D
U
C
T
Restriction Group By
continent
category
country
type
region
pseudo - -
Importance
Order
2
1
1
2
1
-
city
item
2
1
D = 0
continent, category
D = 1
country, type
D = 2
region
D = 3
city, item
this end, we introduce the notion of the local depth d of
a chunk to characterize the articial chunking, similar to
the global chunking depth D(introduced in the previous
section) characterizing the hierarchical chunking.
Denition 11 (Local depth d) The local depth d, where
d 1, of a chunk Ch denotes the chunking depth of Ch
pertaining to articial chunking. A local depth d = 1
denotes that no articial chunking has been imposed on
Ch. A value of d = 0 corresponds to the root of a chunk-
tree by articial chunking andis always a directory chunk.
The value of d increases by one for each articial chunk-
ing level.
Note that the global chunking depth D, while
descending levels created by articial chunking, remains
constant and equal to the maximum global chunking
depth of the cube (in general, to the current global depth
value); only the local depth increases.
Let us assume a bucket size S
B
that can accommodate
a maximumof M
r
directory chunkentries, or a maximum
of M
e
data chunk entries. In order to chunk a large data
chunk Ch of N dimensions by articial chunking, we
dene a grid on it, consisting of m
i
g
(1 i N) number
of members per dimension, such as
N
i=1
m
i
g
M
r
. This
grid will correspond to a new directory chunk, pointing
at the new chunks created from the articial chunk-
ing of the original large data chunk Ch and due to the
aforementioned constraint it is guaranteed that it will
t in a bucket. If we assume a normal grid, then for all
i : 1 i N, it holds m
i
g
=
_
N
M
r
_
.
In particular, if n
i
(1 i N) corresponds to the
number of members of the original chunk Ch along the
dimension i, then a new level consisting of m
i
g
members
will be inserted as a parent level. In other words, a
number of c
i
children (out of the n
i
) will be assigned
to each of the m
i
g
members, where c
i
_
n
i
/m
i
g
_
, as
long as c
i
1. If 0 < c
i
< 1, then the corresponding new
level will act as a pseudo-level, i.e., no chunking will take
place along this dimension. If all new levels correspond
642 N. Karayannidis, T. Sellis
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
QueryDrivenFormBucketRegions
//Input: query pattern table
//Result: bucket-regions formed at all chunking depths
{
FOR EACH (global chunking depth value D) {
Pick first level in the importance order
LOOP {
Try to create as many fully-sized regions that will favor
(i.e., being perpendicular or parallel to the level
according to its characterization as a restrictive or
grouping attribute respectively) this level as it is
possible.
IF (there are more levels in the importance order AND there
are more ungrouped chunk-trees to visit){
pick next level from the order
}
ELSE {
exit from loop
}
}
IF (there are still ungrouped chunk-trees) {
Traverse the chunk in a row/column-major style with the first
level in the importance order being the fastest (slowest)
running attribute, if it is characterized as a grouping
(restriction) attribute; then the second level in
the importance order being the second fastest (slowest)
running attribute, if it is characterized as a grouping
(restriction) attribute and so on for all levels in the
order, and try to pack in the same bucket as much trees as
it is possible, until there are no more trees to visit.
}}}
Fig. 13 A bucket-region formation algorithm that is driven by query patterns
9 10 11 12 13 14 15 16
2
9
3
0
3
1
3
2
3
3
3
4
5
1
4
. . .
DIMENSION A
.
.
.
D
I
M
E
N
S
I
O
N
B
Chunk ID = ... .5|14
D = K (max depth)
Fig. 14 Example of a large data chunk
to pseudo-levels, i.e., n
i
< m
i
g
for all i : 1 i N, then
we take m
i
g
= maximum(n
i
).
We will describe the above process with an example.
Let us assume a bucket that can accommodate a maxi-
mum of M
r
= 10 directory chunk entries or a maximum
of M
e
= 5 data chunk entries. In this case the data chunk
of Fig. 14 is a large data chunk, as it cannot be stored in
a single bucket. Therefore, we dene a grid with m
1
g
, m
2
g
number of members along dimensions A and B, respec-
tively. If the grid is normal then m
1
g
= m
2
g
=
_
10
_
= 3.
Thus, we create a directory chunk, which consists of
3 3 = 9 cells (i.e., directory chunk entries); this is
depicted in Fig. 15.
In Fig. 15, we can also see the new values of each
dimension and the corresponding parentchild relation-
ships between the original values and the newly inserted
ones. In this case, each new value will have at most
c
1
8/3 = 3 children for dimension A and c
2
6/3 = 2 children for dimension B. The created direc-
tory chunk will have a global depth D = K and a local
depth d = 0. Around it, we depict all the data chunks
(partitions of the original data chunk) that correspond
to each directory entry. Each such data chunk will have a
global depth D = K and a local depth d = 1. The chunk-
ids of the new data chunks include one more domain as
a sufx, corresponding to the new chunking depth that
they belong. Notice that fromthe articial chunking pro-
cess new empty chunks might arise. For example see the
rightmost chunk in the top of Fig. 15. As no space will
be allocated for such empty chunks, it is obvious that
articial chunking might lead to a minimization of the
Hierarchical clustering for OLAP 643
Fig. 15 The example large
data chunk articially
chunked
0 1 2
2
9
3
0
3
1
3
2
3
3
3
4
5
1
4
. . .
DIMENSION A
.
.
.
D
I
M
E
N
S
I
O
N
B
Chunk ID = ... .5|14
D = K (max depth)
9 10 11 12 13 14 15 16
d = 0
0
1
2
9 10 11
2
9
3
0
12 13 14
2
9
3
0
15 16
2
9
3
0
15 16
3
1
3
2
12 13 14
3
1
3
2
9 10 11
3
1
3
2
3
3
3
4
15 16
12 13 14
3
3
3
4
9 10 11
3
3
3
4
... .5|14.0|0 d = 1 ... .5|14.1|0 d = 1 ... .5|14.2|0 d = 1
... .5|14.2|1 d = 1
... .5|14.1|1 d = 1
... .5|14.2|2 d = 1
... .5|14.1|2 d = 1
... .5|14.0|2 d = 1
... .5|14.0|1 d = 1
X
size of the original data chunk, especially for sparse data
chunks. This important characteristic is stated with the
following theorem, which shows that in the worst case
the extra size overhead of the resultant 2-level tree will
be equal to the size of a single bucket. However, as cubes
are sparse, chunks will also be sparse and therefore prac-
tically the size of the tree will always be smaller than that
of the original chunk.
Theorem 3 (Size upper bound for an articially chun-
ked large data chunk) For any large data chunk Ch of
size S
Ch
it holds that the two-level chunk-tree CT result-
ing fromthe application of the articial chunking process
on Ch will have a size S
CT
such that
S
CT
S
Ch
+S
B
, (4)
where S
B
is the bucket size.
Proof Assume a large data chunk Ch which is 100%
full. Then from the application of articial chunking no
empty chunks will be produced. Moreover, from the
denition of chunking we know that if we connect these
chunks back together we will get Ch. Consequently, the
total size of these chunks is equal to S
Ch
. Now, the root
chunk of the new tree CT will have (by denition) at
most M
r
entries, so as to t in a single bucket. Therefore
the extra size overhead caused by the root is at most S
B
.
From this we infer that S
CT
S
Ch
+S
B
. Naturally if this
holds for the largest possible Ch it will certainly hold for
all other possible Chs that are not 100% full and thus
may result in empty chunks after the articial chunking.