You are on page 1of 18

CS631 Quick Reference

Author: Ramdas Rao

Data Storage RAID


Overview of Physical Storage Media Redundant Array of Independent (Inexpensive) Disks; RAID improves reli-
ability via redundancy
In decreasing order of cost and performance / speed:
• How RAID improves performance via parallelism:
• Cache - Is volatile, L1 Cache operates at processor speed (e.g., if Increases the number of I/O requests hanbdled per second or the
processor is 3GHz, then the memory access time is 1/3 ns) transfer rate or both
• Main Memory: Access speed is about 10 to 100 ns; 300 times slower
– Bit-level striping:
than cache; volatile
∗ The bits of each byte are split across several disks
• Flash Memory: Read access speed about 100 ms (same as memory);
∗ For an 8-disk configuration, transfer rate is 8 times that of
Writing is slower 4 to 10 µs; Limited number of erase cycles sup-
single disk, and number of I/Os are same as that for single
ported; NOR flash; NAND flash uses page-at-a-time read/write; non-
disk; Bit i of each byte does to disk i
volatile
∗ For a 4-disk config, bits i and 4+i of each byte do to disk i
• Magnetic disk storage: Non-volatile; 10 ms access time; order of
– Block-level striping (most commonly used):
magnitude slower
∗ Stripes blocks across multiple disks (one block on each
• Optical Storage: CD, DVD (Digital Versatile Disk); capacities of 700
disk)
MB to 17 GB; Write-once, read-many (WORM); optical disk juke-
∗ Logical block i goes to disk (i mod n) + 1 and it uses the
box
floor(i/n)th physical block of the disk (forumlae assume
• Tape Storage: Sequential access; mostly used for backup and disk number starts from 1 and blocks from 0)
archival; High capacity (40 GB to 300 GB); Tape jukebozes (li- ∗ For large reads (multiple blocks), the data transfer rate is
braries) - 100s of TB and PB n times that of single disk (n is the number of disks)
∗ For single block read, transfer rate is the same as that of
Magnetic Disks single disk, but other disks are free to process other re-
quests
• Platter - Tracks - Sectors - Blocks
– Other forms of striping: Bytes of a sector, sectors of a block
• A cylinder is a set of tracks one below the other on each platter
• 2 main goals of parallelism in a disk system are:
• Concept of zones: The number of sectors in outer zones is greater
than the number of sectors in the inner zones (e.g., 1000 v/s 500 sec- – Load-balance smaller disk requests so that the throughput is
tors) increased
• Disk Controller Interfaces: ATA (PATA), IDE, SATA (Serial ATA), – Parallelize large accesses so that the response time of large
SCSI, Fiber Channel, Firewrite, SAN (Storage Area Network) - stor- accesses is reduced
age on network made to appear as one large disk, NAS (Network • RAID Levels:
Attached Disks) - NFS or CIFS – RAID Level 0: No redundancy, block striping; used when
• Performance Measures of disks: Access time, capacity, Data transfer backup is easily restorable
rate, Reliability – RAID Level 1: Mirroring w/ block striping (aka level 1 + 0 or
• Access time: Time from when a red or write is issues to the time 10); Mirroring without block striping is called Level 1 (2M
when the data transfer begins number of disks required; used when number of writes are
• Access time = Seek Time (arm positioning) + Latency (waiting for more (e.g., log disk)
sector to rotate under head) – RAID Level 2: Memory Style ECC (w/ parity bits); Fewer
• Average Seek Time = 1/2 of Worst Case Seek Time = 4 to 10 ms number of disks required than level 1; Some disks store par-
ity (e.g., 3 Parity disks for 4 disks of data); Subsumed by level
• Average Latency Time = 1/2 of time for full rotation = 4 to 10 ms
3
• Average Access Time = 8 to 20 ms
– RAID Level 3: Bit-interleaved parity; a single parity bit can be
• Data Transfer Rate = Rate at which date can be transferred = 25 MB/s used for error detection as well as correction (e.g., 1 P disk for
to 100 MB/s 3 disks of data)
The transfer rate on the inner tracks are significantly lower (30 MB/s)
– RAID Level 4: Block-interleaved parity; separate disk for par-
than the outer tracks (since the umber of sectors on the inner tracks is
ity (at block level); Parity disk will be involved for every read
lesser than the outer)
/ write; A single write requires 4 disk accesses: 2 to read the
• Mean Time to Failure (MTTF): 2 old blocks and 2 to write the new blocks (parity and data);
– For a single disk, it is about 57 to 136 years Subsumed by level 4
– If multiple disks are used, the MTTF reduces significantly - – RAID Level 5: Block interleaved distributed parity; all disks
with 1000 new disks, MTTF is 1200 hours = 47 days store parity for the other disks; subsumes level 4
– If 100 disks are in an array and each has a MTTF of 100000 – RAID Level 5: P+Q redundancy (like RAID Level 5, but stores
hours, then the MTTF of the array is 100000/100 = 1000 hours extra redundant information to guard against multiple disk fail-
– If 2 disks have MTTF of 100000 hours and MTTR of 10 hours, ures); ECC such as Reed-Solomon used; 4-bits of parity instead
2
then the M T T DataLoss = 10000010+10
= 500x106 hours of 2 can tolerate upto 2 disk failures
• Mean Time to Data Loss = MTTF + MTTR, (MTTR is Mean Time • Choice of RAID:
to Repair) – RAID Level 0 : Use where data safety is not critical (and
• Optimization of Disk Block Access: backup easily restorable)
– Scheduling of access of blocks (e.g., Elevator algo) – RAID Level 1: Offers best write performance, use for High I/O
– File organization (File Systems, fragmentation, sequential requirements for moderate storage (e.g., log files in a database
blocks, etc.) system)
– Non-volatile write buffers – RAID Level 5: Storage intensive apps such as vide data stor-
∗ NVRAM to speed-up writes age; More frequent reads and rare writes
∗ Log disk (since access to log disk is sequential) – RAID Level 6: Use when data safety is very important
∗ Log file (no separate disk - like in journaling file systems) • Hardware RAID v/s Software RAID: Hot-swapping, etc.
Buffer Manager – Sparse index: An index record appears for only some of the
search-key values (some sequential scan required to locate the
• Buffer Replacement Strategy: LRU (not good for nested loop), MRU
records)
(depends on the join strategy), Toss Immediate
– Main disadvantages of Indexed Sequential File: Performance
• Pinning - pinning a block in memork (block that is not allowed to be
degrades (both for sequential scans as well as for index
written back to disk)
lookups) as file grows; can be remedied by periodic reorg of
• Forced output of blocks: even if space is not required (used when the file, but expensive
xact log records need to go to stable storage)
• Multi-level Index:
File Organization – n-level sparse indices
– If an index occupies 100 blocks, using binary search requires
• Fixed Length Records:
ceiling (log2 (b)) disk accesses
– On deleting, move the records (expensive)
– Closely related to tree structures
– On deleting, move final record to free space (requires addi-
tional disk access) Read update of indices pseudo code from notes / book
– Store header and pointers to link (chain) free space for records • Secondary Indices:
• Variable Length Records: – Cannot be sparse (must be dense)
– Use slotted page structure for each block – Pointers in a secondary index (on search keys that are not can-
– Each block has a header that stores: didate keys) do not point directly to the file (instead, each
∗ Number of record entries in the header points to a bucket that contains pointers to the file)
∗ End of free space in the block – Disadvantages: Sequential scan in secondary-key order is very
∗ An array whose entries contain the location and the size slow; they impose significant overhead on the modification of
of each record in the block the file (note that when a file is modified, every index must be
• Organization of the records within a file: updated)
– Heap file organization
– Seq. file organization (may require overflow blocks or periodic B+-Tree Index Files
reorganization)
– Hashing file organization P1 K1 P2 K2 . . . Kn−1 Pn
– Multi-table clustering file organization: For example, for join • Most widely used index structure
of depositor nad customer, after on depositor record, store the • Maintains its efficiency despite insertions and deletions of data
customer records for that depositor • A b+-tree index takes the form of a balanced tee in which every path
from the root of the tree to a leaf of the tree is of the same length
Data Dictionary Storage • Each non-lead node has between ceil(n/2) and ceil(n) children, where
• Store like a miniature database nis fixed for a particular tree
• Types of information stored: • Each leaf must have at least ceil((n-1)/2) values and at most ceil(n-1)
– Names of the relations values
– Names and attributes of each relation • Each non-leaf must have at least ceil(n/2) pointers and at most ceil(n)
pointers
– Domains and lengths of each attribute
• Imposes performance overhead for insertion and deletion and space
– Names and definitions of views
overhead (as much as half of a node maybe empty), but is still pre-
– Integrity constraints ferred (since periodic file reorg is not needed)
– User, Auth and Accounting info, Passwords
• A B+-tree index is like a multi-level search index
– Statistical data
• Queries on B+-trees:
– Info on indices
– If there are K-search key values in a file, then the path from the
root to the leaf is no larger than ceil(logn/2 (K)). For exam-
Indexing and Hashing ple, if K = 1000000 and n = 100, then ceil(log50 (1000000))=
4. Therefore, at most 4 nodes need to be accessed. For binary
• 2 basic types of indices: Ordered Indices (based on a sorted order-
search, it would require ceil(log2 (1000000)) = 20 nodes.
ing of values) and Hash Indices (based on a uniform distribution of
values across a range of buckets) Algo for B+tree from book
• Indexing techniques must be evaluated on these factors:
– Access time B+-Tree File Organization
– Access type (range of values or point-based)
– Insertion time • Leaf node stores records than pointers
– Deletion time • Need to consider capacity of mode while splitting and coalescing
(since records are stored in the leaf nodes)
– Space overhead for storing the index
• Ordered Indices: • SQL CLOBs and large objects are split into sequence of smaller
records and organized in a B+-Tree file organization
– Clustering or Primary Index: If the file containing the records
is sequentially ordered, then a clustering or primary index is an • Forumla here
index whose search key also defines the sequential order of the • Indexing Strings: Strings are variable length (need to consider capac-
file ity); prefix compression can be used to reduce the size
– Non-clustering or Secondary Index: Indices whose search key • Advantages: No repetition of search-key values; Non-leaf nodes con-
specifies an order different from the sequential order of the file tain pointer to data (so additional ptr is required in non-leaf nodes -
• Indexed Sequential File: A file with a clustering index on the search height of the tree may increase compared to B+-tree); not much gain
key is called an Indexed Sequential File. compared to B+-tree since anyway majority of the data is in the leaf
node
– Dense index: An index record appears for every search-key
value in the file • Disadvantages: Deletion and other ops more complex
Other properties Cost of Selections
• Mutliple Key Access: ts : Seek time, br : number of blocks in file, tT : Transfer time for one block.
Index structures are called access paths since they provide a path through
1. Use multiple-single key indices: Perform intersection. Perfor- which data can be located and accessed.
mance is poor if there are many records satistying condition • Linear Search (A1): Can be applied to any file
individually, but few satisfying both conditions – Cost = ts + (br ∗ tT ) (one seek + search all blocks)
2. Use indices on multiple keys: composite search key; queries – For key attributes, we can stop after finding the match. Average
with conjunction predicates with equality on primary key is still Cost is : ts + b2r ∗ tT . Worst-case cost is: ts + br ∗ tT
okay (since we can treat as (P, -inf) to (P, inf); but if inequality • Binary Search (A2):
for first, then inefficient – Cost for key searches: dlog2 (br )e ∗ (ts + tT )
3. Bitmap indices can be used: existence bitmaps and presence of – Cost for non-key searches: dlog2 (br )e ∗ (bn + tT ), where n is
NULLs need to be handled the number of items with duplicate keys
• Primary Index, equality on key (A3): For a B+-tree, if hi is the height
4. R-tree (extension of B+-tree) to handle indexing on multiple of the tree, then
dimensions (e.g., for geographical data) – Cost = (hi + 1) ∗ (tT + ts )
• Non-unique search keys: Use unique record if to prevent buckets and • Primary Index, equality on non-key (A4):
extra page lookups; Search for customer name = ’X’ internally be- – Cost = hi ∗ (tT + ts ) + ts + b ∗ tT , where b is the number of
comes (’X’, -inf) to (’X’, inf) blocks containing the matched duplicate keys
• Covering indices: Store multiple (extra) attributes alongwith pointer • Secondary Index, equality on key (A5):
to records (e.g., balance can be stored if it is required frequently); – For key: Cost is same as that for A3
saves one disk access – For non-key: Cost = (hi + n) ∗ (tT + ts ), n is the number of
blocks contianing matching keys
• Secondary indices and index relocation:
• Primary Index, comparison (A6):
– Some file organaiztions (such as B+-tree) change the location
– For A > v, first locate V and then sequential access. Cost is
of records even when the records may not have been updated
similar to A4
– To overcome problems due to this in Secondary indices, we can – For A < v or A ≤ v, no index is used. Similar to A1.
store the values of the search-key indices (instead of pointers)
• Secondary Index, comparison (A7):
in the secondary index and use the primary index to lookup
– Searching index is similar to A6
– Cost of access increases, but no change is required on file reorg – But, retrieving each block may require access a different block
– Therefore, linear search may be better
Hashing • Conjunctive Selection using one index (A8):
– Use one index to retrieve (use A2 through A7)
• Hash function must be chosen so that: distribution is uniform and is – Compare each for satisfying the other condition
random
– To reduce the cost, we choose a θi and one of A1 through
• Hashing can be used for: A7 for which the combination results in the least cost for
– Hash file organization: compute address of block directly σthetai (r)
– Hash index organization: organizes index into a hash file struc- – Cost is cost of chosen algo
ture • Conjunctive Selection using composite index (A9):
• Example hash function: s[0[ to s[n-1] is a string of n characters long – Use composite index (same as A3, A4 or A5)
s[0] ∗ 31n−1 + s[1] ∗ 31n−2 + . . . + s[n − 1] mod number of buckets • Conjunctive Selection by using intersection of identifiers (A10):
• Bucket overflows can still occur due to: insufficient buckets, and – Cost is sum of (cost of individual index scans) + (cost of re-
skew trieval of records in the intersection)
– Sorting can be used so that all pointers in a block come to-
• Overflow can be handled through Overflow chaining, or Open hash-
gether; blocks are read in sorted physical order to minimize
ing (linear or quadratic probing, etc.); open hashing is not good for
disk arm movement
db since deletion in this is troublesome
• Disjunctive Selection by using union of pointers (A11):
• Dynamic hashing: One form of extendable hashing; use of hash pre-
– If access paths are available on all conditions, each index is
fix; increasing and decreasing of bucket address table
scanned to get the pointers, union is taken and records are re-
• Advantages of dynamic hashing: No space reservation required; Per- treived
formance does not degrade as file grows or shrinks – Even if one of the condition does not have an access path, the
• Disadvantages of dynamic hashing: Additional lookup for bucket ad- most efficient method could be a linear scan
dress table required
• Linear hashing avoids extra level of indirection at the possible cost of Cost of Joins
more buckets
• Two reasons why sorting is important: the query may require output
• Ordered Indexing: Can handle range queries better to be sorted, and joins and some other operations can be implemented
• Hashing: Bad for range queries; suitable for single-value compar- efficiently, if the input relations are first sorted. Sorting physically is
isons; good for temporary files during query processing more important than sorting logically (to reduce disk arm movement)
Bitmap Index Structure • Natural Join can be expressed as a θ join followed by elimination of
repeated attributes by a projection
Bitmaps and B+-trees can be combined
• nr : number of tuples in r, ns : number of tuples in s, br : number of
blocks of r, bs : number of blocks of s
Query Processing • Nested loop Join:
– If both relations can be read into memory, cost = (br + bs )
• Steps in Query Processing: Parser and Translator (gives rel. algebra – Else, if only one block of each relation fits into memory,
expression), optimizer (also consults statistics about data and give ex- cost = nr ∗ bs + br , assuming “r” is the outer relation
ecution plan), Evaluation Engine (evaluates plan and outputs results • Block Nested loop Join:
of the query)
– Assumption: M+1 is the total blocks available (1 for o/p); else
Mention about join and sorting techniques here the denom will be (M-2)
– r is the outer relation: Cost = d Mb−1r
∗bs +br , where M blocks • Set operations of union and intersection are commutative
are allocated to r
E1 ∪ E2 = E 2 ∪ E 1
• Sort Merge Join
– Soting cost of each reltion assuming they are not sorted is (M E1 ∩ E2 = E 2 ∩ E 1
is the number of pages available for sorting - 1 is for o/p and However, Set Difference is not commutative.
M-1 for input):
• Set union and intersection are associative
∗ for r, br (2dlogM −1 (br /M )e + 1 + 1
∗ for s, bs (2dlogM −1 (bs /M )e + 1 + 1 (E1 ∪ E2 ) ∪ E3 = E1 ∪ (E2 ∪ E3 )
– After sorting, assuming that all the tuples with the same value
(E1 ∩ E2 ) ∩ E3 = E1 ∩ (E2 ∩ E3 )
for the join attributs fir in memory, the cost is: (Sorting Cost) +
br + bs • Selection operation distributes over union, intersection and set differ-
• Hash Join ence operations:
– Assume no overflow occurs σP (E1 − E2 ) = σP (E1 ) − σP (E2 )
– Use smaller relation (say r) as the build relation and larger re-
lation (say s) as the probe relation Also,
– If M > br /M , no need of recursive paritioning and cost is:
Cost = 3(br + bs ) σP (E1 − E2 ) = σP (E1 ) − E2
– Else, if recursive partioning occurs: Cost = 2(br + (this does not hold for intersection???)
bs )dlogM −1 (br )−1e+br +bs (included is the cost for reading • Projection operation distributes over union:
and writing partitions)
πL (E1 ∪ E2 ) = (πL (E1 )) ∪ (πL (E2 ))
• A B-tree organization has b(m − 1)n/mc entries for each node
Join ordering: choose such that the size of th temporary results are re-
Query Optimization duced
Enumeration of Equivalent Expressions:
Equivalence Rules • Space requirements can be optimized by pointing to shared sub ex-
(Set Version) pressions
• Cascade of σ for conjunction selections • Time requirements can be reduced by optimization (dynamic pro-
gramming, etc.)
σθ1 ∧θ2 (E) = σθ1 (σθ2 (E))
Estimating Statistics of Expr Results
• Commutativity of selection operations
• Catalog Information:
σθ1 (σθ2 (E)) = σθ2 (σθ1 (E)) – nr = number of tuples in r
– br = number of block containing tuples of r
• Only final projection in a sequence of projections
– lr = size of a tuple of r in bytes
– fr = blocking factor of r (= number of tuples of r that fit in one
πL1 (πL2 (. . . πLn ))) = πL1 (E)
block)
• Selections can be combined with Cartersian products and theta joins: – V (A, r) = number of distinct values that appear in r for at-
– tribute A
σθ (E1 xE2 ) = E1 on θ E2 – If A is a key for r, then V (A, r) = nr
– – If tuples of r are phsyically stored together, br = d nfrr e
σθ1 (E1 σθ2 E2 ) = E1 o
nθ1 ∧θ2 E2 – Histogram for a range of values of attribute can be used for
estimating (histograms can be equi-width or equi-height)
• Theta joins and natural joins are commutative
• Selection Size Estimation:
E1 o
n E2 = E2 o
n E1 – Equality (σA=a (r))
∗ Assuming equi-probable, N um = nr /V (A, r)
• Associativity of joins ∗ With histogram, num = nrange /V (A, range)
– Natural Joins are associative: – Comparison (σA≤v (r))
∗ If v < min(A, r), num = 0
(E1 o
n E2 ) o
n E3 = E1 o
n (E2 o
n E3 )
∗ If v ≥ max(A, r), N um = nr
v−min(A,r)
– Theta joins are associative in the following manner: If θ2 has ∗ Else, num = nr . max(A,r)−min(A,r) (this can be modi-
only attributes from E2 and E3 , then: fied to use a histogram, where available - use the number
in the ranges, instead of in the entire relation)
(E1 o
n θ1 E 2 ) o
nθ1 ∧θ3 E3 = E1 o
nθ1 ∧θ3 (E2 o
n θ2 E 3 ) ∗ If v is not known as in the case of stored procedures, as-
sume num = nr /2
– Cartesian products are also associative – Complex selections
• Distributivity of selections ∗ Conjunctions (σθ1 ∧θ2 ...∧n (r)): num = nr . s1 ∗sn2 ∗...s
n
n
,
r
– If θ0 only involves E1 , then where si is the number of tuples that satisfy the selection
σθ1 (r).
σθ0 (E1 o
nθ E2 ) = (σθ0 (E1 )) o
n θ E2 si /nr is called the selectivity of the selection σθ1 (r)
∗ Disjunctions (σθ1 ∨θ2 ...θn (r)): num = nr ∗ [1 − (1 −
– If θ1 only involves E1 and θ2 only involves E2 , then s1
nr
)(1 − ns2r ) . . . (1 − n
sn
r
)
∗ Negations: Compute num as: num = nr − num(σθ (r).
σθ1 ∧θ2 (E1 o
nθ E2 ) = (σθ1 (E1 )) o
nθ σθ2 (E2 )) If NULLs are present, compute as: num = nr −
• Distributivity of projections If L1 are only attributes of E1 and L2 num(σθ (r) − num(N U LLs).
only of E2 , then • Join Size Estimation:
– Cartesian Product: N um(rxs) = nr xns
πL1 ∪L2 (E1 o
nθ E2 ) = πL1 (E1 ) o
nθ (πL2 (E2 )) – Natural Joins
∗ R ∩ S = φ: same as Cartesian Product Materialized Views
∗ R ∩ S is a key for R: num ≤ ns . Similarly, when it is a
Normally, only query defintion is stoed. In materialized views, we compute
key for S.
the contents of the view and store. View Maintenance is required to keep
∗ R ∩ S is a foreign key of S, referencing R: N um = ns
the materialized view up-to-date. View Maintenance can be relegated to the
∗ R ∩ S is neither a key for R nor S: Choose minimum of
programmer or be taken care by the system (can be immediate or deferred)
the following, where R ∩ S = {A}
· N um = nr ∗ ns /V (A, s) • Incremental View Maintenance: Update can be treated conceptually
· N um = nr ∗ ns /V (A, r) to Delete followed by Insert
• Size Estimation for Other Operations: • Join Operation:
– Projection (πA (r)): N um = V (A, r) (since projection elimi- – For inserts, vn ew = vo ld ∪ (ir o
n s)
nates duplicates) – For deletes, vn ew = vo ld − (dr on s)
– Aggregation (A GF (r)): N um = V (A, r) (since one tuple in • Selection and Projection Operation:
output for each distinct value of A)
– For selection inserts, vn ew = vo ld ∪ σθ (ir )
– Set Operations
– For selection deletes, vn ew = vo ld − σθ (dr )
∗ Same relation operations: Rewrite as conjunctions, dis-
– For projection, need to handle duplicates:
junctions or negations and use previous results (e.g.,
σθ1 (r) ∪ σθ2 (r) = σθ1 ∨θ2 (r) ∗ Keep count for each tuple in projection πA (r)
∗ Different relation operations: Inaccurate, but provides up- ∗ Decrement count on delete and delete record from view
per bound when count is 0
· N um(r ∪ s) = nr + ns ∗ Increment count on insert or add to view if not present
· N um(r ∩ s) = min(nr , ns ) • Aggregation Operations:
· N um(r − s) = nr – Count: Similar to projection
– Outer Joins: Inaccurate, but provides upper bound – Sum: Similar to count (but need to keep sum as well as count)
∗ N um(rlef touters) = N um(r o n s) + nr . Similarly, – Avg: Keep sum as well as count
for (r right outer s)
– Min, Max: Insetion - easy; deletion - expensive - need to find
∗ N um(routers) = N um(r o n s) + nr + ns
new min, max
• Estimation of number of distinct values:
• Other Operations:
– If selection condition θ forces A to take a single value, N um =
V (A, σθ (r)) = 1 – Set Intersection: (r ∩ s)
– If range of values, then Num = Number of specified values in ∗ On insertion in r, check if it is in s. If so, add to view
the selection condition ∗ On deletion in r, check if it is in s. If so, delete from view
– If selection condition of the form (A op V), then – Outer Joins: (routerjoins)
V (A, σθ (r)) = V (A, r) ∗ s, where s is the selectivity of the ∗ On insertion in r, check if it is in s. If so, add to view. If
selection not is s, still add to view, but padded with NULLs
– In all other cases, num = min(V (A, r), nσθ (r)) ∗ On deletion from s, pad with NULLs if it is in r and no
– For joins: longer in S
∗ On deletion from r, remove from view
∗ If all attrs. in A are from r, then V (A, r o n s) =
min(V (A, r), nro ns )
∗ If A has A1 from r and A2 from S, then N um = Query Optimization using Materialized Views
min(V (A1 , r) ∗ V (A2 − A1 , s), V (A2 , s) ∗ V (A1 − • Optimizer may need to substitute the query (or sub-query) by materi-
A2 , r), nro
ns )
alized view, if it exists
– For projections: N um = nr
• Replacing a use of a materialized view by the view definition. For
– For aggregates like sum, count, avg, it is same as nr example, if σA=10 (V ), where V is defined as r on s and there is an
– For min(A) and max(A), N um = min(V (A, r), V (G, r)), index on A in r, but not in r o
n s.
where G denotes the grouping attributes

Transactions
Choice of Evaluation Plans
Transaction: a set of operations that form a single logical unit of work.
• To choose the best overall algo, we must consider even nonoptimal
algos for individual operations • ACID properties of a transaction:
• Cost-based optimization: With n relations, there are 2(n−1)! different – A - Atomicity: all or none (handled by Transaction Mgmt.
(n−1)!
join orders. component)
• Time complexity is O(3n ) – C - Consistency: if db was consistent before xact, then it should
be consistent after the xact (handled by programmer or con-
• Dynamic Programming Algo: Outline algo here
straints)
– I - Isolation: an xact does not see the effects of a concurrent
Heuristics in Optimization running xact (handled by the Concurrency Control component)
1. Perform selection operations as early as possible (may cause prob- – D - Durability: once committed, stays committed (handled by
lems if no index on selection attribute and r relation is small in the Recovery Mgmt. component)
σθ (r o
n s) • Transaction States: Active, Failed, Aborted (perform rollback), Par-
2. Perform projections early (similar problems as in 1 above) tially committed, Committed
3. Left-Deep Join Orders: convenient for pipelining • Shadow-copy technique: Ensures atomicity and durability, and is
4. Avoid Cartesian products used by text editors. Disadvantage: Very expensive to make copies
of entire db; no support for concurrent xacts
5. Cached plan can be reused
• Need for Concurrent Executions: Improved throughput (tps), Im-
proved resource utilization, Reduced waiting time (e.g., smaller xacts
Optimizing Nested Queries queued up behind a large xact), Reduced average response time
“where exsits” type of query can be optimized by using “decorrelation”; • Schedules: represent the chronological order in which the instruc-
rewrite as join of temporary table (remember to use select distinct and to tions are executed in the system. For a set for n transactions, there
take care of NULL values to preserve the number of tuples) exist n! different serial schedules
• Consistency of the db under concurrent execution can be ensured by Concurrency Control
making sure that any schedule that is executed has the same effect as
Shared locks (S) and Exclusive locks (X): Compatibility matrix: (S,S) true,
a serial schedule (that is, one w/o concurrent execution)
(S,X) false, (X,S) false, (X,X) false
Starvation can be avoided: by processing the lock requests in the order
Conflict Serializability: in which they were made
• Instructions I1 and I2 conflict if they are operations by different xacts
on the same data item and at least one of them is a write operation 2PL
• If a schedule S can be transformed into a schedule S 0 by a series of • Ensures serializability
swaps of non-conflicting instructions, S and S 0 are said to be conflict • Growing phase, Shrinking phase
equivalent • Does not prevent deadlock
• A schedule S is said to be conflict serializable if it is conflict equiva- • Cascading rollbacks may occur (e.g., if T7 reads a data item that was
lent to some serial schedule written by T5 and then T5 aborts
• This prohibits certain types of schedule even though there would be – To avoid cascading rollbacks, strict 2PL can be used where
no problem (e.g., ops that simply add and subtract). However, these exclusive locks must be held till the xact aborts or commits
cases are harder to analyze. (prevents xacts from reading uncommitted writes)
– rigorous 2PL can be used where ALL locks are held till the
View Serializability: xact aborts or commits; xacts are serialized in their commit or-
der
• Less stringent than Conflict Serializability • Upgrading and Downgrading of locks can be done; upgrading should
• View Equivalence: 2 schedules S and S 0 are view equivalent if ALL be allowed only in the growing phase, while downgrading only in the
3 conditions mentioned below are met: shrinking phase (e.g., series of reads followed by write to a data item
– For each data item Q, if xact Ti read the initial value of Q in S, - in other forms of 2PL above, the xact must obtain an X lock on the
then xact Ti in S 0 must also read the initial value of Q data item to be updated, even if it is much later)
– For each data item Q, if xact Ti executes read(Q) in S and if
that value was produced by xact Tj , then that read(Q) op of
xact Ti in S 0 must also read the value of Q that was produced Implementation of locking: Hash table for data items with linked list (of
by that same write op of xact Tj xacts that have been granted locks for that data item plus those that are wait-
– For each data item Q, the xact (if any) that performs the final ing). Overflow chaining can be used.
write(Q) op in S, must also perform the final write(Q) op in S 0
• A schedule is said to be view serializable if it is view equivalent to Graph-based Protocols
some serial schedule
• Acyclic graph of data item locking order
• Blind Writes: Writing a value w/o reading it first
• A data item can be locked by Ti only if its parent is currently locked
• Blind Writes appear in any view serializable schedule that is not con- by Ti
flict serializable • Locks can be released earlier; so shorter waiting times and increased
concurrency
Other properties • Deadlock free; so no rollbacks are required
• Disadvantages: may need to lock more data items than needed (lock-
• Recoverable Schedule: is one where, for each pair of xacts Ti and
ing overhead and increased waiting time), w/o prior knowledge of
Tj such that Tj reads a data item previously written by Ti , then the
which data items to lock, xacts may have to lock the root of the tree
commit operation of Ti appears before the commit operation of Tj
and that can reduce concurrency greatly
• Cascading rollback is undesirable since it can lead to undoing a sig- • Cascadelessness can be obtained by tracking commit dependencies
nificant amount of work such that a transaction is not allowed to commit until the ones that it
• Cascadeless Schedule: is one where, for each pair of xacts Ti andTj had read values written by have not commited
such that Tj reads a data item previosuly written by Ti , the commit
operation of Ti occurs before the read operation of Tj . Timestamp-based Protocols
• A cascadeless schedule is also recoverable, but not vice-versa.
• Determines the serializability order by selecting the order in advance
• The goal of concurrency control schemes is to provide a high degree
• Using timestamps: could be the system clock or a logical counter
of concurrency, while ensuring that all schedules that can be gener-
ated are conflict or view serializable, and are cascadeless. • Each xact is given a timestamp when it enters the system
• Each data item has 2 timestamps: W-timestamp (the largest ts of
• Testing for Conflict Serializability: (to show that the generated sched-
any xact that wrote the data item successfully) and R-timestamp (the
ules are serializable)
largest ts of any xact that read the data item successfully)
– Construct precedence graph for a schedule S (vertices are xacts,
• Timestamp-Ordering Protocol is:
edges indicate read/write dependencies)
– If Ti issues read(Q)
– If the graph contains no cycles, then the schedule S is conflict
∗ If T S(Ti ) < W − timestamp(Q), reject the read and
serializable
rollback Ti
– A serializability order of the xacts can be obtained through ∗ If T S(Ti ) ≥ W − timestamp(Q), execute the read
topological sorting of the precedence graph and set the R-timestamp of Q to maximum of T Si and
– Cycle detection algos are O(n2 ) R-timestamp(Q)
• Testing for View Serializability: – If Ti issues write(Q)
– NP-complete problem ∗ If T S(Ti ) < R − timestamp(Q), reject the write and
rollback Ti
– Sufficient conditions can be used
∗ If T S(Ti ) < W − timestamp(Q), reject the write and
– If sufficient conditions are satisfied, the schedule is view- rollback Ti
serializable ∗ In all other cases, execute the write and set the W-
– But, there may be view-serializable schedules that do not sat- timestamp of Q to T Si
isfy the sufficient conditions – Rolled-back TS get a new timestamp when they are restarted
See examples and exercises of schedules from the book – Freedom from deadlocks
– However, xacts could starve (e.g., long duration xact getting – Multiple-granularity protocol:
restarted repeatedly due to conflicts with short duration xacts) ∗ The compat matrix above must be followed for granting
– Recoverability and cascadelessness can be ensured by: locks
∗ Performing all writes together at the end of the xact; no ∗ It must lock the root of the tree first, and it can lock it in
xact is permitted to access any of the data items that have any mode
been written ∗ It can lock a node Q in S or IS mode only if it currently
∗ Using a limited form of locking, whereby uncommitted has the parent of Q locked in either IS or IX mode
reads are postponed until the xact that updated the item ∗ It can lock a node Q in X, SIX, or IX mode only if it cur-
commits rently has the parent of Q locked in either IX or SIX mode
– Recoverability alone can be guaranteed by using commit de- ∗ It can lock a node only if it has not previously unlcoked
pendencies, that is, tracking uncommited writes and allowing a any node (that is, Ti is 2P)
xact Ti to commit only after the commit of all xacts that wrote ∗ It can unlock a node Q only it it currently has none of the
a value that Ti read. children of Q locked
• Thomas’ Write Rule: ∗ Locking is done top-down, whereas unlocking is done
– Allows greater potential concurrency bottom-up
– Ignores writes if T Si < W − timestamp(Q), instead of – This protocol enhances concurrency and reduces lock overhead
rolling it back and is good for apps that include a mix of:
∗ Short xacts that access only a few data items
Validation-based Protocols ∗ Long xacts that produce reports from the entire file or set
of files
• Also called optimistic concurrency control – Deadlock is possible
• Each xact goes through 3 phases (for update xacts and 2 for read-only
xacts):
Multiversion Schemes
– Read phase: The system executes the xact Ti ; it reads all data
items and performs all write operations on temporary local Instead of delaying the reads or aborting an xact, these schemes use old
variables, w/o updates to the actual db copies of the data. Each write produces a new version of a data item, and
– Validation phase: Checks if the updates can be copied over to read is given one of the versions of the data item. The protocol must ensure
the db w/o conflict that the version given ensures serializability and that an xact be able to easily
– Write phase: Done only if the xact succeeds in the validation determine which version to read.
phase. If so, the system applies the updates to the db; otherwise • Multiversion Timestamp Ordering:
the xact is rolled back – Each xact has unique ts as before (for the TS Scheme)
• Validation test for xact Tj requires that for all xacts Ti with – Each version of data item has content of the data item, R-ts and
T S(Ti ) < T S(T j), one of the following conditions must hold: W-ts
– F inish(Ti ) < Start(Tj ) – Whenever an xact writes to Q, a new version of Q is produced
– The set of data item written by Ti does not intersect with the set whose R-ts and W-ts are initialized to T S(Ti ).
of data items read by Tj and Ti completes its write phase before – Whenever an xact reads Q, the R-ts of Q is set to T S(Ti ) only
Tj starts its validation phase. (Start(Tj ) < F inish(Ti ) < if R − ts(Q) < T S(Ti )
V alidation(Tj ). This ensures that the writes of Ti and Tj do
– The protocol is (an xact Ti wants to read or write Q):
not overlap
∗ Find a version Qk whose w-ts is the largest ts ≤ T S(Ti )
∗ If xact Ti issues read(Q), the value returned is the content
Multiple Granularity of Qk
• Hierarchy of granularity: DB, Areas, Files, Records; visualize as a ∗ If xact Ti issues write(Q) and T S(Ti ) < R − ts(Qk ),
tree with the DB at the root of the tree then rollback Ti (some other xact already read the value
• Explicit locking at one level will mean implicit locking at all nodes and so we cannot change it now). On the other hand, if
below it T S(T i) = W − ts(Qk ), overwrite the contents of Qk
• Care must be taken not to grant explicit lock at a level above which (w/o creating a new version); else, create a new version.
another lock has been granted already (e.g., cannot lock a record ex- – Older versions of a data item are removed by: If there are 2 ver-
plicitly, if the file has been locked). The tree must be traversed from sions of a data item with W-ts less than the oldest transaction
the root to the required level to find out. in the system, the older of these 2 versions can be removed
• Also, a db cannot be locked, if someone else is holding a lock at a – A read request never fails and is never made to wait
lower level. Instead of searching the entire tree to determine this, – Disadv: Reading requires updating of R-ts (2 disk accesses
intention lock modes are used. than one), and conflicts between xacts are resolved through
– When an xact locks a node, it acquires an intention lock on all rollbacks rather than waits (Multiversion 2PL solves the roll-
the nodes from the root to that node. back problem).
– IS (Intention-Shared) lock: If a node is locked in IS mode, then – Does not ensure recoverability and cascadelessness; can be ex-
explicit shared locking is being at the lower level tended in the same manner as the basic TS-ordering scheme
– IX (Intention-Exclusive) lock: If a node is locked in IX mode, • Multiversion 2 PL: Attempts to combine the adv. of multiversion
then explicit exclusive or shared locking is being at the lower with 2PL; it differentiates between read-only xacts and update xacts.
level TODO
– SIX (Shared and Intention-Exclusive) lock: If a node is locked
in SIX mode, then the subtree rooted at that node is being
Deadlock Handling
locked in explicitly shared mode and explicit exclusive lock-
ing is being at the lower level 2 methods to deal with deadlocks: Deadlock prevention, and Deadlock de-
– Compatibility Matrix: tection and recovery. Deadlock prevention is used if the probability of dead-
IS IX S SIX X locks is relatively high; otherwise detection and recovery are more efficient.
IS true true true true false Detection scheme requires overhead to maintain information while running
IX true true false false false to detect deadlocks as well as losses that can occur due to recovery from
S true false true false false deadlocks.
SIX true false false false false • Deadlock Prevention using partial ordering: Use partial ordering
X false false false false false technique like tree protocol
• Deadlock Prevention using total ordering and 2PL: Use total or- – Repeatable Read (xact may not be serializable wrt other xacts;
dering and 2PL; in this case, the xact cannot request locks on items e.g., when an xact is searching for records satisfying some con-
that precede that item in the ordering ditions, the xact may find some records inserted by a committed
• Deadlock Prevention using wait-die: Using xact rollback; older xact, but not others)
xacts are made to wait; younger ones are rolled back if the lock is – Read committed
currently held by an older one; the older the xact gets, the more it – Read uncommitted (lowest level of consistency allowed in
must wait SQL-92)
• Deadlock Prevention using wound-wait: Pre-emptive technique;
younger xact is wounded by older one; younger one is made to wait, Concurrency in Index Structures
if older xact has a lock on the item; there may be fewer rollbacks in
this scheme Since indices are accessed frequently, they would become a point of great
Both wait-die and wound-wait avoid starvation and both may cause lock contention, leading to a low degree of concurrency. It is acceptable to
unnecessary rollbacks have nonserializable concurrent access to an index, as long as the accuracy
of the index is maintained.
• Timeout-based schemes: In between deadlock prevention and de-
tection schemes; allow an xact to wait for sometime; if timeout, as-
2 technique:
sume that deadlock may have occurred and rollback xact; Easy to
• Crabbing Protocol:
implement, but difficult to determine the correct duration of time to
wait; suitable for short xacts – When searching, lock root node in shared mode, then the child
node. After acquiring lock on child, release lock on parent.
• Deadlock Detection and Recovery: Must check periodically if a
deadlock has occurred (detection); Can be done to see if cycles exist – When inserting, traverse tree as in search mode. Then, lock
in a wait-for graph; Selection of a victim can be done on the basis the node affected in X-mode. If coalescing, splitting or redis-
of minimum cost (how many xacts will be involved; how many data tribution is required, lock the parent in X-mode; then perform
items have been used; how much longer to complete, etc.); Total roll- the operations on the node(s) and release the locks on the node
back or partial rollback (just enough rollback to the point where the and the siblings; retain lock on parent, if parent needs further
appropriate lock is released that breaks the deadlock); Starvation can splitting, coalescing, or resitribution.
be prevented by including the number of times an xact has been rolled – Progress of locking goes from top to bottom while searching
back in the cost factor while deciding the victim and bottom to up when splitting, coalescing, redistributing
• B-link-tree locking protocol: Achieves more concurrency by avoid-
Insert and Delete Operations ing holding the lock on one node while holding lock on another node,
by using a modified version of B+-trees called B-link trees; these
• Delete operation similar to write (X lock for delete op in 2PL; treated require that every node including the internal nodes and leaf nodes
similar to write op in TS-ordering) maintain a pointer to its right sibling
• Insert operation: X-lock in 2PL; in TS-ordering, assign TS of the xact – Lookup: Each node must be locked in S mode before access-
that is inserting the item to the R-ts and W-ts of the data item ing it; Split may occur concurrently with lookup, so the search
value may have moved to the right node; Leaf nodes are locked
Phantom Phenomenon in 2PL to avoid phantom phenomenon
– Insertion and deletion: Follows the rules to locate the lead node
Consider computation of sum by using a select and an insert statement; this into which the insertion or deletion will take place; Upgrades
can result in a non-serializable schedule if locking is done at the granularity the shared lock to X lock on the affected leaf; Leaf nodes are
of the data item; neither access any tuple in common - so the conflict would locked in 2PL to avoid phantom phenomenon
go undetected.
– Split: Create the new node (split); change the right-sibling
Can be alleviated by:
pointers accordingly; release X-lock on the original node (if
• Associating a virtual data item with every relation and having the it is non-leaf; leaf nodes are locked in 2PL to avoid phantom
xacts lock this (in addition to the tuples), if they are updating or read- phenomenon)
ing info about the relation
– Coalescing: Node into which coalescing will be done should
• Index-locking protocol using 2PL can be used: nodes of the index be locked in X mode; once coalescing has been done, parent
must be locked in shared mode for lookups; writes must lock the ap- node is locked in X mode to remove the deleted node; then,
propriate nodes of the index in exlcusive mode xact releases the locks on the coalesced nodes, if parent is not
• Variants of index-locking can be used to implement the other schemes to be coalesced, lock on parent can be released
(apart from 2PL) – Note: An insertion or deletion may lock a node, ulock it, and
subsequently relock it. Furthermore, a lookup that runs concur-
Weak Levels of Consistency rently with a split or coalescence operation may find that the
desired value has shifted to the right-sibling node by the split
Serializability allows programmers to ignore issues related to concurrency or coalescence operation; this can be accessed by following the
when they code xacts. right-sibling pointer.
• Degree-Two Consistency: Purpose is to avoid cascading aborts w/o – Coalescing of nodes can cause inconsistencies; lookups may
necessarily ensuring serializability; S-locks may be acquired and re- have to restart
leased at any time; X-locks can be acquired at any time, but cannot be – Instead of 2PL on leaf nodes, key-value locking on individ-
released until the xact aborts or commits; results in non-serializable ual key-values can be done. However, must be done carefully;
schedules; therefore, this approach is undesirable for many apps else, phantom phenomenon can occur for range lookups; can be
• Cursor Stability: Form of two-degree consistency for programs taken by locking one more key value than the range (next-key
written in host languages where iteration of tuples is done using a value).
cursor; Instead of locking the entire relation, the tuple that is cur-
rently being processed is locked in S-mode; Any modified tuples are
locked in X-mode until the xact commits; 2PL is not used, Serializ- Recovery System
ability is not guaranteed; Heavily accessed relations gain increased • Fail-stop assumption: Hardware errors and bugs in software bring
concurreny and improved system performance. Programmers must the system to a halt, but do not corrupt the nonvolatile storage con-
take care at the app level so that db consistency is ensured. tents.
• Weak Levels of Consistency in SQL: SQL-92 levels: • Stable Storage Implementation: Keep 2 physical blocks for each
– Serializable (default) logical database block. Write the info onto the first physical block.
When the first write completes successfully, write the same info onto • Dump database procedure: Output all log records to stable, then the
the second physical block. The o/p is completed only after the second buffer blocks, copy the contents of the db to stable storage, then out-
write completes successfully. put a log record ¡dump¿ onto the stable storage. To recover, only
During recovery, the system examines each pair of physical blocks. records after the ¡dump¿ record must be redone. But copying of en-
If contents same, nothing to be done. If error in one, replace with the tire db is impractical and the xact processing must be halted during
other. If contents differ, replace first’s contents with the contents of the dump. Fuzzy dumps can be used to allows xacts to be active while
the second block. the dump is in progress.
Number of blocks to compare can be reduced by keeping list of on-
going writes in NVRAM (so that only these need to be compared). Advanced Recovery Techniques
Using logical logging for undo process for achieving more concurrency
Log-Based Recovery (faster release of locks on certain structures such as B+-tree index pages)
Recovery is used for rolling back transactions as well as for crash recovery.
Update log record has: Xact Id, Data-item id, Old Value, New Value Fuzzy Checkpointing:
• <T Start>, <T commit> or <T abort> records are written at start, • Normal checkpointing may halt the xact processing for a long time if
commit or abort of a transaction the number of pages to be written is large
• Deferred Database Modification: Only new values need to be stored • Allows xacts to modify buffer blocks once the checkpoint record has
(for redo; no need for undo) been written
• Immediate Database Modification: Requires to store both old and • While performing fuzzy checkpointing, the xact processing is halted
new values (for undo and redo) only briefly to make a list of buffers modified. The checkpoint is
record before the buffers are written out.
– Undo is performed before redo
• The locks are released and xacts can modify the buffer blocks; the
• Checkpoints: Helps in reducing scanning the log after a crash to lo- checkpointing process proceeds to output the modified blocks in its
cate the transactions to be undone and redone; also helps in reducing list in parallel. However, the block being written out by the check-
the time to redo (since the changes before the checkpoint would have pointing process still needs to be locked; other blocks need not be.
been applied already).
• Concept of “last-checkpoint” record at a fixed position on disk can
Transactions are not allowed to perform any update actions, such as
be used to guard against failures. This record should be updated only
writing to a buffer block or writing a log record, while a checkpoint
after ALL the buffers in the checkpoint’s list have been written to
is in progress.
stable storage.
– Output all log records to stable storage
– Output all modified buffer blocks to stable storage ARIES
– Write the checkpoint record to stable storage • Features:
For recovery, the log must be scanned backward to find the most re- – Uses LSN (log sequence number)
cent checkpoint record. It needs to further continue searching back- – Physiological Redo
ward until it finds all the transactions that have some record after the – Dirty Page Table
most recent checkpoint record. Only these transactions need to be – Fuzzy Checkpointing (allows dirty pages to be written contin-
redone / undone. No commit record, do undo; else do redo. uously in the background, removing in bottle necks when all
• Recovery with Concurrency Control: pages need to be written at once)
– List of active transactions are stored as part of the checkpoint • LSN:
record – Every log record has a unique LSN that uniquely identifies the
– The log can be used to rollback even failed xacts. log record
– If strict 2PL is used (that is, excl. locks till end of xact), – LSN is most often file number and an offset within that file
the locks held by an xact may be released only after the xact – Each page has an LSN that indicates the LSN of the last record
has been rolled back. So, when an xact is being rolled back, that modified that page.
no other xact may have updated the same data item (the xact – PageLSN is essential to ensure idempotence in the presence of
should have locked the data item since it was to update it in the phsyiological redo operations
first place). Therefore, restoring the old value of a data item – Physiological redo cannot be reapplied to a page since it would
will not erase the effects of any other xact. result in incorrect changes on the page
– Undo must be done by processing the log backward – Each log record contains a field called ”PrevLSN” that points to
the previous log record for this transactioon (helps in locating
– Redo must be done by processing the log forward
transaction log records easily without reading the whole log)
– For recovery: Scan the log backward until it finds the ¡check-
– CLRs (Compenstation Log Records) have an additional field
point L¿ record performing the following steps as it reads each
UndoNextLSN that is used in the case of the operation-abort
record while scanning backward:
log record to point to the log record that is to be undone next
∗ If ¡Ti commit¿ record found, add Ti to the redo list • Dirty Page Table:
∗ If ¡Ti start¿ record found and Ti is not on the redo list, – Stores the list of pages that have been updated in the buffer
add to the undo list
– For each page, the PageLSN and the RecLSN is also stored
∗ Finally, for all Ti in the checkpoint record list, that does
not appear in the redo list, add to the undo list. This is to – RecLSN indicates which log records have already been applied
take care of long running xacts that may not have updated to the disk version of the page
anything since the checkpoint record was written. – Intially, when the page is brought in from the disk, the RecLSN
∗ Undo must be done prior to redo is set to the current end of the log
• Checkpointing:
• WAL (Write-ahead logging): Before a block of data in main memory
can be output to the database (in non-volatile storage), all log records – A checkpoint log record contains the Dirty Page Table and the
pertaining to data in that block must have been output to stable stor- list of active transactions
age. – For each transaction, the checkpoint record also stores the last
Strictly speaking, the WAL rule requires only that the undo info in LSN for that transaction
the log have been output to stable storage, and permits the redo info – A fixed position on the disk notes the LSN of the last complete
to be written later. This is relevant only in systems where undo and checkpoint log record
redo info are stored in separate log records. • Recovery: 3 Phases (in recovery):
– Analysis Pass: Determines which xacts to undo, which pages ∗ Database Writer Process
were dirty at the time of the crash, and the LSN from which the ∗ Process Monitoring Process
redo pass should start. ∗ Checkpoint Process
– Redo Pass: Starts from a position determined during the anal- – The shared memory contains all the shared data:
ysis phase, and performs a redo, repeating history, to bring the ∗ Buffer Pool
database to a state it was in before the crash. ∗ Lock Table
– Undo Pass: Rolls back all xacts that were incomplete at the ∗ Log Buffer
time of the crash. Need to elaborate here about CLRs, etc. ∗ Cached Query Plans
While undoing, if a CLR is found, it uses the UndoNextLSN to – Semaphores or ”Test and Set” atomic operations must be used
locate the next record to be undone; else it undoes the record to ensure concurrent access to the shared memory
whose number is found in the PrevLSN field – Even if the system handles lock requests through shared mem-
• Advantages of ARIES: ory, it still uses the lock manager process for deadlock detection
– Recovery is faster (no need to reapply already redone records; • Data-server systems (aka query-server systems):
pages need not even be fetched if the changes are already ap- – This architecture is used typically when:
plied) ∗ High-speed connection between clients and servers
– Lesser data needs to be stored in the log ∗ Client systems have comparable computational power as
– More concurrency is possible those of servers
∗ Tasks to be executed are computationally intensive
– Recovery Independence (e.g., for pages that are in error, etc.)
– The client needs to have full backend functionality
– Savepoints (e.g., rolling back to a point where deadlock can be
broken)
– Allows fine-grained locking
Parallel Systems
– Recovery optimizations (fetch-ahead of pages, out-of-order • Speedup v/s Scaleup
redo) • Factors affecting Scaleup / Speedup
• Interconnection Networks:
Remote Backup Systems – Bus
√ √
– Mesh (max. distance is 2( n − 1) or n, if wrapping is al-
Several issues must be addressed:
lowed from the ends)
• Detection of failure: Using “heartbeat” messages and multiple links
of communication – Hypercube (max. distance is log n)
• Parallel System Architectures: Shared-memory, Shared-disks,
• Transfer of control: When original comes back up, it must update
Shared nothing, Hierarchical
itself (by receiving the redo logs from the old backup site and replay-
ing them locally). The old backup can then fail itself to allow the – Hierarchical: Share nothing at the top-level(???), but internally
recovered primary to take over. each node has either shared-memory or shared-disk architec-
ture)
• Time to recover: Hot-spare configuration can be used.
• Time to commit:
Distributed Systems
– One-safe: Commit as soon as commit log record is written to
stable storage at primary • Reasons: Sharing data, Autonomy, Availability
– Two-very safe: Commit only when both primary and secondary • Multidatabase or heterogeneous distributed database systems
have written the log records to stable storage (problem is when • Issues in distributed database systems: Software development cost,
secondary is down) Greater potential for bugs, Increased processing overhead
– Two-safe: Same as Two-very safe when both primary and sec- • Local-Area Networks, Storage Area Networks (SAN)
ondary are up; when secondary is down, proceed as One-safe) • Wide-Area Networks: Disconintuous Connection WANs v/s Contin-
uous Connection WANs
Database System Architectures
Distributed DB
Main Types: Client-Server, Parallel, Distributed
• Each site may participate in the execution of transactions that access
Centralized Systems: data at one site, or several sites.
• The difference between centralized and distributed databases is that,
• Coarse-grained parallelism: in the centralized case, the data reside in one location, whereas in the
A single query is not partitioned among multiple processors. distributed case, the data reside in several locations.
Such systems support a higher throughput; that is, they allow a • Homogeneous Distributed DB: All sites have identical dbms soft-
greater number of transactions to run per second, although individ- ware, are aware of one another, and agree to cooperate in processing
ual transactions do not run any faster. users’ requests
• Fine-grained parallelism: Single tasks are parallelized (split) among • Heterogeneous Distributed DB: Different sites may use different
multiple processors schemas and different dbms software, and may provide only limited
facilities for cooperation in transaction processing
Client-Server Systems:
Distributed Data Storage
Clients access functionality through API (JDBC, ODBC, etc.) or transac-
tional remote procedure calls Two approaches to storing a relation in a distributed db:
• Replication: Several identical copies of a relation are stored; each
Server System Architectures: replica at a different site. Full replication: a copy is stored at every
site
2 types: Transaction-server v/s Data-server systems – Advantages: Availability, Increased parallelism (minimizes
• Transaction-server systems (aka query-server systems): movement of data between sites)
– Components of a Transaction-server system include: – Disadvantages: Increased overhead on update
∗ Server Processes – In general, replication increases the performance of and the
∗ Lock Manager Process availability of data for read operations; but update transactions
∗ Log Writer Process incur greater overhead
– Concept or primary copy of a relation ∗ Otherwise, site has a ¡ready T¿. In this case, it must wait
• Fragmentation: The relation is partitioned into several fragments, and for the coord to recover. This is the “blocking problem”.
each fragment is stored at a different site If locking is used, other transactions may be forced to
– Horizontal Fragmentation: Each tuple to one or more sites wait.
ri = σPi (r). r is reconstructed using: r = r1 ∪r2 ∪r3 . . .∪rn – Network Partition:
– Vertical Fragmentation: Decomposition of scheme of relation ∗ If the coord and all participants are in the same partition,
(so that columns are at one or more sites) ri = πRi (r). The then no effect.
original relation can be obtained by taking the natrual join of ∗ Otherwise, the sites that are in the partition other than the
all the fragmented relations. Primary key (e.g., tuple id) needs coord, treat the failure as if the coord failed. Similarly, for
to exist in each fragment. the sites in the same partition as the coord and the coord,
– For privacy reasons, vertical fragmentation can be used for hid- they treat the failure as if the sites in the other partition
ing columns. had failed.
• Fragmentation and Replication can be combined • To allow the recovered site to proceed, the list of items locked can
also be recorded with the ¡ready T¿ message in the log. The recovery
• Transparency: Users should get: Fragmentation transparency, Repli-
proceeds to relock those items, whereas other xact can proceed.
cation transparency, Location transparency
• To prevent name clashes: a name server can be used (single point-
of-failure) or site id prepended to each relation name. Aliases can 3PC
be used to map aliases to real names stored at each site. This helps • Tries to avoid blocking in certain cases by informing at least “k” other
when the administrator decides to move a data item from one site to sites of its decision
another.
• It is assumed that no network partition occurs and not more than “k”
sites fail, where “k” is a predetermined number
Distributed Transactions • If the coord fails, then the sites elect a new coord. The new coord
• Need to worry about failure of a site or failure of communication link tries to find out if any site knows about the old coord’s intentions. If
while participating in a transaction it finds any one site, then it starts the third phase (to commit or abort).
• Transaction Manager (handles ACID for local) and Transaction Co- If it cannot, the new coord aborts the xact.
ordinator (coordinates the execution of both local and global transac- • If a n/w partition occurs, it may appear to be the same as “k” sites
tions initiated at its site) failing and blocking may occur.
• Transaction Coordinator Responsibilities: Start execution of a trans- • 3PC has overheads; so it is not used widely.
action, Break a xact into sub-parts and distribute to various sites, co- • Also, it should be implemented carefully; else the same xact may be
ordinate termination of xact (abort or commit) committed in one partition and may be aborted in another
• Failures: failure of a link, loss of messages, network partition, failure
of a site Alternative methods of xact processing

2PC Using persistent messaging; this requires complicated error handling (e.g.,
by using compensating xacts). Persistent messaging can be used for xacts
• Protocol: When all sites inform the coordinator that the transaction is that cross organizational boundaries.
complete: Implementation of persistent messaging:
– P1: Coord sends all sites ¡prepare T¿, Sites reply with ¡ready • Sending site protocol: Messages must be logged to persistent stor-
T¿ or ¡no T¿ age within the context of the same xact as the originating xact be-
– P2: Coord sends ¡commit T¿ or ¡abort T¿ (based on whether fore sending it out; On receiving an ack from the receiver, this can
all sites were ready to commit or not) be deleted. If no ack is recd., the site tries repeatedly. After pre-
determined number of failures, error is reported to the application
– All such comm. must be logged to stable storage before it sends
(compensating xact must be applied).
the msg. out so that recovery is possible
• Receiving site protocol: On receipt, the receiver must first log into
– In some implementations, each site sends ¡ack T¿ msg to the
persistent storage; Duplicates must be rejected; After the xact for
coord. The coord records ¡complete T¿ after it receives ¡ack T¿
logging the message to the log relation commits, the receiver send
from all the sites
an ack. Ack is also sent for duplicates. Deleting received messages
• Handling of failures: from the receiver must be done carefully, since the ack may not have
– Failure of site: Handling by coordinator: reached the sender and a duplicate may be sent. Each message can be
∗ If site failed before replying ¡ready¿, the coord treats it given a timestamp to deal with this problem. If the ts of a recd. msg.
similar to a reply of ¡abort¿ is older than some predetermined cutoff, then that msg is discarded
∗ If site failed after replying ¡ready¿, the coord ignores the and all other messages recorded that have ts older than the cutoff can
failure and proceeds normally (the site will take care after be deleted.
it comes back up)
Handling by site: When the site comes back up, it checks it log: Concurrency Control in Dist. DB
∗ If no control records in log, execute undo
∗ If commit in log, commit the xact • Each site participates in the execution of a commit protocol to ensure
∗ If abort in log, execute undo global transaction atomicity.
∗ If ¡ready¿ is present in log, it needs to find out from the co-
ord about the status. If coord is down, it can ask the other Locking Protocols:
sites. If this info is not available, then the site can neither
commit nor abort T. It needs to postpone the decision for • Single Lock-Manager Approach:
T until it gets the needed info. – A single lock manager (residing at a single site) for the entire
– Failure of coord: When coord fails, then the participating sites system
must try to determine the outcome (but cannot be done in all – Request for lock is delayed until it can be granted; message is
cases) sent to the site from which the lock request was initiated.
∗ If site has ¡commit T¿, then it needs to commit the xact – The xact can read from any of the site where the replica is avail-
∗ If site has ¡abort T¿, then it needs to undo able; but all sites where a replica of the data item exists must
∗ If site does not have ¡ready T¿, then it can undo be involved in the writing.
– Advantages: Simple Implementation, Simple Deadlock Han- – Advantages: The cost of read or write locking can be selec-
dling tively reduced by choosing the read and write quorums; by set-
– Disadvantages: Bottleneck, Vulnerability (Single point-of- ting appropriate weights, this protocol can simulate the major-
failure) ity and biased protocols
• Distributed Lock Manager: • Timestamping:
– Lock Manager function is distributed over several sites – Each xact is given a unique timestamp that the system uses in
deciding the serialization order
– Each site maintains a lock manager whose function is to ad-
– 2 methods for generating unique timestamps: (1). Centralized,
minister the lock and unlock requests for those data items that
or (2). Distributed (concatenate the site id at the end of the lo-
are stored at that site
cal unique timestamp - this is done to ensure that the global ts
– this works as for the single case when the data item is not repli- generated in one site are not always greater than those gener-
cated ated in other sites)
– for replicated case, see the methods below – Handling of faster clocks: Use logical counter clock; When-
– Advantages: Simple implementation; reduces the degree to ever a transaction with timestamp ¡x,y¿ visits a site and x is
which the coord is the bottleneck; reasonably low overhead greater than the current value of local clock counter, set local
requiring 2 messages for lock requests and one for unlock re- clock counter to x + 1 (Similar technique can be use for system
quests. clock based timestamps)
– Disadvantages: Deadlock handling is more complex since the
lock/unlock requests are not made at a single site. There may Replication with weak degrees of consistency
be intersite deadlocks even when there is no deadlock within
a single site. • Master-slave replication: updates only at a primary site; xacts can
read from anywhere
• Primary Copy:
• Multi-master replication: Update-anywhere replication
– Single primary copy for each replicated data item
• Laxy propagation: instead of updating all replicas as part of the xact
– Lock / unlock requests are always made to the site that has the performing the update.
primary copy • Updates at replicas as translated into updates at a primary site, which
– This is handled similar to the case for unreplicated data are then propagates lazily to all replicas. (This ensures updates to an
– Advantages: Simple Implementation item are ordered serially, although serializability problems can occur,
– Disadvantages: Single point-of-failure (if the site that has the since xacts may read an old value of some other data item and use it
primary copy fails, the data item is inaccessible, although other to perform an update)
sites containing a replica may be accessible) • Updates are performed at any replica and propagated to all other
• Majority Protocol: replicas. Can cause even more problems since the same data item
may be updated concurrently at multiple sites.
– If a data item Q is replicated at n different sites, then a lock re-
quest must be sent to more than one-half of the n sites in which
Q is stored; the transaction proceeds only when more than one- Deadlock Handling
half of the n sites grant a lock on the data item Q; otherwise, it • Deadlock can occur if the union of the local wait-for graphs contains
is delayed a cycle (even though each local wait-for graph is acyclic)
– Writes are performed on all replicas • Centralized deadlock detection: Global wait-for graph is maintained
– Protocol can be extended to deal with site failures (see later at the central site and is updated whenever a new edge is removed
points ¡which one¿) or inserted from one of the local wait-for graphs (or periodically or
– Advantages: Distributed lock manager functionality whenever the coord needs to invoke the cycle detection algo)
– Disadvantages: More complicated to implement; requires at • When a cycle is detected, the coord selects a victim to be rolled back
least 2(n/2 + 1) messages for handling lock requests and at and it must notify all sites about this. The sites, in turn, roll back the
least (n/2 + 1) messages for handling unlock requests; Dead- victim xact.
lock handling is more complicated - deadlocks can occur even • May produce unnecessary roll-backs:
if a single data item is being locked (unless the requests are – False cycles: Message for adding edge arrives before message
made to the sites in the same predetermined order by all the for removing edge
sites) – One of the xact was to be aborted: If an xact was to be aborted
• Biased Protocol: for reasons other than the deadlock, it may be possible that the
– Requests for shared locks are given more favorable treatment deadlock would have been broken and there would not be the
need to select (another) victim
– Shared locks: Request from one site that has a replica of Q
• Deadlock detection can be done in a distributed manner, but is more
– Exclusive locks: Request locks at all sites that have a replica of complicated.
Q
– Advantages: Lesser overhead on read operations than the ma- Availability
jority protocol; savings are significant when reads are more
– Disadvantages: Writes require more overhead; same complex- Multiple-links can be used between sites; however, multiple links may still
ity for deadlock handling as for the Majority Protocol fail. So there are cases where we cannot distinguish between site failure and
network partition.
• Quorum Consensus Protocol:
– Generalization of the majority protocol Must take care to ensure that these situations are avoided: 2 or more
– Each site is assigned a nonnegative weight; read and write oper- central servers are elected in distinct partitions, and more than one partition
ations are assigned 2 integers called read quorum Qr and write updates a replicated data item
quorum Qw • Majority-based Approach:
– Following condition must be satisfied: Qr + Qw > S and – Each data item stores with it a version number to detect when
2 ∗ Qw > S, where S is the total weight of all the sites at it was last written to. This is updated on every write.
which the data item exists – If a data item is replicated at n sites, then the xact will not pro-
– For read locks, enough replicas must be locked so that their ceed until it has obtained locks from majority of those n sites
total weight ≥ Qr – Read operations look at all versions and choose the highest one
– For write locks, enough replicas must be locked so that their (the sites will lower numbered versions can be informed of the
total weight ≥ Qw new version)
– Write ops write to all replicas that have been locked; the LDAP
version number is one more than the highest numbered one
• Can be used for storing bookmarks, browser settings, etc.
amongst them
• Provides a simple mechanism to name objects in a hierarchical fash-
– This works even when a failed site comes back up (it will told
ion
about its stale data). Site reintegration is trivial - nothing needs
to be done. This is since writes would have updated a majority, • RDN=value can be collected to form the full distinguished name
while reads will read a majority of the replicas and find at least • Querying consists of just selections and projections; no joins
one replica that has the highest version. • Distributed Directory Trees: A node in a DIT (directory info tree)
– Same version numbering technique can be used with the quo- may contain a referral to another node in another DIT; this helps in
rum consensus to make it work in the presence of failures. distributed trees.
However, failures may prevent xacts from proceeding if some • Many LDAP implementations support master-slave replication and
sites are given higher weights. multimaster replication even though replication is not part of the cur-
• Read One, Write All Approach: rent std.
– Unit weights to all sites, read quorum = 1 and write quorum =
n (all sites) Parallel Databases
– No need of version number since even one site failed will not
allow write to the data item to happen Parallelism is used to: speedup (queries are executed faster because more
resources, such as processors and disks, are provided) and scaleup (increas-
– To allow this to work in the case of failures, we could use “read
ing workloads are handled without increased response time, via an increase
one, write all available” - but there are severla complications
in the degree of parallelism)
that can arise in the case of network partitions or temporary site
failures (the site will not know and may have to explicitly cathc
up). Inconsistencies can arise in the case of network partitions. I/O Parallelism
• Site Reintegration: The recovering site must ensure that it gets the • Horizontal Partioning: Tuples of a relation are divided (or declus-
latest values and in addition must continue to receive updates as it is tered) among many disks, so that each tuple resides on one disk
recovering. An easy solution is to halt entire system temporarily, but • Partioning Techniques
this is usually not feasible. Recovery of a link must be informed to
– Round-robin (ith tuple to disk number Dimodn : ensures even
all sites.
distribution of tuples across disks (each disk has approx. the
• Comparison with Remote Backup: In remote backup, concurrency same number of tuples)
control and recovery are performed at a single site (overhead with
– Hash partioning (hashing is on the chosen partioning attributes
2PC are avoided); only data and log records are shipped across.
of the tuples)...if the has function returns i, the tuple is placed
Transaction code is only at one site. Remote backup system offer
on disk Di
a lower-cost approach to availability than replication. On the other
hand, replication can provide greater availability by having multiple – Range partitioning: Contiguous attribute-value ranges to each
replicas and using the majority protocol. disk based on a partitioning attribute.
• Coordinator Selection: • Comparison of partitioning technique based on the access technique:
scanning entire relation, point queries, range queries
– When there is not enough information avlbl to continue from
the failed coord, the backup coord can abort all (or several) cur- – Round-robin: Good for sequential scan of entire data; bad for
rent xacts and restart them under the control of the new coord range and point queries (since each of the n disks must be
searched)
– bully algorithm: If some site is electing itself the coord, then
the site must wait to hear the election message within a prede- – Hash-partioning: Best for point queries; also suited for sequen-
termined time interval. If it does not hear this message, this site tial scans of the entire relation (since the hash function could
will restart the election algo. ensure that the data are evenly distributed); not good for range
(since all disks must be searched)
– Range-partitioning: Suited for range as well as point queries;
Distributed Query Processing point-queries can be answered by looking at the partition vector
Must take into account: cost of data transmission over the n/w as well as the Range parititioning results in higher throughput while main-
hard disk access and the potential gain in performance from having several taining good response time when a query is sent to one disk
sites process parts of the xact in parallel (only a few tuples in the queried range - other disks can be
• Query Transformation: Choose the replica for which the transmis- used for other queries). On the other hand, when many tuples
sion cost is the lowest; can make use of the fact that the selection are to be fetched from a few disks, this could result in an I/O
only fetches tuples from a (fragmented) replica. bottleneck (hostpot) at those disks.
• Simple Join Processing: • Choice of partioning affects other operations such as joins; in general,
– Ship all copies to S1 range or hash partioning are preferred to round-robin.
– Ship to S1 ; compute join; ship result to S3 ; compute join; ship • If a relation consists of m disk blocks and there are n disks, then the
result to S1 (or roles interchanged) relation should be allocated to min(m, n) disks (preferably, try to fit
– Need to worry about the volume of data being shipped; also, relations that fit in a block to a single disk).
indices may have to be re-created at the shipped site • Handling of skew: Attribute-value skew and partition skew
• Semijoin Strategy: Compute r1 o n r2 o n ΠR1 ∩R2 (r1 ) Semijoin: – Attribute-value skew: all tuples with the same value for the
r1 n r2 = ΠR1 (r1 o n r2 ). That is, the semijoin selects those tuples partitioning attribute end up in the same partition; can occur
of relation r1 that contributed to r1 o
n r2 regardless of whether range or hash partioning is used.
• Join Strategies that exploit parallelism: Pipelined-join technique can – Partition-skew: Load imbalance in the partioning, even when
be used, for example, for r1 o n r2 o
n r3 on r4 there is no attr. skew; range partitioning may result in partition
skew, if the partition vector is not chosen carefully; partition
skew is less likely to occur with hash partitioning, if a good
Hetergeneous Dist. DB hash function is chosen.
• Unified View of Data: Difficult because: endianness, ASCII v/s • Loss of speedup due to skew increases with parallelism
EBCDIC, units of measurement, Strings in different languages • Techniques to overcome skew:
(“Cologne” v/s “Koln”) – Balanced range-partitioning vector: Sort by partitioning attr.
• Query Processing: Wrappers for local schema to global schema map- and distribute equally (1/n); can still result in skew; also cost
ping and back; mediator system do not bother about xact processing for sorting is high
– Use histogram to reduce skew further – Asymmetric Frag-and-rep join: Fragment one of the relations
– Use concept of virtual processors: mapping of real virtual pro- “r” using any partitioning technique. Replicate the other rela-
cessors to real in round-robin tion “s”. Each processor performs the join of ri and s using
any join technique.
– (Symmetric) Frag-and-rep join: Fragment both of the relations
Interquery Parallelism using any partitioning technique; the paritions need not be of
• Different queries or xacts execute in parallel with one another the same size. Each processor performs join of ri and sj .
• Xact throughput can be increased, but the response times of individ- – Asymm. frag-and-rep join is useful when one of the relations
ual xacts are no faster than they would be if the xacts were run in “s” is smaller; it can be replicated to all processors.
isolation • Parallel Join using Partitioned Parallel Hash Join:
• Pimary use of interquery parallelism is to scaleup a xact processing – Hash paritition each relation and send to the respective proces-
system to support a large number of xacts per second sor
– As each processor receives the tuples, it performs a local hash
• Not useful for speeding up long running tasks, since each task is ex-
join (build and probe relation)
ecuted sequentially
– Hybrid hash-join could also be used locally to cache the incom-
• Easiest form to support (esp. in shared-memory parallel system)
ing tuples in memory, and thus avoid the costs of writing them
• Cache-coherency problem: can be solved by locking a page in mem- and of reading them back in.
ory before any read or write access and flushes the page to the shared • Parallel Join using Parallel Nested Join:
disk before it releases the lock
– Asymm. frag-and-rep. can be used along with indexed nest
• Other way is to access the latest value from the buffer pool of some loop join at each processor
other processor – The indexed nested loop join can be overlapped with the dis-
tribution of the tuples in “s” to reduce the costs of writing the
Intraquery Parallelism tuples of “s” to disk and to read them back.
• Other operations:
• Execution of a single query in parallel on multiple processors and
– Selection for a range can proceed in parallel at each proces-
disks
sor whose range partition overlaps with the specified range of
• Useful for speeding up long-running tasks values in the selection.
• Intraoperation Parallelism: Speed up processing of a query by par- – Duplicate elimination can be parallelized using parallel sort-
allelizing the execution of each individual operation (such as sort, ing technique or hash / range partitioning and eliminating the
select, project, and join) duplicates locally at each processor.
• Interoperation Parallelism: Speed up processing by executing in – Projection w/o duplicate elimination can be done as tuples are
parallel the different operations in a query expr read in from the disk in parallel.
• Both can be used simulatenously on a query – Aggregation can be done in parallel by paritioning on the
• Since the number of ops in a typical query is small compared to the grouping attributes and then computing the aggregate locally
number of tuples processed by each operation, intraop parallelism at each processor. (Either range or hash partitioning can be
scales better with increasing parallelism. However, with relatively used). The cost of transferring tuples can be reduced by partly
small number of processors, both forms of parallelism are important computing the aggregate values before partitioning (and then
using partitioning as before).
• Cost of Parallel Evaluation of Operations: Start-up costs, Skew,
Intraoperation Parallelism: Contention for resources, Cost of assembling the final result
• Parallel Sort using Range-Partioning Sort: T otalT ime = Tpart + Tasm + max(T0 , T1 , . . . , Tn−1 , where
T part is the time for partitioning the relations.
– Range partition the data as per the sorting attribute and send
A paritioned parallel evaluation is only as fast as the slowest of
it to the respective processors
the parallel executions.
– Each processor sorts within the range
– The final merge is trivial since the range partitioning in the first Interoperation Parallelism:
phase ensures that all key values in processor Pi are less than
those in Pj , for all i ≤ j. • 2 forms: Pipelined Parallelism and Independent Parallelism
• Parallel Sort using Parallel External Sort-Merge: • Pipelined: Major advantage is that the intermediate results are not
written to disk; they are just fed to the other processors in the pipeline
– Each processor locally sorts the data on its disk
• Independent: r1 join r2 can be computed independently of r3 join r4;
– Merging of the sorted runs is done similar to external sort- has lower degree of parallelism
merge. Merging can be done by range-partitioning the sorted
data at each processor and then each processor sending the val-
ues in each partition to the respective processor.
Query Optimization:
Could result in execution skew where each processor will be- • Avoid long pipelines (resources will be hoarded) and it will take time
come a hot-spot when it is its turn to receive the tuples. To for the first input to reach the last processor in the pipeline
avoid this, each processor send the first block of every parti- • Advatnage of parallelism could get negated by the overhead of com-
tion, then the second block of every partition, and so on. As a munication
result, all processors receive data in parallel. • Heuristic 1: Consider only evaluation plans that parallelize every op-
• Parallel Join using Partitioned Join: eration across all processors, and that do not use any pipelining
– Works only for equi-joins or natural joins • Heuristic 2: Exchange-operator model: Exchange operators can be
– Range-parition or hash partition can be used to partition the 2 introduced into an evaluation plan to transform it into a parallel eval-
relations (r and s) to be joined uation plan
– Each processor can perform the join for the ith partition of r A large parallel databse system must also address these availability is-
and s sues: Resilience to failure of some processors or disks; On line reorganiza-
tion of data and schema changes.
– To prevent skew, the range partitioning vector must be such that
Online index construction : should not lock the entire relation in shared
the sum of the sizes of ri and si is roughly equal over all i
mode as it is done usually; instead it should keep track of updates that oc-
• Parallel Join using Fragment-and-replicate Join: cur while it is active and incroporate the changes into the index being con-
– Works for any kind of join structed.
XML • API for XML Processing: DOM (Document Object Model) and SAX
(Simple API for XML)
• Self-documenting (because of presence of tags)
– SAX is useful when the application needs to create its own data
• Format of the document is not rigid (e.g., extra tags)
representation of the data read from the XML document
• XML allows nested structures
• Wide-variety of tools available
• XML Schema Definitions: Storage of XML Data
– DTD (Document Type Definition)
• Non-relational Data Stores: Flat file and special XML database (suf-
– XSD (XML Schema Definition)
fers from no support for atomicity, transactions, concurrency, data
– Relax NG isolation and security)
• DTD: • Relational Databases:
– ELEMENT, ATTLIST, default values supported, #PCDATA,
(+, *, ? for repetitions), empty and any – Store as a String: Database does not know the schema of the
– Attributes can have #REQUIRED or #IMPLIED stored elements; searching is inefficient; additional fields may
be stored (at the cost of redundancy) for indexing; function in-
– ID, IDREF and IDREFS for uniqueness, references and list of
dices can be used A
references (e.g., “owns”)
– Tree Representation: nodes(id, type, label, value) and
– Limitations of DTD:
child(child id, parent id); position can be added, if order must
∗ Text elements and attributes cannot be constrained to be be preserved; many XML queries can be converted to relational
of specific types ones; disadvantage of large number of joins
∗ Difficult to specify unordered sets of subelements
∗ Lack of typing in IDs and IDREF or IDREFS – Map to Relations: All attributes are stored as string-valued at-
∗ (?) No support for user-defined types tributes of the relation; if subelement of simple type, add as an
attribute of the relation; else, add as a separate relation; par-
• XSD:
ent id needs to be added; position can be added for position;
– Specified in XML syntax relations that can occur at most once can be “flattened” into
– Support for type checking (simple as well as user-defined (com- the parent relation by moving all their attributes into the parent
plexType and sequence) relations
– Specification of keys and key references using xs:key and – Publishing and Shredding XML Data: “Publish” means “to
xs:keyref XML from relational”; “Shredding” means “to relational from
– Benefits over DTD: XML”
∗ Text can be constrained to specific types or sequences ∗ Publishing: An XML element for every tuple and every
∗ Allows user-defined types column of the relation as a subelement of the XML ele-
∗ Allows uniqueness and foreign-key constraints ment (more complicated for nesting)
∗ Integrated with namespaces to allow different parts of a ∗ Shredding: Similar to “Map to Relations”
document to conform to different schemas
– Native Storage within a Relational Database: Using CLOB
∗ Allows maximum and minimum value checking
and BLOB; binary representations of the XML can be stored
∗ Allows complex types to be inherited through a form of
directly as a BLOB; some dbs provide xml data type; Xquery
inheritance
can be executed on a XML document within a row and a SQL
query can be used for iterating over the required rows
Querying and Transformation:
– SQL / XML: XML extensions to SQL; xmlelement, xmlat-
• XPath, XQuery (FLWOR expressions), XSLT tributes, xmlforest, xmlagg, xmlconcat
• XPath:
– Nodes are returned in the same order as they appear in the doc-
XML Applications:
ument
– @ is used for attributes • Storing Data with complex structure (such as bookmarks)
– /bank/account[bal > 400]/@account_no • Standardized Data Exchange Formats (e.g., ChemML, RosettaNet)
– count function for counting the nodes matched • Web Services (SOAP) - Web Services provide a RPC call interface
– — operator for union of the results with XML as the mechanism for encoding parameters as well as re-
– // for slipping multiple levels, .. specifies parent sults
– function doc(name) allows to look into the document whose • Data Mediation (collecting data from various web sites / sources and
name is specified presenting a single XML view to the user; e.g., showing user’s bank
(e.g., doc("bank.xml")/bank/account) account details from various banks)
• XQuery:
– Uses XPath and is based on XQL and XML-QL
Additional Research Papers
– Uses FLWOR Expressions: for, let, where, order by, return
– for statement is like “from” in SQL Mention about tree algo here...keeping id and pre-order numbering, etc.
– return statement treats everything as plain text to be output ex-
cept for strings within which are treated as expressions to be
evaluated Advanced Transaction Processing
– return can have nested queries
TP-Monitor Architectures:
– User-defined functions and types are allowed
– some and every can be used for testing existential and universal • Process-per-client model
qualification • Single-server model (multithreaded; a bug in one app can affect all
• XSLT: other apps; not suited for parallel or distributed databases)
– Templates are used • Many-server, single-router model (PostgreSQL, Web apps)
– “match” and “xsl:value-of select” are used • Many-server, many-router model (very high performance web sys-
– xsl:key and xsl:sort tems, Tandem Pathway)
Main Memory DB: Thus, it appears that the enforcement of xact atomicity must either
lead to an increased probability of long-duration waits or create a
Since disk I/O is often the bottleneck for reads/writes, we can make the db
possibility of cascading rollback.
system less disk bound by increasing the size of the database buffer. Since
memory sizes are increasing and costs are decreasing, an increasing number • Concurrency Control:
of apps can be expected to have data fit into main memory. Larger main – Correctness may be achievable without serializability
memories allow faster processing of transactions, since data are memory
– Could split db into sub-dbs on which concurrency can be man-
resident. But there are still disk-related limitations:
aged separately
• Log records must be written to stable storage (logging process will
become a bottleneck). Could use NVRAM or group commit to re- – Could use concurrency techniques that exploit multiple ver-
duce the overhead imposed by logging. sions
• Buffer blocks marked as modified by committed xacts still have to be • Nested and Multilevel xacts:
written so that the amount of log that needs to be replayed at recovery – A long-duration xact may be viewed as a set of sub xacts
time is reduced.
– If a sub xact of T is permitted to release locks on completion,
• After a crash recovery, even after recovery is complete, it takes some
T is called a “multilevel xact”
time before the db is fully loaded in main memory
Opportunities for optimization: – If locks held by a sub xact of T are automatically assigned to T
• Data Structures with pointers can be used across pages (unlike those on completion of the sub xact, it is called “nested xact”
on the disk) • For large data items:
• There is no need to pin pages in memory before they are accessed,
– Difficult to store both old and new values; therefore, we can
since buffer pages will never be replaced
use the concept of logical logging
• Query-processing techniques should be designed to minimize space
overhead (otherwise, main memory limits may be exceeded and – Shadow-copy technique can be used to keep copies of pages
swapping may take place slowing the query processing) that have been modified
• Operations such as locking and latching may become bottlenecks -
these should be improved.
Xact Mgmt in Multidb
• Recovery algos can be optimized, since pages rarely need to be writ-
ten out to make space for other pages. See Practice Exercise 25.5 and 25.8
“Group commit” is to reduce the overhead of logging by delaying writes Strong correctness
to a log until a batch is ready. This results in a slight delay in the commit of
transactions that perform updates.
Two-Level Serializability (2LSR): Ensure serializability at 2 levels:

Real-Time Xact Systems: • Each local db system ensures local serializability among its local
xacts, including those that are part of a global xact
Systems with deadlines are called “real-time systems”.
• The mutidb system ensures serializability among the global xacts
• Hard deadline: Serious problems, such as system crash, may occur if alone - ignoring the orderings induced by the locak xacts.
a task is not completed by its deadline.
• Global-read protocol: allows global xacts to read, but not to update
• Firm deadline: The task has zero value if not completed by its dead-
local data item, while disallowing all access to global data by local
line.
xacts
• Soft deadline: The task has diminishing value if it completed after
the deadline. – Local xacts access only local data items
• Pre-emption of lock or rolling back a xact may be required – Global xacts may access global data items, and may read local
• Variance in xact execution time (disk access v/s in memory, locking, data items (though they must not write local data items)
xact aborts, etc.) can cause difficulty in supporting real-time con- – There are no consistency constraints between local and global
straints. data items
• Local-read protocol: allows local xacts to read global data, but disal-
Long-Duration Xacts: lows all access to local data by global xacts
• Properties: – Local xacts may access local data items, and may read global
– Long duration (human interaction) data items stored at that site (though they must not write global
data items)
– Exposure of uncommited data
– Subtasks: User may want to abort a subtask only without – Global xacts access only global data items
rolling back the entire xact – No xact may have a value dependency (A xact has value de-
– Recoverability: Aborting a long-duration interactive xact be- pendency if the value that it writes to a data item at one site
cause of a system crash is unacceptable depends on a value that it read for a data item on another site).
– Performance: Fast response time is expeted in contrast to • Global-read-write / local-read protocol: most generous; allows global
throughput (number of xacts per second) xacts to read and write local data, and allows local xacts to read global
• Nonserializable Executions: Enforcement of serializability can cause data.
problems for long-duration xacts – Local xacts may access local data items, and may read global
– 2 PL: Longer waiting times (since data items locked are not re- data items stored at that site (though they must not write global
leased until no other data items are needed to be locked). This, data items)
in turn, leads to longer response time and increased chance of – Global xacts may read and write global as well as local data
deadlock. items
– Graph-based protocols: An xact may have to lock more data
– There are no consistency constraints between local and global
than it needs. Long-duration lock waits are likely to occur.
data items
– Timestamp-based protocols: No waiting for locks, but xact
could get aborted. Cost of aborting a long -duration xact may – No xact may have a value dependency (A xact has value de-
be prohibitive pendency if the value that it writes to a data item at one site
depends on a value that it read for a data item on another site).
– Validation protocols: Same as that for timestamp-based proto-
cols Ticket-based systems can also be used.
Data Warehousing Example: Finding the cumulative balance in an account, given
a relation specifying the deposits and withdrawals on an ac-
• A data warehouse is a repository of data gathered from multiple
count
sources and stored under a common, unified database schema.
• 2 types of db apps: Transaction Processing and Decision Support
Data Warehousing:
• Transaction Processing: Record info about transactions
• Decision Support: Aim to get high-level of info from the detailed • When and what data to gather: Source-based / Destination-based
info stored in transaction-processing systems, and to use the high- (push / pull)
level info to make decisions • What schema to use
• DSS aim to get high-level information from the detailed information • Data transformation and cleansing: merge-purge, deduplication,
stored in a transaction-processing system. householding, other types such as units of measurement
• Issues related DSS: • How to propagate updates: Same as view-maintenance problem
– OLAP deals with tools and techniques that can give nearly in- • What data to summarize
stantaneous answers to queries requesting summarized data, • ETL (Extract, Transform, Load)
even though the database may be extremely large • Warehouse Schemas:
– Database query languages are not suited to the performance of – Fact tables: tables containing multi-dimensional data
detailed statistical analyses of data (SAS and S++ do much bet- – Dimensional tables: To minimize storage requirements
ter) (foreign-key looked up into other tables)
– For performance as well as for organization control, data – Star schema, Snowflake schema
sources will not permit other parts to retrieve data. DW gather • Components of a DW: Data Loaders, DBMS, Query and Analysis
data from multiple sources under a unified schema at a single Tools (+ data sources)
site
– Data Mining combined knowledge-discovery with efficient im- Data Mining:
plementations that can be used on extremely large databases
• Measure attributes: Those that can be measured (e.g., price, quantity • Classifiers:
sold) – Decision-Tree Classifiers
• Dimension attributes: Other attributes; these are the dimensions on – Bayesian Classifiers (easier to construct than decision-tree
which the measure attributes, and the summary of measure attributes, classifiers and work better in the case of null or missing attibute
are viewed. values)
• Multidimensional Data: Data that can be modeled as dimension attrs • Other types of data mining: clustering, text mining, data visualization
and measure attrs are called multi-dimensional data • TODO: Details about classifiers here
• Cross-tabulation (aka cross-tab or pivot table): A table where values
for one attribute form the row headers, values for another attributes Advanced App Dev
form the column headers, and where the cell values represent some
aggregate. For example, a table that has the sum of quantity sold for • Benchmarks are standardized sets of tasks that help to characterize
item name as row headers and color as column headers for all sizes. the performance of db systems. They help to get a rough ideas of the
• A change in the data may result in more columns being added to the hardware and software requirements of an app, even before the app
cross-tab (e.g., when a new colored item is added to the data, it will is built.
appear as a new column in the above cross-tab) • Tunable parameters at 3 levels:
• Data Cube: Generalization of a cross-tab to “n” dimensions – Hardware Level: CPU, Memory, Adding disks or using RAID
• For a table with n dimensions, aggregation can be performed with – DB System params: Buffer sizes, checkpointing intervals
grouping on of the 2n subsets of the n dimensions. Grouping on the – Higher level: Schema (indices), transactions
set of all n dimensions is useful only if the table may have duplicates. These must be considered together; a tuning at one level may result
• Operations on a data cube: in a bottleneck in another (e.g., tuning at a higher level may result in
– Pivoting: The operation of changing the dimensions used in a a bottleneck at the CPU level)
cross-tab. • Tuning of hardware:
– Slicing / Dicing: Viewing the data cube for a particular value – For today’s disk, average access time is 10 ms and avg. transfer
of a dimension (aka called dicing particularly when the values rate of 25 MB/s
for multiple dimensions are fixed) – A reduction of 1 I/O per second saves: (price per disk drive) /
– Rollup: From finer to coarser granularity (e.g., rollup a table on (access per second per disk)
the size attribute) Drill Down: From coarser to finer granularity – Storing a page in memory costs: (price per MB of memory) /
• Hierarchy on dimensions: Date/Time (and Hour of day), Date, Month (number of pages per MB of memory)
(and Day of Week), Quarter, Year – Break-even point is:
• OLAP Implementation: price per disk drive price per MB of memory
n∗ =
– MOLAP: OLAP cube stored in “multi-dimensional arrays” access per second per disk pages per MB of memory
– ROLAP: Relational OLAP (data stored in relational database)
– 5-minute rule
– HOLAP: Hybrid OLAP (some in memory and some in rela-
– For sequentially accessed, we get 1-minute rule
tional)
– RAID 5 is much slower than RAID 1 on random writes: RAID
• Simple Optimization: compute aggregates from an already computed
5 requires 2 reads and 2 writes to execute a single randown
aggregration, instead of from the original relation; this does not work
write
for non-decomposable aggregate functions such as “median”
– If an app performs r reads and w writes (random), then RAID
• For n dimension attributes, there can be 2n groupings 5 will require r + 4w I/O ops per second; RAID 1 will require r
• SQL-1999 constructs: + w I/O ops per second
– rank, dense rank, stddev, variance group by cube, group by – If we take the current disks performance as 100 I/Os / second,
rollup, percent rank, ntile, cume dist we can find the number of disks required (e.g., (r+w)/100).
– Windowing: rows unbounded preceding, between rows 1 pre- This value is enough to hold 2 copies of all the data. For such
ceding and 1 following range between 10 preceding and cur- apps, if RAID 1 is used, the number of disks required is actu-
rent row ally less than if RAID 5 is used.
– RADI 5 is useful only when the data storage requirements are • TPC-R: Db is permitted to use mat. views and other redundant info
large and the data xfer and I/O rates are small (that is, for very • TPC-H: Ad-hoc (prohibits mat. views and other redundant info)
large and very “cold” data) • TPC-W: Web commerce (performance metrics are in WIPS - Web
• Tuning of schema: Use denormalized relation or materialized views instructions per second)
• Tuning of indices: • App migration: Big-bang approach v/s Chicken-little approach
– Removing of indices may speed up updating
– For range queries, B+-tree indices are preferable to hash in- Spatial Data
dices
– If most number of queries and updates are clustered, clustered • Nearness Queries and Region queries (inside a region, etc.)
indices could be used • Hash joins and sort-merge joins cannot be used on spatial data
• Materialized Views: Using deferred view maintenance reduces the
burden on updates. Indexing of Spatial Data
• Automated tuning of physical design:
• k-d Trees:
– Greedy heuristics: Estimates costs of using materialized differ-
– Partitioning is done along one-dimension at the node ath the
ent indices / views and the cost of maintaining it
top level of the tree, along another dimension in nodes at the
– Choose the one that prvoides max benefit per unit storage space next level (and so on, cycling through all the dimensions)
– Once this has been chosen, recompute the cost of other indices – One-half point in one partition and one-half in the other
/ views
– Partioning stops when a node has less than given max. number
– Continue the process until the space avaiable for storing the of points
mat. indices / views is exhausted or the cost of maintaining
– Each line in the diag. corresponds to a node in the k-d tree
the remaining candidates is more than the benefit to the queries
that could use indices / views. – k-d-B tree extends the k-d tree to allow multiple child nodes
for each internal node (just like B-tree extends a binary tree) to
• Tuning of transactions:
reduce the height of the tree. k-d-B trees are better suited for
– Improve set of orientation (e.g., by using proper group by or by secondary storage than k-d trees.
using stored procedures)
• Quadtrees:
– Reduce lock contention (maybe use weaker levels of consis-
– Each node of a quadtree is associated with a rectangualr region
tency)
of space.
– Minibatch transactions
– Each non-leaf node divides its region into 4 equal-sized quad-
rants
Performance Benchmarks • R-trees:
Use harmonic mean for different xact types: – A rectangular bounding box is associated with each tree node
n – Ranges may overlap (as compared to the B+-trees, k-d trees
1
+ 1
+ ... + 1 and quadtrees.
t1 t2 tn
– A search for objects containing a given point has to follow all
• TPC-A: Single type xact that models cash withdrawal and deposit at child nodes whose bounding boxes contain the point
a bank teller (not used widely currently) – The storage efficiency of R-trees is better than that of k-d trees
• TPC-B: Same as TPC-A but focuses only on back-end db server) (not or quadtrees, since a object is stored only once. However,
used widely currently) searching is not efficient in R-trees since multiple paths may
• TPC-C: More complex system model; order entry, etc. have to be searched. Inspite of this, R-trees are popular (be-
• TPC-D: For decision support (scale factor is used - scale factor of 1 cause of space efficiency and similarity to B-trees)
represents the benchmark on a 1 GB db) Read insertion, deletion and searching from book

You might also like