Professional Documents
Culture Documents
Outline
Purpose develop technology specs Deliverable program/data structures, technology purchases, organization redesigns
Implementation Maintenance
Purpose - translate the logical description of data into the technical specifications for storing and retrieving data Goal - create a design for storing data that will provide adequate performance and insure database integrity, security and
recoverability
Input
Number of rows Number of distinct values for each attribute, ranges Ideally, the set of queries (including insert, update, delete, select) and their frequencies E.g., average response time, number of queries per minute
Output
relations
Decisions
Attribute data types Physical record descriptions
estimates
expectations
Leads to File
security needs
Indexes and
technology used
Query optimization
Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.)
Data volumes
Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.)
10
Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.) Usage analysis:
140 purchased parts accessed per hour 80 quotations accessed from these 140 purchased part accesses 70 suppliers accessed from these 80 quotation accesses
11
Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.) Usage analysis:
75 suppliers accessed per hour 40 quotations accessed from these 75 supplier accesses 40 purchased parts accessed from these 40 quotation accesses
12
Designing Fields
Field: smallest unit of data in database Field design
13
Figure 6-2 Example code look-up table (Pine Valley Furniture Company)
Code saves space, but costs an additional lookup to obtain actual value
14
Outline
15
Motivation
Suppose the following query is running really slow. What can you do to make it faster without buying new hardware?
Return product ID and quantity of product in the order with oid = 2 Select pid, quantity From order_line Where oid=2
16
Motivation
Select pid, quantity From order_line Where oid=2
If there are 10000 order_line
17
Motivation
Select pid, quantity From order_line Where oid=2
We can create an index on order_line.oid The index will be used to find order_line with oid=2 Only those order_line will be accessed
18
Basic Concepts
Index is a data structure that speeds up access to a table (so each index is tied to a table) Can you create an index on two tables? Components of an index
Types of indexes
Benefits:
Speed up queries
Cost:
Whenever data gets updated, index also needs to be updated Index needs disk space as well
20
Basic Concepts
Search Key - attribute or set of attributes used to look up records in a file. Select pid, quantity From order_line Where oid=2
Basic Concepts
Pointer either points to data records or another index entry (we will talk about this later)
Index files are typically much smaller than the original file because all other columns are not stored
22
Primary Index
Each table may have one primary index such that Index entries are stored sorted on the search key value, so do the rows in the table. E.g., Create an index on orders.oid If this index and the orders table are sorted on oid, this index is the primary index of orders
Table(oid,): Index:
(1,) (1,)
(2,) (2,)
(,)
(,)
How many primary indexes a table can have? Only one since table can be only sorted in one way
23
Secondary Index
Secondary index: an index whose index entries are not sorted in the same order as the rows in the table E.g., Suppose we create an index on order_line.pid
(1,3) (2,5,) (3,2,) (,) (1000,1,)
Table (oid,pid,):
Index(pid, pointer):
(1,)
(2,)
(3,)
(,)
Index: How many secondary indexes a table can have? As many as you want. But secondary index is less efficient than primary index since data are not sorted in the same order (so need to move back and forth if many rows are retrieved)
24
B+-tree index
The most commonly used type of index
Q1: Select * from order_line where oid >= 1234 and oid <= 2000;
Q2: Select pid, quantity From order_line Where oid=2 We can create an B+-tree index on order_line.oid to evaluate the range condition in Q1 and the equality condition in Q2
25
B+-tree index
B+-tree index can answer both equality conditions and range conditions very efficiently It has a tree structure, where leaves points to real data, and are sorted on indexed columns Intermediate levels help locate leaves satisfying range or equality conditions
26
Leaves of the tree are all at same level consistent access time
Create an index
create index <index-name> on <relation-name> (<attribute-list>) Use comma to separate attributes if there are more than one
To drop an index
drop index <index-name>
There is no need to create an index on primary key because it is created automatically when the primary key is defined
29
30
Indexes are not needed for toy data sets (no more than 100 rows) Indexes are always needed for any serious application Whether the index is used in a SQL query is still decided by DBMS Index incurs update overhead and needs extra space
31
Create multi-column index if there are multiple condition on the same table and at least one of them is equality condition The order in multi-column index starts with the those columns with equality conditions
E.g. select pid from order_line where pid = 2 and quantity > 1; Create an index on order_line(pid, quantity) An index on order_line(quantity, pid) is not useful
32
Create index on columns appear in join condition (or foreign key columns)
E.g. select * from product p, order_line L where p.pid = L.pid Create an index on order_line(pid) The index will be used to evaluate join efficiently Do you need to create an index on product.pid?
33
Create an index on order_line(pid) The index will be used to evaluate the group by more efficiently
34
35
Outline
Choose storage formats for attributes Indexes and their appropriate use
CPU
Cache
Memory
RAID
Motivation
Suppose your application needs very high read/write rate? (e.g., 100 customers visit your website per minute) Buy a faster disk? But better solutions? Your application also needs to run at 24/7, but what if your disk fails? Is backup the solution?
38
39
RAID
A set of disk drives that appear to the user to be a single disk drive (you only see one disk in your OS) Data blocks (pages) are arranged in stripes such that data are spread onto multiple disks There are different ways of striping, represented as different RAID levels We assume data blocks (sectors) are in the order of 1,2,3,4,
40
RAID-0
All
disks are used for parallel read and write Best I/O performance No backup, so data may get lost
41
RAID Levels
RAID Level 1: Mirrored disks
Keep one identical copy of disks So improve reliability If 1% of chance a disk will fail each month, the chance of both disks fail in the same month?
1 2 3 4 1 2 3 4
42
RAID Levels
RAID Level 0+1: when number of disks >= 4
Two copies of data, half of disks for each copy (mirroring) Within each copy, spread data on all disks (striping) Benefits of both RAID 0 (parallel read/write) and RAID 1 (reliability) But cost may be high (minimal 4 disks, and half space for redundant
copy)
2 4
1
3
2
4
43
For a block of data, compute the number of bits with value 1 If there are odd number of bits, parity code = 1 Otherwise, parity code = 0
E.g., Data 1101: parity = 1 Data 1100: parity = 0 So if one bit gets corrupted, the parity and the data does not match
E.g., Correct Data 1101: parity = 1 Corrupted Data 1100: parity = 1 but number of 1 bits is even
44
RAID Level 5: Compute one parity block over one bit of data on other disks
Data
Data
Data
Parity
45
Data
Data
Parity
Data
Data
Parity
Suppose only one disk goes wrong, we can recover the data on that disk using data on other disks and parity. The data must be 0 in this case.
47
E.g., 4 disks, 12 data blocks, 4 parity blocks Tips: first put parity blocks on different disks (in a different row for each disk), then put data blocks to the remaining slots
P1 4 7 ? 1 P2 8 ?
2 5
P3 ?
6
9 ?
48
So if only one disk goes wrong, all data stored on that disk can be recovered What is the size of copy (parity) given n disks?
P1 4 7 10 1 P2 8 11
2 5
P3 12
6
9 P4
49
Consider level 0, 0+1, and 5 (level 1 can be seen as Level 0+1 for 2 disks) Read performance: about the same for Level 0, 0+1, and 5 (all disks can read at the same time) Write performance: Level 0 and 0+1 the best, Level 5 is bad because parity block also needs to be read/updated if data get updated
50
Fault tolerant: Level 0 not fault tolerant, all others do Space wasted for fault tolerant: Level 0: waste no space but NOT fault tolerant Level 0+1: half of disks (bad) Level 5: 1 disk, no matter the total number of disks (good)
51
Denormalization
Transforming normalized relations into unnormalized physical record specifications Benefits:
Can improve performance (speed) by reducing number of table lookups (i.e. reduce number of necessary join queries)
Figure 6-3 A possible denormalization situation: two entities with oneto-one relationship
53
Figure 6-5
A possible denormalization situation: reference data
Data duplication
55