You are on page 1of 55

Chapter 6: Physical Database Design and Performance

Outline

The physical database design process

Input and output Know what columns to index

Indexes and their appropriate use

Know different RAID levels and how to choose RAID levels

The Physical Design Stage of SDLC (Figures 2-4, 2-5 revisited)


Project Identification and Selection Project Initiation and Planning Analysis

Purpose develop technology specs Deliverable program/data structures, technology purchases, organization redesigns

Logical Design Physical Design Physical Design

Database activity physical database design

Implementation Maintenance

Physical Database Design

Purpose - translate the logical description of data into the technical specifications for storing and retrieving data Goal - create a design for storing data that will provide adequate performance and insure database integrity, security and

recoverability

Input

Logical design (normalized relations) Statistics about data


Number of rows Number of distinct values for each attribute, ranges Ideally, the set of queries (including insert, update, delete, select) and their frequencies E.g., average response time, number of queries per minute

Information about usage (called workload)

Requirement for performance

Output

Attribute data types Indexes


Storage (RAID level) Denormalization (merge some tables such as look up table)

Others (not covered)


Partitioning database into smaller pieces


6

Physical Design Process


Inputs
Normalized
Volume

relations

Decisions
Attribute data types Physical record descriptions

estimates

Attribute definitions Response time Data

expectations
Leads to File

(doesnt always match logical design) organizations database architectures

security needs

Backup/recovery needs Integrity expectations DBMS

Indexes and

technology used

Query optimization

Figure 6-1 Composite usage map (Pine Valley Furniture Company)

Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.)

Data volumes

Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.)

Access Frequencies (per hour)

10

Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.) Usage analysis:
140 purchased parts accessed per hour 80 quotations accessed from these 140 purchased part accesses 70 suppliers accessed from these 80 quotation accesses

11

Figure 6-1 Composite usage map (Pine Valley Furniture Company) (cont.) Usage analysis:
75 suppliers accessed per hour 40 quotations accessed from these 75 supplier accesses 40 purchased parts accessed from these 40 quotation accesses

12

Designing Fields
Field: smallest unit of data in database Field design

Choosing data type Coding, compression, encryption Controlling data integrity

13

Figure 6-2 Example code look-up table (Pine Valley Furniture Company)

Code saves space, but costs an additional lookup to obtain actual value

14

Outline

The physical database design process

Input and output Know what columns to index

Indexes and their appropriate use

Know different RAID levels and how to choose RAID levels

15

Motivation

Suppose the following query is running really slow. What can you do to make it faster without buying new hardware?

Return product ID and quantity of product in the order with oid = 2 Select pid, quantity From order_line Where oid=2

16

Motivation
Select pid, quantity From order_line Where oid=2
If there are 10000 order_line

You have to go through all of them to check whether Oid = 2.

17

Motivation
Select pid, quantity From order_line Where oid=2

We can create an index on order_line.oid The index will be used to find order_line with oid=2 Only those order_line will be accessed
18

Basic Concepts
Index is a data structure that speeds up access to a table (so each index is tied to a table) Can you create an index on two tables? Components of an index

Search key Index entries Index file Primary & secondary


19

Types of indexes

Benefits and Cost of Indexes

Benefits:

Speed up queries

Cost:

Whenever data gets updated, index also needs to be updated Index needs disk space as well

20

Basic Concepts
Search Key - attribute or set of attributes used to look up records in a file. Select pid, quantity From order_line Where oid=2

We can create an index on order_line.oid Search key of this index?


21

Basic Concepts

An index file consists of index entries of the form


search-key
pointer

Pointer either points to data records or another index entry (we will talk about this later)

Index files are typically much smaller than the original file because all other columns are not stored

22

Primary Index
Each table may have one primary index such that Index entries are stored sorted on the search key value, so do the rows in the table. E.g., Create an index on orders.oid If this index and the orders table are sorted on oid, this index is the primary index of orders

Table(oid,): Index:

(1,) (1,)

(2,) (2,)

(,)
(,)

How many primary indexes a table can have? Only one since table can be only sorted in one way
23

Secondary Index

Secondary index: an index whose index entries are not sorted in the same order as the rows in the table E.g., Suppose we create an index on order_line.pid
(1,3) (2,5,) (3,2,) (,) (1000,1,)

Table (oid,pid,):

Index(pid, pointer):

(1,)

(2,)

(3,)

(,)

Index: How many secondary indexes a table can have? As many as you want. But secondary index is less efficient than primary index since data are not sorted in the same order (so need to move back and forth if many rows are retrieved)

24

B+-tree index
The most commonly used type of index
Q1: Select * from order_line where oid >= 1234 and oid <= 2000;

Q2: Select pid, quantity From order_line Where oid=2 We can create an B+-tree index on order_line.oid to evaluate the range condition in Q1 and the equality condition in Q2
25

B+-tree index

B+-tree index can answer both equality conditions and range conditions very efficiently It has a tree structure, where leaves points to real data, and are sorted on indexed columns Intermediate levels help locate leaves satisfying range or equality conditions

26

Figure 6-7b B-tree index

Leaves of the tree are all at same level consistent access time

uses a tree search


Average time to find desired record = depth of the tree
27

Index Definition in SQL

Create an index
create index <index-name> on <relation-name> (<attribute-list>) Use comma to separate attributes if there are more than one

To drop an index
drop index <index-name>

Oracle (and DB2, SQL Server) uses B+ tree index


28

Index Definition in SQL


E.g. Create index idx_orderline_pid on order_line(pid);

There is no need to create an index on primary key because it is created automatically when the primary key is defined

29

Index Definition in SQL


Check whether an index exists select * from user_indexes; Many indexes with weird names (start with sys) are system created indexes for primary keys

Check details of an index select * from USER_IND_COLUMNS;

Drop an index: drop index index-name; Drop index idx_orderline_pid;

30

Rules of Thumb of Index Creation

Indexes are not needed for toy data sets (no more than 100 rows) Indexes are always needed for any serious application Whether the index is used in a SQL query is still decided by DBMS Index incurs update overhead and needs extra space
31

Rules of Thumb of Index Creation


Create index on columns appear in where clause The index will be used to evaluate where condition efficiently

Create multi-column index if there are multiple condition on the same table and at least one of them is equality condition The order in multi-column index starts with the those columns with equality conditions

E.g. select pid from order_line where pid = 2 and quantity > 1; Create an index on order_line(pid, quantity) An index on order_line(quantity, pid) is not useful
32

Rules of Thumb of Index Creation

Create index on columns appear in join condition (or foreign key columns)

E.g. select * from product p, order_line L where p.pid = L.pid Create an index on order_line(pid) The index will be used to evaluate join efficiently Do you need to create an index on product.pid?
33

Rules of Thumb of Index Creation


Create index on group by and order by columns E.g. select sum(quantity) from order_line group by pid;

Create an index on order_line(pid) The index will be used to evaluate the group by more efficiently

34

Rules of Thumb of Index Creation


Create index on all columns appear in the SQL if there are many columns that do not appear in the SQL statement E.g. select pid from order_line; Create an index on order_line(pid) The index will be used to access only columns appear in the SQL statement

35

Outline

The physical database design process

Input and output

Choose storage formats for attributes Indexes and their appropriate use

Know what columns to index

Know different RAID levels and how to choose RAID levels


36

Types of Storage Media


Price goes down from top to bottom Capacity goes up Speed goes down

CPU
Cache

Only hard disk is non volatile (meaning it survives power off)

Memory

Hard disks Database is stored on hard disks


37

RAID

Motivation

Suppose your application needs very high read/write rate? (e.g., 100 customers visit your website per minute) Buy a faster disk? But better solutions? Your application also needs to run at 24/7, but what if your disk fails? Is backup the solution?

38

Redundant Arrays of Independent Disks

RAID Solution: buy more disks rather than faster disks


Speed up read/write by reading/writing multiple disks at the same time Improve reliability by having copies of data at run time (So if one disk fails, program can still run) Cost factor: multiple cheap disks are often cheaper than a single super fast disk

39

RAID

A set of disk drives that appear to the user to be a single disk drive (you only see one disk in your OS) Data blocks (pages) are arranged in stripes such that data are spread onto multiple disks There are different ways of striping, represented as different RAID levels We assume data blocks (sectors) are in the order of 1,2,3,4,
40

RAID-0

All

disks are used for parallel read and write Best I/O performance No backup, so data may get lost

41

RAID Levels
RAID Level 1: Mirrored disks
Keep one identical copy of disks So improve reliability If 1% of chance a disk will fail each month, the chance of both disks fail in the same month?
1 2 3 4 1 2 3 4
42

RAID Levels
RAID Level 0+1: when number of disks >= 4
Two copies of data, half of disks for each copy (mirroring) Within each copy, spread data on all disks (striping) Benefits of both RAID 0 (parallel read/write) and RAID 1 (reliability) But cost may be high (minimal 4 disks, and half space for redundant

copy)

2 4

1
3

2
4
43

Parity Code for Error Detection

For a block of data, compute the number of bits with value 1 If there are odd number of bits, parity code = 1 Otherwise, parity code = 0

E.g., Data 1101: parity = 1 Data 1100: parity = 0 So if one bit gets corrupted, the parity and the data does not match
E.g., Correct Data 1101: parity = 1 Corrupted Data 1100: parity = 1 but number of 1 bits is even
44

RAID Levels (Cont.)

RAID Level 5: Compute one parity block over one bit of data on other disks
Data

Data

Data

Parity

45

RAID Levels (Cont.)

RAID Level 5: Each disk can detect whether it is functioning correctly


Data

Data

Data

Parity

What shall be the data on disk 3?


46

RAID Levels (Cont.)

RAID Level 5: Each disk can detect whether it is functioning correctly


Data

Data

Data

Parity

Suppose only one disk goes wrong, we can recover the data on that disk using data on other disks and parity. The data must be 0 in this case.
47

RAID Levels (Cont.)

RAID Level 5: striping data and parity blocks


E.g., 4 disks, 12 data blocks, 4 parity blocks Tips: first put parity blocks on different disks (in a different row for each disk), then put data blocks to the remaining slots
P1 4 7 ? 1 P2 8 ?

2 5
P3 ?

6
9 ?
48

RAID Levels (Cont.)

RAID Level 5: striping data and parity blocks

So if only one disk goes wrong, all data stored on that disk can be recovered What is the size of copy (parity) given n disks?
P1 4 7 10 1 P2 8 11

2 5
P3 12

6
9 P4
49

Choice of RAID Level

Consider level 0, 0+1, and 5 (level 1 can be seen as Level 0+1 for 2 disks) Read performance: about the same for Level 0, 0+1, and 5 (all disks can read at the same time) Write performance: Level 0 and 0+1 the best, Level 5 is bad because parity block also needs to be read/updated if data get updated
50

Choice of RAID Level

Fault tolerant: Level 0 not fault tolerant, all others do Space wasted for fault tolerant: Level 0: waste no space but NOT fault tolerant Level 0+1: half of disks (bad) Level 5: 1 disk, no matter the total number of disks (good)
51

Denormalization
Transforming normalized relations into unnormalized physical record specifications Benefits:
Can improve performance (speed) by reducing number of table lookups (i.e. reduce number of necessary join queries)

Costs (due to data duplication)


Wasted storage space Data integrity/consistency threats

Common denormalization opportunities


One-to-one relationship (Fig. 6-3) Many-to-many relationship with attributes (Fig. 6-4) Reference data (1:N relationship where 1-side has data not used in any other relationship) (Fig. 6-5)
52

Figure 6-3 A possible denormalization situation: two entities with oneto-one relationship

53

Figure 6-4 A possible denormalization situation: a many-to-many


relationship with nonkey attributes

Extra table access required

Null description possible


54

Figure 6-5
A possible denormalization situation: reference data

Extra table access required

Data duplication

55

You might also like