You are on page 1of 19

Chapter 6: I/O Why I/O?

Who cares and what to consider Amdahl’s law


Device charateristics and types • speedup only CPU, I/O becomes bottleneck
• e.g.,
I/O system architecture
• suppose I/O takes 10% time
• buses, I/O processors
• speedup CPU 10 times
High performance disk architectures
• system only speeds up 5 times
I/O Performance

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 1 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 2

Throughput vs latency Throughput vs latency


“There is an old network saying: bandwidth problems can be who cares about latency
cured with money. latency problems are harder because the
speed of light is fixed - you can’t bribe God.” - David Clark • why don’t you just context switch
• fallacy
throughput
• requires more memory
• bandwidth
• requires more processes(jobs)
• I/Os per second
• human productivity increases super-linearly as
latency • response time decreases
• response time

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 3 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 4
I/O Overlap I/O Performance
I/O overlaps with computation in complicated ways Timejob = timecpu + timeI/O - timeoverlap
I/O request I/O request I/O interrupt
e.g., 10 = 10+4-4
job 1 job 2 job 3 job 1
USER speed up CPU by 2x

what is timejob
OS
timejob = 5+4-4 = 5 (best)
done
timejob = 5+4-0 = 9 (worst)
I/O

timejob = 5+4-2 = 7 (average?)

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 5 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 6

I/O Characteristics I/O Characteristics


supercomputers Time sharing filesystems
• data transfer rate important • small files
• many MBs per second for large files • sequential accesses

Transaction processing • many creates/deletes

• I/O rate important


• “random” accesses
• disk I/Os per second

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 7 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 8
Device Characteristics Device Characteristics
behavior Data Rate
Device I or O? Partner
• input - read once KB/s
• output - write once mouse I human 0.01
• storage - read many times; usually write graphics dis- O human 60,000
play
partner modem I/O machine 2-8
• human
LAN I/O machine 500-6000
• machine
tape storage machine 2000
data rate disk storage machine 2000-10,000
• peak transfer rate

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 9 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 10

Magnetic Disks Disk Parameters


spindles: 1-4 (most 1)
Head

platters per spindle :1-20


Platters
rpm: 3000-6000 RPM (most 3600)
Arm
platter diameter: 1.3”-8”
Spindle
• trend towards smaller disks
Cylinder
Track
Platters Spans • higher RPM
• mass production
Sector
tracks per surface: 500-2500
tor Gap Intersec-

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 11 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 12
Disk Parameters Disk Operations
sectors per surface: 32 typical seek: move head to track
• sector # —gap—data+ECC— n 
• avg seek time =  ∑ seek ( i )  ⁄ n
• fixed length sectors (except IBM) 1 
• typically fixed sectors per track
• n is # tracks, seek(i) is time to seek ith track
• recently constant bit density
rotational latency: wait for sector
• avg rotational latency 0.5/3600 = 8.3 ms

transfer rate
• typically 1-4 MB per second

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 13 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 14

Disk Operations Disk Performance


overhead avg disk access = avg seek time + avg rot. delay + transfer + ovhd
• controller delay e.g.,
• queuing delay • 3600 rpm; 2MB/s
• avg seek time: 9ms
• controller overhead: 1ms
• read 512-byte sector
• 9 ms+.5/3600 + .5KB/2 MB/s + 1 ms
• = 18.6 ms

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 15 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 16
Alternatives to Disks Alternatives to Disks
DRAMS FLASH memory
• SSD - solid state disk + no seek time
• standard disk interface + fast transfer
• DRAM and battery backup + non-volatile
• ES - expanded storage – bulk erase before write
• software controlled cache
– slow writes
• large (4K) blocks
– “wears” out over time
+ no seek time
+ fast transfer rate
– cost

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 17 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 18

Optical Disks Graphics Display - CRT


read-only
• CD-ROM
• cheap and reliable
• slow Electron
Gun
write-once
• not-so cheap phorous
X + Y Deflectors
Phos-

• slow Screen Coated

write-many screen has many scan lines each of which has many pixels
• expensive, slow phosphorous acts as capacitor- refresh 30-60 times/second

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 19 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 20
Graphics Displays - Frame Buffer Graphics Displays - Frame Buffer
frame buffer stores bit map
CPU Memory
• one entry per pixel
• black - 1 bit per pixel
0.2 MB/s • gray-scale 4-8 bits per pixel
• color (RGB) 8 bits per color
Frame
CRT
Buffer
30 MB/s • typical size 1560 x 1280 pixels
• • black and white: 250 KB
• color (RGB): 5.7 MB

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 21 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 22

Rducing cost of Frame Buffer Frame Buffer Operations


key idea: only a small number of colors are used in one image logically output only

color map: frame buffer stores color map index • but read as well

• color map translates index to full 24-bit color BIT BLTS: bit block transfers
Frame Buffer Color Map
(256×24) • read-modify-write operations

X0 17
• e.g., read xor write
120 014 074 CRT

8-bit
• used for cursors etc
index
Y0
open question
24-bit RGB
• OS only?
• 1560 x 1280 with 256-entry color map - factor 3 reduction
• or direct user access? protection?

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 23 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 24
Frame Buffer Implementation Other Issues in Displays
1560 x 1280 RGB display double buffering
• bandwidth required = 1560x1280x24x30 = 171 MB/s • duplicate frame buffer

how can we implement this? • to prevent displaying incomplete updates

• Video DRAMS • may be necessary for animation

• dual-ported DRAM z-buffer


• regular random access port • for displaying 3-D images
• serial video port • assign z-dimension for each pixel
• use 24 in parallel for RGB • store z-dimension in frame buffer
what about bandwidth? interleave video DRAMS • BIT BLTS compare Z-dimension

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 25 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 26

Networks Networks
Terminal networks Long haul networks
• machine-terminal • machine-machine
• star - point-to point • irregular structure - point to point
• 0.3-19 Kbits/s, RS232 protocol • 50-2000 Kbits/s, > 10 km

LANs • Internet

• machine-machine
• bus, ring, star
• 0.1-100 Mbits/s, < 10 km
• ethernet

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 27 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 28
LAN LAN
E.g., Ethernet ATM Asynchronous Transfer Method
• one-write bus with collisions and exponential backoff Phone company uses for long-haul networks (packet-switch)
• within building
not a viable LAN yet
• 10Mb

Now ethernet is
• point to point to clients (switched network)
• with hubs
• client s/w unchanged
• 100Mb

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 29 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 30

WAN I/O System Architecture


E.g., ARPANET, Internet hierarchical data paths
arranged as a DAG • divides bandwidth going down hierarchy
• often buses at each level
backbones now 1Gb/s; 100Gb/s in the future
TCP/IP - protocol stack I/O processing

• Transmission control protocol, Internet protocol • program controlled


• DMA
Key issues:
• dedicated I/O processors
• Top-to-bottom systems issues
• getting net into homes
• cable modem, ISDN, ??

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 31 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 32
I/O System Architecture Buses
High Low
CPU Option
Performance cost
Address/data lines separate? yes no
Cache
Data lines wider narrower
CPU - Memory Bus
transfer size multiple single word
words
Frame
Memory IOP Buffer CRT bus masters multiple one
split transactions yes no
I/O Bus
clocking synchronous asynchronous
Disk Disk Network
Controller Controller Interface

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 33 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 34

Buses CPU interface


CPU-Memory buses physical connection
• want speed • direct to cache
• usually custom design (fast - several GB/s) + no coherence problems
• eg SGI Challenge, Sun SD, HP Summit – pollutes cache
– CPU and I/O arbitrate for cache
I/O buses
• CPU-memory bus
• compatibility is important
+ DMA
• usually standard designs - PCI (Express), SCSI (slower - – may not be standard
<= GB/s)

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 35 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 36
CPU interface CPU Interface
• I/O bus
CPU I/O
+ industry standard
– slower than memory bus
Direct to Cache
Cache
– indirection through IO processor
CPU - Memory Bus

Memory Bus

Memory I/O IOP

I/O Bus

I/O Bus

I/O

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 37 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 38

Bus Arbitration Distributed Arbitration


centralized star connection set of wire-OR priority lines
• high cost set of wire-OR timing and control lines
• high performance
each requesting device indicates its priority
daisy chain
device removes its less-significant bits if higher priority present
• cheap
eventually only highest priority remains
• low performance
special care to ensure fairness
distributed arbitration
• medium price/performance

arbitration for next bus mastership overlap with current transfer

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 39 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 40
Bus Switching Methods Standard I/O Buses
circuit-switched buses Micro PCI-
S bus SCSI
• bus is held until request is complete channel Xpress
• simple protocol data width 32 bits 32 32-64 8-16
• latency of device affects bus utilization clock 16-25 Mhz asynch 256 10/asynch
# masters multiple multiple multiple multiple
split transaction or packet-switched (or pipelined)
b/w, 32-bit 33 MB/s 20 150+ 20 0r 6
• bus is released after request is initiated
read
• others use the bus until reply comes back
b/w, peak 89 75 800+ 20 or 6
• complex bus control
• better utilization of bus

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 41 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 42

Memory buses I/O Processing


HP SGI Sun program controlled
Summit Challenge XDbus • CPU explicitly manages all transfers
data width 128 bits 256 144 • high I/O overhead => big minus!
clock 60 MHz 48 66
DMA - direct memory access
# masters multiple multiple multiple • DMA controller manages single block transfers
b/w, peak 960 MB/s 1200 1056
I/O processors I - OP
These are older buses • processors dedicated to I/O operations
Currently, 128 bits, 250MHz+, DDR, several 10s of GB/s • capable of executing I/O programs
• may be special-purpose or general-purpose

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 43 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 44
Communicating with I/O processors Communicating with I/O processors
I/O control I/O completion
• memory mapped • polling
• ld/st to “special” addresses => operations occur • wait for status bit to change
• protected by virtual memory • periodic checking
• I/O instructions • interrupt
• special instructions initiate I/O operations • I/O completion interrupts CPU
• protected by privileged instructions

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 45 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 46

IBM 3990 I/O processing High-Performance Disk Architectures


channel == IOP extension to conventional disks

1 user program sets up table in memory with I/O request (pointer disk arrays
to channel program) then execute syscall
redundant arrays of inexpensive disks (RAIDs)
2 OS checks for protection, then executes “start subchannel” instr

3 pointer to channel program is passed to IOP. IOP executes


channel program

4 IOP interacts with storage director to execute individual channel


commands. IOP is free to do other work between channel
commands

5 on completion, IOP places status in memory, interrupts CPU

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 47 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 48
Extensions to conventional disks Extensions to conventional disks
fixed head disk parallel transfer disk
• head per track, head does not seek • read from multiple surfaces at the same time
• seek time eliminated • difficulty in looking onto different tracks on multiple surfaces
• rotational latency unchanged • lower cost alternatives possible (disk arrays)
• low track density increasing disk density
• not economical • an on-going process
• requires increasingly sophisticated lock-on control
• increases cost

solid state disks

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 49 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 50

Extensions to conventional disks Disk Arrays


disk caches collection of individual disks
• RAM to buffer data between device and host • each disk has its own arm/head
• fast writes - buffer acts as a write buffer
data distributions
• better utilization of host-to-device path A0 A0 A0 A0
A1 A1 A1 A1
A0 B0 C0 D0 A0 A1 A2 A3
• high miss rate increases request latency A2
A3
A2
A3
A2
A3
A2
A3
A4 A4 A4 A4
| | | |
disk scheduling A1 B1 C1 D1
B0 B0 B0 B0
A4 A5 A6 A7

B1 B1 B1 B1
• schedule simultaneous I/O requests to reduce latency B2 B2 B2 B2 | | | |
| | | |
A2 B2 C2 D2
• e.g., schedule request with shortest seek time C0
C1
C0
C1
C0
C1
C0
C1
B0 B1 B2 B3

• works best for unlikely cases (long queues)


Independent Fine-grain Coarse-grain

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 51 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 52
Disk Arrays Disk Arrays
independent addressing coarse-grain striping
• s/w user distribute data • data transfer parallelism for large requests
• load balancing an issue • concurrency for small requests

fine-grain striping • load balanced by statistical randomization

• one bit, one byte, one sector must consider workload to determine stripe size
• #disks x stripe unit evenly divides smallest accessible data
• perfect load balance; only one request served at a time
• effective transfer rate approx N times better than single disk
• access time can go up, unless synchronized disks

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 53 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 54

Redundancy Mechanisms Redundant Array of Inexpensive Disk-RAID


disk failures are a significant fraction of hardware failuers arrays of small cheap disks to provide high performance/reliability
• striping increases #corrupted files per failure D = # data disks C = # check disks
data replication level1: mirrored disks (D=1 , C =1)
• disk mirroring • overhead too high
• allow multiple reads
level2: bit interleaved array for soft errors (e.g., D=10, C=4)
• writes must be synchronized
• layout like ECC for DRAMs
parity protection • read all bits across groups
• use a parity disk • merge update bits with bits not updated; recompute parity
• rewrite full group including checks

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 55 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 56
Redundant Array of Inexpensive Disk-RAID Redundant Array of Inexpensive Disk-RAID
level 3: hard error detection and parity (e.g., D=4, C=1) level5: rotated parity to parallelize writes
• key: failed disk is easily identifed by controller • parity spread out across disks in a group
• no need for special code to identify failed disk • different updates of parities go to different disks
• striped data - N data and 1 parity level6: two-dimensional array
• because failed disk is known, parity enough for recovery • array of data is a two-dimensional array
level 4: intra goup parallelism • with row and column parities
• coarse-grain striping • more than 1 failure
• like level 3 + ability to do more than one small I/O at a time
• write must update disk with data and parity disk

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 57 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 58

I/O Performance - Method1 I/O Performance


Like Iron Law, we can do simple calculations for I/O performance assume steady state => arrival rate == departure rate

Better option: I/O is shared resource and sees requests from Little’s Law:
many jobs, so if jobs are independent enough I/O requests will be
random enough that we can use queuing theory (ECE 600, 547) • rate = avg. # in system/avg. response time
• applies to any queue in equilibrium
Think of I/O as a queuing system
• requests enter the queue at a certain rate queue
• wait for service server
arrivals
• service takes certain time
• requests leave the system at a certain rate
• we can calculate response time for each request

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 59 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 60
I/O Performance I/O Performance
Total time in system = time in queue + time in service utilization = arrival rate/service rate
total time is response time - that’s what matters note that little’s law can be applied to individual components

service rate = 1/time to serve • server: # in server = arrival rate x time in service
• queue: queue length = arrival rate x time in queue
lenth of system = length of queue + avg. # of jobs in service

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 61 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 62

I/O Performance I/O Performance


for FIFO queues C - square of coeff of variance

time in system = q length x time in service + residual service time • =1 exponential


• > 1 hyperexponential
residual service time -
• < 1 hypoexponential
• depends on probability distribution of service time
• constant => memoryless property
avg residual service time = 1/2 x mean x (1+C)
• C - square of coefficient of variance
• C = variance/mean2
• variance = E(X2) - (E(x))2

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 63 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 64
I/O Performance I/O Performance
time in q = q lengthxservice time + util x average residual time avoid bottlenecks in I/O system
time in q = (service time x (1+C) x util)/ (2 x (1- util)) designing an I/O system

if C =1 • list I/O devices

• time in q = service time x (util/(1-util)) • list cost


• which is why util should not get too high • record CPU demand
• list memory or bur demand of each device
• determine performance of each option
• simulation or queuing theory

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 65 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 66

I/O Performance Bottleneck Analysis I/O Performance


Choice of large or small disk drives - find out I/O per second SCSI-2 strings - 20MB/s with 15 disks per bus
• 500 MIPS CPU SCSI-2 - 1ms overhead per I/O
• 16-byte 100 ns memory
large 8G disk or small 2G disks
• 200 MB/s I/O bus - upto 20 SCSI buses and controllers
both 7200 RPM, 8-ms avg seek, 6MB/s transfer
• 10000 instrs per I/O
• 16KB per I/O total storage = 200GB

Need to find the slowest component (“weakest link”)

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 67 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 68
I/O Performance I/O Performance
CPU limit - 500 MIPS/10000 = 50000 IOPS SCSI-2 transfer = 16KB/20MB/s = 0.8 ms
memory limit - 1/100ns x 16/ 16KB = 10000 IOPS SCSI-2 limit - 1/(1+0.8) = 556 IOPS

I/O bus limit - 200M/16KB = 12500 IOPS disk performance


• I/O time = 8ms + 0.5/7200 + 16KB/6MB = 14.9ms
• disk limit = 1/14.9 = 67 IOPS
memory limits performance to 10000 IOPS
25 8-GB disks => 25x67 = 1675

100 2-GB disks => 100 x 67 = 6700

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 69 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 70

I/O Performance I/O Performance


minimum SCSI-2 buses for 25 8-GB disks = 25/15 = 2 SCSI strings slightly less perf than disks

minimum SCSI-2 buses for 100 2-GB disks = 100/15 = 7 number of disks per SCSI at full b/w = 556/67 = 8

max IOPS for 2 SCSI-2 = 2 x 556 = 1112 number of SCSI for 8-GB = 25/8 = 4

max IOPS for 7 SCSI-2 = 7 x 556 = 3892 number of SCSI for 2-GB = 100/8 = 13

so we have
• 25 8-GB with 2 or 4 SCSI strings
• 100 2-GB with 7 or 13 SCSI strings

use cost to pick the best

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 71 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 72
I/O Performance Unix File System Performance
We assumed 100% util for some of the components cache files in memory
but queuing delay worsens severely with high util • mmeory much faster than disks

so we need to limit util - rules of thumb file cache is key to I/O performanc
• I/O bus < 75% • OS parameters - cache size, write policy
• disk string < 40% • asynchronous writes => processor continues
• disk arm < 60% • coherence in client/server systems
• disk < 80%
• recalculate performance based on these limits

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 73 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 74

You might also like