Chapter 6: I/O Why I/O?

Chapter 6: I/O Why I/O?
Who cares and what to consider Amdahl’s law

Device charateristics and types • speedup only CPU, I/O becomes bottleneck
• e.g.,
I/O system architecture
• suppose I/O takes 10% time
• buses, I/O processors
• speedup CPU 10 times
High performance disk architectures
• system only speeds up 5 times
I/O Performance
© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 1 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 2
Throughput vs latency Throughput vs latency

“There is an old network saying: bandwidth problems can be who cares about latency
cured with money. latency problems are harder because the
speed of light is fixed - you can’t bribe God.” - David Clark • why don’t you just context switch
• fallacy
throughput
• requires more memory
• bandwidth
• requires more processes(jobs)
• I/Os per second
• human productivity increases super-linearly as
latency • response time decreases
• response time
I/O Overlap I/O Performance
I/O overlaps with computation in complicated ways Timejob = timecpu + timeI/O - timeoverlap
I/O request I/O request I/O interrupt
e.g., 10 = 10+4-4
job 1 job 2 job 3 job 1
USER speed up CPU by 2x
what is timejob
OS
timejob = 5+4-4 = 5 (best)
done
timejob = 5+4-0 = 9 (worst)
I/O
timejob = 5+4-2 = 7 (average?)
I/O Characteristics I/O Characteristics

supercomputers Time sharing filesystems
• data transfer rate important • small files
• many MBs per second for large files • sequential accesses
Transaction processing • many creates/deletes
• I/O rate important

• “random” accesses
• disk I/Os per second
Device Characteristics Device Characteristics
behavior Data Rate
Device I or O? Partner
• input - read once KB/s
• output - write once mouse I human 0.01
• storage - read many times; usually write graphics dis- O human 60,000
play
partner modem I/O machine 2-8
• human
LAN I/O machine 500-6000
• machine
tape storage machine 2000
data rate disk storage machine 2000-10,000
• peak transfer rate
Magnetic Disks Disk Parameters

spindles: 1-4 (most 1)
Head
platters per spindle :1-20

Platters
rpm: 3000-6000 RPM (most 3600)
Arm
platter diameter: 1.3”-8”
Spindle
• trend towards smaller disks
Cylinder
Track
Platters Spans • higher RPM
• mass production
Sector
tracks per surface: 500-2500
tor Gap Intersec-
Disk Parameters Disk Operations
sectors per surface: 32 typical seek: move head to track
• sector # —gap—data+ECC— n 
• avg seek time =  ∑ seek ( i )  ⁄ n
• fixed length sectors (except IBM) 1 
• typically fixed sectors per track
• n is # tracks, seek(i) is time to seek ith track
• recently constant bit density
rotational latency: wait for sector
• avg rotational latency 0.5/3600 = 8.3 ms
transfer rate
• typically 1-4 MB per second
Disk Operations Disk Performance

overhead avg disk access = avg seek time + avg rot. delay + transfer + ovhd
• controller delay e.g.,
• queuing delay • 3600 rpm; 2MB/s
• avg seek time: 9ms
• controller overhead: 1ms
• read 512-byte sector
• 9 ms+.5/3600 + .5KB/2 MB/s + 1 ms
• = 18.6 ms
Alternatives to Disks Alternatives to Disks
DRAMS FLASH memory
• SSD - solid state disk + no seek time
• standard disk interface + fast transfer
• DRAM and battery backup + non-volatile
• ES - expanded storage – bulk erase before write
• software controlled cache
– slow writes
• large (4K) blocks
– “wears” out over time
+ no seek time
+ fast transfer rate
– cost
Optical Disks Graphics Display - CRT

read-only
• CD-ROM
• cheap and reliable
• slow Electron
Gun
write-once
• not-so cheap phorous
X + Y Deflectors
Phos-
• slow Screen Coated
write-many screen has many scan lines each of which has many pixels
• expensive, slow phosphorous acts as capacitor- refresh 30-60 times/second
Graphics Displays - Frame Buffer Graphics Displays - Frame Buffer
frame buffer stores bit map
CPU Memory
• one entry per pixel
• black - 1 bit per pixel
0.2 MB/s • gray-scale 4-8 bits per pixel
• color (RGB) 8 bits per color
Frame
CRT
Buffer
30 MB/s • typical size 1560 x 1280 pixels
• • black and white: 250 KB
• color (RGB): 5.7 MB
Rducing cost of Frame Buffer Frame Buffer Operations

key idea: only a small number of colors are used in one image logically output only
color map: frame buffer stores color map index • but read as well
• color map translates index to full 24-bit color BIT BLTS: bit block transfers
Frame Buffer Color Map
(256×24) • read-modify-write operations
X0 17
• e.g., read xor write
120 014 074 CRT
8-bit
• used for cursors etc
index
Y0
open question
24-bit RGB
• OS only?
• 1560 x 1280 with 256-entry color map - factor 3 reduction
• or direct user access? protection?
Frame Buffer Implementation Other Issues in Displays
1560 x 1280 RGB display double buffering
• bandwidth required = 1560x1280x24x30 = 171 MB/s • duplicate frame buffer
how can we implement this? • to prevent displaying incomplete updates
• Video DRAMS • may be necessary for animation
• dual-ported DRAM z-buffer

• regular random access port • for displaying 3-D images
• serial video port • assign z-dimension for each pixel
• use 24 in parallel for RGB • store z-dimension in frame buffer
what about bandwidth? interleave video DRAMS • BIT BLTS compare Z-dimension
Networks Networks
Terminal networks Long haul networks
• machine-terminal • machine-machine
• star - point-to point • irregular structure - point to point
• 0.3-19 Kbits/s, RS232 protocol • 50-2000 Kbits/s, > 10 km
LANs • Internet
• machine-machine
• bus, ring, star
• 0.1-100 Mbits/s, < 10 km
• ethernet
LAN LAN
E.g., Ethernet ATM Asynchronous Transfer Method
• one-write bus with collisions and exponential backoff Phone company uses for long-haul networks (packet-switch)
• within building
not a viable LAN yet
• 10Mb
Now ethernet is
• point to point to clients (switched network)
• with hubs
• client s/w unchanged
• 100Mb
WAN I/O System Architecture

E.g., ARPANET, Internet hierarchical data paths
arranged as a DAG • divides bandwidth going down hierarchy
• often buses at each level
backbones now 1Gb/s; 100Gb/s in the future
TCP/IP - protocol stack I/O processing
• Transmission control protocol, Internet protocol • program controlled

• DMA
Key issues:
• dedicated I/O processors
• Top-to-bottom systems issues
• getting net into homes
• cable modem, ISDN, ??
I/O System Architecture Buses
High Low
CPU Option
Performance cost
Address/data lines separate? yes no
Cache
Data lines wider narrower
CPU - Memory Bus
transfer size multiple single word
words
Frame
Memory IOP Buffer CRT bus masters multiple one
split transactions yes no
I/O Bus
clocking synchronous asynchronous
Disk Disk Network
Controller Controller Interface
Buses CPU interface

CPU-Memory buses physical connection
• want speed • direct to cache
• usually custom design (fast - several GB/s) + no coherence problems
• eg SGI Challenge, Sun SD, HP Summit – pollutes cache
– CPU and I/O arbitrate for cache
I/O buses
• CPU-memory bus
• compatibility is important
+ DMA
• usually standard designs - PCI (Express), SCSI (slower - – may not be standard
<= GB/s)
CPU interface CPU Interface
• I/O bus
CPU I/O
+ industry standard
– slower than memory bus
Direct to Cache
Cache
– indirection through IO processor
CPU - Memory Bus
Memory Bus
Memory I/O IOP
I/O Bus
I/O Bus
I/O
Bus Arbitration Distributed Arbitration

centralized star connection set of wire-OR priority lines
• high cost set of wire-OR timing and control lines
• high performance
each requesting device indicates its priority
daisy chain
device removes its less-significant bits if higher priority present
• cheap
eventually only highest priority remains
• low performance
special care to ensure fairness
distributed arbitration
• medium price/performance
arbitration for next bus mastership overlap with current transfer
Bus Switching Methods Standard I/O Buses
circuit-switched buses Micro PCI-
S bus SCSI
• bus is held until request is complete channel Xpress
• simple protocol data width 32 bits 32 32-64 8-16
• latency of device affects bus utilization clock 16-25 Mhz asynch 256 10/asynch
# masters multiple multiple multiple multiple
split transaction or packet-switched (or pipelined)
b/w, 32-bit 33 MB/s 20 150+ 20 0r 6
• bus is released after request is initiated
read
• others use the bus until reply comes back
b/w, peak 89 75 800+ 20 or 6
• complex bus control
• better utilization of bus
Memory buses I/O Processing

HP SGI Sun program controlled
Summit Challenge XDbus • CPU explicitly manages all transfers
data width 128 bits 256 144 • high I/O overhead => big minus!
clock 60 MHz 48 66
DMA - direct memory access
# masters multiple multiple multiple • DMA controller manages single block transfers
b/w, peak 960 MB/s 1200 1056
I/O processors I - OP
These are older buses • processors dedicated to I/O operations
Currently, 128 bits, 250MHz+, DDR, several 10s of GB/s • capable of executing I/O programs
• may be special-purpose or general-purpose
Communicating with I/O processors Communicating with I/O processors
I/O control I/O completion
• memory mapped • polling
• ld/st to “special” addresses => operations occur • wait for status bit to change
• protected by virtual memory • periodic checking
• I/O instructions • interrupt
• special instructions initiate I/O operations • I/O completion interrupts CPU
• protected by privileged instructions
IBM 3990 I/O processing High-Performance Disk Architectures

channel == IOP extension to conventional disks
1 user program sets up table in memory with I/O request (pointer disk arrays
to channel program) then execute syscall
redundant arrays of inexpensive disks (RAIDs)
2 OS checks for protection, then executes “start subchannel” instr
3 pointer to channel program is passed to IOP. IOP executes

channel program
4 IOP interacts with storage director to execute individual channel

commands. IOP is free to do other work between channel
commands
5 on completion, IOP places status in memory, interrupts CPU
Extensions to conventional disks Extensions to conventional disks
fixed head disk parallel transfer disk
• head per track, head does not seek • read from multiple surfaces at the same time
• seek time eliminated • difficulty in looking onto different tracks on multiple surfaces
• rotational latency unchanged • lower cost alternatives possible (disk arrays)
• low track density increasing disk density
• not economical • an on-going process
• requires increasingly sophisticated lock-on control
• increases cost
solid state disks
Extensions to conventional disks Disk Arrays

disk caches collection of individual disks
• RAM to buffer data between device and host • each disk has its own arm/head
• fast writes - buffer acts as a write buffer
data distributions
• better utilization of host-to-device path A0 A0 A0 A0
A1 A1 A1 A1
A0 B0 C0 D0 A0 A1 A2 A3
• high miss rate increases request latency A2
A3
A2
A3
A2
A3
A2
A3
A4 A4 A4 A4
| | | |
disk scheduling A1 B1 C1 D1
B0 B0 B0 B0
A4 A5 A6 A7
B1 B1 B1 B1
• schedule simultaneous I/O requests to reduce latency B2 B2 B2 B2 | | | |
| | | |
A2 B2 C2 D2
• e.g., schedule request with shortest seek time C0
C1
C0
C1
C0
C1
C0
C1
B0 B1 B2 B3
• works best for unlikely cases (long queues)

Independent Fine-grain Coarse-grain
Disk Arrays Disk Arrays
independent addressing coarse-grain striping
• s/w user distribute data • data transfer parallelism for large requests
• load balancing an issue • concurrency for small requests
fine-grain striping • load balanced by statistical randomization
• one bit, one byte, one sector must consider workload to determine stripe size
• #disks x stripe unit evenly divides smallest accessible data
• perfect load balance; only one request served at a time
• effective transfer rate approx N times better than single disk
• access time can go up, unless synchronized disks
Redundancy Mechanisms Redundant Array of Inexpensive Disk-RAID

disk failures are a significant fraction of hardware failuers arrays of small cheap disks to provide high performance/reliability
• striping increases #corrupted files per failure D = # data disks C = # check disks
data replication level1: mirrored disks (D=1 , C =1)
• disk mirroring • overhead too high
• allow multiple reads
level2: bit interleaved array for soft errors (e.g., D=10, C=4)
• writes must be synchronized
• layout like ECC for DRAMs
parity protection • read all bits across groups
• use a parity disk • merge update bits with bits not updated; recompute parity
• rewrite full group including checks
Redundant Array of Inexpensive Disk-RAID Redundant Array of Inexpensive Disk-RAID
level 3: hard error detection and parity (e.g., D=4, C=1) level5: rotated parity to parallelize writes
• key: failed disk is easily identifed by controller • parity spread out across disks in a group
• no need for special code to identify failed disk • different updates of parities go to different disks
• striped data - N data and 1 parity level6: two-dimensional array
• because failed disk is known, parity enough for recovery • array of data is a two-dimensional array
level 4: intra goup parallelism • with row and column parities
• coarse-grain striping • more than 1 failure
• like level 3 + ability to do more than one small I/O at a time
• write must update disk with data and parity disk
I/O Performance - Method1 I/O Performance

Like Iron Law, we can do simple calculations for I/O performance assume steady state => arrival rate == departure rate
Better option: I/O is shared resource and sees requests from Little’s Law:
many jobs, so if jobs are independent enough I/O requests will be
random enough that we can use queuing theory (ECE 600, 547) • rate = avg. # in system/avg. response time
• applies to any queue in equilibrium
Think of I/O as a queuing system
• requests enter the queue at a certain rate queue
• wait for service server
arrivals
• service takes certain time
• requests leave the system at a certain rate
• we can calculate response time for each request
I/O Performance I/O Performance
Total time in system = time in queue + time in service utilization = arrival rate/service rate
total time is response time - that’s what matters note that little’s law can be applied to individual components
service rate = 1/time to serve • server: # in server = arrival rate x time in service
• queue: queue length = arrival rate x time in queue
lenth of system = length of queue + avg. # of jobs in service

for FIFO queues C - square of coeff of variance
time in system = q length x time in service + residual service time • =1 exponential

• > 1 hyperexponential
residual service time -
• < 1 hypoexponential
• depends on probability distribution of service time
• constant => memoryless property
avg residual service time = 1/2 x mean x (1+C)
• C - square of coefficient of variance
• C = variance/mean2
• variance = E(X2) - (E(x))2
time in q = q lengthxservice time + util x average residual time avoid bottlenecks in I/O system
time in q = (service time x (1+C) x util)/ (2 x (1- util)) designing an I/O system
if C =1 • list I/O devices
• time in q = service time x (util/(1-util)) • list cost

• which is why util should not get too high • record CPU demand
• list memory or bur demand of each device
• determine performance of each option
• simulation or queuing theory
I/O Performance Bottleneck Analysis I/O Performance

Choice of large or small disk drives - find out I/O per second SCSI-2 strings - 20MB/s with 15 disks per bus
• 500 MIPS CPU SCSI-2 - 1ms overhead per I/O
• 16-byte 100 ns memory
large 8G disk or small 2G disks
• 200 MB/s I/O bus - upto 20 SCSI buses and controllers
both 7200 RPM, 8-ms avg seek, 6MB/s transfer
• 10000 instrs per I/O
• 16KB per I/O total storage = 200GB
Need to find the slowest component (“weakest link”)
CPU limit - 500 MIPS/10000 = 50000 IOPS SCSI-2 transfer = 16KB/20MB/s = 0.8 ms
memory limit - 1/100ns x 16/ 16KB = 10000 IOPS SCSI-2 limit - 1/(1+0.8) = 556 IOPS
I/O bus limit - 200M/16KB = 12500 IOPS disk performance

• I/O time = 8ms + 0.5/7200 + 16KB/6MB = 14.9ms
• disk limit = 1/14.9 = 67 IOPS
memory limits performance to 10000 IOPS
25 8-GB disks => 25x67 = 1675
100 2-GB disks => 100 x 67 = 6700

minimum SCSI-2 buses for 25 8-GB disks = 25/15 = 2 SCSI strings slightly less perf than disks
minimum SCSI-2 buses for 100 2-GB disks = 100/15 = 7 number of disks per SCSI at full b/w = 556/67 = 8
max IOPS for 2 SCSI-2 = 2 x 556 = 1112 number of SCSI for 8-GB = 25/8 = 4
max IOPS for 7 SCSI-2 = 7 x 556 = 3892 number of SCSI for 2-GB = 100/8 = 13
so we have
• 25 8-GB with 2 or 4 SCSI strings
• 100 2-GB with 7 or 13 SCSI strings
use cost to pick the best
I/O Performance Unix File System Performance
We assumed 100% util for some of the components cache files in memory
but queuing delay worsens severely with high util • mmeory much faster than disks
so we need to limit util - rules of thumb file cache is key to I/O performanc
• I/O bus < 75% • OS parameters - cache size, write policy
• disk string < 40% • asynchronous writes => processor continues
• disk arm < 60% • coherence in client/server systems
• disk < 80%
• recalculate performance based on these limits

Chapter 6: I/O Why I/O?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6: I/O Why I/O?

Uploaded by

Copyright:

Available Formats

Chapter 6: I/O Why I/O?

Who cares and what to consider Amdahl’s law

Throughput vs latency Throughput vs latency

timejob = 5+4-2 = 7 (average?)

I/O Characteristics I/O Characteristics

Transaction processing • many creates/deletes

• I/O rate important

Magnetic Disks Disk Parameters

platters per spindle :1-20

Disk Operations Disk Performance

Optical Disks Graphics Display - CRT

• slow Screen Coated

Rducing cost of Frame Buffer Frame Buffer Operations

how can we implement this? • to prevent displaying incomplete updates

• Video DRAMS • may be necessary for animation

• dual-ported DRAM z-buffer

WAN I/O System Architecture

• Transmission control protocol, Internet protocol • program controlled

Buses CPU interface

Memory I/O IOP

Bus Arbitration Distributed Arbitration

arbitration for next bus mastership overlap with current transfer

Memory buses I/O Processing

IBM 3990 I/O processing High-Performance Disk Architectures

3 pointer to channel program is passed to IOP. IOP executes

4 IOP interacts with storage director to execute individual channel

5 on completion, IOP places status in memory, interrupts CPU

solid state disks

Extensions to conventional disks Disk Arrays

• works best for unlikely cases (long queues)

fine-grain striping • load balanced by statistical randomization

Redundancy Mechanisms Redundant Array of Inexpensive Disk-RAID

I/O Performance - Method1 I/O Performance

I/O Performance I/O Performance

time in system = q length x time in service + residual service time • =1 exponential

if C =1 • list I/O devices

• time in q = service time x (util/(1-util)) • list cost

I/O Performance Bottleneck Analysis I/O Performance

Need to find the slowest component (“weakest link”)

I/O bus limit - 200M/16KB = 12500 IOPS disk performance

100 2-GB disks => 100 x 67 = 6700

I/O Performance I/O Performance

use cost to pick the best

You might also like