You are on page 1of 184

Version 3.

10 LISA 2006 1
1994-2006 Hal Stern, Marc Staveley
System & Network
Performance Tuning
Hal Stern
Sun Microsystems
Marc Staveley
SOMA Networks
This tutorial is copyright 1994-1999 by Hal L. Stern and 1998-2006 by Marc
Staveley. It may not be used in whole or part for commercial purposes without
the express written permission of Hal L. Stern and Marc Staveley.
Hal Stern is a Distinguished Systems Engineer at Sun Microsystems. He was the
System Administration columnist for SunWorld from February 1992 until April
1997, and previous columns and commentary are archived at:
http://www.sun.com/sunworldonline.
Hal can be reached at hal.sternQsun.com.
Marc Staveley is the Director of IT for SOMA Networks Inc. He is a frequent
speaker on the topics of standards-based development, multi-threaded
programming, system administration and performance tuning.
Marc can be reached at marcQstaveley.com
Some of the material in the Notes sections has been derived from columns and
articles first appearing in SunWorld, Advanced Systems and SunWorld Online.
Hal thanks IDG and Michael McCarthy for their flexibility in allowing him to
retain the copyrights to these pieces.
Rough agenda:
9:00 - 10:30 AM Section 1
11:00 - 12:30 PM Section 2
1:30 - 3:00 PM Section 3
3:30 - 5:00 PM Section 4
Version 3.10 LISA 2006 2
1994-2006 Hal Stern, Marc Staveley
Syllabus
Tuning Strategies & Expectations
Server Tuning
NFS Performance
Network Design, Capacity Planning &
Performance
Application Architecture
Some excellent books on the topic:
Raj Jain, Computer System Performance (Wiley)
Mike Loukides, System Performance Tuning (O'Reilly)
Adrian Cockcroft and Richard Pettit, Sun Performance and Tuning, Java and the
Internet (SMP/PH)
Craig Hunt, TCP/IP Network Administration (O'Reilly)
Brian Wong, Configuration and Capacity Planning for Solaris Servers
(SunSoft/PH)
Richard Mc Dougall et al. Sun Blueprints: Resource Management (SMP/PH)
Some Web resources:
Solaris Tunable Parameters Reference Manual
(http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters/)
Solaris 2 - Tuning Your TCP/IP Stack and More (http://www.sean.de/Solaris)
Version 3.10 LISA 2006 3
1994-2006 Hal Stern, Marc Staveley
Tutorial Structure
Background and internals
Necessary to understand user-visible symptoms
How to play doctor
Diagnosing problems
Rules of thumb, upper and lower bounds
Kernel tunable parameters
Formulae for deriving where appropriate
If you take only two things from the whole tutorial, they should be:
- Disk configuration matters
- Memory matters
Version 3.10 LISA 2006 4
1994-2006 Hal Stern, Marc Staveley
Tuning Strategies &
Expectations
Section 1
Version 3.10 LISA 2006 5
1994-2006 Hal Stern, Marc Staveley
Topics
Practical goals
Terms & conditions
Workload characterization
Statistics and ratios
Monitoring intervals
Understanding diagnostic output
Version 3.10 LISA 2006 6
1994-2006 Hal Stern, Marc Staveley
Practical Goals
Section 1.1
Version 3.10 LISA 2006 7
1994-2006 Hal Stern, Marc Staveley
Why Is This Hard?
Business
Transaction
Database
Transaction
Transaction Monitor
DBMS Organization
SQL Optimizer
System
CPU
Network
Latency
Disk
I/O
User
CPU
increasing
loss of
correlation
decreasing
ease of
measurement
The problem with un-correlated inputs and measurements is akin to that of
driving a car while blindfolded: the passenger steers while elbowing you to
work the gas and brakes. When your reflexes are quick, you can manage, but if
you misinterpret a signal, you end up rebooting your car.
Correlating user work with system resources is what Sun's Dtrace and
FreeBSD's ktrace attempt to do.
Version 3.10 LISA 2006 8
1994-2006 Hal Stern, Marc Staveley
Social Contract Of Administration
Why bother tuning?
Resource utilization, purchase plans, user outrage
Users want 10x what they have today
sound and video today, HDTV tomorrow
Simulation and decision support capabilities
Application developers should share
responsibility
Who owns educational process?
Performance and complexity trade-off
Load, throughput and cost evaluations
System administrators today are playing a difficult game of perception
management. Hardware prices have declined to the point where most
managers believe you can get Tandem-like fault tolerance at PC prices with no
additional software, processes or disciplines. Much of this tutorial is about
acquiring, enforcing and insisting on discipline.
Version 3.10 LISA 2006 9
1994-2006 Hal Stern, Marc Staveley
Tuning Potential
Application architecture: 1,000x
SQL, query optimizer, caching, system calls
Server configuration: 100x
Disk striping, eliminate paging
Application fine-tuning: 2-10x
Threads, asynchronous I/O
Kernel tuning: less than 2x on tuned system
If kernel bottleneck is present, then 10-100x
Kernel can be a binary performance gate
Here are some "laws" of the computing realm compared:
Moore's law predicts a doubling of CPU horsepower every 18 months, so that
gives us about a 16x improvement in 6 years.
If you look at reported transaction throughput for Unix database systems,
though, you'll see a 100x improvement in the past 6 years -- there's more than
just compute horsepower at work. What we've measured is the result of
operating systems, disks, parallelism, bus throughput and improved
applications.
An excellent discussion of "rules of thumb" as a consequence of Moore's Law is
found in Gray and Shenoy's Rules of thumb in data engineering, Microsoft
Research technical report MS-TR-99-100, Feb. 2000.
Version 3.10 LISA 2006 10
1994-2006 Hal Stern, Marc Staveley
Practical Tuning Rules
There is no "ideal" state in a fluid world
Law of diminishing returns
Early gains are biggest/best
More work may not be cost-effective
Negativism prevails
Easy to say "This won't work"
Hard to prove configuration can deliver on goals
Headroom for well-tuned applications?
Good tuning job introduces new demands
Kaizen
Version 3.10 LISA 2006 11
1994-2006 Hal Stern, Marc Staveley
Terminology: Bit Rates
Bandwidth
Peak of the medium, bus: what's available
Easy to quote, hard to reach
Throughput
What you really get: useful data
Protocol dependent
Utilization
How much you used
Not just throughput/bandwidth
100% utilized with useless data: collisions
Bandwidth => Utilization => Throughput
Each measurement shows a slight (or sometimes great) loss over the previously
ordered metric.
Formal definitions:
Bandwidth: the maximum achievable throughput under ideal workload
conditions (nominal capacity)
Throughput: rate at which the requests can be serviced by the system.
Utilization: the fraction of time the resource is busy servicing requests.
Version 3.10 LISA 2006 12
1994-2006 Hal Stern, Marc Staveley
Terminology: Time
Latency
How long you wait for something
Response time
What user sees: system as a black box
Standard measures
TPC-C: transactions per minute
TPC-D: queries per hour
Bad Things
Knee, wall, non-linear
Load
Throughput
Knee capacity
Usable capacity
Version 3.10 LISA 2006 13
1994-2006 Hal Stern, Marc Staveley
Example
Bandwidth to NYC
10 lanes x 5 cars/s x 4 people/car = 200 pps
Throughput
1 person/car (bad protocol), 1-2 cars/s (congestion)
Parking delays (latency)
How to fix it
Increase number of lanes (bandwidth)
More people per vehicle (efficient protocol)
Eliminate toll (congestion)
Better parking lots (reduce latency)
Tolls add to latency (since you have to stop and pay them) and also to
congestion when traffic merges back into a few lanes. Congestion from traffic
merges is another form of increased latency.
Now consider this: You wire your office with 100baseT to the desktops, feeding
into 1000baseT switched Ethernet hubs. If you run 16 desktops into each
switch,, you're merging 16 * 100 = 1600 Mbits/sec into a 1000 Mbits/sec
"tunnel".
Version 3.10 LISA 2006 14
1994-2006 Hal Stern, Marc Staveley
Unit Of Work Paradox
Unit of work is the typical "chunk size" for
Network traffic
Disk I/O
Small units optimized for response time
Network transfer latency, remote processing
Large units optimized for protocol efficiency
Compare ftp (~4% waste) & telnet (~90% waste)
Ideal for large transfers like audio, video
Where does ATM fit?
ATM uses fixed-size cells, making it ideal for audio and video that need to be
optimized for response time. Unfortunately, the cells are very small (48 bytes of
payload) so ATM incurs a large processing overhead for transfers involving
large files, like audio or video clips.
Version 3.10 LISA 2006 15
1994-2006 Hal Stern, Marc Staveley
Workload Characterization
What are the users (processes) doing?
Estimating current & future performance
Understanding resource utilization
Fixed workloads
Easy to characterize & project
Random workloads
Take measurements, look at facilities over time
Tools & measurements intervals
Version 3.10 LISA 2006 16
1994-2006 Hal Stern, Marc Staveley
Completeness Counts
Random or sequential access?
Koan of this tutorial
Don't say: 1,000 NFS requests/second
Read/write and attribute browsing mix?
Average file size and lifetime?
Working set of files?
Don't say: 400 transactions/second
Lookup, insert, update mix?
Indexes used?
Checkpoints, logs, 2-phase commit?
Version 3.10 LISA 2006 17
1994-2006 Hal Stern, Marc Staveley
Statistics & Ratios
Section 1.2
Version 3.10 LISA 2006 18
1994-2006 Hal Stern, Marc Staveley
Useful Metrics
Latency over utilization
Loaded resources may be sufficient
What does the user see?
Peak versus average load
How system reacts under crunch
What are new failure modes at peaks?
Time to:
Recover, repair, rebuild from faults
Accommodate new workflow
Managing applications
Version 3.10 LISA 2006 19
1994-2006 Hal Stern, Marc Staveley
Recording Intervals
Instantaneous data rarely useful
Computer and business transactions long-lived
Smooth out spikes in small neighbourhoods
Long-term averages aren't useful either
Peak demand periods disappear
Can't tie resources to user functions
Combine intervals
5-10 seconds for fine-grain work (OLTP)
10-30 seconds for peak measurement
10-30 minutes for coarse-grain activity (NFS)
Version 3.10 LISA 2006 20
1994-2006 Hal Stern, Marc Staveley
Nyquist Frequency
Same total load between B and D
Peaks are different at C
Sampling frequency determines accuracy
Nyquist frequency is >2x "peak cycle"
Peaks every 5 min, sample every 2.5 min
A B C D
E
The total area under the two curves is about the same from "B" to "D". If you
simply measure at these endpoints and take an average, you'll think the two
loads are the same, and miss the peaks. If you measure at twice the frequency of
the peaks -- "B", "C" and "D", you'll see that peak demand is greater than the
average on the green-lined system.
The Nyquist theorem: to reconstruct a sampled input signal accurately,
sampling rate must be greater than twice the highest frequency in the input
signal.
The Nyquist frequency: the sampling rate / 2
Version 3.10 LISA 2006 21
1994-2006 Hal Stern, Marc Staveley
Normal Values
Maintain baselines
"But it was faster on Tuesday!"
Distinguish normal and threshold-crossing states
Correlate to type of work being done (user model)
Scalar proclamations aren't valuable
CPU load without application knowledge
Disk I/O traffic without access patterns
Memory usage without cache hit data
Version 3.10 LISA 2006 22
1994-2006 Hal Stern, Marc Staveley
Effective Ratios
Find relationships between work and resources
Units of work: NFS operations, DB requests
Units of management: disks, memory, network
Use correlated variables
Or ratios are just randomly scaled samples
Measure something to be improved
Bad example: Bugs/lines of code
Good example: collisions/packet size
Confidence intervals
Sensitivity of ratio & error bars (accuracy)
Be sure you can control granularity of the denominator. That is, you shouldn't
be able to cheat by increasing the denominator and lowering a cost-oriented
ratio, showing false improvement. Bugs per line of code is a miserable metric
because the code size can be inflated. Quality is the same but the metric says
you've made progress.
The accuracy of a ratio is multiplied by its sensitivity - a small understatement in
a ratio that grows superlinearly with its denominator turns into a large error.
When you multiply two inexact numbers, you also multiply their errors
together. Looking at 50 I/O operations per second, plus or minus 5 Iops is
reasonable, but 50 Iops plus or minus 45 Iops is the same as taking a guess.
The Arms index, named for Richard Arms, is sometimes called the TRIN
(Trading Index). It's a measure of correlation between the price and volume
movements of the NYSE. Instead of looking at up stocks/down stocks or up
volume/down volume, the Arms index computes
(up stocks/down stocks) / (up vol/down vol)
When the index is at 1.0, the up and down volumes reflect the number of issues
moving in each direction. An index of 0.5 means advancing issues have twice
the share volume of decliners (strong); an index over 2.0 means the decliners are
outpacing the gainers on a volume basis.
Version 3.10 LISA 2006 23
1994-2006 Hal Stern, Marc Staveley
Understanding
Diagnostic Output
Section 1.3
Version 3.10 LISA 2006 24
1994-2006 Hal Stern, Marc Staveley
General Guidelines
Use whatever works for you
Make sure you understand output format & scaling
Know inconsistencies by platform & tool
Ignore the first line of output
Average since system was booted
Interval data is more important
Accounting system
Source of accurate fine-grain data
Need to turn on on most systems
Process accounting gives you detailed break-downs of resource utilization,
including the number of system calls, the amount of CPU used, and so on. This
adds at most a few percent to system overhead. While accounting can be about
5% in worst case, auditing (used for security and fine-grain access control) adds
between 10-20% overhead. Auditing tracks every operation from a user process
into the kernel.
If your system stays up for a long (100 days or more) period of time, you may
find some of the counters wrap around their 31-bit signed values, producing
negative reported values.
Version 3.10 LISA 2006 25
1994-2006 Hal Stern, Marc Staveley
Standard UNIX System Tools
vmstat, sar
Memory, CPU and system (trap) activity
sar has more detail, histories
vmstat uses KB, sar uses pages
iostat
Disk I/O service time and operation workhorse
nfsstat
Client and server side data
netstat
TCP/IP stack internals
pflags, pcred, pmap, pldd, psig, pstack, pfiles, pwdx, pstop, prun, pwait,
ptree, ptime: (Solaris) display various pieces of information about process(es)
in the system.
mpstat: (Solaris, Linux): per-processor statistics, e.g. faults, inter-processor
cross-calls, interrupts, context switches, thread migrations etc.
top (all), prstat (Solaris): show an updated view of the process in the system.
memtool (Solaris <=8): everything you ever wanted to know about the
memory usage in a Solaris box [http://playground.sun.com/pub/memtool]
mdb::memstat (Solaris >=9): same info as memtool
Lockstat, Dtrace (Solaris >=10): what are the processes and kernel really
doing?
setoolkit (Solaris, and soon others): virtual performance experts
[http://www.setoolkit.com]
kstat (Solaris): display kernel statistics
RRDB/ORCA/Cricket/MRTG/NRG/Smokeping/HotSaNIC/OpenNMS:
performance graphing tools
HP Perfview: part of OpenVIEW
Version 3.10 LISA 2006 26
1994-2006 Hal Stern, Marc Staveley
Accounting
7 processes running on a loaded system
top or ps show "cycling" of processes on CPUs
Which one is the pig in terms of user CPU, system
calls, disk I/O initiated?
Accounting data shows per-process info
Memory
CPU
System calls
Turn on Accounting - Mike Shaprio (Distinguished Engineer at Sun, and all round kernel
guru) claims tht the overhead oI accounting is low. The kernel always collects the data,
you just pay the I/O overhead to write it to disk.
Version 3.10 LISA 2006 27
1994-2006 Hal Stern, Marc Staveley
Output Interpretation: vmstat
% vmstat 5
procs memory page disk faults cpu
r b w free re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
1 0 0 1788 0 1 36 0 0 16 0 16 0 0 0 42 105 297 45 14 41
3 0 0 2000 0 1 60 0 0 0 0 20 0 0 0 83 197 226 38 45 18
procs - running, blocked, swapped
fre - free memory (not process, kernel, cache)
re - reclaims, page freed but referenced
at - attaches, page already in use (ie, shared library)
pi/po - page in/out rates
fr - page free rate
sr - paging scan rate
Always, always drop the first line of output from system tools like vmstat. It
reflects totals/averages since the system was booted, and isn't really meaningful
data (certainly not for debugging).
You'll see the fre column start high - close to the total memory in the system -
and then sink to about 5% of the total memory over time, in systems like Solaris
(<= 2.6), Irix and other V.4 variants. This is due to file and process page caching,
and is perfectly normal.
Version 3.10 LISA 2006 28
1994-2006 Hal Stern, Marc Staveley
Interpretation, Part 2
Z vmstat S
procs memory page disk faults cpu
r b w fre re at pi po fr de sr so s s? dS in sy cs us sy id
o o 7SS o S6 o o 6 o 6 o o o 4? oS ?97 4S 4 4
S o o ?ooo o 6o o o o o ?o o o o SS 97 ??6 SS 4S S
disk - disk operations/sec, use iostat -D
in - interrupts/sec, use vmstat -i
sy - system calls/sec
cs - context switches/sec
us - % CPU in user mode
sy - % CPU in system mode
id - % CPU idle time
swap (Solaris) - amount of swap space used
mf (Solaris) - minor fault, did not require page in (zero fill
on demand, copy on write, segmentation or bus errors)
Zero fill on demand (ZFOD) pages are paged in from /dev/zero, and produce
(as you would expect) a page filled with zeros, quite useful for the initialized
data segment of a process
Version 3.10 LISA 2006 29
1994-2006 Hal Stern, Marc Staveley
Example #1
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
2 0 0 1788 0 1 36 0 0 0 0 6 0 0 0 42 45 297 97 2 1
3 0 0 2000 0 1 60 0 0 0 0 2 0 0 0 83 97 226 94 4 2
High user time, little/no idle time
Some page-in activity due to filesystem reads
Application is CPU bound
Version 3.10 LISA 2006 30
1994-2006 Hal Stern, Marc Staveley
Example #2
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
3 11 0 1788 0 0 34 0 0 0 0 24 10 0 0 34 272 310 25 58 17
3 10 0 2000 0 0 30 0 0 0 0 14 12 0 0 35 312 340 26 55 19
Heavy disk activity resulting from system calls
Heavy system CPU utilization, but still some idle
time
Database or web server with badly tuned disks
Lower system call rate implies NFS server, same
problems
System calls can "cause" interrupts (when I/O operations complete), network
activity, and disk activity. A high volume of network inputs (such as NFS traffic
or http requests) can cause the same effects, so it's important to dig down
another level to find the source of the load.
Version 3.10 LISA 2006 31
1994-2006 Hal Stern, Marc Staveley
Example #3
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
3 0 0 1788 0 0 4 0 0 0 0 1 0 0 0 534 10 25 15 80 5
2 0 0 2000 0 0 3 0 0 0 0 1 0 0 0 515 12 30 15 83 2
High interrupt rate without disk or system call
activity
Implies network, serial port or PIO device
generating load
Host acting as router, unsecured tty port or a
nasty token ring card
Version 3.10 LISA 2006 32
1994-2006 Hal Stern, Marc Staveley
Example #4
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
3 3 0 1788 0 12 54 30 60 0 100 53 0 0 0 64 110 105 15 10 75
2 4 0 2000 0 10 43 28 58 0 110 41 0 0 0 60 112 130 12 10 78
Page-in/page-out and free rates indicate VM
system is busy
High idle time from waiting on disk
Paging/swapping to root disk (primary swap
area)
Machine is memory starved
Version 3.10 LISA 2006 33
1994-2006 Hal Stern, Marc Staveley
Server Tuning
A single machine
(works for desktops too)
Section 2
Version 3.10 LISA 2006 34
1994-2006 Hal Stern, Marc Staveley
Topics
CPU utilization
Memory consumption & paging space
Disk I/O
Filesystem optimizations
Backups & redundancy
Version 3.10 LISA 2006 35
1994-2006 Hal Stern, Marc Staveley
Tuning Roadmap
Eliminate or identify CPU shortfall
Reduce paging and fix memory problems
Balance disk load
Volume management
Filesystem tuning
Backups & integrity planning
Do steps in this order.
Version 3.10 LISA 2006 36
1994-2006 Hal Stern, Marc Staveley
CPU Utilization
Section 2.1
Version 3.10 LISA 2006 37
1994-2006 Hal Stern, Marc Staveley
Where Do The Cycles Go?
> 90% user time
Tune application code, parallelize
> 30% system time
User-level processes: system programming
Kernel-level work consumes system time
NFS, DBMS calls, httpd calls, IP routing/filtering
NIS, DNS (named), httpd are user-level
High system-level CPU without corresponding user-
level CPU is unusual in these configurations
Perhaps the best tool for quickly identifying CPU consumers is top.
top is a graphical version of ps that runs on every Unix variant known.
A high system CPU % on an NIS or DNS server could indicate that the machine
is also acting as a router, or handling other network traffic.
Version 3.10 LISA 2006 38
1994-2006 Hal Stern, Marc Staveley
Idle Time
> 10% idle
I/O bound, tune disks
Input bound, tune network
%wait, %wio are for disk I/O only
Network I/O shows up as idle time
RPC, NIS, NFS are not I/O waits
One possibility for high idle time is that the system is really doing nothing. This
is fine if you aren't running any jobs, but if you are expecting input and aren't
getting it, it's time to look away from the client/server and at the network. The
client trying to send on the network will show a variety of network contention &
latency problems, but the server will appear to be idle.
Version 3.10 LISA 2006 39
1994-2006 Hal Stern, Marc Staveley
Multiprocessor Systems
vmstat, sar show averages
Example: 25% user time on 4-way host
4 CPUs at 25% each
2 CPUs at 50% each, 2 idle
1 CPU at 100%, 3 idle
Apply rules on per-CPU basis
System-specific tools for breakdown
mpstat, psrinfo (Solaris 2.x)
Version 3.10 LISA 2006 40
1994-2006 Hal Stern, Marc Staveley
A Puzzle
Server system with framebuffer behaves well
(mostly)
Periodically experiences major slowdown
File service slows to crawl
User and system CPU total near 100%
Can never find problem on console; problem
disappears when monitoring begins
Version 3.10 LISA 2006 41
1994-2006 Hal Stern, Marc Staveley
Controlling CPU Utilization
Process "pinning"
Maintain CPU cache warmth
Cut down on MP bus/backplane traffic
Unclear effects for multi-threaded processes
Resource segregation
Scheduler tables
Process serialization
Memory allocation
E10K domains
OS may do a better job than you do!
"pinning" in Solaris may be done with the "psr" commands: psrset, psrinfo.
Version 3.10 LISA 2006 42
1994-2006 Hal Stern, Marc Staveley
Process Serialization
Multiple user processes: good MP fit
Memory, disk must be sufficient
Resource problems
# jobs > # CPUs
sum(memory) > available memory
Cache thrashing (VM or CPU)
Resource management to the rescue
The key win of using a batch scheduler is that it controls usage of memory and
disk resources as well. Even if you're not CPU bound, a job scheduler can
eliminate contention for memory (discussed later) by controlling the total
memory footprint of jobs that are runnable at any one time. When you're short
on memory, 2 x 1/2 isn't 1; it's more like 0.5
Version 3.10 LISA 2006 43
1994-2006 Hal Stern, Marc Staveley
Resource Management
Job Scheduler: serialization
Batch queue system
Line up jobs for CPUs like bank tellers
Manage total memory footprint
Batch Scheduler: prioritization
Modifies scheduler to only let some jobs run when
system is idle
Fair Share Scheduler: parallelization
Gives groups of processes "shares" of memory and
CPU
Your goal with the job scheduler is to reduce the average wait time for a job. If
the typical time to complete is 5 minutes for a job when, say, 5 jobs run in
parallel, then you should try getting the average completion time down into the
1 1/2 to 3 minute range by freeing up resources for each job to run as fast as
possible. Even though the jobs run serially, the expected time to completion is
lower when each job finishes more quickly.
A batch scheduler for Solaris is available from Sun PS's Customer Engineering
group
An example of one produced using the System V dispatch table is described in
SunWorld, July 1993
Version 3.10 LISA 2006 44
1994-2006 Hal Stern, Marc Staveley
Context Switches
What is a context switch? (cs or ctx)
New runnable thread (kernel or user) gets CPU
Rates vary on MP systems
Causes
Single running process yields to scheduler
Interrupt makes another process runnable
Process waits on I/O or event (signal)
A symptom, not a problem
With high interrupt rates: I/O activity
With high system call rates: bad coding
Version 3.10 LISA 2006 45
1994-2006 Hal Stern, Marc Staveley
Traps and System Calls
What is a trap?
User process requests operating system help
Causes
System call, page fault (common)
Floating Point Exception
Unimplemented instructions
Real memory errors
Less common traps are cause for alarm
Wrong version of an executable
Hardware troubles
Version mismatches:
SPARC V7 has no integer multiply/divide
SPARC V8 has imul/idiv, and optimized code uses it. When run on a SPARC
V7 machine, each imul generates an unimplemented instruction trap, which the
kernel handles through simulation, using the same user-level code the compiler
would have inserted for a V7 chip.
Symptoms of this problem: very high trap rate (on the order of thousands per
second, or about one per arithmetic operation) but no system calls. Normally, a
high trap rate is coupled with a high system call rate -- the system calls generate
traps to get the kernel's attention.
Version 3.10 LISA 2006 46
1994-2006 Hal Stern, Marc Staveley
Memory Consumption
& Paging (Swap) Space
Section 2.2
Version 3.10 LISA 2006 47
1994-2006 Hal Stern, Marc Staveley
Page Lifecycle
Page creation: at boot time
Page fills
From file: executable, mmap()
From process: exec()
Zero Fill On Demand (zfod): /dev/zero
Page uses
Kernel and its data structures
Process text, data, heap, stack
File cache
Pages backed by filesystem or paging (swap)
space
/dev/zero is the "backing store" for zero-filled pages. It produces an endless
stream of zeros -- you can map it, read it, or cat it, and you get zeros.
/dev/null is a bottomless sink you write to it and the data disappears.
Reading from /dev/null produces an immediate end of file, not pages of
zeros.
Version 3.10 LISA 2006 48
1994-2006 Hal Stern, Marc Staveley
Filesystem Cache
System V.4 uses main memory
Systems run with little free memory
Available frames used for files
Side effects
Some page freeing normal
All filesystem I/O is page in/out
Solaris (>= 8)
Uses the cyclic page cache for filesystem pages
filesystem cache lives on the free list
Version 3.10 LISA 2006 49
1994-2006 Hal Stern, Marc Staveley
Paging (Swap) Space & VM
BSD world
Total VM = swap plus shared text segments
Must have swap at least as large as memory
Can run out of swap before memory
Solaris world
Total VM = swap + physical memory - N pages
Can run swap-less
Swap = physical memory "overage"
Running out of swap
EAGAIN, no more memory, core dumps
If you run swapless, you cannot catch a core dump after a reboot (since there is
no space for the core dump to be written).
Version 3.10 LISA 2006 50
1994-2006 Hal Stern, Marc Staveley
Estimating VM Requirements
Look at output of ps command
RSS: resident set size, how much is in memory
Total RSS field for lower bound
SZ: stack and data, backed by swap space
Total SZ field for upper bound, good first cut
Memory leaks
Processes grow
SZ increases
Examine use of malloc()/free()
Will exhaust paging space
may hang system
Under Solaris, you can use the memtool package to estimate VM requirements
(http://playground.sun.com/pub/memtool).
Memory leaks are covered in more detail in Section 5, as an application problem.
Your first indication that you have an issue is when you notice VM problems,
which should point back to an application problem, so we mention it here first.
Version 3.10 LISA 2006 51
1994-2006 Hal Stern, Marc Staveley
Paging Principles
Reclaim pages when memory runs low
Start running pagedaemon (pid 2)
Crisis avoidance
Guiding principle of VM system
Page small groups on demand
Keep small pool free
Swap to make large pools available
Compare 200M swapped out in one step to 64M
paged out in 16,000 steps
The hands of the "clock" sweep through the page structures at the same rate,
at a fixed distance apart (handspreadpages).
If the fronthand encounters a page whose reference bit is on, it turns the bit
off. When the backhand looks at the page later, it checks the bit. If the bit is
still off, nothing referenced this page since the fronthand looked at it. The
page may move onto the page freelist (or written to swap).
The rate at which the hands sweep through the page structures varies
linearly with the amount of free memory. If the amount of free memory is
lotsfree, the hands move at a minimum scan rate, slowscan. As the
amount of free memory approaches 0, the scan rate approaches fastscan.
Handspreadpages determines the amount of time an application has to
touch a page before it will be stolen for the free list.
Version 3.10 LISA 2006 52
1994-2006 Hal Stern, Marc Staveley
VM Pseudo-LRU Analysis
pagedaemon runs every 1/4 second
Runs for 10 msec to "sweep a bit"
Clock algorithm
Pages arranged in logical circle
Backhand
Fronthand
h
a
n
d
s
p
r
e
a
d
"
Version 3.10 LISA 2006 53
1994-2006 Hal Stern, Marc Staveley
VM Thresholds (Solaris > 2.6)
Lotsfree: defaults to 1/64 of memory
Point at which paging starts
Up to 30% of memory (not enforced)
desfree: panic button for swapper
lotsfree
minfree: unconditional swapping
desfree
low water mark for free memory
Version 3.10 LISA 2006 54
1994-2006 Hal Stern, Marc Staveley
VM Thresholds (cont.)
cachefree
Solaris 2.6 (patch 105181-10) and Solaris 7
Not Solaris >= 8
lotsfree * 2
page scanner looks for unused pages that are not
claimed by executables (file system buffer cache
pages)
cachefree > lotsfree > desfree > minfree
Strict ordering
lotsfree-desfree gap should be big enough for a
typical process creation or malloc(3) request.
If priority_paging=1 is set in /etc/system then cachefree is set to twice
lotsfree (otherwise cachefree == lotsfree), and slowscan moves to
cachefree (see next slide).
10% to 300% Desktop performance increase
Not clear if it is any good for servers, depends on the type. Typically not good
for file servers.
Version 3.10 LISA 2006 55
1994-2006 Hal Stern, Marc Staveley
VM Thresholds in action
Slowscan
Fastscan
minfree
desfree
lotsfree
100
8192
4MB 8MB 16MB
FJee hemoJy
S
c
a
h

R
a
1
e
cachefree
32MB
minfree is needed to allow "emergency" allocation of kernel data structures such
as socket descriptors, stacks for new threads, or new memory/VM system
structures. If you dip below minfree, you may find you can't open up new
sockets (and you'll see EAGAIN errors at user level).
The speed at which you crash through lotsfree toward minfree is driven by the
demand for memory. The faster you consume memory, the more headroom you
need above minfree to allow the system to absorb the new demand.
Solaris >= 2.6
fastscan = min( mem, 64 MB)
slowscan = min( 1/20 mem, 100 pages)
handspreadpages = fastscan
Therefore all of memory is scanned in 2 (20) seconds at fastscan (slowscan) and
an application has 1 (10) seconds to reference a page before it will be put on the
free list [for a 128MB machine, like they still exist...]
Version 3.10 LISA 2006 56
1994-2006 Hal Stern, Marc Staveley
Sweep Times
Time required to scan all of memory
physmem/fastscan lower bound
physmem/slowscan upper bound
Shortest window for pages to be touched
handspreadpages/fastscan
Application-dependent tuning
Increase handspread, especially on large memory
machines
Match LRU window (coarsely) to transaction
duration
As an example of an upper bound on the scanning time: consider slowscan at
100 pages/second, and a 640M machine with a 4K pagesize. That's 160K pages,
meaning a full memory scan will take 1600 seconds. Crank up the value of
fastscan to reduce the round-trip scanning time if required
The output of vmstat -S shows you how many "revolutions" the clock hands
have made. If you find the system spinning the clock hands you may be
working too hard to free too few pages of memory.
Some tuning may help for large, scientific applications that have peculiar or
very well-understood memory traffic patterns. Sequential access, for example,
benefits from faster "free behind"
Servers (systems doing lots of filesystem I/O) should set fastscan large (131072
[8KB] pages = 1GB/second)
Version 3.10 LISA 2006 57
1994-2006 Hal Stern, Marc Staveley
Activity Symptoms
Scan rate (sr), free rate (fr)
Progress made by pagedaemon
Pageouts (po)
Page kicked out of memory pool, file write
Pagein (pi)
Page fault, filled from text/swap, file read
Reclaim (re)
Waiting to go to disk, brought back
Attach (at)
Found page already in cache (shared libraries)
If you see the scan rate (sr) and the free rate (fr) about equal, this means the
virtual memory system is releasing pages as fast as it's scanning them. Most
probably, the least-recently used algorithm has degenerated into "last scanned",
meaning that tuning the handspread or the scan rates may improve the page
selection process.
Version 3.10 LISA 2006 58
1994-2006 Hal Stern, Marc Staveley
VM Problems
Basic indicator: scanning and freeing
Page in/out could be filesystem activity
Swapping
Large memory processes thrashing?
Attaches/reclaims
open/read/close loops on same file
Kernel memory exhaustion
sar -k 1 to observe
lotsfree too close to minfree
Will drop packets or cause malloc() failures
Chris Drake and Kimberley Brown's "Panic!" is a great reference, including a
host of kernel monitoring and sampling scripts.
Version 3.10 LISA 2006 59
1994-2006 Hal Stern, Marc Staveley
Other Tunables
maxpgio
# swap disks * 40 (Solaris <= 2.6)
# swap disks * 60 (Solaris == 9)
# swap disks * 90 (Solaris >= 10)
maxslp
Solaris < 2.6
Deadwood timer: 20 seconds
Set to 0xff to disable pre-emptive swapping
Solaris >= 2.6
swap out processes sleeping for more than maxslp
seconds (20) if avefree < desfree
Tuning these values produces the best returns for your effort.
maxpgio (assumes one operation per revolution * 2/3)
# swap disks * 40 for 3,600 RPM disks
# swap disks * 80 for 7,200 RPM disks
# swap disks * 110 for 10,000 RPM disks
# swap disks * 167 for 15,000 RPM disks
[ 2/3 of the revolutions/second]
maxslp added meaning between Solaris 2.5.1 and 2.6, it is also used as the
amount of time that a process must be swapped out before being considered a
candidate to be swapped back in, in low memory conditions.
Version 3.10 LISA 2006 60
1994-2006 Hal Stern, Marc Staveley
VM Diagnostics
Add memory for fileservers
Improve file cache hit rate
Calculate expected/physical I/O rates
Add memory for DBMS servers
Configure DBMS to use it in cache
Watch DBMS statistics for use/thrashing
100-300M is typical high water mark
Add memory to eliminate scanning
Version 3.10 LISA 2006 61
1994-2006 Hal Stern, Marc Staveley
Memory Mapped Files
mmap() maps open file into address space
Replaces open(), malloc(), read() cycles
Improves memory profile for read-only data
Used for text segments and shared data segments
Mapped files pages to underlying filesystem
Text segments paged from NFS server?
Data files paged over network from server?
When network performance matters...
Use shared memory segments, paged locally
NFS-mounted executables produce sometimes unwanted effects due to the way
mmap() works over the network. When you start a Unix process (in SunOS 4.x,
or any SystemV.4/Solaris system), the executable is mapped into memory using
mmap() -- not copied into memory as in earlier BSD days. Once the executable
pages are loaded, you won't notice much difference, but if you free the pages
containing the text segment (due to paging/swapping), you're going to re-read
the data over the wire, not from the local swap device.
Version 3.10 LISA 2006 62
1994-2006 Hal Stern, Marc Staveley
New VM System (Solaris >= 8)
Page scanner is a bottleneck for the future
new hardware supports > 512GB
64-16M pages to scan!
File system pressure on the VM
high filesystem load depletes free memory list
resulting high scan rates makes applications suffer
from excessive page steals
A server with heavy I/O pages against itself!
Priority paging (new scanner) is not enough
Cyclic Page Cache is the current answer
separate pool for regular file pages
fs flush daemon becomes fs cache daemon
Version 3.10 LISA 2006 63
1994-2006 Hal Stern, Marc Staveley
Disk I/O
Section 2.3
Version 3.10 LISA 2006 64
1994-2006 Hal Stern, Marc Staveley
Disk Activity
Paging and swapping
Memory shortfalls
Database requests
Lookups, log writes, index operations
Fileserver activity
Read, read-ahead, write requests
Version 3.10 LISA 2006 65
1994-2006 Hal Stern, Marc Staveley
Disk Problems
Unbalanced activity
"Hot spot" contention
Unnecessary activity
Hit disk instead of memory
Disks and networks are sources of greatest
gains in tuning
Version 3.10 LISA 2006 66
1994-2006 Hal Stern, Marc Staveley
Diagnosing Disk Problems
iostat -D: disk ops/second
Z iostat -D S
sdo sd
rps wps util rps wps util
S o ??.o 4o o 9o.o
Look for excessive number of ops/disk
Unbalanced across disks?
iostat -x: service time (svc_t)
Long service times (>100 msec) imply queues
Similar to disk imbalance
Could be disk overload (20-40 msec)
The typical seek/rotational delays on a disk are 8-15 msec. A typical transfer
takes about 20 msec. If the disk service times are consistently around 20 msec,
the disk is almost always busy. When the service times go over 20 msec, it
means that requests are piling up on the spindle: an average service time of 50
msec means that the queue is about 2.5 requests (50/20) long.
Note that for low I/O volumes, the service times are likely to be inaccurate and
on the high side. Use the service times as a thermometer for disk activity when
you're seeing a steady 10 I/O operations (iops) per second or more.
Version 3.10 LISA 2006 67
1994-2006 Hal Stern, Marc Staveley
Disk Basics
Physical things
Disk performance
sequential transfer rate
5 - 40 MBytes/s
Theoretical max: nsect * 512 * rpm / 60
50-100 operations/s random access
6-12 msec seek, 3-6 msec rotational delay
Track-to-track versus long seeks
Seek/rotational delays
Access inefficiencies
While nsect * 512 * rpm tells you how fast the spinning disk platter can deliver
data, it's not completely accurate for the zone-bit recorded (ZBR) disks that are
common today. ZBR SCSI disks only fudge the nsect value in the disk
description, providing an average number of sectors per cylinder. In reality, the
first 70% of the disk is quite fast and the last 30% has a lower transfer rate.
Version 3.10 LISA 2006 68
1994-2006 Hal Stern, Marc Staveley
SCSI Bus Basics
SCSI 1 (5MHz clock rate)
8, 16-bit (wide), or 32-bit (fat)
Synchronous operation yields 5 Mbyte/sec
SCSI 2 - Fast (10MHz clock rate)
10 Mbytes/s with 8-bit bus
20 Mbytes/s with 16-bit (wide) bus
Ultra (20MHz clock rate)
Ultra/wide = 40MB/sec
Ultra 2 (40MHz clock rate)
Ultra 3 (80MHz clock rate)
If devices from different standards exist on the same SCSI bus then the clock rate
of all devices is the clock rate of the slowest device.
Ultra 3 is sometimes called Ultra 160.
Version 3.10 LISA 2006 69
1994-2006 Hal Stern, Marc Staveley
SCSI Cabling Basics
Single Ended
6m for SCSI 1
3m for SCSI 2
3m for Ultra up to 4 devices
1.5m for Ultra > 4 devices
Differential
25m cabling
Low Voltage Differential (LVD)
12m cabling
used by Ultra 2 and 3
Differential signaling is used to suppress noise over long distances. If you ask a
friend to signal you with a lantern, it's easy to distinguish high (1) from low (0).
If the friend is now standing on a boat, which introduces noise (waves), it's
much harder to differentiate high and low. Instead, give your friend two
lanterns, and define "high" as "lanterns apart" and "low" as "lanterns together".
The noise affects both lanterns, but measuring the difference between them edits
the noise from the resulting signal.
If Single Ended and LVD exist on the same bus then the cabling lengths are the
minimum of the two.
Version 3.10 LISA 2006 70
1994-2006 Hal Stern, Marc Staveley
Fibre Channel and iSCSI
Industry standard at the frame level
FC-AL: fiber channel arbitrated loop
100 Mbytes/sec typical
Use switches and daisy chains to build storage
networks
Vendors layer SCSI protocol on top
SCSI disk physics still apply
But you can pack a lot of disks on the fiber
Ditto iSCSI over GigE
Version 3.10 LISA 2006 71
1994-2006 Hal Stern, Marc Staveley
The I/O Bottleneck
When can't an 72GB disk hold a 500MB DB?
When you need more than 100 I/Os per second
How do you get > 40MByte/s file access?
Gang disks together to "add" transfer rates
Key info nugget #1: Access pattern
Sequential or random, read-only or read-write
Key info nugget #2: Access size
2K-8K DMBS, but varies widely
8K NFS v2, 32K NFS v3
4K-64K filesystem
Realize that when you're bound by random I/O rates, you're not moving that
much data -- the bottleneck is the physical disk arm moving across the disk
surface.
At 100 I/O operations/sec, and 8 KBytes/operations, a SCSI disk moves only
800 KBytes/sec at maximum random I/O load.
The same disk will source 40 MBytes/sec in sequential access mode, where the
disk speed and interface are the limiting factors.
Version 3.10 LISA 2006 72
1994-2006 Hal Stern, Marc Staveley
Disk Striping
Combine multiple disks into single logical disk
with new properties
Better transfer rate
Better average seek time
Large capacity
Terminology
Block size: chunk of data on each disk in stripe
Interleave: number of disks in stripe
Stripe size: block size * interleave
Version 3.10 LISA 2006 73
1994-2006 Hal Stern, Marc Staveley
Volume Management
Striping done at physical (raw) level
Run raw access processes on stripe (DBMS)
Can build filesystem on volume, after striping
Host (SW) or disk array (HW) solutions
Some DBMSs do striping internally
Bottleneck: multiple writes
Stripe over multiple controllers, SCSI busses
Version 3.10 LISA 2006 74
1994-2006 Hal Stern, Marc Staveley
Striping For Sequential I/O
Each request hits all disks in parallel
Add transfer rates to "lock heads"
Block size = access size/interleave
Examples:
64K filesystem access, 4 disks, 16K/disk
8K filesystem access, 8 disks, 1K/disk
Can get 3-3.5x single disk
On a 4-6 way stripe
Version 3.10 LISA 2006 75
1994-2006 Hal Stern, Marc Staveley
Striping For Random I/O
Each request should hit a different disk
Random requests use all disks
Force scattering of I/O
Reduce average seek time with "independent
heads"
Block size = access size
Examples:
8K NFS access on 6 disks, 48K stripe size
2K DBMS access on 4 disks, 8K stripe size
Version 3.10 LISA 2006 76
1994-2006 Hal Stern, Marc Staveley
Transaction Modeling
Types: read, write, modify, insert
Meta data structure impact
Filesystem structures: inodes, cylinder groups,
indirect blocks
Logs and indexes for DBMS
Insert operation is R-M-W on index, W on data, W
on log
Insert/update on DBMS touches data, index, log
Version 3.10 LISA 2006 77
1994-2006 Hal Stern, Marc Staveley
Cache Effects
Not every logical write I/O hits disk
DB write clustering
NFS, UFS dirty page clustering
Hardware arrays may cache operations
Reads can be cached
DB page/block cache (Oracle SGA, e.g.)
File/data caching in memory
Locality of reference
Cache can help or hurt performance
Version 3.10 LISA 2006 78
1994-2006 Hal Stern, Marc Staveley
Simple DBMS Example
Medium sized database on a busy day
200 users, 8 Gbyte database, 1 request/10 sec
50% updates, 20% inserts, 30% lookups, 4 tables, 1
index on each
Disk I/O rate calculation
.5 * 4/U + .2 * 3/I + .3 * 2/L = 3.2 I/O per table
12.8 I/O per transaction, ~10 with caching?
Arrival rate
200 users * 1 / 10 secs = 20/sec
Demand: 200 I/Os/sec, peak to 220 or more
The sample disk I/O rates are derived as follows:
Updates have to do a read, an update to an index and an update to a data block,
as well as a log write (4 transactions)
Inserts do an index and data block write, and a log write (3 transactions)
Lookups read from the index and data blocks (2 transactions)
Version 3.10 LISA 2006 79
1994-2006 Hal Stern, Marc Staveley
Haste Needs Waste
Using a single disk is a disaster
Disk can only do 50-60 op/s, response time 10/s
4 disks barely do the job
Provides 200-240 I/Os/sec
DBMS uses 90% of I/O rate capacity
6 disks would be better
Waste most of the available space
Version 3.10 LISA 2006 80
1994-2006 Hal Stern, Marc Staveley
Filesystem Optimization
Section 2.4
Version 3.10 LISA 2006 81
1994-2006 Hal Stern, Marc Staveley
UNIX Filesystem
Filesystem construction
Each file identified by inode
Inode holds permissions, modification/access times
Points to 12 direct (data) blocks and indirect blocks
Indirect block contains block pointers to data
blocks
Double indirect blocks contain pointers to
blocks that contain pointers to data blocks
Version 3.10 LISA 2006 82
1994-2006 Hal Stern, Marc Staveley
UFS Inode
Mode, time
Owners
Etc...
Indirect
Double ind.
12
direct
blocks
Data
Data
Data
2048
slots
2048
slots
2048
slots
Data
Data
Data
Data
Data
Data
Inode
2048
slots
2048
slots
Data
Data
Data
- Direct blocks up to 100 KBytes
- Indirect blocks up to 100 MBytes
- Double indirect blocks up to 1 TByte
Version 3.10 LISA 2006 83
1994-2006 Hal Stern, Marc Staveley
Filesystem Mechanics
Inodes are small and of fixed size
Fast access, easy to find
File writes flushed every 30 seconds
Sync or update daemon
UNIX writes are asynchronous to process
Watch for large bursts of disk activity
Filesystem metadata
Create redundancy for repair after crash
Cylinder groups, free lists, inode pointers
fsck: scan every block for "rollback"
The fact that write() doesn't complete synchronously can cause bizarre failures.
Most code doesn't check the value of errno after a close(), but it should. Any
outstanding writes are completed synchronously when close() is called.
If any errors occurred during those writes, the error is reported back through
close(). This can cause a variety of problems when quotas are exceeded or disks
fill up (over NFS, where the server notices the disk full condition).
More details: SunWorld Online, System Administration, October 1995
http://www.sun.com/sunworldonline
Version 3.10 LISA 2006 84
1994-2006 Hal Stern, Marc Staveley
The BSD Fast Filesystem
Original UNIX filesystem
All inodes at the beginning of the disk
open() followed by read() always seeks
BSD FFS improvements
Cylinder groups keep inodes and data together
Block placement strategy minimizes rotational delays
Inode/cylinder group ratio governs file density
Minfree: default 10%, safe to use 1% on 1+G
disks
McKusick, Leffler, Quaterman and Karels, "Design and Implementation of the
4.3 BSD Operating System"
mkfs and newfs always look at the # bytes per inode parameter (fixed). To
change the inode density, you need to change the number of cylinders in a
group by adjusting the number of sectors/track:
Filesystems for large files (like CAD parts files) usually have more bytes per
inode; filesystems for newsgroups should have fewer bytes per inode (with the
exception of the filesystem for alt.binaries.*)
Version 3.10 LISA 2006 85
1994-2006 Hal Stern, Marc Staveley
Fragmentation & Seeks
Fragments occur in last block of file
Frequently less than 1% internal fragmentation
10% free space reduces external fragmentation
Block placement strategy breaks down
Avoid filling disk to > 90-95% of capacity
Introduces rotational delays
File ordering affects performance
Seeking across large disk for related files
Version 3.10 LISA 2006 86
1994-2006 Hal Stern, Marc Staveley
Large Files
Reading
Read inode, indirect block, double indirect block,
data block
Sequential access should do read-ahead
Writing
Update inode, (double) indirect, data blocks
Can be up to 4 read-modify-write operations
Large buffer sizes are more efficient
Single access for "window" of metadata
Version 3.10 LISA 2006 87
1994-2006 Hal Stern, Marc Staveley
Turbocharging Tricks
Striping
Journaling (logging)
Write meta data updates to log, like DBMS
Eliminate fsck, simply replay log
Ideal for synchronous writes, large files
logging option (Solaris 7)
Extents
Cluster blocks together and do read-ahead
Eliminate more rotational delays
Can add 2-3x performance improvement
McVoy and Kleiman, "Extent-like Performance From The UNIX Filesystem",
Winter USENIX Proceedings, 1991.
Linux also has the EXT2 filesystem, which is extent based and has different
placement policies.
Journaling and logging are often used interchangeably. Logging filesystems and
log-based filesystems, however, are not the same thing. A logging filesystem
bolts a log device onto the UNIX filesystem to accelerate writes and recovery. A
log-based filesystem is a new (non-BSD FFS) structure, based on a log of write
records. There is a long and exacting description of the differences in Margo
Seltzer's PhD thesis from UC-Berkeley.
Version 3.10 LISA 2006 88
1994-2006 Hal Stern, Marc Staveley
Access Patterns
Watch actual processes at work
What are they doing?
nfswatch: NFS operations on the wire
truss (Solaris, SysV.4), strace (Linux, HPUX),
ktrace (*BSD)
dtrace (Solaris >= 10)
Application write size should match filesystem
block size.
Use a Filesystem benchmark
Are the disks well balanced, is the filesystem well
tuned?
filebench, bonnie
More details on using these tools: SunWorld Online, System Administration,
September 1995
http://www.sun.com/sunworldonline
Don't use process tracing for performance-sensitive issues, because turning on
system call trapping (used by the strace/truss facility) slows the process down
to a snail's pace.
Solaris Dtrace (Solaris >= 10) is more light weight.
Bonnie (http://www.textuality.com/bonnie) is a good all-round Unix
filesystem benchmark tool
Filebench extensible system to simulate many different types of workloads
http://sourceforge.net/projects/filebench/
http://www.opensolaris.org/os/community/performance/filebench/
Version 3.10 LISA 2006 89
1994-2006 Hal Stern, Marc Staveley
Resource Optimization
Optimize disk volumes by type of work
Sequential versus random access filesystems
Read-only versus read-write data
Eliminate synchronous writes
File locking or semaphores more efficient
Journaling filesystem faster
Watch use of symbolic links
Often causes disk read to get link target
Don't update the file access time for read-only
volumes
Don't update the file access time (for news and mail spools, etc.)
-o noatime
Delay updating file access time (Solaris >= 9)
-o dfratime
Version 3.10 LISA 2006 90
1994-2006 Hal Stern, Marc Staveley
Non-Volatile Memory
Battery backed memory
RAM in disk array controller (array cache)
disk cache
Synchronous writes at memory write speed
Version 3.10 LISA 2006 91
1994-2006 Hal Stern, Marc Staveley
Inode Cache
Inode cache for metadata only
Data blocks cached in VM or buffer pool
Buffer pool for inode transit
vmstat -b
sar -b 5 10
Watch %rcache (read cache) hit rate
Lower rate means more disk I/O for inodes
Set high water mark
set bufhwm=8000 Solaris /etc/system
Version 3.10 LISA 2006 92
1994-2006 Hal Stern, Marc Staveley
Directory Name Lookup Cache
Name to inode mapping cache
Must be large for file/user server
Low hit rate causes disk I/O to read directories
vmstat -S to observe
Aim for > 90% hit rate
Causes of low hit rates:
File creation automatically misses
Names > {14,32} characters not inserted
Long names not efficient
Solaris >= 2.6
- uses the ncsize parameter to set the DNLC size.
- handles long filenames in the DNLC
Solaris >= 8
- can use the kstat -n dnlcstats command to determine how well the
DNLC is doing
Version 3.10 LISA 2006 93
1994-2006 Hal Stern, Marc Staveley
Filesystem Replication
Replicate popular read-only data
Automounter or "workgroups" to segregate access
Define update and distribution policies
200 coders chasing 4 class libraries
Replicate libraries to increase bandwidth
Hard to synchronize writeable data
Similar to DBMS 2-phase commit problem
Andrew filesystem (AFS)
Stratus/ISIS rNFS, Uniq UPFS from Veritas
The ISIS Reliable NFS product is now owned by Stratus Computer,
Marlborough MA
Uniq Consulting Corp has a similar product that does N-way mirroring of NFS
volumes. Contact Kevin Sheehan at kevin@uniq.com.au, or your local Veritas
sales rep, since Veritas is now reselling (and supporting) this product
Version 3.10 LISA 2006 94
1994-2006 Hal Stern, Marc Staveley
Depth vs. Breadth
Avoid large files if possible
Break large files into smaller chunks
Don't backup a 200M file for a 3-byte change
Files > 100M require multiple I/Os per operation
Directory search is linear
Avoid broad directories
Name lookup is per-component
Avoid deep directories
Use hash tables if required
Version 3.10 LISA 2006 95
1994-2006 Hal Stern, Marc Staveley
Tuning Flush Rates
Dirty buffers flushed every 30 seconds
Causes large disk I/O burst
May overload single disk
Balance load if requests < 30s apart
Generic update daemon
while :
do sync; sync; sleep 15; done
Solaris tunables
autoup: time to cover all memory
tune_t_fsflushr: rate to flush
autoup is the oldest a dirty buffer can get before it is flushed. tune_t_fsflushr is
the rate at which the sync daemon is run; it defaults to 30 seconds.
All of memory will be covered in autoup seconds.
flushrate/autoup is the fraction covered by each pass of the update daemon.
Increase autoup, or cut the flush rate, to space out the bursts
Extremely large disk service times (in excess of 100 msec) can be caused by large
bursts from the flush daemon causing a long disk queue. If the filesystem flush
sends 20 requests to a single disk, it's likely there will be some seeking between
writes, so the 20 requests will average 20 msec each to complete. Since all disk
requests are scheduled in a single pass by fsflush, the service time for the last
one will be nearly 400 msec, while the first few will finish in around 20 msec,
yielding an average service time of 200 msec!
Version 3.10 LISA 2006 96
1994-2006 Hal Stern, Marc Staveley
Backups & Redundancy
Section 2.5
Version 3.10 LISA 2006 97
1994-2006 Hal Stern, Marc Staveley
Questions of Integrity
Backups are total loss insurance
Lose a disk
Lose a brain: egregious rm *
Disk integrity is inter-backup insurance
Preserve data from high-update environment
Time to restore backup is unacceptable
Doesn't help with intra-day deletes
Disaster recovery is a separate field
Version 3.10 LISA 2006 98
1994-2006 Hal Stern, Marc Staveley
Disk Redundancy
Snapshots
Copy data to another disk or machine
tar, dump, rdist, rsync
Prone to failure, network load problems
Disk mirroring (RAID 1)
Highest level of reliability and cost
Some small performance gains
RAID arrays (RAID 5 and others)
Cost/performance issues
VLDB byte capacity
RAID = Redundant Array of Inexpensive Disks.
When the RAID levels were created (at UC-Berkeley), the popular disk format
was SMD (as in Storage Modular Device, not Surface Mounted Device).
10" platters weighed nearly 100 pounds and held 500 MB, while SCSI disks
topped out at 70 MB but cost significantly less (and were easier to lift and install)
Version 3.10 LISA 2006 99
1994-2006 Hal Stern, Marc Staveley
RAID 1: Mirrored Disks
100% data redundancy
Safest, most reliable
Historically rejected due to disk count, cost
Best performance (of all RAID types)
Round-robin or geometric reads: like striping
Writes at 5-10% hit
Combine mirroring and striping
Stripe mirrors (1+0) to survive interleave failures
Mirror stripes (0+1) for safety with minimal overhead
RAID 0 = striping
Few systems can do 1+0
1+0 allows multi-disk failures as long as at least one mirror disk per stripe
survives.
Version 3.10 LISA 2006 101
1994-2006 Hal Stern, Marc Staveley
RAID 5: Parity Disks
Stripe parity and data over disks
No single "parity hot spot"
Performance degrades with more writes
R-M-W on parity disk cuts 60%
Similar to single-disk for reads
Ideal for large DSS/DW databases
If size >> performance, RAID 5 wins
Best path to large, safe disk farm
20-40% cost savings
Version 3.10 LISA 2006 102
1994-2006 Hal Stern, Marc Staveley
RAID 5 Tuning
Tunables
Array width (interleave) - sometimes
Block size - required
Count parity operations in I/O demand
Read = 1 I/O
Write = 4 I/O
Ensure parity data is not a bottleneck
Averaging parity disk reads and writes limited by
(total) 50-60 IOP/second limit
RAID 5 write:
- read original block
- read parity block
- xor original block with parity block
- xor new block with parity block
- write new block
- write parity block
Version 3.10 LISA 2006 103
1994-2006 Hal Stern, Marc Staveley
Backup Performance
Derive rough transfer rate needs
100 GB/hour = 30 MB/second
5MB/s for DLT, 10MB/s for SDLT
15MB/s for LTO, 35MB/s for LTO-2
60MB/s for LTO-3
6MB/s for AIT, 24MB/s for AIT-4
80MB/sec over quiet Ethernet (GigE)
Multiple devices increase transfer rate
Stackers grow volume
Drives increase bulk transfer rate
Careful of shoe shining
When designing the backup system, also take into consideration the length of
time you must keep the data around. Some industries, such as financial
services, require at least a 7 year history of data for SEC or arbitration hearings.
Drug companies and health-care firms must keep patient data near-line until the
patient dies; if a drug pre-dated a patient by 5 years then you're looking at the
better part of a century.
Media types in vogue today decay. Magnetic media loses its bits; CD-ROMs
may decay after a long storage period. How will you read your backups in the
future? If you've struggled with 1600bpi tapes lately you know the feeling of
having data in your hand that's not convertible into on-line form.
Final warning: dump isn't portable! If you change vendors, make sure you can
dump and reload your data.
Version 3.10 LISA 2006 104
1994-2006 Hal Stern, Marc Staveley
Backup to Disk
Rdiff-backup
incremental backups to disk with easy restore
BackupPC
incremental and full backups to disk with a web front
end for scheduling and restores
good for backing up MSwindows clients to Unix
server
Snapshots
Offsite replicas
Is this familiar:
- Secure the individual systems
- Run aggressive password checkers
- Restrict NFS, or use NFS with Kerberos or DCE/DFS to encrypt file access
- Prevent network logins in the clear (use ssh)
- BUT: do backups over the network! Exposing the data over the network
during the backup un-does much of the effort in the other precautions.
Version 3.10 LISA 2006 105
1994-2006 Hal Stern, Marc Staveley
NFS Performance Tuning
Section 3
Version 3.10 LISA 2006 106
1994-2006 Hal Stern, Marc Staveley
Topics
NFS internals
Diagnosing server problems
Client improvements
Client-side caching & tuning
NFS over WANs
Version 3.10 LISA 2006 107
1994-2006 Hal Stern, Marc Staveley
NFS Internals
Section 3.1
Version 3.10 LISA 2006 108
1994-2006 Hal Stern, Marc Staveley
NFS Request Execution
stat()
getattr()
nfs_getattr()
kerne RPC
port 2049 hardcoded
nfsd
getattr()
UFS stat()
HSFS stat()
% s -
Version 3.10 LISA 2006 109
1994-2006 Hal Stern, Marc Staveley
NFS Characterization
Small and large operations
Small: getattr, lookup, readdir
Large: read/write, create, readdir
Response time matters
Clamped at 50 msec for "reasonable" server
Users notice 20 msec to 50 msec dropoff
Scalability is still a concern
Usually network limited, hard to reach capacity
Flat response time is best measure
Client-side demand management
Version 3.10 LISA 2006 110
1994-2006 Hal Stern, Marc Staveley
NFS over TCP
NFS/TCP is a win for:
Wide-area networks, with higher bit error rates
Routed networks
Data-transfer oriented environments
Large MTU networks, like GigE with jumbo frames
Advantages
Better error recovery, without complete retransmit
Fewer retransmissions and duplicate requests
Disadvantage
Connection setup at mount time
Version 3.10 LISA 2006 111
1994-2006 Hal Stern, Marc Staveley
NFS Version 3
Improved cache consistency
Attributes returned with most calls
"access" RPC mimics permission checking of local
system open() call
Improved correctness with NFS/TCP
Performance enhancements
Asynchronous write operations, with logging
Larger buffer sizes, up to 32KBytes
NFS v3 uses a 64-byte (not bit) file handle, with the actual size used per mount
negotiated between the client and server.
Version 3.10 LISA 2006 112
1994-2006 Hal Stern, Marc Staveley
Diagnosing Server Problems
Section 3.2
Version 3.10 LISA 2006 113
1994-2006 Hal Stern, Marc Staveley
Indicators
Usual server tuning applies
Don't worry about CPU utilization
Client response time is early warning system
Some NFS specific details
Server isn't always the limiting factor
Typical Ethernet supports 300-350 LADDIS ops
To get 2,000 LADDIS: 7-8 Ethernets
LADDIS stands for Legato, Auspex, Digital, Data General, Interphase and Sun,
the 6 companies that helped produce the SPEC standard for NFS benchmarks.
LADDIS is now formally known as SPEC NFS and is reported as a number of
ops/sec, at 50 msec response time or less.
More info: spec-ncga@cup.portal.com
Keith, Bruce. LADDIS: The Next Generation in NFS Server Benchmarking. spec
newsletter. March 1993. Volume 5, Issue 1.
Watson, Andy, et.al. LADDIS: A Multi-Vendor and Vendor-Neutral SPEC
NFS Benchmark. Proceedings of the LISA VI Conference, October 1992. pp. 17-
32.
Wittle, Mark, and Bruce Keith. LADDIS: The Next Generation in NFS File
Server Benchmarking. Proceedings of the Summer 1993 USENIX Conference.
July 1993. pp. 111-128.
Version 3.10 LISA 2006 114
1994-2006 Hal Stern, Marc Staveley
Request Queue Depth
nfsd daemons/threads
One request per nfsd daemon
Lack of nfsds makes server drop requests
May show up as UDP socket overflows (netstat -s)
Guidelines
Daemons: 24-32 per server, more for many disks
Kernel threads (Solaris): 500-2000
no penalty for being long
Add more for browsing environment
Version 3.10 LISA 2006 115
1994-2006 Hal Stern, Marc Staveley
Attribute Hammering
Use nfsstat -s to view server statistics
getattr > 40%
Increase client attribute cache lifetime
Consolidate read-only filesystems
readlink > 5%
Replace links with mount points
writes > 5%
NVRAM situation
% nfsstat -s
null getattr setattr root lookup readlink read
32 0% 527178 33% 9288 0% 0 0% 449726 28% 189466 12% 188665 15%
wrcache write create remove rename link symlink
0 0% 134797 8% 13799 0% 15826 1% 2725 0% 4388 0% 74 0%
mkdir rmdir readdir fsstat
1575 0% 1532 0% 23898 1% 242 0%
On an NFS V3 client, you'll see entries for cached writes, access calls, and other
extended RPC types.
Version 3.10 LISA 2006 116
1994-2006 Hal Stern, Marc Staveley
Transfer Oriented Environments
Ensure adequate CPU
1 CPU per 3-4 100BaseT Ethernets
1 CPU per 1.5 ATM networks at 155 Mb/s
1 CPU per 1000BaseT Ethernet (GigE)
Disk balancing is critical
Optimize for random I/O workload
Large memory may not help
What is working set/file lifecycle?
Version 3.10 LISA 2006 117
1994-2006 Hal Stern, Marc Staveley
Client Improvements
Section 3.3
Version 3.10 LISA 2006 118
1994-2006 Hal Stern, Marc Staveley
Client Tuning Overview
Eliminate end to end problems
Request timeouts are call to action
700 msec timeout versus 50 msec "pain" level
Reduce demand with improved caching
Adjust for line speed
< Ethernet links
Uncontrollable congestion
Routers or multiple hops
Application tuning rules apply
Version 3.10 LISA 2006 119
1994-2006 Hal Stern, Marc Staveley
Client Retransmission (UDP only)
Unanswered RPC request is retransmitted
Repeated forever for hard mounts
Up to 5 times (retrans) for soft mounts
What can go wrong?
Severe congestion (storms)
Server dropping requests/packets
Network losing requests or sections of them
One lost packet kills entire request
Version 3.10 LISA 2006 120
1994-2006 Hal Stern, Marc Staveley
Measuring Client Performance
Client-side performance is what user sees
nfsstat -m OK for NFS over UDP
Shows average service time for lookup, read and
write requests
iostat -n with extended service times
NFS over TCP harder to measure
Stream-oriented, difficult to match requests and
replies
tcpdump snoop to match XIDs in NFS header
wireshark (ethereal) does this.
Version 3.10 LISA 2006 121
1994-2006 Hal Stern, Marc Staveley
Client Impatience (NFS over UPD only)
Use nfsstat -rc
timers > 0
Server slower than
expected
nfsstat -m: expected
response time
1400 msec retrans++
700 msec retrans++
120 msec tmers++
cas++ request
% nfsstat -rc
Client rpc:
calls badcalls retrans badxid timeout wait newcred timers
224978 487 64 263 549 0 0 696
% nfsstat -m
/home/thud from thud:/home/thud (Addr 192.151.245.13)
Lookups: srtt = 7 (17 ms), dev=4 (20ms), cur=2 (40ms)
Reads: srtt=14 (35 ms), dev=3 (15ms), cur-3 (60ms)
Note that the NFS backoff and retransmit scheme is not used for NFS over TCP,
since TCP's congestion control and restart algorithms properly fit the connection
oriented model of TCP traffic. The NFS mechanism is used for UDP mounts,
and the timers used for adjusting the buffer sizes and transmit intervals are
shown with nfsstat -m. On an NFS/TCP client, the timers will be zero.
badcalls > 0
SoIt NFS mount Iailures
Operation interrupted (application Iailure)
Data loss or application Iailures
Should never see these
Version 3.10 LISA 2006 122
1994-2006 Hal Stern, Marc Staveley
Client's Network View (NFS over UPD only)
retrans > 5%
Requests not reaching server or not serviced
badxid close to 0
Network is dropping requests
Reduce rsize, wsize on mount
badxid > 0
Duplicate request cache isn't consolidating
retransmissions
Tune server, partition network
Using NFS/TCP or NFS Version 3, you'll be hard-pressed to see badxid counts
above zero. Using TCP, the NFS client doesn't have to retransmit the whole
request, only the part that was lost to the server. As a result, there should rarely
be completely retransmitted requests. NFS V3 implementations also tend to be
more "correct" than V2 implementations, since fewer requests that are not
actually idempotent (like rmdir or remove) are retransmitted.
Version 3.10 LISA 2006 123
1994-2006 Hal Stern, Marc Staveley
Client Caches
Caching critical for performance
If data exists, don't go over the wire
Dealing with stale data
Cached items
Data pages in memory: default
Data pages on disk: eNFS, CacheFS
File attributes: in memory
Directory attributes: in memory
DNLC: local name lookup cache
Version 3.10 LISA 2006 124
1994-2006 Hal Stern, Marc Staveley
Attribute Caching
getattr requests can be > 40% of total
May hit server disk
Read-only filesystems
Increase actimeo to 600 or more
"Slow start" when data really changes
Rapidly changing filesystem (mail)
Try noac for no caching
File locking disables attribute and data caching
When a file is locked on the client system, that client begins to read and write
the file without any buffering. If your application calls
read(fd, buf, 128);
you'll read exactly 128 bytes over the wire from the NFS server, bypassing the
attribute cache and the local memory cache to be sure you fetch the latest copy
of the data from the server.
If file locking and strict ordering of writes are an issue, consider using a
database.
Version 3.10 LISA 2006 125
1994-2006 Hal Stern, Marc Staveley
CacheFS Tips
Read-mostly, fixed size working set
Re-use files after loading into cache
Write-once files are worst case
Growing or sparse working set causes thrashing
Watch size of cache using df
Multi-homed hosts
CacheFS creates one cache per host name
Make client bindings persistent, not random
Penalty for cold cache less than that for no server
Using CacheFS solves the page-over-the-network problem where a process' text
segment is paged from the NFS server, not from a local disk. When using large
executables (some CAD applications, FORTRAN with many common blocks),
CacheFS may improve paging performance by keeping traffic on the local host.
Version 3.10 LISA 2006 126
1994-2006 Hal Stern, Marc Staveley
Buffer Sizing
Default of 8KB good for Ethernet speeds
At 56 Kb requires > 1 second to transmit
Remarkably anti-social behavior
Even worse for NFSv3 (32KB packets)
Reduce read/write sizes on slow links
In vfstab, automounter
rsize=1024,wsize=2048
Match to line speed and other uses
256 bytes is lower limit
readdir breaks with smaller buffer
Per Packet Time to Read
Line Speed rsize Latency 1 kbyte File
56 kbaud 128 bytes 20 msec 430 msec
56 kbaud 256 bytes 40 mses 310 msec
224 kbaud 256 bytes 10 msec 150 msec
T1 (1.5 Mbit) 1024 bytes 1 msec 42 msec
Version 3.10 LISA 2006 127
1994-2006 Hal Stern, Marc Staveley
Network Design and Capacity
Planning
Section 4
Version 3.10 LISA 2006 128
1994-2006 Hal Stern, Marc Staveley
Topics
Network protocol operation
Naming services
Network topology
Network latency
Routing architecture
Network reliability colors end to end performance. If your network is delaying
traffic or losing packets, or if you suffer multiple network hops each with long
latency, you will impact what the user sees. The worst possible example is the
Internet: you get variable response time depending upon how many people are
downloading images, what current events have users flocking to the major sites,
and the time of day/day of the week.
Version 3.10 LISA 2006 129
1994-2006 Hal Stern, Marc Staveley
Network Protocol
Operation
Section 4.1
Version 3.10 LISA 2006 130
1994-2006 Hal Stern, Marc Staveley
Up & Down Protocol Stacks
lCMP
ARP:
update cache
copy into kernel
TCP slow start
TCP segmentation
lP: locate route}interface
lP: MTU fragmentation
lP: find MAC address
Eth: send packet
RlP update
route
tables
ARP: get
lP mapping
backoff}re-xmit
collision
Eth: accept frame
TCP}UDP: valid port7
lP re-assembly
lP: match local lP7
read() on socket
write() on socket
Solaris exposes nearly every tunable parameter in the TCP, UDP, IP, ICMP and
ARP protocols using the ndd tool.
Find a description of the tunable parameters and their upper/lower bounds on
Richard Steven's web page containing the Appendix to his latest TCP/IP books:
http://www.kohala.com/start/tcpipiv1.appe.update1.ps
Also on docs.sun.com at
http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters
Solaris 2 - Tuning Your TCP/IP Stack and More
http://www.sean.de/Solaris
Version 3.10 LISA 2006 131
1994-2006 Hal Stern, Marc Staveley
Naming Services
Section 4.2
Version 3.10 LISA 2006 132
1994-2006 Hal Stern, Marc Staveley
Round-Robin DNS
Use several servers, in parallel, that have unique
IP addresses
DNS will return all of the IP addresses in response to
queries for www.blahblah.com
Clients resolving the name get the IP addresses
in round-robin fashion
When DNS cache entry times out, new one is
requested
Clients will wait up to DNS entry lifetime for a retry
Be sure to set the DNS server entry's Time To Live (TTL) to zero or a few
seconds, such that successive requests for the IP address of the named host get
new DNS entries
Name Servers that do Round-Robin:
- BIND 8
- djbdns
- lbnamed (true load balancer written in perl)
Version 3.10 LISA 2006 133
1994-2006 Hal Stern, Marc Staveley
Round-Robin DNS, cont'd
The Good
No real failure management, it "just works"
Scales very well; just add hardware and mix
Only 1/N clients affected, on average, for N server
farm (for a server failure)
The Bad
Clients can see minutes of "downtime" as DNS
entries expire, if TTL is too long
Can cheat with multiple A records per host, but not
all clients sort them correctly
None if done correctly
Version 3.10 LISA 2006 134
1994-2006 Hal Stern, Marc Staveley
lP d|reclor
weo server
weo server
weo server
weo server
192.9.230.1
192.9.231.0
x.x.x.1
x.x.x.1
x.x.x.2
x.x.x.3
192.9.232.0
IP Redirection
lP d|reclor
WWW.o|ar.cor
Version 3.10 LISA 2006 135
1994-2006 Hal Stern, Marc Staveley
IP Redirection Mechanics
Front-end boxes handle IP address resolution
Public IP address shows up DNS maps
Internal (private) IP networks used to distribute load
Can have multiple networks, with multiple servers
Improvement over DNS load balancing
All round-robin choices made at redirector, so client's
DNS configuration (or caches) don't matter
Redirector can be made redundant
Hosts could be redundant, too
Cisco NetDirector, Hydra HydraWeb
Version 3.10 LISA 2006 136
1994-2006 Hal Stern, Marc Staveley
Network Topology
Section 4.3
Version 3.10 LISA 2006 137
1994-2006 Hal Stern, Marc Staveley
SwitchTrunking (802.3ad)
Multiple connections to host from single switch
Improves input and output bandwidth
Eliminates impedance mismatch between switch and
network connection
Spread out input load on server side
Warnings:
Trunk can be a SPOF
Assumes switch can handle mux/demux of traffic at
peak loads
Solaris requires the SUNWtrku package (Sun Trunking Software)
Version 3.10 LISA 2006 138
1994-2006 Hal Stern, Marc Staveley
Latency and Collisions
Collisions
CSMA/CD works "late", node backs off, tries again
Fills Ethernet with malformed frames
Defers
CSMA/CD works "early"
Not counted, but adds to latency
Collisions "become" defers as more nodes share load
Use netstat -k (Solaris >= 2.4) or kstat (Solarsis >= 7) to see
defers and other errors
802.11
Version 3.10 LISA 2006 139
1994-2006 Hal Stern, Marc Staveley
Dealing With Collisions
Rate = collisions/output packets
Collisions counted on transmit only
Monitor on several hosts, especially busy ones
Use netstat -i or LANalyzer to observe
Collision rate can exceed 100% per host
Thresholds
Should decrease with number of nodes on net
>5% is clear warning sign
Usually 1% is a problem
Correlate to network bandwidth utilization
Most Ethernet chip drivers understate the collision rate. In addition to only
counting collisions in which the station was an active participant, the chip may
report 0, 1 or "more than 1" collision. Most driver implements take "more than
1" to mean 2, which in fact it could be up to 16 consecutive collisions.
Version 3.10 LISA 2006 140
1994-2006 Hal Stern, Marc Staveley
Collisions and Switches
Switched Ethernet cannot have collisions (*)
Each device talks to switch independently
No shared media worries
Still get contention at switch under load
Ability of switch to forward packets to right interface
for output
Ability to handle input under high loads
Look for dropped/lost packets on switch
Results in NFS retransmission, RPC failure, NIS
timeouts, dismal TCP throughput
Version 3.10 LISA 2006 141
1994-2006 Hal Stern, Marc Staveley
Collisions, Really Now
Full versus Half Duplex
Full Duplex: each node has a home run and no
contention for either path to/from switch
Half Duplex: you can still see collisions, in rare cases
What makes switch-host collide?
Many small packets, in steady streams
Large segments probably are OK
Version 3.10 LISA 2006 142
1994-2006 Hal Stern, Marc Staveley
Switches and Routers
Bridges, Switches
Very low latency, single IP network or VLAN
One input pipe per server
Routers
Higher latency, load dependent
Multiple pipes per server
Version 3.10 LISA 2006 143
1994-2006 Hal Stern, Marc Staveley
Switched Architectures
Switches offer "home run" wiring
Each station has dedicated, bidirectional port
Reduce contention for media (collisions = 0)
Construct virtual LANs on switch, if needed
"Smooth out" variations in load
Only broadcast & multicast normally cross between
network segments
Watch for impedance mismatch at switch
80 clients @ 100 Mb/s swamps a 2 Gb/s backplane
Version 3.10 LISA 2006 144
1994-2006 Hal Stern, Marc Staveley
Network Partitioning
Switches & bridges for physical partitioning
Corral traffic on each side of bridge
Primary goal: reduce contention
Routing for protocol isolation
Non-IP traffic (NetWare)
Broadcast isolation (NetBIOS, vocal applications)
Non-trusted traffic (use a firewall, too)
VLAN capability on switches for creating
geographically difficult wiring schemes
Version 3.10 LISA 2006 145
1994-2006 Hal Stern, Marc Staveley
Network Latency
Section 4.4
Version 3.10 LISA 2006 146
1994-2006 Hal Stern, Marc Staveley
Trickle Of Data?
Serious fragmentation at router or host
TCP retransmission interval too short
Real-live network loading problem
Handshakes not completing quickly
Nagel algorithm (slow start)
PCs often get this wrong
Set tcp_slow_start_initial=2 to send two segments,
not just one: dramatically improves web server
performance from PC's view
tcp_slow_start_after_idle=2 as well
inhibit the sending oI new TCP segments when new outgoing data arrives Irom the user iI
any previously transmitted data on the connection remains unacknowledged.
- John Nagel (RFC 896)
Version 3.10 LISA 2006 147
1994-2006 Hal Stern, Marc Staveley
Longer & Fatter Pipes
Fat networks (ATM, GigE, 10GigE)
Benefit versus cost trade-offs
Backbone or desktop connections?
Longer networks (WAN)
Guaranteed capacity, grade of service?
End to end security and integrity?
Latency versus throughput
Still 20 msec coast to coast
GigE jumbo frames >> Ethernet in latency, loses for
small packets
Version 3.10 LISA 2006 148
1994-2006 Hal Stern, Marc Staveley
Long Fat Networks
40 msec
latency
Send 4 KB of data
in 3 msec over T1
Wait 70+ msec to
send more,
producing gaps
in transmit
stream
Receiver sees gaps
in data, acks as fast as
it can
first bits arrive
in 40 msec,
last bit arrives
in 43 msec
Bad TCP/IP implementations will retransmit too much because it sees the high
latency as an indication that the packet didn't arrive and retransmit it. The
retransmit timer is too small.
Version 3.10 LISA 2006 149
1994-2006 Hal Stern, Marc Staveley
Tuning For LFNs
Set the sender and receiver buffer size high
water marks
Usually an ndd or kernel option, but resist temptation
to make "global fix"
Set using setsockopt() in application to avoid running
out of kernel memory
Buffer depth = 2 * bandwidth * delay product
or bandwidth * RTT (ping)
1.54 Mbit/sec network (T1) with 25 msec delay = 10
KB buffer
155 Mbit/sec network (OC3) with 25 msec delay = 1
MB buffer
Solaris:
# increase max tcp window (maximum socket buffer size)
# max_buf = 2 x cwnd_max (congestion window)
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152
# increase default SO_SNDBUF/SO_RCVBUF size.
ndd -set /dev/tcp tcp_xmit_hiwat 65536
ndd -set /dev/tcp tcp_recv_hiwat 65536
Linux (>= 2.4):
echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem
See http://www-didc.lbl.gov/tcp-wan.html and
http://www.psc.edu/networking/perf_tune.html for a longer explanation.
A list of tools to help determine the bandwidth of a link can be found at
http://www.caida.org/tools/taxonomy/.
Version 3.10 LISA 2006 150
1994-2006 Hal Stern, Marc Staveley
Routing Architecture
Section 4.5
Version 3.10 LISA 2006 151
1994-2006 Hal Stern, Marc Staveley
IP Routing
IP is a "gas station" protocol
Knows how to find next hop
Makes best effort to deliver packets
Kernel maintains routing tables
route command adds entries
So does routed
Dynamic updates: ICMP redirects
Version 3.10 LISA 2006 152
1994-2006 Hal Stern, Marc Staveley
What Goes Wrong?
Unstable route tables (lies)
Machines have wrong netmask or broadcast
addresses
Servers route by accident (multiple interfaces)
Incorrect or missing routes
Lost packets
nfs_server: bad sendreply
Asymmetrical routes
Performance skews for in/outbound traffic
Version 3.10 LISA 2006 153
1994-2006 Hal Stern, Marc Staveley
RIP Updates
Routers send RIP packets every 30 seconds
Each router increases costs metric (cap of 15)
Active/passive gateway notations
/etc/gateways to seed behavior
Default routes
Chosen when no host or network route matches
May produce ICMP redirects
/etc/defaultrouter has initial value
Version 3.10 LISA 2006 154
1994-2006 Hal Stern, Marc Staveley
Routing Architecture
Default router or dynamic discovery
One router or several?
Dynamic recovery
RDISC (RFC 1256)
Multiple default routers
Recovery time
Function of network radix
Version 3.10 LISA 2006 155
1994-2006 Hal Stern, Marc Staveley
Tips & Tricks
Watch for IP routing on servers
netstat -s shows IP statistics
Consumes server CPU, network input bandwidth
Name service dependencies
Broken routing affects name service
If netstat -r hangs, try netstat -rn
Version 3.10 LISA 2006 156
1994-2006 Hal Stern, Marc Staveley
ICMP Redirects
Packet forwarded over interface on which it
arrived
ICMP redirect sent to transmitting host
Sender should update routing tables
Impact on default routes
Implies a better choice is available
Ignore or "fade" on host if incorrect
ndd -set /dev/ip ip_ignore_redirect 1
ndd -set /dev/ip ip_ire_redirect_interval 15000
Turn off to appease non-listeners
ndd -set /dev/ip ip_send_redirects 0
Version 3.10 LISA 2006 157
1994-2006 Hal Stern, Marc Staveley
MTU Discovery
Sending large MTU frames works routers
Increases latency
Do work on send side if you know MTU
RFC 1191 - MTU discovery
Send packet with "don't fragment" bit set
Router returns ICMP error if too big
Repeat with smaller frame size
Disable with:
ndd -set /dev/ip ip_path_mtu_discovery 0
This RFC, like all others, may be found in one of the RFC repositories:
www.rfc-editor.org, www.ietf.org, www.faqs.org/rfcs
Version 3.10 LISA 2006 158
1994-2006 Hal Stern, Marc Staveley
ARP Cache Management
ARP cache maintains IP:MAC mappings
May want to discard quickly
Mobile IP addresses with multiple hardware
addresses, or DHCP with rapid re-attachment
Network equipment reboots
HA failover when MAC address doesn't move
Combined route/ARP entries at IP level
ndd -set /dev/ip ip_ire_cleanup_interval 30000
ndd -set /dev/ip ip_ire_flush_interval 90000
Local net ARP entries explicitly aged
ndd -set /dev/arp arp_cleanup_interval 60000
See SunWorld Online, February and April 1997
Version 3.10 LISA 2006 159
1994-2006 Hal Stern, Marc Staveley
Application Architecture
Appendix A
Version 3.10 LISA 2006 160
1994-2006 Hal Stern, Marc Staveley
Topics
System programming
Network programming & tuning
Memory management
Real-time design & data management
Reliable computing
Version 3.10 LISA 2006 161
1994-2006 Hal Stern, Marc Staveley
System Programming
Section A.1
Version 3.10 LISA 2006 162
1994-2006 Hal Stern, Marc Staveley
What Can Go Wrong?
Poor use of system calls
Polling I/O
Locking/semaphore operations
Inefficient memory allocation or leaks
Version 3.10 LISA 2006 163
1994-2006 Hal Stern, Marc Staveley
System Call Costs
System calls are traps: serviced like page faults
Easily abused with small buffer sizes
Example
read() and write() on pipe

sy cs us sy id
10 KBytes
271 41 4 12 84
1 KBytes 595 319
5
33 62
1 Byte 3733 2178 11 89
0
Version 3.10 LISA 2006 164
1994-2006 Hal Stern, Marc Staveley
Using strace/truss
Shows system calls and return values
Locate calls that make process hang
Debug permission problems
Determine dynamic system call usage
% strace ls /
lstat ("/", 0xf77ffbb8) = 0
open("/", 0, 0) = 3
brk(0xf210) = 0
fcntl(3, 02, 0x1) = 0
getdents(3, 0x9268, 8192) = 716
Using strace or truss greatly slows a process down. You're effectively putting a
kernel trap on every system call.
Collating Results
tracestat:
#!/bin/sh
awk '{
if ( $1 == "-" )
print $2
else
print $1
}' | sort | uniq -c
% strace xprocess | tracestat
13 close
87 getitimer
2957 ioctl
13 open
228 read
117 setitimer
582 sigblock
Version 3.10 LISA 2006 165
1994-2006 Hal Stern, Marc Staveley
Synchronous Writes
write() system call waits until disk is done
Often 20 msec or more disk latency
Reduces buffering/increases disk traffic
Caused by
Explicit flag in open()
sync/update operation, or NFS write
Closing file with outstanding writes (news spool)
Typical usage
Ensuring data delivery to disk, for strict ordering
Side-effects
Close(2) is synchronous
waits for pending write(2)'s to complete fails if:
quota exceeded (EQUOTA)
filesystem full (EFSFULL)
Check the return value!
Version 3.10 LISA 2006 166
1994-2006 Hal Stern, Marc Staveley
Eliminating Sync Writes
NFS v3 or async mode in NFS v2
Use file locking or semaphores
Application ensures order of operations, not
filesystem
Better solution for multiple writers of same file
Avoid open()-write()-close() loops
Use syslog-like process for logging events
Use database for preferences, history, environment
Version 3.10 LISA 2006 167
1994-2006 Hal Stern, Marc Staveley
Network Programming
Section A.2
Version 3.10 LISA 2006 168
1994-2006 Hal Stern, Marc Staveley
TCP/IP Buffering
Segment sizes negotiated at connection
Receiver advertises buffer up to 64K (48K)
Sender can buffer more/less data
Determine ideal buffer size on per-application
basis
Global changes are harmful, can consume resources
setsockopt(..SO_RCVBUF..)
setsockopt(..SO_SNDBUF..)
The global parameters on Solaris systems are set via ndd(1M):
tcp_xmit_hiwat, udp_xmit_hiwat for sending buffers
tcp_recv_hiwat, udp_recv_hiwat for receiving buffers
Global parameters in /sys/netinet/in_proto.c for BSD systems are:
tcp_sendspace, udp_sendspace
tcp_recvspace, udp_recvspace
Version 3.10 LISA 2006 169
1994-2006 Hal Stern, Marc Staveley
TCP Transmit Optimization
Small packets buffered on output
Nagle algorithm buffers 2nd write until 1st is
acknowledged
Will delay up to 50 msec
setsockopt (..TCP_NODELAY..)

Retransmit timer for PPP/dial-up nets
tcp_rexmit_interval_min
Default of 100 up to 1500
tcp_rexmit_interval_initial
Default of 200 up to 2500
Version 3.10 LISA 2006 170
1994-2006 Hal Stern, Marc Staveley
Connection Management
High request volume floods connection queue
BSD had implied limit of 5 connections
Now tunable in most implementations
Connection requires 3 host-host trips
Client sends request to server
Server sends packet to client
Client completes connection
Longer network latencies (Internet) require
deeper queue
Version 3.10 LISA 2006 171
1994-2006 Hal Stern, Marc Staveley
Connections, Part II
Change listen(5) to listen(20) or more
20-32000+ ideal for popular services like httpd
ndd -set /dev/tcp tcp_conn_req_max 100
Socket addresses live on for 2 * MSL
Database process crashes and restarts
Can't bind to pre-defined address
setsockopt(..SO_REUSEADDR..) doesn't help
Decrease management timers
tcp_keepalive_interval (msec)
tcp_close_wait_interval(msec) [Solaris <2.6]
tcp_time_wait_interval (msec) [Solaris 2.6]
Determine the average backlog using a simple queuing theory formula: average
wait in a queue = service time * arrival rate
With an arrival rate of 150 requests a second, and a round trip handshake time
of 100 msec, you'll need a queue 15 requests deep. Note that 100 msec is just
about the latency of a packet from New York to California and back again.
Once you've fixed the connection and timeout problems, make sure you aren't
running out of file descriptors for key processes like inetd.
Version 3.10 LISA 2006 172
1994-2006 Hal Stern, Marc Staveley
Memory Management
Section A.3
Version 3.10 LISA 2006 173
1994-2006 Hal Stern, Marc Staveley
Address Space Layout
Static areas: text, data
Initialized data, including globals
Uninitialized data (BSS)
Growth
Stack: grows down from top
mmap: grows down from below stack limit
Heap: grows up from top of BSS
Version 3.10 LISA 2006 174
1994-2006 Hal Stern, Marc Staveley
Stack Management
Local variables, parameters go on stack
Don't put large data structures on stack
Use malloc()
Can damage window overlaps
Version 3.10 LISA 2006 175
1994-2006 Hal Stern, Marc Staveley
Dynamic Memory Management
malloc() and friends, free()
Library calls on top of brk()
Don't mix brk() and malloc()
free() never shrinks heap, SZ is high-water mark
Cell management
malloc() puts cell size at beginning of block
Allocates more than size requested
Time- or space-optimized variants
Try standard cell sizes
Version 3.10 LISA 2006 176
1994-2006 Hal Stern, Marc Staveley
Typical Problems
Memory leaks: SZ grows monotonically
Address space fragmentation: MMU thrashing
Data stride
Access size matches cache size
Repeatedly use 1 cache line
Fix: move globals, resize arrays
Use mmap() for sparsely accessed files
More efficient than reading entire file into memory
Version 3.10 LISA 2006 177
1994-2006 Hal Stern, Marc Staveley
mmap() or Shared Memory?
Memory mapped files:
Process coordination through file name
Backed by filesystem, including NFS
No swap space usage
Writes may cause large page flush, better for reading
Shared memory
More set-up and coordination work with keys
Backed by swap, not filesystem
Need to explicitly write to disk
Version 3.10 LISA 2006 178
1994-2006 Hal Stern, Marc Staveley
Real-Time Design
Section A.4
Version 3.10 LISA 2006 179
1994-2006 Hal Stern, Marc Staveley
Why Worry About Real-Time?
New computing problems
Customer service with live transfer
Real-time expectations of customers
Web-based access
If a user's in front of it, it's real time
Predictable response times
High volume transaction environment
Threaded/concurrent programming models
Things to learn from device drivers
Version 3.10 LISA 2006 180
1994-2006 Hal Stern, Marc Staveley
System V Real-Time Features
Real-time scheduling class
Kernel pre-emption, including system calls
Process promotion to avoid blocking chains
No real-time network or filesystem code
Resource allocation and "nail down"
mlock(), plock() to lock memory/processes
Move process into real-time class with priocntl
Version 3.10 LISA 2006 181
1994-2006 Hal Stern, Marc Staveley
Real-Time Methodology
Processes run for short periods
Same model used by Windows
Must allow scheduler to run: sleep or wait
CPU-bound jobs will lock system
Time quanta inversely proportional to priority
Minimize latency to schedule key jobs
Ship low-priority work to low-priority thread
No filesystem/network dependencies
Version 3.10 LISA 2006 182
1994-2006 Hal Stern, Marc Staveley
Summary
Version 3.10 LISA 2006 183
1994-2006 Hal Stern, Marc Staveley
Parting Shots, Part 1
Be greedy
Solve for biggest gains first
Don't over-tune or over-analyze
Don't trust too much
3rd party code, libraries, blanket statements
Verify information given to you by users
Bottlenecks are converted
Add network pipes, reduce latency, hurt server
Fixing one problem creates 3 new ones
Some speedups are superlinear
Version 3.10 LISA 2006 184
1994-2006 Hal Stern, Marc Staveley
Parting Shots, Part 2
Today's hot technology is tomorrow's capacity
headache
Web browser caches, PCN, streaming video
But taxing use leads to insurrection
Rules change with each new release
New features, new algorithms
RFC compliance is creative art
Nobody thanks you for being pro-active
But you should be!

You might also like