You are on page 1of 12

HP-UX 11i Knowledge-on-Demand:

performance optimization best-practices from our labs to you


T

Developer series
T

Kernel tracing on HP-UX -Webcast topic transcript


T

Hello everyone. My name is Carol Muser, and I'm here to talk to you today about kernel tracing on HP-UX.
[NEXT SLIDE]
I'm a member of the HP-UX Kernel Performance Team, and I've been working at HP for ten years. My expertise lies in
kernel tracing and kernel profiling. I've worked on kernel instrumentation, benchmark tracking, and troubleshooting
performance problems as well.
[NEXT SLIDE]
As I talk to you about kernel tracing, I'd like to start with an overview, and then I'll describe the interface to use it, and
show you a sample output, describe the internals of kernel tracing, go through a case study, list additional features,
and then conclude.
[NEXT SLIDE]
A good way to introduce kernel tracing is to talk about what it is, the uses to trace kernels procedure calls. It traces,
whenever a function gets called or a routine gets called, it gets the parameters, it gets the caller, and it can show you
the timing of function calls made within the kernel across all CPUs. Another purpose to run kernel tracing is to observe,
analyze, and debug kernel behavior. The advantage of being able to run ktracer on HP-UX is that the functionality is
built into every kernel; you don't need to do anything special, you just turn it on and it works for whatever flavor kernel
you have, whichever architecture.
And it's fast! You can trace millions of function calls per second.
It's powerful! There are roughly 40,000 different trace points, there are dozens of data fields captured, and you can
choose which user specified variables to capture at each trace point.
It's flexible! You can adjust the function list or the data retention policy or the way it gets deactivated on a certain
trigger.
It's extensible! As new source code is developed and linked in, ktracer will find it.
And it's insightful! You can discover OS behavior that you had no idea about before.
[NEXT SLIDE]

Ktracer is newly available as ktrace in Caliper. The web link is hp.com/go/caliper. Ktracer works on all
releases and architectures for HP-UX, PA and IPF. No performance cost when it's off. There is a cost when on;
the cost is proportional to how many function calls you are tracing. ktracer works at run-time at boot up on a live
system and a crash dump. It was originally developed by a kernel developer for kernel developers, and it is not
intended for use in a production environment. It's useful for debugging, it's useful for testing, panic analysis, and
bring-up. It is to be used at your own risk. We're expanding the audience because our customers are asking us
for more tools, and this one is so powerful and can be such a timesaver. Well-educated customers could benefit
with a better customer experience by having such a powerful tool.
[NEXT SLIDE]
Ktracer can trace the CPU and the process and threads, and it can trace function callers, callees, capture all the
arguments through a timestamp, and capture globals or other user-specified variables. It can do tens of
thousands of functions. It can trace any subsystem of the kernel; could be firmware or network, it could be IO or
file system. A list of function names is shown, kernel modules, kernel libraries. You can trace all processes or
just one of your choosing.
[NEXT SLIDE]
The ideal audience for ktracer is someone who wants to run a tool like Solaris D-trace or like Linux Fast Kernel
Tracing on HP-UX. This user would know UNIX internals and have access to HP-UX source code. I find it very
useful to run a C-scope browser in one window and look at ktracer output in another window. The ideal
audience develops software, particularly kernel software, DLKMs or drivers for HP-UX. And this audience would
be interested in learning more about the behavior within the HP-UX operating system.
[NEXT SLIDE]
Here's an overview of how ktracer works. There's a ktracer command that the user runs in order to select which
kernel functions they're interested in tracing. There might be 30 or 40,000 functions; there could be more or less
depending on the kernel and which modules are linked in and which architecture, but thousands of functions,
and you can select all of them or a handful or anywhere in between. So ktracer places a trace point at the
beginning of each function. There's a 1-to-1 correspondence between function entry points and trace points, and
each and every time a function is called that's been selected, the kernel goes to the trace point, branches to
tracing code, and collects one trace record. This is not sampling based; this is instrumentation where every
function call that's made will be traced. There are volumes of data collected regarding the function address and
arguments and caller, and it also collects the stack pointer, the spinlock depth, the system mask, PPR, EIEM and
variables, as I mentioned, that are user-specified. Because so much data is collected, not all of it is kept. You
can make the trace buffers larger to keep more, but if the buffers overflow, they'll just wrap around and overwrite
the oldest data, keeping only the most recent.
[NEXT SLIDE]
Here is sample output from ktracedump. It shows what trace records look like. There is one trace record per row
with a header across the top. The same format is used for all trace records. All the traced activity here occurred
on CPU 1, and you can see which process ID and thread ID are active. The 1st trace record shows that the
kernel function bubbleup called external interrupt, and supplies a timestamp. The absolute seconds (AbsSec)
column is the timestamp and the ElUSec column shows the elapsed time between the previous trace and the
current one. Only 8 columns are shown here; there are 29 columns available. This particular slide shows a ten
millisecond snapshot of IO activity using LVM on scsi disk on IA64. The time stamps provide granularity to the
nanosecond.
[NEXT SLIDE]
Here's more on the purpose of kernel tracing. It can be used to educate yourself on how the OS behaves. It

answers questions such as which syscalls are made when and who makes them, and how long is my time slice
and what happens during IO forwarding? What happens when an interrupt comes in? Does my code get
called? Are these parameters valid? So in this manner, ktracer can be used to study and better understand what
the OS does.
[NEXT SLIDE]
A second purpose of kernel tracing is to troubleshoot and debug problems. It's useful both for functional defects
and for performance issues. You can look -- our clock interrupts arriving every ten milliseconds. How long do the
thread, a kernel thread, hold the CPU? And which spin lock acquisitions are nested, and how deeply?
[NEXT SLIDE]
Now that we've covered the overview, I'll go into detail on the user interface for kernel tracing.
[NEXT SLIDE]
So the ktracer command is used to control when tracing's turned on and off and which functions are traced and
how large the buffers are within the kernel, and just to get started, for the simplest options, ktracer -g will gather a
trace for a specific interval of time. You pass it a number of seconds and it just does it all from start to finish:
traces the kernel, and produces an output report. Ktracer -w is another all-in-one command line, but instead of
gathering for a number of seconds it gathers traces while a workload runs. It executes the workload and then
runs ktracedump to produce the output. When I say "do it all options" or "all-in-one," there are six steps listed on
the slide that need to be done to get ktracer up and running and then stopped and reporting. Step one is to
allocate, two is to select the function list, three, install and begin, four is to run the workload or sleep for a
number of seconds, and then five would be to stop. Another step in stopping is to remove trace points, and this
restores the original performance, the faster performance on a system. Then invoke ktracedump.
[NEXT SLIDE]
This slide describes two options that get through those steps of allocate and select functions install and begin
automatically. They differ only in the function list. Ktracer -L starts with a lightweight set of key functions, which
have low performance overhead to trace, and which are particularly interesting because they relate to syscalls
and to kernel entry points such as trap or pre-handler, and they relate to interrupt handling and process
switching. There are about a dozen key functions; their count and names vary based on architecture.
Ktracer can trace tens of thousands of functions. There are some functions that ktracer cannot trace due to
limitations. It cannot trace leaf functions on PA, where leaf functions are non-interior functions, meaning they do
not call other functions. ktracer cannot trace function calls that are inlined by the compiler or the optimizer,
because the optimizer removes the function call site so the calls aren't actually made. Ktracer only traces calls
that are made. Ktracer with 11.31 LR doesn't trace functions within DLKMs. I've developed a patch that has the
capability to do so, and I'm looking for beta testers, and I'm wondering are there any the audience who would
be interested in trying out a patch. Please contact me at Carol.Muser@hp.com if you are interested. Ktracer has
some other functions it can't trace that are based on exclude list, or instructions that can't be replaced, or unwind
information, unavailable or incorrect, or specific function names that have like a $cold in them or some other
specialized characters that were compiler generated. If you run ktracer -a and give it a specific function name
that you're interested in tracing, it will either say it is added it or it will give an error message describing why this
particular function cannot be traced.
[NEXT SLIDE]
So here's a description of how to select the trace function list. Ktracer -m will select by module, -l by library.
Ktracer -a can add a function by name or by address, or explain why not. Ktracer
-r will remove a function. A new option with -r is to pass it a number, and it will remove that number of most

frequently traced functions. Ktracer -z zeros the function list.


[NEXT SLIDE]
This slide shows you how to begin and halt tracing. There are man pages for both ktracer and ktracedump that
describe the various options and operations as well. I'll describe the ktracer -S option. Capital S will stop when
a certain function is reached, that is, turn off trace record collection. Both Ktracer Light and Ktracer Reasonable,
which are the -L and -R options, incorporate stop functions at the panic function and the assertion fail function.
What this stop functionality provides is a way to trace the last 2,000 or n functions on each CPU leading up to a
panic or an assertion failure. After youve run ktracer and encountered a panic, you can run Ktracedump in a
dump directory and list all the function calls that led up to the panic. Debug kernels start ktracer-Light during boot
and leave it on. Perf kernels don't; they start with ktracer off and they leave it off, but you can enable it through
super-user control when desired.
[NEXT SLIDE]
A simple usage scenario is covered here. You can run ktracer -L to turn on tracing of a few key functions, and
then run a benchmark or test case, and then ktracer -h will turn tracing off, and ktracedump -D, redirect the output
to a file, and that produces a formatted report that displays the traces, the trace records that have been collected.
[NEXT SLIDE]
Here's an example with some more detail. There's, this example shows ktracer-Light begin but then removes the
external interrupt function and add the BT-LAN ISR function, plus all the functions in the TCP module, and then
stop if the dma_cleanup function is encountered. Now begin. The -B leads me to a point that the options are
processed in the order specified from left to right, and while modifying the trace function list, ktracer needs to be
off, and turns itself off automatically if needed. If ktracer was on when you asked to modify the function list, it
turned off automatically, so after you finish selecting functions, youll need an explicit begin option in order to
start tracing.
[NEXT SLIDE]
So the ktracedump command that produces the report of the trace records, and it will report the trace function list,
as well. Each field in the trace record can be shown as a column of output. Man ktracedump to see the details.
Some basic options are shown here to run Ktracedump on a live system or dump. And Ktracedump -FN will
report the function list, so it you want to see which light functions apply on your system, then ktracer L;
ktracedump FN will show you the list that pertains. If you run ktracer R; ktracedump FN, then you'll see
exactly how many functions ktracer can trace, whether it is 30,000 or 40,000 or some other count, and what
each of their names are.
[NEXT SLIDE]
Ktracedump -H shows trace fields, and each field can be displayed as a column of output. In some circumstances
you can only fit 80 characters on a screen or 132 characters on a printout, so you need to limit which columns
are displayed. Ktracedump has several options to take advantage of screen real estate, to show only narrow
columns or the most important columns, or to adjust the width of the columns. The trace fields that can be
displayed are listed here: sequence and CPU number, process ID and thread ID, the ones I've been discussing
earlier, caller, function, absolute time, and elapsed time.
This brings us to the stack pointer, which Id like to describe a little more. The stack pointer is a pointer to an
area in memory that contains procedure calling context, saved registers and local variables. On IPF by
convention, addresses that start with 0X9F are on the kernel stack. Addresses on the ICS (Interrupt Control
Stack) start with 0XE. You can tell from the stack pointer whether a trace was taken while on the kernel stack or
in interrupt mode (on the ICS). 64-bit user space addresses would start with 0x4, but won't see user stack

addresses because this is a kernel tracing facility. ktracer is not able to trace lightweight system calls because
lightweight syscalls don't actually create a kernel stack; they stay on the user stack.
[NEXT SLIDE]
This slide shows more trace fields that Ktracer will collect in a trace record. The PSR, TPR, the first four
arguments, and then a symbolic interpretation of arg0 through arg3, and then the globals 0 through 3, which
are variables that the user selects. For the symbolic arguments, e.g. SymArg0, Ktracer take the argument value,
e.g. arg0, and interprets it based on function context and kernel knowledge. If the trace record is for the
spinlock or spinunlock function, then ktracedump translates arg0 into a lock name to arrive at SymArg0. If the
traced function is syscall, then arg0 is the syscall num, and ktracedump puts the corresponding syscall name into
SymArg0. If the traced function is trap, then arg0 represents a trap number, and ktracedump will show the trap
name in the SymArg0 column. If arg0 is the address of a kernel function or variable, SymArg0 will be the
name+offset of the kernel function or variable. If you want to see arg0 in decimal rather than hex, you can run
Ktracedump -J arg0 followed by a percent sign then a printf string, such as J arg0%16ld to print arg0 as a 16
character-wide decimal. If you change ld to lx, arg0 will be printed in hexadecimal. The C-format strings are
listed in the man-page for printf.

[NEXT SLIDE]
Ive referred to Ktracedump -J for selecting which columns of trace data to show in the ktracedump output.
Capital J turns columns on, little j turns columns off. If you don't want to see the columns PSR and SpnD, the
example shows you how to turn both of those off.
[NEXT SLIDE]
Ktracedump -A displays all 29 columns that are available for each trace record.
The default trace record order is from lowest CPU number to highest CPU number. Trace records are listed from
oldest to newest on a specific CPU, based on time of capture.
There is a new S option for Ktracedump to sort all trace records by timestamp independent of CPU number.
I'd like to really recommend the vim (VI iMproved) editor, available at vim.org. It works where vi doesnt on
large files. The volume of Ktracedump output is frequently in terms of megabytes or gigabytes, depending on
what size you request for the trace buffers. The :set nowrap option helps view lines that would wrap around or
are very long. Vim has syntax highlighting, multiple undo and other features that I find quite useful, so I
recommend it.
[NEXT SLIDE]
Recently, the Caliper team and my team have worked together in order to integrate Ktracer into Caliper. Now
Caliper can be invoked with the ktrace measurement feature. An example is shown on this slide where you list
the Ktracer args in quotes and the Ktracedump args in quotes, and then specify the output file. The ktracer and
ktracedump args are identical to the arguments I've described earlier for the Ktracer and Ktracedump commands.
Caliper is a multi-purpose performance tool. It runs on HP-UX and on Linux. It can do user-level and kernel
profiling, and data cache and ITLB miss measurement and much more. Tracing is something that fits well under
the umbrella of performance measurement tools. That's at hp.com/go/caliper.
[NEXT SLIDE]
Here's a simple usage scenario that will be covering the output of ktrace.out in the next section, based on
running this command: ktracer -L -w workload. -L turns on ktracer light, -w runs the workload. This particular
workload I'm running is very small; it's just a test command. The ktracedump command will extract the trace

records from the kernel. Ktracedump is a /dev/kmem reader, and it just reads the traces without grabbing any
locks or modifying the kernel tracing state so that it doesnt interfere or change ktracer operation. Ktracedump
will interpret and format that trace data, and then write the ktrace.out file, which we'll be looking at next.
[NEXT SLIDE]
So here, in the sample output, we'll cover the file header, trace header, columns, and trace records.
[NEXT SLIDE]
The output from ktracedump starts with the file header. It has timestamps for when the workload began and
ended. It shows the options passed to ktracedump, the version number, some dates, the kernel whatstring and
linkstamp. If you ever need to report a problem, please include the file header in order to provide parameters of
the system being used.
The next 14 slides show ktracedump output. Nine of these slides have the same content with different
highlighting, so well be able to move through them quickly.
[NEXT SLIDE]
The file header describes some of the internals of ktracer about how time is collected, is adjusted. Time is
collected on each CPU from the interval timer, and then it's adjusted by the offset versus the monarch. Trace
buffer management can be circular, which I've described already, where the oldest data gets overwritten if the
trace buffers fill. Each trace buffer by default holds 2,048 traces. I will cover linear trace buffers on a future
slide.
This file header shows more data about the run, such as there's CPUs traced. There are just 12 traces gathered,
so that means 12 function calls were made that were selected function calls, and the sort order. So each
processor had buffer space for 2,048 trace records, but only 12 of the trace records were actually used. Iticks
refer to the hardware clock frequency, and equate to machine cycles on some machines.
[NEXT SLIDE]
Here's the trace header. This looks similar to a ktracedump -H output I showed earlier showing which columns
can be displayed. Here we have nine columns displayed, and I'll go through each one on the next nine slides.
[NEXT SLIDE]
The first column is awk-parse info. It is a single character in column one; it is a T if the line contains a trace
record, it's an H if the line contains a header, or other letters as you can see. And it is particularly useful if
you're writing a Perl or a Python script or using grep or awk commands to filter the output, and you want to look
at just the traces or data of a particular nature.
[NEXT SLIDE]
The second column is a sequence number. It's zero-based. This slide shows sequences from 0-11. Sometimes I'll
call these trace numbers; if trace number 6 has the, if you go across you'll see resume cleanup as a function for
trace 6. If you sort the output by the zero-based sequence number, then you'll see a sequence of events across
all CPUs from the farthest back in time to the most recent. The sequence number is calculated based on the
absolute timestamp.
[NEXT SLIDE]
The third column is a CPU column. In this example, all the traces were collected on CPU 1.

[NEXT SLIDE]
The fourth column is the process ID. If a trace record is captured while in process context, the process ID is
captured and displayed here. If a trace record is captured while executing in idle or on the interrupt stack,
ktracer will show -1 here to indicate not in process context. If you want to trace only the kernel activity that
occurs outside of process context, you can run ktracer -p -1 to trace only the kernel activity that occurs outside of
process context. The thread ID (TID) is collected along with the process ID. The TID column is not displayed here.
[NEXT SLIDE]
The fifth column is the spinlock depth. It shows how many kernel spinlocks were held at the time the trace was
captured. This is interesting when troubleshooting spinlock related issues.
[NEXT SLIDE]
The sixth column is the traced function. This corresponds to the function names that we've put on our traced
function list. Since the output reflects ktracer light, the functions on the list include syscall, resume_cleanup, and
swtch. We also had other functions on the list, such as panic and assertion fail, but those were not called during
this workload. Another column named "Caller" is not shown here but is often of interest, as well.
[NEXT SLIDE]
The seventh column is the elapsed microseconds since previous trace. Now, there's another column that's the
absolute sec, which shows the time in seconds since boot, and that timestamp for each trace is used in order to
calculate the elapsed time, which is shown here. So it's, the elapsed time is the time since previous trace, just
taken by subtracting the previous trace timestamp from the current traces timestamp. I'd like to clarify that or
make a point of it because the timestamps are taken at function entry point, and the time between a resumed
cleanup trace, for instance, to a switched trace would be the elapsed time from when resumed cleanup to the
time when switch was entered. Ktracer isn't specifically showing how much time is shown within a function, but if
you want to trace a time from function entry to exit, you would need to have a function exit point be one of the
trace points that ktracer is aware of.
[NEXT SLIDE]
The eighth column is the PSR, that's the Processor Status Register, column. On PA, it's the Processor Status Word.
The entire PSR or PSW is captured but only five characters are printed based on screen-width constraint. The
letters pCPDI reflect bits in the PSR, where p is PMU enable (PSR.pp), C is interrupt collection enable (PSR.ic), P is
protection key (PSR.pk), D is data translation (PSR.dt), and I is the interrupt enable (PSR.i). I personallly find the Ibit to be most commonly of interest. The capital letter I will show if it's on and an underscore will show if it's off.
So if you look at trace six, for resume_cleanup there's an underscore at the end of the PSR column; that means
interrupts were off when resume_cleanup was entered, and then for the next trace, interrupts were turned back on
by the time the swtch function was entered.
[NEXT SLIDE]
The final column in this output is column nine. It's the symbolic argument 0. Arg 0 is captured in numeric format,
as I mentioned. You can show just the arg0 column to get the numeric version, but symbolic argument 0 shows a
translation of syscall number into syscall name. Where this slide shows syscall as the traced function, SymArg0 is
siginhibit, sigenable, or wait, for instance. Where the traced function is resume_cleanup, SymArg0 is numeric
because ktracedump found no interpretation for hexadecimal constant 0x10.
[NEXT SLIDE]
That completes the explanation of all the columns that are shown in trace header by default. Now I'll show you
just the trace records that we've been looking at before, and do some further analysis on them. PID 28231,

made six syscalls, and then it switched out, and then switched back in and made three more syscalls. If we look
at the elapsed time, the elapsed time is 436 microseconds for that switch, and that is more times than all the
syscalls put together. I would also want to comment that both the processes, 28231 and 28234, made other
kernel function calls while they were running but only syscall and resume_cleanup and a handful of functions on
the light function list were traced, so we don't see the other function calls.
The function histogram section on the slide shows you how to list which functions were called how often by
running an egrep command to show just the lines with trace records (they start with T), and print the function
name (its in column 6), then sort by function name, and count how many trace records were found for each
function. We can see that there were two calls to resume_cleanup, one to swtch, and nine syscalls.
[NEXT SLIDE]
So as I mentioned in the last slide, the elapsed time is very large for process switching. HP-UX 11i v3 includes
Project Mercury, which provides application developers with a way to reduce inopportune process switching,
and to promote faster performance of critical regions of userspace code. Please view the Project Mercury
presentation at http://hp.com/go/knowledgeondemand for details.
[NEXT SLIDE]
So that completes my description of ktracedump sample output.
Now Ill go into the internals of ktracer, buffers and trace points and its performance.
[NEXT SLIDE]
So how does kernel tracing work? Well, it has a set of trace buffers. Each CPU has its own. They're allocated
from kernel memory. Each trace buffer can hold 2,048 traces by default, and after that many procedure calls
are traced, then the buffer becomes full. If you want trace buffers to be larger, ktracer -A and the new number of
traces can be used to increase the number of traces that each CPU can store. There is a system-wide safeguard
of 6 million traces. That's not per CPU, that's system-wide of 6 million, because the traces take up about 160
bytes, so that system-wide safeguard is for one gigabyte of space dedicated to kernel trace buffers. You can
ADB max_ntraces to eliminate the safeguard and take up as much kernel memory as requested and as available.
Use at your own risk. There's only one trace buffer per CPU, and if you choose to clear the buffers or free ktracer
or overwrite the traces by further activity, then ktracedump can no longer extract the old trace records. It is
important to understand that when you increase the number of traces with ktracer -A, the impact is not just greater
kernel memory usage, but also greater disk storage requirements for the ktracedump output, and longer
processing time to write the files and to read them and to search through the output.
Ktracer A does not cause ktracer to take any longer in collecting each trace.
[NEXT SLIDE]
Here's a diagram to represent trace buffers. On the left is the mpinfo struct, one for CPU. Each mpinfo has a
field called trace header that points to the top of the trace buffer. Each trace buffer has the 2,048 trace records
by default. On the right is the operating system and has the kernel tracing code, which writes a trace to the next
slot. The trace buffers use cell local memory because it's tied more closely to the CPU, and that makes the
performance faster for ktracer.
[NEXT SLIDE]
The trace buffers can be managed as circular or linear. Circular we've discussed: when a trace buffer becomes
full, ktracer wraps around and replaces the oldest trace in sequence and the oldest traces are lost. The other
method for managing traces is linear: when a trace buffer becomes full no more traces are collected for that CPU.
This means the newest traces are lost. We call it trace buffer overflow when traces are lost. And each CPU fills
its trace buffer at the rate that trace procedure calls are made. This rate varies; one CPU may be making more

function calls that are traced than another. Because the buffers are per-CPU, you will notice if you look, sort
across CPUs or look at the sequence numbers that the CPUs that were executing trace function calls more rapidly
have only the newer timestamps for circular buffers.
[NEXT SLIDE]
Here I cover trace timing. So the timestamp in each trace record is taken from the I-timer -- that's the interval timer
on the CPU -- and then it's adjusted by the time of day offset versus CPU 0. I've mentioned before, ktracedump S, absolute seconds or absolute time, or even Z sequence or seek will sort the traces by timestamp across all
CPUs. I will warn you, if there's clock drift on the machine that's excessive, this is going to throw off the sort
order, and sorting by timestamp across CPUs will make it apparent as you look at the output that trace record
loss occurs on a per CPU basis. To avoid trace buffer overflow, which is important to do, you can increase the
number of trace records or you can stop tracing the most frequently called procedures, like with the -R and then
the number of procedures you want to eliminate. You can trace for a shorter timeframe, or you can trace only a
single process. Ktracer-R to remove the most frequently called procedures is particularly useful. I've seen this
function interrupt stroke clear idle that can be called a million times a second, and if you trace back then you can
fill your entire trace buffers just with interrupt stroke clear idle. If you remove that, then you can eliminate the
noise and get a better signal on other functions that are being called.
[NEXT SLIDE]
Here's a diagram of trace points in kernel text. On the left is kernel text, on the right is trace function list. The
kernel text contains kernel tracing code that collects the trace, then branches to the function's initial instructions.
And to understand that a little better, in the kernel text there are tens of thousands of kernel functions, and two
examples are switch and resume. When switch and resume are added to the trace function list, the first
instruction of switch gets overwritten by a branched trace encode. On PA, it's just the first instruction. On IPF,
it's actually the first bundle of three instructions, and the initial instructions need to be copied and saved so that
they can be executed after the trace encode is complete. And so those initial instructions are saved in the trace
function list, and when switch gets executed it branches to trace encode, collects the trace, and then branches to
the trace function list, executes the initial instructions, and then resumes normal path flow by a branch to switchplus offset. So because Ktracer actually modifies code of trace functions, we do caution you that this is not to be
used on a production system, that it's a lab contributed tool. It's useful for kernel developers doing debugging,
doing testing, doing troubleshooting.
[NEXT SLIDE]
The resiliency of kernel code supporting ktracer has improved dramatically as HP-UX releases have progressed.
It's just not guaranteed.
With this slide I'll tell you the details of ktracer performance overhead. There is no cost when it is turned off.
There is no cost when there are no selected functions. The performance cost to run ktracer depends directly on
the frequency of calls to functions that are traced. So function calls that are not traced add no performance cost.
Each time a function call gets traced, that's where the cost exists. Optimally, the time it takes to collect a trace on
PA is 110 machine cycles, and on IPF it is 90 cycles. This IPF measurement is about ten times faster than it was
in 11.23, 11i v2. Because the cost is per function call made, and some function calls are made much more
frequently than others, it is faster to trace hundreds of functions that are infrequent than to trace one single
function that is intensively called. Let me illustrate that point. Ill use intr_strobe_clear_idle as an example of
frequent, and clock_int as an example of infrequent. Clock_int is called 100 times per second on PA. I have a
one gigahertz box as an example. So to trace clock_int for a second costs 100 traces multiplied by 110 cycles
per trace times 1 gig cycles per second, so it takes 11 microseconds to capture those 100 clock_int traces. 11
microseconds is just a fraction of a percent of the one second interval.
But on the other hand, we have a frequently called, intr_strobe_clear_idle function that is called 500 times per
millisecond, (rather than per second), on a 1.6 gigahertz Itanium box. So to trace it for one millisecond costs
500 multiplied by 90 times 1.6, which is 28 microseconds. That's 2.8% of the 1 millisecond interval, and that is

more costly for one millisecond of tracing of inner stroke clear idle than 1,000 times as long of tracing on clock
int. So with .0011% overhead, clock_int can go on the light function list, but with 2.8% overhead,
intr_strobe_clear_idle should not be one of those functions. It's far from light and should be removed from the
function list unless specifically needed. As you use ktracer, you will find that strategically selecting the right
functions and eliminating the wrong ones is a key strategy to effective tracing.
[NEXT SLIDE]
So now that we've covered the trace points and trace buffers and trace records, I'd like to describe a case study
of how ktracer was used for a real life example.
[NEXT SLIDE]
This section of my presentation goes into a case study on a performance slowdown. Before I describe the
slowdown, Ill explain that the HP-UX development community has invested a great deal of effort in order to make
11.31 OS performance much faster than 11.23: 30% faster on average, and many times faster for some
multithreaded workloads. This case study is one example where effort was needed and was applied in order to
improve performance. The HP team who runs the SAS benchmark reported that after they upgraded the OS, the
benchmark performance regressed by 18%. This regression occurred on the same physical machine, with the
same benchmark, the same physical disks, and the same database. The only thing that differed was the root
disk: one had HP-UX 11.23, the other had 11.31. This kind of regression is unacceptable. My team was called
to work with the SAS team to figure out why the performance was worse, so we could do something to fix it.
The 11.31 OS root disk provided new commands, libraries, and kernel. We needed to narrow down which of
these components is taking more time than before. The time command indicated the benchmark took 28 more
seconds in user mode, and 3 seconds more in system mode. We ran some user tools like Caliper and sar to see
where the difference is, and sar showed that the disk writes took much longer, with much longer disk average
wait and average queue. We also ran spinwatcher, a tool that does profiling, spinlock monitoring and systemwide measurements. It identified more user time in sas, more kernel time in idle, and it showed that most of the
time in both new and old runs were spent in the processes SAS and vxfsd which is the file system daemon. Most
of the kernel CPU time went to vx_inode_free_list.
So, what changed in the OS relating to disk writes? Well, HP-UX rolled from VXFS 3.5 to 4.1, and we deployed
a unified file cache, changed the disc path naming, and made many more improvements and changes. Since
there's regression, we have to ask, well, who should own this problem? Where exactly is the slowdown?
[NEXT SLIDE]
So ktracer to the rescue! Its a great tool for troubleshooting kernel performance slowdowns. The slowdown
during SAS is apparently related to disk writes and vxfsd. To get greater insight, we decided to use ktracer to
trace the kernel activity of the vxfsd process. ps ef shows us it has PID 59. Then we chose to trace all the
functions in the VXFS module using the -m vxfs option, then ran ktracer -B as SAS began, then ran ktracer -h as it
ended. In this way we traced what the kernel does while vxfsd is running. When we dumped the traces and
analyzed them, we discovered that most of the vxfsd kernel activity involved calls to the functions named
vx_recsmp_rangelock and vx_recsmp_rangeunlock. So, locking appeared to be at issue.
We decided to get more detail by modifying ktracer setup to trace just these range lock and unlock function calls,
and to trace all the processes that might grab this range lock. To do so, we ran ktracer z to zero, that is,
remove the traced function list, so that we're no longer tracing the entire VXFS module. Then we used the -a
option twice to add the two range lock and unlock functions of interest. Option p 0 reverted to tracing all
processes instead of just PID 59.
[NEXT SLIDE]

Now that we've changed the parameters for ktracer to trace just two functions and trace all processes instead of
one, we need to run ktracer again, so at this point we realize that -- so we started ktracer right as the last phase
of SAS begins and shut it off before it ends. We had learned that the SAS benchmark has two phases, and the
first phase ran for roughly 42 minutes, and in both cases about the same length for phase one, and then the last
phase runs just for three minutes, and it contributes 90% to the overall result of the benchmark, and that second
phase, which had the slowdown with the new OS versus the old one. So we focused on the last phase of the
FAS benchmark, and when starting ktracer again to trace all processes and just those two lock functions for the
last phase, we wanted to run ktracer -Z to zero out all the existing traces in the trace buffer and start over. So -Z
did that, -B to begin, let the benchmark final phase run. Run ktracer -h right before the benchmark ends or just a
couple minutes into it to make sure we're not tracing things that are not part of the benchmark we halted before it
ended. When I invoke ktracedump, I now knew specific columns that I was interested in that weren't part of the
default dump output. I wanted to see the timestamps rather than just the elapsed time, so I turned on absolute
seconds (AbsSec), turned on the Caller column to see which functions were calling the lock and unlock functions.
I included -F to list the functions that were being traced. I find it helpful to see in the ktracedump output that this
was the run where just REC SMP lock and unlock were traced.
[NEXT SLIDE]
After we got the ktracedump output, my colleague and I both wrote scripts that parsed that output and calculated
the lock hold duration. We used the absolute second difference in order to get that lock hold duration. The
elapsed time would show the lock hold if it was a lock followed immediately by an unlock, and the elapsed time
was useful, but if there was a nested lock, then we couldn't just use the relative elapsed time. We needed the
absolute time. And we found the ARG 0 path to the lock -- the REC SMP range lock function ARG 0 is a
vx_inode address, and we wanted to use ARG 0 to uniquely identify which lock was being acquired. As we
analyzed the output with the scripts, we found that some of the range unlock in the middle of a run had no
matching lock. Now, if there were unlocks at the beginning of the run, at the beginning of the ktracer output, I
would understand that perhaps the lock had been already acquired before ktracer was turned on, but when we
found range unlocked in the middle of the run, well into the run that had no matching lock, that didn't make
sense, and so we went and looked more at the kernel and the list of functions to find out are there other functions
that might have grabbed this lock that's getting unlocked without a matching lock. And we found there's a
vx_recsmp_rangetrylock function that can acquire the lock as well, so we added that to the list of functions to
trace, and then we needed to repeat the ktracer run, so as the beginning of the slide says, "Run Ktracer
iteratively to gain insight." That's typically how its done. Ktracer gives you some insight into a problem, and
you can figure out which direction to head. Then rerun it with different arguments or different time lengths,
different function lists, single process, or all process and get closer and closer until you find the problems source.
So how are we going to isolate the file system performance regression? We are going to trace all processes, we
will trace all the matching recsmp lock and unlock functions, including trylock. After adding the range trylock
function, we capture trace data once again during the last phase of SAS. When we did that and ran
ktracedump, we were able to accurately identify the long hold times for the lock. We wanted to figure out which
object was being locked for a long time, and so we mapped the lock argument 0 into a inode and then ran a P4
debugging function to get its pathname. We found that pathname lookup only works for names that are still in
the directory name lookup cache. Ktracer when it runs just captures the ARG 0 value, but it doesn't capture the
pathname that that inode points to, so if it's still cached, you can find it after the benchmark ends by running the
commands, but you do have to be aware that if it's no longer in the cache, that data will no longer be available.
There are ways to program ktracer to capture more data than just numeric ARG 0, and that could have been
useful in this case. But we did have enough data to go on using the methods above to identify the long hold
times with the range lock, and we sent the ktracer data and a problem report to the owner of VXFS. They were
able to look at the trace data, figure out where the problem was, and issue a support bulletin. The web link for
the support bulletin is listed here: http://support.veritas.com/docs/290636.htm. A one liner to describe it is that
the max_diskq tunable, was only 1 megabyte and needed to be much larger in order to allow full utilization of
the disk. vxtunefs is a command which can be used to change tunables on a per file system basis, and the
recommendation from Veritas is to increase the max_diskq tunable to 2048 megabytes. Without ktracer, we
would not have known nearly so quickly, if at all, what the cause of that performance slowdown was.

[NEXT SLIDE]
Additional features: inserting trace points anywhere. This is a feature, if you own kernel source code, that you
can use to create your own trace points. If you want to trace something that's in the middle of a function or if
you want to trace parameters that aren't captured by the first four arguments of an existing function call, you can
add a function call, and we have one that's generic kt_dbg in the kernel. It exists on 11.23 and 11.31. And
you can kt_dbg basically just captures the arguments and returns immediately without doing any other work,
and so you can add a call to kt_dbg and pass the parameters that you want traced to kt_dbg. You might want to
see the line number or you might want to see an error code or a buffer or time, so there's an example of after
you've added the kt_dbg calls and you've built your kernel and you can run ktracer a, supply that kt_dbg
function name, and then run your workload, When you run ktracedump, pass it -a to show all the argument
values: ARG 0 through 3, and look in columns ARG 0-3 of the trace output to see the values that you wanted
captured. You can use this function in place of a printf of 0 to 4 scalar variables. You have these benefits over
printf: it has lower runtime performance cost, and you don't flood the console with the same message over and
over -- you can turn the message on and off with ktracer -- and it's searchable, and you can align the column
output, and you can adjust it with that percent format to be whatever printf string you want on the fly and after
the fact without building a new kernel. This kt_dbg insertion can be used in the case where you can't trace leaf
functions but want to. If you add a call to kt_dbg, then it's not a leaf function anymore.
[NEXT SLIDE]
Embedded calls. There's a ktracer API that allows programmatic control over when ktracer's activated, so you
can use that to start and stop data collection from within a program, or start and stop it when a user-defined
condition is achieved. So it can reduce overhead during the rest of the run, and it can prevent losing interesting
traces by keeping only the small number of traces in the buffers.
[NEXT SLIDE]
Conclusion: In summary, kernel tracing on HP-UX is fast; it's built-in; it helps engineers to understand OS
behavior on HP-UX. It helps to troubleshoot problems, whether functional or performance in nature, by capturing
data about the kernel's procedure calls, and it captures kernel variables. There are quick start options and Do-ItAll, so it's easy to invoke. Unix internals knowledge and source access to HP-UX are most helpful in order to
make sense of the output ktracedump produces. Selecting which functions to trace is strategic to solving
problems and to reducing trace buffer overflow and to keeping the performance overhead low for ktracer.
[NEXT SLIDE]
If you're interested in kernel tracing, you may also like kernel profiling, spinwatcher, and an incredibly fast and
defect-free operating system.
[NEXT SLIDE]
For more information, you can contact me through e-mail. You can go to the Caliper website to download the
ktracer and ktracedump facilities. For general information, go to Google.com. I hope you found this
presentation interesting and informative. Thank you for your attention.

For more information:


www.hp.com/go/knowledgeondemand
HTU

UTH

2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without
notice. The only warranties for HP products and services are set forth in the express warranty statements
accompanying such products and services. Nothing herein should be construed as constituting an additional
warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

You might also like