HP-UX Performance and Tuning (H4262S)

StudentPerformance and Tuning HP-UX H4262S C.
00 guide
HP Training
Student guide
Copyright 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. This is an HP copyrighted work that may not be reproduced without the written permission of HP. You may not use these materials to deliver training to any person outside of your organization without the written permission of HP. UNIX is a registered trademark of the Open Group. Printed in the USA HP-UX Performance and Tuning Student guide May 2004
Contents
Contents
Overview............................................................................................................................................. 1 Module 1 Introduction 11. SLIDE: Welcome to HP-UX Performance and Tuning.................................................. 1-2 12. SLIDE: Course Outline ..................................................................................................... 1-3 13. SLIDE: System Performance ........................................................................................... 1-4 14. SLIDE: Areas of Performance Problems........................................................................ 1-6 15. SLIDE: Performance Bottlenecks ................................................................................... 1-8 16. SLIDE: Baseline .............................................................................................................. 1-10 17. SLIDE: Queuing Theory of Performance ..................................................................... 1-12 18. SLIDE: How Long Is the Line?....................................................................................... 1-13 19. SLIDE: Example of Queuing Theory ............................................................................ 1-14 110. SLIDE: Summary............................................................................................................. 1-16 111. LAB: Establishing a Baseline......................................................................................... 1-17 112. LAB: Verifying the Performance Queuing Theory ...................................................... 1-19 Module 2 Performance Tools 21. SLIDE: HP-UX Performance Tools...................................................................................... 2-2 22. SLIDE: HP-UX Performance Tools (Continued) ............................................................... 2-3 23. SLIDE: Sources of Tools....................................................................................................... 2-4 24. SLIDE: Types of Tools .......................................................................................................... 2-6 25. SLIDE: Criteria for Comparing the Tools ........................................................................... 2-8 26. SLIDE: Data Sources ........................................................................................................... 2-10 27. SLIDE: Performance Monitoring Tools (Standard UNIX).............................................. 2-11 28. TEXT PAGE: iostat ......................................................................................................... 2-12 29. TEXT PAGE: ps................................................................................................................... 2-14 210. TEXT PAGE: sar .............................................................................................................. 2-16 211. TEXT PAGE: time, timex .............................................................................................. 2-18 212. TEXT PAGE: top .............................................................................................................. 2-19 213. TEXT PAGE: uptime, w................................................................................................... 2-21 214. TEXT PAGE: vmstat ....................................................................................................... 2-22 215. SLIDE: Performance Monitoring Tools (HP Specific) .................................................. 2-25 216. TEXT: glance................................................................................................................... 2-26 217. TEXT PAGE: gpm .............................................................................................................. 2-28 218. TEXT PAGE: xload.......................................................................................................... 2-30 219. SLIDE: Data Collection Performance Tools (Standard UNIX) .................................... 2-31 220. TEXT PAGE: acct Programs .......................................................................................... 2-32 221. TEXT PAGE: sar .............................................................................................................. 2-34 222. SLIDE: Data Collection Performance Tools (HP-Specific) .......................................... 2-36 223. TEXT PAGE: MeasureWare/OVPA and DSI Software................................................... 2-37 224. TEXT PAGE: PerfView/OVPM ......................................................................................... 2-39 225. SLIDE: Network Performance Tools (Standard UNIX)................................................ 2-41 226. TEXT PAGE: netstat..................................................................................................... 2-42 227. TEXT PAGE: nfsstat..................................................................................................... 2-44 228. TEXT PAGE: ping ............................................................................................................ 2-46 229. SLIDE: Network Performance Tools (HP-Specific) ...................................................... 2-48 230. TEXT PAGE: lanadmin .................................................................................................. 2-49
http://education.hp.com
H4262S C.00 iii 2004 Hewlett-Packard Development Company, L.P.
Contents
231. 232. 233. 234. 235. 236. 237. 238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268.
TEXT PAGE: lanscan .....................................................................................................2-51 TEXT PAGE: nettune (HP-UX 10.x Only)....................................................................2-53 TEXT PAGE: ndd (HP-UX 11.x Only) ..............................................................................2-55 TEXT PAGE: NetMetrix (HP-UX 10.20 and 11.0 Only)..................................................2-57 SLIDE: Performance Administrative Tools (Standard UNIX) ......................................2-58 TEXT PAGE: ipcs, ipcrm...............................................................................................2-59 TEXT PAGE: nice, renice ............................................................................................2-61 SLIDE: Performance Administrative Tools (HP-Specific) ............................................2-63 Text Page: getprivgrp, setprivgrp.........................................................................2-64 Text Page: rtprio ............................................................................................................2-66 Text Page: rtsched..........................................................................................................2-67 Text Page: scsictl..........................................................................................................2-69 Text Page: serialize .....................................................................................................2-71 Text Page: fsadm...............................................................................................................2-72 Text Page: getext, setext............................................................................................2-74 Text Page: newfs, tunefs, vxtunefs .........................................................................2-75 Text Page: Process Resource Manager (PRM)..............................................................2-77 Text Page: Work Load Manager (WLM) .........................................................................2-78 Text Page: Web Quality of Service WebQoS..............................................................2-79 SLIDE: System Configuration and Utilization Information (Standard UNIX) ............2-80 TEXT PAGE: bdf, df ........................................................................................................2-81 TEXT PAGE: mount ..........................................................................................................2-83 SLIDE: System Configuration and Utilization Information (HP-Specific) ..................2-84 TEXT PAGE: diskinfo...................................................................................................2-85 TEXT PAGE: dmesg ..........................................................................................................2-86 TEXT PAGE: ioscan........................................................................................................2-88 TEXT PAGE: vgdisplay, pvdisplay, lvdisplay .................................................2-90 TEXT PAGE: swapinfo...................................................................................................2-92 TEXT PAGE: sysdef........................................................................................................2-93 TEXT PAGE: kmtune, kcweb..........................................................................................2-95 SLIDE: Application Profiling and Monitoring Tools (Standard UNIX) .......................2-96 TEXT PAGE: prof, gprof...............................................................................................2-97 Text page: Application Response Measurement (ARM) Library Routines ................2-98 SLIDE: Application Profiling and Monitoring Tools (HP-Specific) ............................2-99 Text page: Transaction Tracker .....................................................................................2-100 Text page: caliper HP Performance Analyzer..........................................................2-101 SLIDE: Summary ..............................................................................................................2-103 LAB: Performance Tools Lab..........................................................................................2-104
Module 3 GlancePlus 3-1. SLIDE: This Is GlancePlus................................................................................................3-2 3-2. SLIDE: GlancePlus Pak Overview...................................................................................3-4 3-3. SLIDE: gpm and glance .................................................................................................3-6 3-4. SLIDE: glance The Character Mode Interface ......................................................3-8 3-5. SLIDE: Looking at a glance Screen ..............................................................................3-11 3-6. SLIDE: gpm The Graphical User Interface ..............................................................3-13 3-7. SLIDE: Process Information ..........................................................................................3-15 3-8. SLIDE: Adviser Components .........................................................................................3-17 3-9. SLIDE: adviser Bottleneck Syntax Example............................................................3-18 3-10. SLIDE: The parm File .....................................................................................................3-19
H4262S C.00 iv 2004 Hewlett-Packard Development Company, L.P.
Contents
3-11. 3-12. 3-13. 3-14. 3-15. 3-16. 3-17. 3-18.
SLIDE: GlancePlus Data Flow....................................................................................... 3-21 SLIDE: Key GlancePlus Usage Tips.............................................................................. 3-23 SLIDE: Global, Application, and Process Data ........................................................... 3-24 SLIDE: Can't Solve What's Not a Problem................................................................... 3-25 SLIDE: Metrics: "No Answers without Data"............................................................... 3-26 SLIDE: Summary............................................................................................................. 3-27 SLIDE: HP GlancePlus Guided Tour ............................................................................ 3-28 LAB: gpm and glance Walk-Through ............................................................................ 3-29
Module 4 Process Management 41. SLIDE: The HP-UX Operating System............................................................................ 4-2 42. SLIDE: Virtual Address Process Space (PA-RISC) ....................................................... 4-4 43. SLIDE: Virtual Address Process Space (IA-64) ............................................................. 4-6 44. SLIDE: Physical Process Components........................................................................... 4-7 45. SLIDE: The Life Cycle of a Process ................................................................................ 4-9 46. SLIDE: Process States .................................................................................................... 4-11 47. SLIDE: CPU Scheduler................................................................................................... 4-14 48. SLIDE: Context Switching ............................................................................................. 4-16 49. SLIDE: Priority Queues .................................................................................................. 4-17 410. SLIDE: Nice Values......................................................................................................... 4-19 411. SLIDE: Parent-Child Process Relationship.................................................................. 4-20 412. SLIDE: glance Process List.................................................................................... 4-21 413. SLIDE: glance Individual Process......................................................................... 4-23 414. SLIDE: glance Process Memory Regions ............................................................. 4-24 415. SLIDE: glance Process Wait States....................................................................... 4-25 416. LAB: Process Management ............................................................................................ 4-26 Module 5 CPU Management 51. SLIDE: Processor Module................................................................................................ 5-2 52. SLIDE: Symmetric Multiprocessing................................................................................ 5-4 53. SLIDE: Cell Module .......................................................................................................... 5-5 54. SLIDE: Multi-Cell Processing .......................................................................................... 5-6 55. SLIDE: CPU Processor..................................................................................................... 5-8 56. SLIDE: CPU Cache ......................................................................................................... 5-11 57. SLIDE: TLB Cache .......................................................................................................... 5-12 58. SLIDE: TLB, Cache, and Memory ................................................................................. 5-14 59. SLIDE: HP-UX Performance Optimized Page Sizes............................................... 5-16 510. SLIDE: CPU Metrics to Monitor Systemwide......................................................... 5-19 511. SLIDE: CPU Metrics to Monitor per Process ......................................................... 5-21 512. SLIDE: Activities that Utilize the CPU ......................................................................... 5-23 513. SLIDE: glance CPU Report .................................................................................... 5-25 514. SLIDE: glance CPU by Processor ......................................................................... 5-26 515. SLIDE: glance Individual Process......................................................................... 5-27 516. SLIDE: glance Global System Calls ..................................................................... 5-28 517. SLIDE: glance System Calls by Process............................................................... 5-29 518. SLIDE: sar Command ................................................................................................... 5-30 519. SLIDE: timex Command .............................................................................................. 5-32 520. SLIDE: Tuning a CPU-Bound System Hardware Solutions .................................. 5-33 521. SLIDE: Tuning a CPU-Bound System Software Solutions.................................... 5-35 522. SLIDE: CPU Utilization and MP Systems..................................................................... 5-36
H4262S C.00 v 2004 Hewlett-Packard Development Company, L.P.
Contents
523. 5-24. 525.
SLIDE: Processor Affinity ..............................................................................................5-37 LAB: CPU Utilization, System Calls, and Context Switches ......................................5-38 LAB: Identifying CPU Bottlenecks ................................................................................5-40
Module 6 Memory Management 61. SLIDE: Memory Management ..........................................................................................6-2 62. SLIDE: Memory Management Paging ........................................................................6-4 63. SLIDE: Paging and Process Deactivation.......................................................................6-5 64. SLIDE: The Buffer Cache .................................................................................................6-7 65. SLIDE: The syncer Daemon ..........................................................................................6-9 66. SLIDE: IPC Memory Allocation .....................................................................................6-10 67. SLIDE: Memory Metrics to Monitor Systemwide ...................................................6-12 68. SLIDE: Memory Metrics to Monitor per Process ...................................................6-14 69. SLIDE: Memory Monitoring vmstat Output...............................................................6-16 610. SLIDE: Memory Monitoring glance Memory Report...........................................6-18 611. SLIDE: Memory Monitoring glance Process List.................................................6-19 612. SLIDE: Memory Monitoring glance Individual Process......................................6-20 613. SLIDE: Memory Monitoring glance System Tables.............................................6-21 614. SLIDE: Tuning a Memory-Bound System Hardware Solutions ............................6-23 615. SLIDE: Tuning a Memory-Bound System Software Solutions ..............................6-24 6-16: SLIDE: PA-RISC Access Control ...................................................................................6-26 617. SLIDE: The serialize Command..............................................................................6-28 618. LAB: Memory Leaks ........................................................................................................6-30 Module 7 Swap Space Performance 71. SLIDE: Swap Space Management Simple View ....................................................... 7-2 72. SLIDE: Swap Space After a New Process Executes ............................................... 7-4 73. SLIDE: The swapinfo Command ................................................................................. 7-5 74. SLIDE: Swap Space Management Realistic View.................................................... 7-7 75. SLIDE: Swap Space After a New Process Executes ............................................... 7-8 76. SLIDE: Swap Space When Memory Equals Data Swapped.................................. 7-10 77. SLIDE: Swap Space When Swap Space Fills Up ................................................... 7-11 78. SLIDE: Pseudo Swap ..................................................................................................... 7-12 79. SLIDE: Total Swap Space Calculation with Pseudo Swap................................... 7-14 710. SLIDE: Example Situation Using Pseudo Swap ......................................................... 7-16 711. SLIDE: Swap Priorities .................................................................................................. 7-17 712. SLIDE: Swap Chunks ..................................................................................................... 7-18 713. SLIDE: Swap Space Parameters ................................................................................... 7-19 714. SLIDE: Summary ............................................................................................................ 7-21 715. LAB: Monitoring Swap Space ....................................................................................... 7-22 Module 8 Disk Performance Issues 81. SLIDE: Disk Overview ......................................................................................................8-2 82. SLIDE: Disk I/O Read Data Flow................................................................................8-4 83. SLIDE: Disk I/O Write Data Flow (Synchronous) ....................................................8-6 84. SLIDE: Disk Metrics to Monitor Systemwide ...........................................................8-8 85. SLIDE: Disk Metrics to Monitor Per Process..........................................................8-10 86. SLIDE: Activities that Create a Large Amount of Disk I/O.........................................8-12 87. SLIDE: Disk I/O Monitoring sar d Output...............................................................8-14 88. SLIDE: Disk I/O Monitoring sar b Output...............................................................8-16 89. SLIDE: Disk I/O Monitoring glance Disk Report......................................................8-18
H4262S C.00 vi 2004 Hewlett-Packard Development Company, L.P.
Contents
810. 811. 812. 813. 814. 815. 816. 817. 818.
SLIDE: Disk I/O Monitoring glance Disk Device I/O .......................................... 8-19 SLIDE: Disk I/O Monitoring glance Logical Volume I/O....................................... 8-20 SLIDE: Disk I/O Monitoring glance System Calls per Process.......................... 8-21 SLIDE: Tuning a Disk I/O-Bound System Hardware Solutions ........................................................................................................ 8-22 SLIDE: Tuning a Disk I/O-Bound System Perform Asynchronous Meta-data I/O.......................................................................... 8-24 SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Controllers ......................................................................... 8-26 SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Drives.................................................................................. 8-28 SLIDE: Tuning a Disk I/O-Bound System Tune Buffer Cache .......................................................................................................... 8-30 LAB: Disk Performance Issues...................................................................................... 8-33
Module 9 HFS File System Performance 91. SLIDE: HFS File System Overview ................................................................................. 9-2 92. SLIDE: Inode Structure .................................................................................................... 9-5 93. SLIDE: Inode Data Block Pointers ................................................................................. 9-6 94. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd? ................. 9-8 95. SLIDE: File System Blocks and Fragments ................................................................. 9-10 96. SLIDE: Creating a New File on a Full File System ..................................................... 9-13 97. SLIDE: HFS Metrics to Monitor Systemwide ......................................................... 9-15 98. SLIDE: Activities that Create a Large Amount of File System I/O ............................ 9-17 99. SLIDE: HFS I/O Monitoring bdf Output ...................................................................... 9-18 910. SLIDE: HFS I/O Monitoring glance File System I/O ........................................... 9-19 911. SLIDE: HFS I/O Monitoring glance File Opens per Process.............................. 9-20 912. SLIDE: Tuning a HFS I/O-Bound System Tune Configuration for Workload ..... 9-22 913. SLIDE: Tuning a HFS I/O-Bound System Use Fast Links...................................... 9-25 914. LAB: HFS Performance Issues ...................................................................................... 9-27 Module 10 VxFS Performance Issues 101. SLIDE: Objectives ........................................................................................................... 10-2 102. SLIDE: JFS History and Version Review...................................................................... 10-5 103. SLIDE: JFS Extents......................................................................................................... 10-9 104. SLIDE: Extent Allocation Policies .............................................................................. 10-11 105. SLIDE: JFS Intent Log .................................................................................................. 10-13 106. SLIDE: Intent Log Data Flow....................................................................................... 10-16 107. SLIDE: Understand Your I/O Workload ..................................................................... 10-18 108. SLIDE: Performance Parameters ................................................................................ 10-20 109. SLIDE: Choosing a Block Size..................................................................................... 10-21 1010. SLIDE: Choosing an Intent Log Size ........................................................................... 10-23 1011. SLIDE: Intent Log Mount Options............................................................................... 10-25 1012. SLIDE: Other JFS Mount Options ............................................................................... 10-27 1013. SLIDE: JFS Mount Option: mincache=direct...................................................... 10-31 10-14. SLIDE: JFS Mount Option: mincache=tmpcache ................................................. 10-33 1015. SLIDE: Kernel Tunables ............................................................................................... 10-35 1016. SLIDE: Fragmentation.................................................................................................. 10-37 1017. TEXT PAGE: Monitoring and Repairing File Fragmentation .................................. 10-40 1018. SLIDE: Using setext .................................................................................................. 10-50 1019. SLIDE: I/O Tunable Parameters .................................................................................. 10-52
H4262S C.00 vii 2004 Hewlett-Packard Development Company, L.P.
Contents
1020. 1021. 1022. 1023.
SLIDE: vxtunefs Command for Tuning VxFS ........................................................10-54 SLIDE: /etc/vx/tunefstab Configuration ..........................................................10-56 SLIDE: Taking Snapshots and Performance ..............................................................10-58 LAB: JFS File System Tuning.......................................................................................10-60
Module 11 Network Performance 111. SLIDE: The OSI Model ....................................................................................................11-2 112. SLIDE: NFS Read/Write Data Flow ...............................................................................11-4 113. SLIDE: NFS on HP-UX with UDP ..................................................................................11-6 114. SLIDE: NFS on HP-UX with TCP...................................................................................11-7 115. SLIDE: biod on Client .....................................................................................................11-9 116. SLIDE: TELNET.............................................................................................................11-11 117. SLIDE: FTP.....................................................................................................................11-13 118. SLIDE: Metrics to Monitor NFS..............................................................................11-15 119. SLIDE: Metrics to Monitor Network ......................................................................11-18 1110. SLIDE: Determining the NFS Workload .....................................................................11-20 1111. SLIDE: NFS Monitoring nfsstat Output ............................................................11-23 1112. SLIDE: Network Monitoring lanadmin Output ..................................................11-28 1113. SLIDE: Network Monitoring netstat i Output ................................................11-31 1114. SLIDE: glance NFS Report ......................................................................................11-32 1115. SLIDE: glance NFS System Report.........................................................................11-33 1116. SLIDE: glance Network by Interface Report.........................................................11-34 1117. SLIDE: Tuning NFS .......................................................................................................11-35 1118. SLIDE: Tuning the Network .........................................................................................11-37 1119. SLIDE: Tuning the Network (Continued)...................................................................11-39 1120. LAB: Network Performance.........................................................................................11-41 Module 12 Tunable Kernel Parameters 121. SLIDE: Kernel Parameter Classes .................................................................................12-2 122. SLIDE: Tuning the Kernel...............................................................................................12-5 123. SLIDE: Kernel Parameter Categories............................................................................12-8 124. SLIDE: File System Kernel Parameters ........................................................................12-9 125. SLIDE: Message Queue Kernel Parameters ...............................................................12-11 126. SLIDE: Semaphore Kernel Parameters.......................................................................12-13 127. SLIDE: Shared Memory Kernel Parameters ...............................................................12-15 128. SLIDE: Process-Related Kernel Parameters ..............................................................12-17 129. SLIDE: Memory-Related Kernel Parameters..............................................................12-19 1210. SLIDE: LVM-Related Kernel Parameters ....................................................................12-21 1211. SLIDE: Networking-Related Kernel Parameters........................................................12-22 1212. SLIDE: Miscellaneous Kernel Parameters..................................................................12-23 Module 13 Putting It All Together 131. SLIDE: Review of Bottleneck Characteristics .............................................................13-2 132. SLIDE: Performance Monitoring Flowchart ................................................................13-4 133. SLIDE: Review Memory Bottlenecks .......................................................................13-6 134. SLIDE: Correcting Memory Bottlenecks ......................................................................13-7 135. SLIDE: Review Disk Bottlenecks .............................................................................13-8 136. SLIDE: Correcting Disk Bottlenecks.............................................................................13-9 137. SLIDE: Review CPU Bottlenecks ...........................................................................13-11 138. SLIDE: Correcting CPU Bottlenecks...........................................................................13-12 139. SLIDE: Final Review Major Symptoms..................................................................13-13
H4262S C.00 viii 2004 Hewlett-Packard Development Company, L.P.
Contents
Appendix A Applying GlancePlus Data A1. TEXT PAGE: Case Studies Using GlancePlus ............................................................. A-2 Solutions
H4262S C.00 ix 2004 Hewlett-Packard Development Company, L.P.
Contents
H4262S C.00 x 2004 Hewlett-Packard Development Company, L.P.
Overview
Overview
Course Description
This course is intended to introduce students to the various aspects of monitoring and tuning their systems. Students will be taught how to monitor which tools to use, symptoms to look for, and what remedial actions to take. The course also covers HP GlancePlus/Gpm and HP PerfRx. The course is designed to: Introduce the subject of performance and tuning. Describe how the system works. Identify what tools we can use to look at performance. Identify the symptoms we may encounter and what they indicate.
Course Goals
To educate the students on HP-UX performance monitoring To enable them to identify bottlenecks and potential problems To learn the appropriate remedial actions to take
Student Performance Objectives

Module 1 Introduction
List characteristics of a system yielding good user response time. List characteristics of a system yielding high data throughput. List three generic areas most often analyzed for performance. List the four most common bottlenecks on a system.
Module 2 Performance Tools
Identify various performance tools available on HP-UX. Categorize each tool as either real time or data collection. List the major features of the performance tools. Compare and contrast the differences between the tools
Module 3 GlancePlus
Compare GlancePlus with other performance monitoring/management tools. Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).
H4262S C.00 1 2004 Hewlett-Packard Development Company, L.P.
Overview Module 4 Process Management
Describe the components of a process. Describe how a process executes, and identify its process states. Describe the CPU scheduler. Describe a context switch and the circumstances under which context switching occurs. Describe in general, the HP-UX priority queues.
Module 5 CPU Management
Describe the components of the processor module. Describe how the TLB and CPU cache are used. List four CPU related metrics. Identify how to monitor CPU activity. Discuss how best to use the performance tools to diagnose CPU problems. Specify appropriate corrections for CPU bottlenecks.
Module 6 Memory Management
Describe how the HP-UX operating system performs memory management. Describe the main performance issues that involve memory management. Describe the UNIX buffer cache. Describe the sync process. Identify the symptoms of a memory bottleneck. Identify global and process memory metrics. Use performance tools to diagnose memory problems. Specify appropriate corrections for memory bottlenecks. Describe the function of the serialize command.
Overview Module 7 Swap Space Performance
Describe the difference between swap usage and swap reservation. Interpret the output of the swapinfo command. Define and configure pseudo swap. Define and configure swap space priorities. Define and configure swchunk and maxswapchunks.
Module 8 Disk Performance Issues
List three ways disk space can be used. List disk device files. Identify disk bottlenecks. Identify kernel system parameters.
Module 9 File System Performance
List three ways file systems are used. List basic file system data structures. Identify file system bottlenecks. Identify kernel system parameters.
Module 10 VxFS Performance
Understand JFS structure and version differences Explain how to enhance JFS performance Set block sizes to improve performance Set Intent-Log size and rules to improve performance Understand and manipulate synchronous and asynchronous IO Identify JFS tuning parameters Understand and control fragmentation issues Evaluate the overhead of online backup snapshots
Overview Module 11 NFS Performance
List factors directly related to network performance. Describe how to determine network workloads (server and client). Evaluate UDP and TCP transport options. Identify a network bottleneck. List possible solutions for a network performance problem.
Module 12 Tunable Kernel Parameters
Identify which tunable parameters belong to which category Identify tunable kernel parameters that could impact performance Tune both static and dynamic tunable parameters
Module 13 Putting It All Together
Identify and characterize some network performance problems. List some useful tools for measuring network performance problems and state how they might be applied. Identify bottlenecks on other common system devices not associated directly with the CPU, disk, or memory.
Overview
Student Profile and Prerequisites

The student should be well versed in UNIX and able to perform the usual duties of a system administrator. Students should have completed HP-UX System and Network Administration I and HP-UX System and Network Administration II prior to attending this course or equivalent experience on another manufacturer's equipment. NOTE: The course Inside HP-UX (H5081S) is not a formal prerequisite to attending HP-UX Performance and Tuning, but it should be considered a co-requisite training course for the serious HP-UX Performance Specialist. (The suggested order is Inside HP-UX then HP-UX Performance Tuning, but as the two courses have a synergistic relationship, the order is not absolute).
Curriculum Path
Fundamentals of UNIX (H51434S) | | HP-UX System and Network Administration I (H3064S) | | HP-UX System and Network Administration II (H3065S)
OR HP-UX Administration for the Experienced UNIX Administrator (H5875S) | | | |
Inside HP-UX (H50815S)
Recommended
HP-UX Performance and Tuning (H4262S)
Overview
Agenda
The following agenda is only a guideline. The instructor may vary it if desired. The course will run until the afternoon of the third day. The last hour or so can be used to demonstrate more fully the performance offerings, such as HP PRM and HP PerfView. Day 1 1 Introduction 2 Performance Tools Day 2 3 GlancePlus 4 Process Management 5 CPU Management Day 3 6 Memory Management 7 Swap Space PerformanceManaging 8 Disk Performance Issues Day 4 9 File System Performance 10 ---- VxFS Performance 11 NFS Performance Day 5 12 Tunable Kernel Parameters 13 Putting It All Together
Objectives
Upon completion of this module, you will be able to do the following: List characteristics of a system yielding good user response time. List characteristics of a system yielding high data throughput. List three generic areas most often analyzed for performance. List the four most common bottlenecks on a system.
H4262S C.00 1-1 2004 Hewlett-Packard Development Company, L.P.
11. SLIDE: Welcome to HP-UX Performance and Tuning
Welcome to HP-UX Performance and Tuning
Student Notes
Welcome to the HP-UX Performance and Tuning course. This course is designed to provide a high level understanding of common performance problems and common bottlenecks found on an HP-UX system. The course uses HP performance tools to view activity currently on the system. While many tools can be used to analyze the activity, this course primarily utilizes the glance tool, which is specifically tailored for HP-UX systems.
12. SLIDE: Course Outline
Course Outline
Introduction to Performance Performance Tools Overview GlancePlus Process Management CPU Management Memory Management Swap Space Performance Issues Disk and File System Performance Issues HFS Performance Issues VxFS Performance Issues Network Performance Issues Tuning the Kernel Putting it All Together Performance Recap
Student Notes
Topics covered in this course include: System Internals This module includes information related to how the system components (CPU, memory, file systems, and network) function and interact with each other. Similar to a mechanic not being able to tune a car's engine until he understands how it works, a system administrator cannot tune system resources properly until he has a good understanding of how the resources work. Performance Tools There are many performance tools that are available with HP-UX. Some tools come as standard equipment; other tools are additional add-on products. Some tools provide runtime monitoring; other tools perform data collection. We will review all of the tools. Specialty Areas These modules cover areas of special interest to customers in particular types of environments. Three specialty areas are covered at a high level. These are NFS and networking, databases, and application profiling.
13. SLIDE: System Performance
System Performance
Response Time Users
System Throughput Management
Computer System
Student Notes
Different computer systems have different requirements. Some systems may need to provide quick response time; other systems may need to provide a high level of data throughput.
Response Time User's Perspective

Response time is the time between the instant the user presses the return key or the mouse button and the receipt of a response from the computer. Users often use response time as a criterion of system performance. A system that yields high response is typically not 100% utilized. Often there are free CPU cycles, along with low utilization of disk drives, and with no swapping or paging. Because the system resources are not being utilized constantly, often when a user executes a task the resources are available immediately, yielding quick response time to the user. Users want low utilization of resources in order to experience optimal response time.
Throughput IT or MIS Management Perspective

Throughput is the number of transactions accomplished in a fixed-time period. Management is often interested in how many compilations or how many reports they can generate in a specific amount of time. Many systems use benchmarks (like SPECmarks or TPC), which measure, in general, how many operations or transactions a system can perform per minute. A system that yields high throughput is typically 100% utilized. There are no free CPU cycles; there are always jobs in the CPU run queue; the disk drives are constantly being utilized; and there is often pressure on memory. Because the system resources are constantly in use, the amount of work produced typically yields good system throughput. Management wants high utilization of resources to maximize system performance.
Question
Is it possible to get both good response time and high system throughput?
14. SLIDE: Areas of Performance Problems
Areas of Performance Problems
Application
Operating System
Hardware
Student Notes
This slide shows a hierarchical view of a computer system. The base of a computer is its hardware. Built on top of the hardware is the operating system (i.e. the operating system is dependent on the hardware in order to run). The application programs are built on top of the operating system (OS). All three of these areas can have performance problems.
Hardware
The hardware moves data within the computer system. If the hardware is slow, then, no matter how finely tuned the OS and applications are, the system will still be slow. Ultimately, the system is only as fast as the hardware can move the data. Items affecting the speed of the hardware include CPU clock speed, amount and speed of memory, type of disk controller (Fast/Wide SCSI or Single-Ended SCSI), and type of network card (FDDI or Ethernet).
Operating System
The operating system runs on top of the hardware. It controls how the hardware is utilized. The operating system decides which process runs on the CPU, how much memory to allocate for the buffer cache, whether I/O to the disks is performed synchronously or asynchronously, and so on. If the operating system is not configured properly, then the performance of the system will be poor. Items affecting how the operating system performs include process priorities and their nice values, the tunable OS parameters, the mount options used for file systems, and the configurations of network and swap devices
Applications
The applications run on top of the operating system. The application programs include software, such as database management systems, electronic design applications programs, and accounting-based applications. The performance of the application program is dependent on the operating system and hardware, but it is also dependent on how the application is coded, and how the application itself is configured. Items affecting the performance of the application include how the application data is laid out on the disk, how many users are trying to use the application currently, and how efficiently the application uses the system's resources.
Questions
In which of these three areas are most performance problems located?
15. SLIDE: Performance Bottlenecks
Performance Bottlenecks
Network
CPU Run Queue
CPU
Disk I/O Queue
Processes
Memory
Disk System Bottleneck Areas CPU Memory Disk Network
Student Notes
Poor performance often results because a given resource cannot handle the demand being placed upon it. When the demand for a resource exceeds the availability of the resource, then a bottleneck exists for that resource. Common resource bottlenecks are: CPU A CPU bottleneck occurs when the number of processes wanting to execute is constantly more than the CPU can handle. Basic symptoms of a CPU bottleneck are high CPU Utilization and multiple jobs in the CPU run queue, consistently. A memory bottleneck occurs when the total number of processes on the system will not all fit into memory at one time (i.e. there are more processes than memory can hold). When this happens, pages of memory need to be copied out to the swap partition on disk to free space in memory. Basic symptoms of a memory bottleneck are high memory utilization and consistent I/O activity to the swap partition on disk.
Memory
Disk
A disk bottleneck occurs when the amount of I/O to a specific disk is more than the disk can handle. Basic symptoms of a disk bottleneck include high utilization of a disk drive and multiple I/O requests consistently in the disk I/O queue. A network bottleneck occurs when the amount of time needed to perform network-based transactions is consistently greater than expected. Basic symptoms of a network bottleneck include network collisions, network request timeouts, and packet retransmissions.
Network
16. SLIDE: Baseline
Baseline
Response Time
Best Possible Response Time
Response Time with Five Users
Response Time with Ten Users
Response Time with Fifteen Users
Student Notes
In order to quantify good versus poor performance, a customer needs to know what the best possible response time for a given workload can be. The procedure for calculating the best possible response time for a given workload is known as baselining. To calculate the baseline (i.e. the best possible response time) for a particular workload, the workload needs to be performed when no other activity is on the system. The intent is that when all resources are free, the workload will be able to execute as quickly as possible, thereby yielding the best possible response time. Once the baseline value is known, a relative measure is now available for determining how poorly the workload is performing. For example, assume a baseline value of 5 seconds for the workload shown on the slide. When five users are on the system, the response time for the workload increases to 7 seconds. The relative comparison shows response time taking 40% (or 2 seconds) more time to perform this workload when five users are on the system. We have just quantified the relative effect of having five users on the system relative to this particular workload.
The slide illustrates the typical behavior for a given workload. As more users concurrently utilize the system, the response time for a given workload gets worse. NOTE: In this class we will run baseline metrics using simplified "workload" simulation programs. Results will vary greatly with your applications.
17. SLIDE: Queuing Theory of Performance
Queuing Theory of Performance
Response Time 4X 3X 2X X Baseline Percent Utilization
25
50
75
100
Student Notes
The queuing theory of performance states that the average response time of a given resource is directly linked to the average utilization of that resource. The slide shows a baseline value of X seconds for a given resource. According to the queuing theory, the users will experience this response time when the resource has an average utilization of 0 to 25%. When the average utilization of the resource reaches 75%, the average response time will double. As the average utilization approaches 100%, the average response time quadruples. The bottom line is, as the average utilization of the resource increases, the average response time gets worse and worse. Why does the average response time become poor as the average utilization of a resource increases?
18. SLIDE: How Long Is the Line?
How Long Is the Line?
System Resource
The Line Starts Here
Student Notes
The reason why the average response time gets so poor when the average resource utilization increases, is because the length of the line waiting to get to the resource gets longer. As resource utilization increases, the number of jobs waiting on the resource also increases. When poor performance is experienced, it is most often due to the length of the queue becoming long. A long queue causes jobs to spend most of their time waiting in line for the resource (CPU, memory, network, or disk), as opposed to being serviced by the resource. The slide shows four people waiting in line to get to a resource (think of a line in a bank with one bank teller). If it takes 5 minutes to service one customer, then the fourth person in line will wait 15 minutes before reaching the resource. Adding another 5 minutes to service, the request brings the total response time to 20 minutes for the last person in line, as opposed to 5 minutes if the line had been empty. Of course there is also an overhead experienced because of switching from one customer to the next. This switching is minimal in this example because the customers are handled in a serial fashion.
19. SLIDE: Example of Queuing Theory
Example of Queuing Theory
sar -d 5 5 15:31:55 15:32:00 15:32:05 15:32:10 15:32:15 15:32:20 device c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 %busy 81 5 84 3 68 1 71 0 69 0 avque 3.4 .5 3.5 .5 2.9 .5 2.7 .5 2.7 .5 r+w/s 31 1 34 2 31 0 30 1 29 1 blks/s 248 32 245 8 248 6 30 3 29 3 avwait 59.31 0.65 71.64 0.25 51.36 0.48 62.88 0.65 61.70 0.65 avserv 21.20 23.58 24.04 17.93 18.55 19.18 24.16 29.25 24.14 29.25
Student Notes
The above slide provides an example of the queuing theory for the disk drives as reported with the sar tool. The four fields to focus on are: %busy avque The percentage of utilization of each disk The average number of I/O requests in the queue for that disk
avwait The average amount of time a request spends waiting in that disks queue avserv The average amount of time to service that I/O request (not including the wait time) Analyzing the data shows a baseline around 20 milliseconds to service an I/O request (approximate average of avserv column). The first line item shows a disk that is 81% utilized. The total response time is the average wait plus the average service, or approximately 80 milliseconds. This is four times longer than the baseline of 20 milliseconds. In fact, each snapshot shows the busy disk waiting in the queue for an amount of time greater than the amount of time to service the I/O request. To
see why the wait time is so high, look at the avque size. Notice the queue size is highest when the device is most busy. This is the basic concept of the performance queuing theory.
110. SLIDE: Summary
Summary
Objective for the system:
Provide fast response time to users, or Maximize throughput of system
Three performance problem areas: Hardware

Operating System Application
Performance bottlenecks: CPU

Disk Memory Network
Need for baselines Performance queuing theory
Student Notes
To summarize this module, systems are tuned for response time or for throughput. This class focuses on tuning for best possible response time. Areas that affect response time are speed of the hardware, configuration of the operating system, and configuration of the application. This class focuses on the configuration of the operating system. Common bottlenecks with computer systems include CPU, memory, disk, and network. This class discusses all four bottlenecks. Baselines are an important measurement tool for quantifying performance. In the lab for this module, the student will establish CPU and disk I/O baselines. Finally, the queuing theory of performance states that the average response time increases as the average utilization of a resource increases. This is an important concept, which will be revisited throughout this course.
111. LAB: Establishing a Baseline

Directions
The following lab exercise establishes baselines for three CPU-bound applications and one disk-bound application. The objective is to time how long these applications take when there is no activity on the system. These same applications will be executed later on in the course when other bottleneck activity is present. The impact of these bottlenecks on user response time will be measured through these applications. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Compile three C programs long, med and short by running the BUILD script # ./BUILD 3. Time the execution of the long program. Make sure there is no activity on the system. # timex ./long Record Execution Time real: user: sys:
4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time real: user: sys:
5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time real: user: sys:
6. Time the execution of the diskread program. # timex ./diskread Record Execution Time real: user: sys:
7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why.
112. LAB: Verifying the Performance Queuing Theory

Directions
The performance queuing theory states that as the number of jobs in a queue increases, so will the response time of the jobs waiting to use that resource. This lab uses the short program compiled from /home/h4262/baseline/prime_short.c. 1. In terminal window 1, monitor the CPU queue with the sar command. # sar -q 5 200 2. In a second terminal window, time how long it takes for the short program to execute. # timex ./short & How long did the program take to execute? _________________ How does this compare to the baseline measurement from earlier? _____ What is the CPU queue size? _______ 3. Time how long it takes for three short programs to execute. # timex ./short & timex ./short & timex ./short &
How long did the slowest program take to execute? ___________________ How did the CPU queue size change from step 2? ___________________ 4. Time how long it takes for five short programs to execute. # timex ./short & timex ./short & timex ./short & timex ./short & timex ./short & \
How long did the slowest program take to execute? _____________________ How did the CPU queue size change from step 3? _____________________ 5. Is the relationship between elapsed execution (real) time and the number of running programs linear? 6. Comment about the overhead of switching from one process to another.

Objectives
Upon completion of this module, you will be able to do the following: Identify various performance tools available on HP-UX. Categorize each tool as either real time or data collection. List the major features of the performance tools. Compare and contrast the differences between the tools.
21. SLIDE: HP-UX Performance Tools
HP-UX Performance Tools
Student Notes
Many performance tools are available for many different purposes. In the HP-UX operating system, there are over 50 different performance-related tools. Some tools provide real-time performance information, such as, How busy the CPU is right now? Other tools collect data in the background and maintain a history of performance information. This module addresses all the tools and the different functions they perform.
22. SLIDE: HP-UX Performance Tools (Continued)
HP-UX Performance Tools

Objective: Identify the various performance tools available on HP-UX Demonstrate their mechanics Discuss their features Compare and contrast the differences between the tools
Student Notes
The objective of this module is to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool is used. The module is intended to be a quick reference of performance tools, which the student can refer to when needing to select a tool for a specific task. NOTE: This module does not discuss how to interpret the output of the tools. Interpretation of the metrics is provided in later modules.
23. SLIDE: Sources of Tools
Sources of Tools
Standard tools

Tools found on UNIX systems, including HP-UX
Tools frequently found on other UNIX systems HP-UX-specific tools
Tools found only on HP-UX Tools licensed and sold separately (Generally available only on HP-UX)
Optional tools
Student Notes
Three types of tools are presented in this module: Standard Tools Standard tools are those frequently found on many UNIX systems, including HP-UX. The advantage of the standard tools is that their results can be compared with those being collected on other UNIX platforms. This provides an "apples for apples" comparison, which is desirable when comparing systems. The output from these standard tools (and some of the options) may vary slightly among UNIX systems. In addition, differences between the various UNIX implementations can affect the reliability of the metrics being output by the tools. Therefore, be careful to check the results with other tools or seek help before basing important tuning decisions on the value of one metric. HP-Specific Tools HP-specific tools are those which are found only on HP-UX operating systems. These tools are often tailored specifically to understand HP-UX implementations. These tools are generally not found on other UNIX implementations, as other
implementations are different from those of HP. Some of the HP-specific tools come with the base OS; others are purchased as optional tools. Optional Tools Optional tools are tools that are added to the operating system in addition to the standard tools. Some of the optional tools, such as the HP-PAK (Programmers Analysis Kit), may be included with add-on software, such as compilers for HP-UX. Other optional tools, like GlancePlus, PerfView, MeasureWare, NetMetrix, PRM (Process Resource Manager), and WLM (Work Load Manager), are purchased individually or in small bundles (GlancePlus Pak also includes a MeasureWare agent). Optional tools are typically licensed from HP. They offer many advantages over the standard tools including:
ease of use accuracy granularity low overhead additional metrics
24. SLIDE: Types of Tools
Types of Tools
Data Collection Performance Run-Time Monitoring System Configuration and Utilization Performance Administration Network Monitoring Application Profiling and Monitoring
Student Notes
The tools covered in this section fall into six main categories: Run-Time Monitoring Tools These tools provide information as to the performance of the system now. The information is current and provides a real-time perspective as to the state of the system at the current moment. Data Collection Performance Tools These tools collect performance data in the background, summarize or average the data into a summary record, and log the summary record to a file or files on disk. They do not typically provide real-time data. Network Monitoring Tools These tools monitor performance, status, and packet errors on the network. They include both monitoring and configuration tools related to network management. Performance Administrative Tools A system administrator can use these tools to manage the performance of his system.
They typically do not report any data, but allow the current configuration of the system (and its components) to be changed to help improve performance System Configuration and Utilization Information Tools These tools report current system configurations (such as LVM and file systems). They also report utilization of resource statistics, like disk and file system space and number of processes. Application Profiling and Monitoring Tools These tools provide in-depth analysis about the behavior of a program. These tools monitor and trace the execution of a process, and report the resources and calls made during its execution.
25. SLIDE: Criteria for Comparing the Tools
Criteria for Comparing the Tools

Source of data Scope Additional cost versus no cost Intrusiveness Accuracy Ease of use Portability Metrics available Data collection and storage Permissions required
Student Notes
Each tool has strengths and weaknesses, advantages and disadvantages, and unique features. Some items to consider when selecting a tool are: Source of Data Scope The collected data can come from a variety of sources, including the kernel, an application, or a specific daemon (like the midaemon). The scope determines the level of detail provided with the tool. Most of the standard tools do not show process-level metrics. For example, they display global disk I/O rates, but do not show which process is generating the I/O or the disk on which the I/O is concentrated. The cost sometimes determines if the tool is an option. Many of the HP-specific tools have additional costs associated with them. (Many of these tools have evaluation copies available for a trial period.) The intrusiveness relates to the overhead associated with running the tool. Some tools also have significant overhead. A large user community using top, for example, may be responsible for generating large amounts of "monitoring" overhead on the system. Another example is the ps
Cost
Intrusiveness
command. It has little impact on most systems due to the low frequency at which it is executed. However, the ps command places fairly high overhead on the system during its execution. Accuracy The accuracy of the tool relates to the reliability of the data being reported. Many standard UNIX tools, like vmstat and sar, have been ported from other UNIX systems. The registers that they monitor may not always correspond to the registers that the kernel updates. There are other factors that can have significant impact on the tool you decide to use. These factors include familiarity, metrics available, permissions required, and portability.
Others
As the tools are presented in the upcoming pages, many of these items will be addressed.
26. SLIDE: Data Sources
Data Sources
Kernel Memory /dev/kmem or pstat() sar ps iostat vmstat
Kernel Instrumentation Trace Buffers Shared Memory Segment
midaemon Measurement Interface Library glance utility extract
scopeux
Socket
logfiles
pv
Student Notes
The standard tools read information from the UNIX counters and registers maintained in kernel memory (accessible via the /dev/kmem device file and the pstat() system call). These counters and registers are updated 10 times a second as a standard part of most UNIX system implementations. The data in the counters and registers are generally adequate for most performance jobs, but do not provide enough detail when in-depth tuning is needed. The optional tools for HP-UX use an additional source called kernel instrumentation (KI). The KI interface provides additional information beyond the UNIX kernel counters and registers. The KI interface gathers performance information on a system call basis, with every system call generated by every process being traced. The KI interface uses a proprietary measurement interface library to derive the additional metrics. These tools are frequently revised and updated to provide the highest levels of accuracy with the lowest possible overhead. The optional tools, such as Glance and MeasureWare, are KI-based tools when running on HP-UX systems, although they are available for other vendor systems as well. Additional information about KI-based tools (also known as resource and performance management (RPM) tools) can be obtained from the RPM web site at: www.hp.com/go/rpm
27. SLIDE: Performance Monitoring Tools (Standard UNIX)
Performance Monitoring Tools (Standard UNIX)

Global Metrics iostat ps sar time timex top uptime,w vmstat Yes No Yes No Some Yes Some Yes Process Details No Yes No Some Some Some Some No Alarming Capability No No No No No No No No
Student Notes
The slide shows run-time performance monitoring tools included with HP-UX. These tools provide current information about the performance of the system. These tools are standard UNIX performance tools, which are found on most other UNIX implementations. The Global Metrics column indicates whether the tool will show aggregate resource utilization without differentiating between specific resources. The Process Detail column indicates whether the tool will show resources being used by a single PID. The Alarming Capability column indicates whether the tool is capable of sending an alarm when one of the metrics exceeds a user-defined threshold.
28. TEXT PAGE: iostat

The iostat command reports I/O statistics for each active disk on the system. Tool Source: Documentation Interval Data Source: Type of Data: Metrics: Logging: Overhead: Unique Features: Full Pathname: Pros and Cons: Standard UNIX (BSD 4.x) man page >= 1 second Kernel registers/counters Global Physical Disk I/O Standard output device Varies, depending on the output interval Terminal I/O /usr/bin/iostat + statistics by physical disk drive - limited statistics - poorly documented and cryptic headings
Syntax
iostat [-t] [interval [count]] -t interval count Report terminal statistics as well as disk statistics Display successive lines summaries at this frequency Repeat the summaries this number of times
Key Metrics
The iostat metrics include: bps sps msps Blocks (kilobytes) transferred per second Number of seeks per second Average milliseconds per seek
With the advent of new disk technologies, such as data striping, where a single data transfer is spread across several disks, the average milliseconds per seek becomes impossible to compute accurately. At best it is only an approximation, varying greatly, based on several dynamic system conditions. For this reason and to maintain backward compatibility, the milliseconds per seek (msps) field is set to the value 1.0.
Examples
# iostat 5 2 device bps c0t6d0 0 c0t6d0 1100
sps 0.0 34.6
msps 1.0 1.0
# iostat -t 5 1 tty tin tout 0 0 device c0t6d0 bps 0 sps 0.0 msps 1.0 us 2 cpu ni 0 sy 1 id 98
29. TEXT PAGE: ps

The ps command displays information about selected processes running on the system. The command has many options for reducing the amount of output. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead Unique Features: Full Pathname: Pros and Cons: Standard UNIX (BSD 4.x) man page on demand in core process table per process state, priority, nice values, PIDs, times, ... Standard output device Varies, depending on the number of processes Wait channel and Run queue of processes. /usr/bin/ps + familiarity + options for altering output - minimal information - no averaging or summarization (i.e. no global metrics)
Syntax ps [-aAcdefHjlP] [-C cmdlist] [-g grplist] [-G gidlist] [-n namelist] [-o format] [-R prmgrplist [-s sidlist] [-t termlist] [-u uidlist] [-U uidlist]
Key Metrics
The ps metrics include: ADDR C F The memory address of the process, if resident; otherwise, the disk address. Recent processor utilization, used for CPU scheduling (0-255). Flags associated with the process (octal, additive): 0 1 2 4 Process is on the swap device Process is in core memory Process is a system process Process is locked in memory
(and many more) NI PPID PID The nice value for the process; used in priority computation. The process ID number of the parent process. The process ID number of this process.
PRI S
The priority of the process. The state of the process I S R T Z Process is being created (very rarely seen) Process is Sleeping Process is currently Runnable Process is Stopped (rare) Process is terminated (aka Zombie process)
STIME SZ TIME TTY WCHAN
Starting time of process. The size in 4-KB memory pages. The cumulative execution time of the process. The controlling terminal for the process. The address of a structure representing the event or resource for which the process is waiting or sleeping.
Example
# ps -fu daemon UID PID PPID daemon 1171 1170 daemon 1565 1171 # ps -lu daemon F S UID 1 S 1 1 S 1 C STIME TTY 0 13:03:42 ? 0 17:47:47 ? TIME COMMAND 3:10 /usr/bin/X11/X :0 0:00 pexd /tmp/to_pexd_1171.2 /dev/ttyp2
PID 1171 1565
PPID 1170 1171
C PRI NI 1 154 20 0 154 20
ADDR dbea00 10e6900
SZ 697 115
WCHAN TTY 3ace9c ? 3ace9c ?
TIME COMD 3:10 X 0:00 pexd
210. TEXT PAGE: sar

The sar command collects and reports on many different system activities (and system areas), including CPU, buffer cache, disk, and others. Related commands include sadc, sa1, and sa2. These commands are related to the data collection functionality of sar and will be addressed with the data collection commands. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging Overhead: Unique Features: Full Pathname: Pros and Cons: Standard UNIX (System V) man page and kernel source >= 1 second /dev/kmem registers/counters Global CPU, Disk, and Kernel resources Standard output device, or file on disk Varies, depending on the output interval Disk I/O wait time, kernel table overflows, buffer cache hit ratios /usr/sbin/sar + familiarity + performs both real time and data collection functions - no per process information - no paging information, only designed for swapping (no longer done on HP-UX)
Syntax
sar [-ubdycwaqvmpAMSP] [-o file] t [n] Metric-related options: -u CPU Utilization -q Run queue and swap queue lengths and utilization -b Buffer cache stats -d Disk utilization -y TTY utilization -c System call rates -w Swap activity -v Kernel table utilization -m Semaphore and message queue utilization -a File access system routine utilization -A Everything! -M Per processor breakdown (used with u and/or q) -P/p Per processor set breakdown (used with MU and/or Mq)
Key Metrics
The sar command has many metrics. Included below are some sample metrics based on the disk and CPU reports:
CPU Report (-u)
The CPU report displays utilization of CPU and the percentage of time spent within the different modes. %usr %sys %wio %idle Percentage of time system spent in user mode Percentage of time system spent in system mode Percentage of time processes were waiting for (disk) I/O Percentage of time system was idle
Disk Report (-d)
The disk report displays activity on each block device (i.e. disk drive). Device %busy avque r+w/s blks/s avwait avserv Logical name of the device (device file name) Percentage of time the device was busy servicing a request Average number of I/O requests pending for the device Number of I/O requests per second (includes reads and writes) Number of 512-byte blocks transferred (to and from) per second The average amount of time the I/O requests wait in the queue before being serviced The average amount of time spent servicing an I/O request (includes seek, rotational latency, and data transfer times)
Examples
# sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average %usr 64 61 61 61 61 %sys 36 39 39 39 39 %wio 0 0 0 0 0 10/14/97 %idle 0 0 0 0 0
# sar -d 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average device c0t6d0 c0t6d0 c0t6d0 c0t6d0 c0t6d0 %busy 19.36 26.40 21.00 21.00 22.44 avque 0.55 0.58 0.54 0.54 0.56
10/14/97 r+w/s 20 27 23 23 23 blks/s 1341 1687 1528 1528 1552 avwait 6.37 7.10 5.48 5.48 6.34 avserv 14.27 15.00 14.09 14.09 14.45
211. TEXT PAGE: time, timex

Description
The time and timex commands report the elapsed (wall clock) time, the time spent in system mode, and the time spent in user mode, for a specific invocation of a program. The timex command is an enhanced version of time, and can report additional statistics related to resources used during the execution of the command. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax time command timex [-o] [-p[fhkmrt]] [-s] command -o -s List amount of I/O performed by command (requires pacct file to be present) List activity (SAR data) present during execution of command (requires sar file to be present)
Standard UNIX (System V) man page and kernel source Process completion Kernel registers/counters Process CPU (user, system, elapsed) Standard output device Minimal Timing how long a process executes /usr/bin/timex + minimal overhead - cannot be used on already running processes
Example
timex find / 2>&1 >/dev/null | tee -a perf.data real user sys 39.49 1.47 11.24
212. TEXT PAGE: top

Description
The top command displays a real-time list of the CPU consumers (processes) on the system, sorted, with the greatest users at the top of the list. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: Standard UNIX (BSD 4.x) man page >= 1 second Kernel registers/counters Global, Process CPU, Memory Standard output device Varies, depending on presentation interval Real-time list of top CPU consumers /usr/bin/top + quick look at global and process CPU data - limited statistics - uses curses for terminal output
Syntax
top [-s time] [-d count] [-n number] [-q] -s time -d count -n number -q Set Set Set Run the delay between screen updates the number of screen updates to "count", then exit the number of processes to be displayed quick. The top command with a nice value of zero.
Key Metrics
The top metrics include: SIZE RES %WCPU %CPU Total size of the process in KB. This includes text, data, and stack. Resident size of the process in KB. This includes text, data, and stack. Average (weighted) CPU usage since top started. Current CPU usage over the current interval.
Example
* Start top with a 10 second update interval # top -s 10 * Start top and display only 5 screen updates then exit # top -d 5 * Start top and display only top 15 processes # top -n 15
* Start top and let it run continuously # top System: r3w14 Fri Oct 17 10:24:23 1997 Load averages: 0.55, 0.37, 0.25 115 processes: 113 sleeping, 2 running Cpu states:LOAD USER NICE SYS IDLE 0.55 9.9% 0.0% 2.0% 88.1%
BLOCK 0.0%
SWAIT 0.0%
INTR 0.0%
SSYS 0.0%
Memory: 24204K (15084K) real, 46308K (33432K) virtual, 2264K free Page# 1/9 TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND ? 680 root 154 20 1328K 468K sleep 33:23 12.36 12.34 snmpdm ? 728 root 154 20 340K 136K sleep 18:20 5.82 5.81 mib2agt ? 1141 root 154 20 12784K 3708K sleep 84:06 4.47 4.47 netmon ? 1071 root 80 20 1264K 568K run 0:19 3.00 2.99 pmd ? 3892 root 179 20 308K 296K run 0:00 2.59 0.34 top * To go to the next/previous page, type "j" and "k" respectively * To go to the first page, type "t"
NOTE:
The two values preceding real and virtual memory are the memory allocated for all processes, and in parentheses, memory allocated for processes that are currently runnable or that have executed within the last 20 seconds.
NOTE:
swait and block are relevant for SMP systems and will be 0.0% on single processor systems. swait is the time a processor spends spinning while waiting for a spinlock. block is the time a processor spends blocked while waiting for a kernel-level semaphore.
213. TEXT PAGE: uptime, w

The uptime command shows how long a system has been up, and who is logged in and what they are doing. The w command is linked to uptime and prints the same output as uptime -w, displaying a summary of the current activity on the system. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax uptime [-hlsuw] [user] w [-hlsuw] [user] -h -l -s -u -w Suppress the first line and the header line Print long listing Print short listing Print only the utilization lines; do not show user information Print what each user is doing; same as w command.
Standard UNIX (BSD 4.x) man page on demand Kernel registers/counters and /etc/utmp Global Load averages, number of logged on users Standard output device Varies, depending on number of users logged in Easiest way to see time since last reboot, load averages /usr/bin/uptime + quick look at load average, how long systems been up - limited statistics
Example
# uptime 11:23am # uptime 11:23am User root root root root root up 3 days, 22:22, 7 users, load average: 0.62, 0.37, 0.30
-w up 3 days, 22:22, 7 users, load average: 0.57, 0.37, 0.30 tty login@ idle JCPU PCPU what console 9:26am 94:20 /usr/sbin/getty console pts/0 9:26am 5 /sbin/sh pts/3 9:26am 1:57 /sbin/sh pts/4 10:16am 2 2 vi tools_notes pts/5 9:43am script
214. TEXT PAGE: vmstat

The vmstat command reports virtual memory statistics about processes, virtual memory, and CPU activity. Tool Source: Documentation Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: Standard UNIX (BSD 4.x) man page, include files >= 1 second Kernel registers/counters Global CPU, Memory Standard output device Varies, depending on presentation interval Cumulative VM statistics since last reboot /usr/bin/vmstat + minimal overhead - poorly documented - cryptic headings - lines wrap on 80-column character display - statistics can bleed together
Syntax vmstat [-dnS] [interval [count]] vmstat -f | -s | -z -d -n -S -f -s -z Include disk I/O information Print in a format more easily viewed on a 80-column display Include swapping information Print number of processes forked since boot, number of pages used by all forked processes, and the average pages/forked process Print virtual memory summary information Zero the summary registers.
Key Metrics
The vmstat metrics include:
Process metrics
r b w
In run queue Blocked for resource (I/O, paging, and so on) Runnable or short sleeper (< 20 sec.) but swapped
Module 2 Performance Tools VM metrics
avm free re at pi po fr sr
Fault metrics
Active virtual pages Number of pages on the free list Page reclaims Address translation faults Pages paged in Pages paged out Pages freed by vhand, per second Pages surveyed (dereferenced) by vhand, per second
in sy cs
CPU metrics
Device interrupts per second System calls per second CPU context switch rate (switches/second)
us sy id
User mode utilization System mode utilization Idle time
Examples
# vmstat -n 5 2 VM memory avm free re 7589 728 0 CPU cpu procs us sy id r b 2 1 97 0 74 7670 692 0 47 11 42 0 75 page po 0 faults sy 490
at 0
pi 0
fr 0
de 0
sr 0
in 140
cs 30
w 0 0 0
235
4959
170
# vmstat -nS 5 2 VM memory avm free si 7984 584 0 CPU cpu procs us sy id r b 2 1 97 0 75 7972 549 0 1 1 98 0 76
so 0
pi 0
page po 0
fr 0
de 0
sr 0
in 140
faults sy 490
cs 30
w 0 0 0
203
462
53
# vmstat -f 3949 forks, 497929 pages, average=
126.09

# vmstat -s 0 swap ins 0 swap outs 0 pages swapped in 0 pages swapped out 1116471 total address trans. faults taken 346175 page ins 7976 page outs 200675 pages paged in 16824 pages paged out 213104 reclaims from free list 216129 total page reclaims 110 intransit blocking page faults 587961 zero fill pages created 303212 zero fill page faults 248573 executable fill pages created 67077 executable fill page faults 0 swap text pages found in free list 80233 inode text pages found in free list 166 revolutions of the clock hand 106769 pages scanned for page out 13236 pages freed by the clock daemon 75633551 cpu context switches 1612387244 device interrupts 1137948 traps 247228805 system calls
215. SLIDE: Performance Monitoring Tools (HP Specific)
Performance Monitoring Tools (HP Specific)

Global Metrics glance gpm xload Yes Yes Yes Process Details Yes Yes No Alarming Capability Yes Yes No
Student Notes
This slide shows the HP-specific, run-time performance monitoring tools included with HP-UX. Currently, glance and gpm are available for HP-UX. Both glance and gpm are optional, and can be purchased separately. If you are running 11i (any version), both glance and gpm are included with the Enterprise and Mission Critical Operating Environments. The glance and gpm tools provide real-time monitoring capabilities specific to the HP-UX operating system. Both tools provide access to performance data not available with standard UNIX tools, and both tools use the midaemon (i.e. KI interface) to collect performance data, yielding much more accurate performance results. xload is an X-windows application, which graphically shows the recent length of the CPUs run queue. It consists of a window that displays vertical lines which represent the average number of processes in the run queue over the previous intervals. The default interval size is 8 seconds.
216. TEXT: glance

The glance tool is available for HP-UX. This is the recommended (and preferred) performance monitoring tool for HP-UX systems (character-based display). This tool shows information that cannot be seen with any of the standard UNIX monitoring tools. The accuracy of the data is considered more reliable, as the source is the midaemon, as opposed to the kernel counters and registers. NOTE: Free evaluation copies of glance and gpm can be obtained for trial periods. The phone number to obtain an evaluation copy is (800)237-3990. HP man page and on-line help >= 2 second midaemon Global, Process, and Application CPU, Memory, Disk, Network, and Kernel resources Standard output device, screen shots to a file Varies, depending on presentation interval and number of processes Per process (and global) system call rates Extensive on-line help for the metrics Sort by CPU usage, memory usage, or disk I/O usage Files opened per process /opt/perf/bin/glance + extensive per-process information + extensive global information + more accurate than standard UNIX tools - uses the curses display library - relatively slow startup - not bundled with the OS (prior to 11i)
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature:
Full Pathname: Pros and Cons:
Syntax
glance [-j interval] [-p [dest]] [-f dest] [-maxpages numpages] [-command] [-nice nicevalue] [-nosort] [-lock] [-adviser_off] [-adviser_only] [-bootup] [-iterations count] [-syntax filename] [-all_trans] [-all_instances] [-disks <n>] [-kernel <path>] [-nfs <n>] [-pids <n>] [-no_fkeys]
Key Metrics
The glance tool includes reports for the following areas:
Hot Key a c d g h i l m n s t u v w z A B D F G H I J K L M N P R T W Y Z ? <CR> GLANCE PLUS REPORT CPU by Processor CPU Report Disk Report Process List FUNCTION All CPUs Performance Stats CPU Utilization Stats Disk I/O Stats Global Process Stats Help I/O by Filesystem Lan Stats Memory Stats NFS Stats Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Zero all Stats
I/O by Filesystem Network by LAN Memory Report NFS Report Process selection System Table Report Disk Report I/O by Logical Volume Swap Detail
Application List Global Waits DCE Activity Process Open Files Process Threads Alarm History Thread Resource Thread Wait DCE Process List Process System Calls Process Memory Regions NFS Global Activity PRM Group List Process Resources Transaction Tracker Process Wait States Global System Calls Global Threads Help with options Update screen with new data
See Module 3 for a more complete discussion of glance and gpm.
217. TEXT PAGE: gpm

The gpm tool is a graphical version of glance. All the benefits of using glance apply to gpm (GlancePlus Monitor). NOTE: Free evaluation copies of Glance and gpm can be obtained for a 90-day trial period. The phone number to obtain the evaluation copy is (800)237-3990. HP man page and on-line help >= 1 second midaemon Global, Process, Application CPU, Memory, Disk, Network, kernel resources Standard output device and screen shots to a file Varies, depending on presentation interval and number of processes Alarming capabilities Performance advisor /opt/perf/bin/gpm + extensive per-process information + extensive global information + more accurate than standard UNIX tools - no selection for printing graphs - not bundled with the OS, prior to 11i
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax gpm [-nosave] [-rpt [rptname]] [-sharedclr] [-nice nicevalue] [-lock] [-disks <n>] [-kernel <path>] [-lfs <n>] [-nfs <n>] [-pids <n>] [Xoptions]
Glance and GPM Advantages

Both Glance and GPM: Use the same metrics Use the midaemon and kernel registers/counters as data sources Have adjustable presentation intervals Have the ability to renice processes Provide alarming capability (via /var/opt/perf/advisor.syntax) Provide per-CPU metrics
Can be configured to monitor application performance (that is, groups of processes)
Glance Advantages
Advantages of using Glance include: It is independent of X-Windows. It uses less overhead.
GPM Advantages
Advantages of using gpm include: It has customizable advisor syntax, which generates color-coded alarms. Has the ability to kill processes Reports are customizable. More comprehensive online documentation is available.
See Module 3 for a more complete discussion of glance and gpm.
218. TEXT PAGE: xload

xload is a graphical tool that will display the average length of the run queue over recent 10 second intervals. Since it is displayed in its own window on a graphics terminal, the window can be resized to accommodate good detail and many intervals at once. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man page 10 seconds (default) Kernel registers Global Run queue length none Very little Visual representation of run queue lengths /usr/contrib./bin/X11/xload + visual representation of run queue lengths in real time + expandable window for greater time and detail + self-scaling - no scale labels - no per-processor information
Syntax xload [-toolkitoption ] [-scale integer] [-update seconds] [-hl|-highlight color] [-jumpscroll pixels] [-label string] [-nolabel] [-lights]
Example xload -update 30
219. SLIDE: Data Collection Performance Tools (Standard UNIX)
Data Collection Performance Tools (Standard UNIX)

Global Metrics acctcom sar Some Yes Process Details Some No Alarming Capability No No
Student Notes
This slide shows the standard UNIX data collection tools included with HP-UX. Data collection tools gather performance data and other system-activity information, and store this data to a file on the system. By default, not too many standard UNIX tools perform data collection. The two most common tools are the acct (system accounting) suite of tools and sar (via the sadc and sa1 programs), the system activity reporter.
220. TEXT PAGE: acct Programs

The system accounting programs are primarily a financial tool and are designed to charge for time and resources used on the system. Information such as connect time, pages printed, disk space used for file storage, and commands executed (and the resources used by those commands) is collected and stored by the acct commands. Generally not considered a performance tool, the accounting commands can provide useful data for certain situations.
Description
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Standard UNIX (System V) man pages on demand Kernel registers and other kernel routines System resources used, on a per user basis Connect time, Disk space used, others Binary file /var/adm/acct/pacct Medium to large (up to 33%), depending on number of users and amount of activity Shows the amount of system resources being consumed by each user on the system. Logs every command executed by every user on the system. Full Pathname: Pros and Cons: /usr/sbin/acct/[acct_command] + provides information to charge users for system use + extensive system utilization information kept - extremely large overhead, especially on an active system. - poor documentation
Syntax
/usr/sbin/acct/acctdisk /usr/sbin/acct/acctdusg [-u file] [-p file] /usr/sbin/acct/accton [file] /usr/sbin/acct/acctwtmp reason /usr/sbin/acct/closewtmp /usr/sbin/acct/utmp2wtmp and many more
System Accounting Notes

System Accounting can be started: Manually Run the /usr/sbin/acct/startup command. Automatically at Boot Time Edit the /etc/rc.config.d/acct file and set the START_ACCT parameter equal to one (for example, START_ACCT=1). Only terminated processes are reported. Accounting reports include:
CPU time accounting Disk accounting Memory accounting Connect time accounting User command history Several more
221. TEXT PAGE: sar

The sar tool comes with additional programs, which assist in performance data collection and storage. The performance data is kept for one month before being overwritten with new data. Since collected data is overwritten each month, monitoring the files sizes is unnecessary. The sadc program is a data collector which runs in the background, usually started by sar or sa1. The sa1 program is a convenient shell script for collecting and storing sar data to a log file under /var/adm/sa. This script is typically run from root's cron file and collects (by default) three system snapshots per hour. The sa2 program is also a convenient shell script for converting collected sar data (binary format) into readable ASCII report files. The report files are typically stored in /var/adm/sa. The sa2 script is also normally run from root's cron file. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: Standard UNIX (System V) man page >= 1 second Kernel registers Global CPU, Disk, Kernel resources Binary file under /var/adm/sa Varies, depending on snapshot interval Only standard UNIX performance data collector /usr/sbin/sar + familiarity + relatively low overhead - no per process information - accuracy not as good as MeasureWare/OVPA
Syntax
sar [-ubdycwaqvmAMS] [-o file] t [n] sar [-ubdycwaqvmAMS] [-s time] [-e time] [-i sec] [-f file]
Some data collection related options:

-s -e -i -o -f The start time of the desired data The end time of the desired data The size of the reporting interval in seconds The file to write the data to The file to read the data from
Configure Data Collection through cron Jobs

To set up sar data collection, add the following to root's cron file: 0 0 0 5 5 5 * 8-17 18-7 18 18 18 * * * * * * * * * * * * 0,6 1-5 1-5 1-5 1-5 1-5 /usr/lbin/sa/sa1 /usr/lbin/sa/sa1 /usr/lbin/sa/sa1 /usr/lbin/sa/sa2 /usr/lbin/sa/sa2 /usr/lbin/sa/sa2
1200 3 -s 8:00 -e 18:01 -i 3600 -u -s 8:00 -e 18:01 -i 3600 -b -s 8:00 -e 18:01 -i 3600 -q
Create the /var/adm/sa directory: mkdir /var/adm/sa Some systems recommend adding the above entries to adm's cron file instead of root's. On these systems, be sure to give write access to all users on the /var/adm/sa directory. chmod a+w /var/adm/sa
222. SLIDE: Data Collection Performance Tools (HP-Specific)
Data Collection Performance Tools (HP-Specific)

Global Metrics MeasureWare/OVPA PerfView/OVPM Data Source Integration Yes Yes User Definable Process Details Yes Yes User Definable Alarming Capability Yes Yes User Definable
Student Notes
This slide shows the HP-specific data collection performance tools, which can be added to an HP-UX system. The MeasureWare/OVPA (OpenView Performance Agent) and PerfView/OVPM (OpenView Performance Manager) tools are available for HP-UX systems. These tools are optional products (separately purchasable). These tools significantly enhance a customer's ability to track performance trends and review historical performance data about a system. The standard UNIX tools collect little to no perprocess information, and have no alarming capabilities. With the MeasureWare/OVPA and PerfView/OVPM tools, global and per-process information is collected. In addition, alarms can be set to notify a user when a collected metric exceeds a defined threshold. Recently, PerfView was renamed OpenView Performance Manager and MeasureWare was renamed OpenView Performance Agent. There were no other significant changes made to the products.
223. TEXT PAGE: MeasureWare/OVPA and DSI Software

MeasureWare/OVPA is the recommended and preferred tool for collecting performance data on an HP-UX system. MeasureWare/OVPA collects all the global and process statistics, consolidates the data into a 5-minute summary, and writes the record to a circular log file. Processes can be grouped into applications, and various thresholds are available for determining which processes are included in the summary. OVPA version 3.x is identical to Measureware. OVPA version 4.x serves the same purpose, but has a new user interface. Included with MeasureWare/OVPA is a product/tool called Data Source Integration (DSI). DSI allows custom, application-specific metrics to be defined and collected via the MeasureWare/OVPA product. This custom information can include database statistics, networking statistics collected with NetMetrix, or MIB information from a networking device (router or gateway) collected with SNMP. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man pages, manual, on-line help 1 minute and 5 minute summaries midaemon Global, Process, Application CPU, Memory, Disk, Network, Other Circular binary files under /var/opt/perf/datafiles Number of processes and number of application definitions Parameter file to define the extent of data collection. Circular, compact log file format /opt/perf/bin/mwa + extensive global information + extensive per-process information + customize data collection with DSI - requires another tool (PerfView/OVPM) for graphical analysis - not included with the base OS
Syntax
mwa [action] [subsystem] [parms] in which action is start stop restart Start all or part of MeasureWare/OpenView Performance Agent. (default) Stop all or part of MeasureWare/OpenView Performance Agent. Reinitialize all or part of MeasureWare/OpenView Performance Agent. This option causes some processes to be stopped and restarted.
status
List the status of all or part of MeasureWare/OpenView Performance Agent processes.
MeasureWare/OVPA and Data Source Integration Notes

The MeasureWare/OVPA agent for HP-UX is part of the RPM (Resource and Performance Management) set of performance tools. To find the complete list of available RPM products, visit the RPM Web site at: www.hp.com/go/rpm MeasureWare/OVPA is designed for use with the PerfView/OVPM Analyzer tool and features extensive alarming syntax. The utility and extract programs for MeasureWare/OVPA provide many features for the analysis and management of the MeasureWare/OVPA log files. The MeasureWare/OVPA agent is fully integrated with the OpenView product line and is capable of sending alarm messages to the PerfView/OVPM Monitor, Network Node Manager, and IT Operations. The MeasureWare/OVPA agent is available for a large number of UNIX platforms including: AIX, Solaris, NCR System VR4, Microsoft Windows NT, and more. Data Source Integration (DSI) is one of the most powerful features of MeasureWare/OVPA. DSI provides the ability to log data from any data source as long as it writes its output to stdout. HP sells additional agents, which make use of this data source integration to allow for the monitoring of databases, network operating systems (for example, Windows NT and NetWare), and the Network Response Monitoring metrics (a facility of NetMetrix). Data can be imported from such operating environments as SAP/R3 and Baan.
See the course B5136S Performance Management with HP OpenView for a more complete discussion of MeasureWare/OVPA.
224. TEXT PAGE: PerfView/OVPM

The PerfView/OVPM tool allows collected MeasureWare/OVPA information to be viewed in a feature-rich, GUI interface. Graphs, charts, alarms, and other details are easily viewed with the PerfView/OVPM tool. Similarly to the MeasureWare product, OVPM version 3 is identical to PerfView, whereas OVPM version 4 has the same functionality but has a new user interface. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP Man pages, manual, online help On demand MeasureWare/OVPA log files Global, Process, and Application CPU, Memory, Disk, Network, others To central monitoring workstation number of systems being analyzed, number of systems sending alarms Many predefined graph templates. Access to any system currently running the MeasureWare/OVPA agent. /opt/perf/bin/pv + Centralized and automated performance monitoring + Can view data from DSI sources + Graphs can be saved in a worksheet format - Does not come standard with the OS
Syntax
pv [options]
PerfView/OVPM Notes
There are three components that make up the PerfView/OVPM product:
PerfView/OVPM Analyzer
The PerfView/OVPM Analyzer allows for the performance administrator to easily access data from any MeasureWare/OVPA Agent. By default, the last 8 days of data are pulled in to be analyzed, but any amount of data that has been collected can be retrieved. The PerfView/OVPM Analyzer allows you to compare multiple systems against a specific metric as well for load balancing. The graphs produced by the PerfView/OVPM Analyzer can be stored, or printed out to any Postscript or PCL printer. As with all of the RPM products, the PerfView/OVPM Analyzer is fully integrated with Network Node Manager and IT Operations.
Module 2 Performance Tools PerfView/OVPM Monitor
The PerfView/OVPM Monitor receives alarms sent by MeasureWare/OVPA agents. It allows you to filter alarms by severity and type. The PerfView/OVPM Monitor is an optional module and may not be required if you are also running Network Node Manager or IT Operations.
PerfView/OVPM Planner
The PerfView/OVPM Planner allows you to use collected MeasureWare/OVPA data to see performance trends. The more data provided to the PerfView/OVPM Planner and the less time you project it, the more accurate the reports will be. The PerfView/OVPM Planner is not a true capacity-planning tool in that it does not provide modeling or simulation capability.
See the course B5136S Performance Management with HP OpenView for a more complete discussion of PerfView/OVPM.
225. SLIDE: Network Performance Tools (Standard UNIX)
Network Performance Tools (Standard UNIX)

Resource netstat nfsstat ping Various LAN Statistics Network File Sharing Statistics Test Network Connectivity and Packet Round-Trip Response Time Super User Access Required No No No
Student Notes
This slide shows the standard UNIX networking performance tools included with HP-UX. Networking performance tools monitor performance and errors on the network. The standard UNIX networking tools primarily allow for monitoring of performance. The HP-specific tools will introduce the ability to tune some networking parameters to better meet the needs of a system's networking environment. NOTE: Super user (or root) access is not needed to monitor networking status by default.
226. TEXT PAGE: netstat

The netstat command displays general networking statistics. Information displayed includes: active sockets per protocol network data structures (like route tables) LAN card configuration and traffic Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Features: Standard UNIX (BSD 4.x) man pages and manual on demand Kernel registers and LAN card Global Network, LAN I/O, Sockets Standard output device Varies, depending on network activity Shows established and listening sockets. Shows traffic going through LAN interface card. Shows amount of memory allocated to networking /usr/bin/netstat + provides lots of information on networking configuration - provides lots of metrics; not all metrics are documented well
Full Pathname: Pros and Cons:
Syntax netstat [-aAn] [-f address-family] [system [core]] netstat [-f address-family] [-p protocol] [system [core]] netstat [-gin] [-I interface] [interval] [system [core]]
Examples
Display network connections
# netstat -n Active Internet connections Proto Recv-Q Send-Q Local Address Foreign Address tcp 0 0 156.153.192.171.1128 156.153.192.171.1129 tcp 0 0 156.153.192.171.1129 156.153.192.171.1128 tcp 0 0 156.153.192.171.947 156.153.192.171.1105 Active UNIX domain sockets Address Type Recv-Q Send-Q Inode Conn Refs Nextref c6f300 dgram 0 0 844afc 0 0 0 c87e00 dgram 0 0 844c4c 0 0 0 de4f00 stream 0 0 0 f75240 0 0 f71200 stream 0 0 0 f75280 0 0 (state) ESTABLISHED ESTABLISHED ESTABLISHED Addr /var/tmp/psb_front_socket /var/tmp/psb_back_socket /var/spool/sockets/X11/0
Display network interface information:

# netstat -in Name Mtu Network ni0* 0 none ni1* 0 none Address none none Ipkts 0 0 Ierrs 0 0 Opkts 0 0 Oerrs 0 0 Coll 0 0

lo0 lan0 4608 1500 127 156.153.192.0 127.0.0.1 156.153.192.171 6745 156 0 0 6745 0 0 0 0 0
Display network interface traffic:

# netstat -I lan0 5 (lan0)-> input packets 188 2 . . . output packets 172 1 (Total)-> input packets 6973 2 output packets 6785 1
Display protocol status:

# netstat -s tcp: 2244 packets sent 1191 data packets (217208 bytes) 4 data packets (5840 bytes) retransmitted 692 ack-only packets (276 delayed) 318 control packets 2277 packets received 1288 acks (for 195140 bytes) 144 duplicate acks 1360 packets (236775 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 83 out-of-order packets (0 bytes) 0 discarded for bad header offset fields 0 discarded because packet too short 134 connection requests 120 connection accepts 243 connections established (including accepts) udp: 0 bad checksums 164 socket overflows 0 data discards ip: 460730 total packets received 0 bad header checksums 0 with ip version unsupported 2253 fragments received 2670 packets not forwardable 0 redirects sent icmp: 1989 calls to generate an ICMP error message Output histogram: echo reply: 727 destination unreachable: 1989 727 responses sent arp: 0 Bad packet lengths 0 Bad headers probe: 0 Packets with missing sequence number 0 Memory allocations failed igmp: 0 messages received with bad checksum 10939700 membership queries received 10969833 membership queries received with incorrect field(s) 0 membership reports received
227. TEXT PAGE: nfsstat

The nfsstat command displays network file system (NFS) statistics. Categories of NFS information include: server statistics client statistics RPC statistics performance detail statistics Sun Microsystems man pages on demand Kernel registers Global NFS, RPC Standard output device Varies, depending on NFS activity Shows RPC calls, retransmissions, and timeouts. /usr/bin/nfsstat + reports both client and server activity - limited documentation
Syntax
nfsstat [ -cmnrsz ]
Examples
To reset all nfsstat counters to zero: # nfsstat -z To display server/client RPC and NFS statistics: # nfsstat (this defaults to nfsstat -cnrs)
Server rpc: Connection oriented: calls badcalls nullrecv 0 0 0 Connectionless oriented: calls badcalls nullrecv 0 0 0
badlen 0
xdrcall 0
dupchecks 0
dupreqs 0
badlen 0
xdrcall 0
dupchecks 0
dupreqs 0
Server nfs: calls badcalls 0 0 Version 2: (0 calls) null getattr setattr 0.0% 0.0% 0.0% wrcache write create 0.0% 0.0% 0.0% mkdir rmdir readdir 0.0% 0.0% 0.0% Version 3: (0 calls) null getattr setattr 0.0% 0.0% 0.0% write create mkdir 0.0% 0.0% 0.0% rename link readdir 0.0% 0.0% 0.0% commit 0.0% Client rpc: Connection oriented: calls badcalls badxids 20 0 0 badverfs timers cantconn 0 17 0 Connectionless oriented: calls badcalls retrans 20 0 0 badverfs timers toobig 0 17 0 Client nfs: calls badcalls clgets 20 0 20 Version 2: (20 calls) null getattr setattr 0.0% 18.90% 0.0% wrcache write create 0.0% 0.0% 0.0% mkdir rmdir readdir 0.0% 0.0% 1.5% Version 3: (0 calls) null getattr setattr 0.0% 0.0% 0.0% write create mkdir 0.0% 0.0% 0.0% rename link readdir 0.0% 0.0% 0.0% commit 0.0%
root 0.0% remove 0.0% statfs 0.0% lookup 0.0% symlink 0.0% readdir+ 0.0%
lookup 0.0% rename 0.0%
readlink 0.0% link 0.0%
read 0.0% symlink 0.0%
access 0.0% mknod 0.0% fsstat 0.0%
readlink 0 0% remove 0 0% fsinfo 0.0%
read 0.0% rmdir 0.0% pathconf 0.0%
timeouts 0 nomem 0
newcreds 0 interrupts 0
badxids 0 nomem 0
timeouts 0 cantsend 0
waits 0 bufulocks 0
newcreds 0
cltoomany 0 root 0.0% remove 0.0% statfs 1.5% lookup 0.0% symlink 0.0% readdir+ 0.0% lookup 0.0% rename 0.0% readlink 0.0% link 0.0% read 0.0% symlink 0.0%
access 0.0% mknod 0.0% fsstat 0.0%
readlink 0.0% remove 0.0% fsinfo 0.0%
read 0.0% rmdir 0.0% pathconf 0.0%
228. TEXT PAGE: ping

The ping command sends an ICMP echo packet to a host, and times how long it takes for the echo packet to return. This command is often used to test connectivity to another system. Specific details of the implementation include: An ICMP echo packet is sent once a second. Upon receipt of the echo packet, the round-trip time is displayed. The ability to display (via the -o option) the IP route taken. Public Domain man pages on demand NIC and ICMP packets Network Packet transmission Standard output device minimal; one packet transmission per second Shows round-trip times between systems Shows route taken to and from the second system. /usr/sbin/ping + familiarity + understood by all UNIX-based (and TCP/IP-based) systems - limited functionality
Syntax
ping [-oprv] [-i address] [-t ttl] host [-n count]
Examples
Send two ICMP echo packets to host star1: # ping star1 -n 2 PING star1: 64 byte packets 64 bytes from 156.153.193.1: icmp_seq=0. time=1. ms 64 bytes from 156.153.193.1: icmp_seq=1. time=0. ms ----star1 PING Statistics---2 packets transmitted, 2 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/1
Send one ICMP packet and display the IP path taken: # ping -o 156.152.16.10 -n 1 PING 156.152.16.10: 64 byte packets 64 bytes from 156.152.16.10: icmp_seq=0. time=337. ms ----156.152.16.10 PING Statistics---1 packets transmitted, 1 packets received, 0% packet loss round-trip (ms) min/avg/max = 337/337/337 1 packets sent via: 15.63.200.2 - [ name lookup failed ] 15.68.88.4 - [ name lookup failed ] 156.152.16.1 - [ name lookup failed ] 156.152.16.10 - [ name lookup failed ] 15.68.88.43 15.63.200.1 - [ name lookup failed ] - [ name lookup failed ]
229. SLIDE: Network Performance Tools (HP-Specific)
Network Performance Tools (HP-Specific)

Resource lanadmin lanscan nettune (10.x) ndd (11.x) NetMetrix Layer 2 Networking Statistics and NIC Reset LAN Hardware and Software Status Change Kernel Networking Parameters Change Kernel Networking Parameters Collects network performance data using RMON LAN probes Super User Access Required Yes No Yes Yes Yes
Student Notes
This slide shows the HP-specific networking performance tools included with HP-UX. The first three tools listed (lanadmin, lanscan, and ndd/nettune) come standard with the base OS. The NetMetrix product is an additional product. The HP-specific networking tools display additional networking information and allow tuning of various networking parameters.
230. TEXT PAGE: lanadmin

The lanadmin command tests, displays statistics for, and allows modifications to LAN cards on the HP-UX system. Specific capabilities include: Resetting the LAN card and executing the LAN card self-tests Displaying and clearing LAN card statistics Changing the LAN card speed, the MTU size, and the link level address HP man pages on demand Kernel registers and Network Interface Card Network Packet transmission status and errors Standard output device minimal Allows LAN interface card to be reset. /usr/sbin/lanadmin + provides extensive transmission statistics + allows for tuning of parameters normally requiring source code to change. - many statistics have little to no documentation
Syntax /usr/sbin/lanadmin [-e] [-t] /usr/sbin/lanadmin [-a] [-A station_addr] [-m] [-M mtu_size] [-R] [-s] [-S speed] NetMgmtID -e -t Echo the input commands on the output device. Suppress the display of the command menu before each command prompt.
Example
# lanadmin Test Selection mode. lan menu quit verbose = = = = LAN Interface Administration Display this menu Terminate the Administration Display command menu
Enter command: lan
LAN Interface test mode. LAN Interface Net Mgmt ID = 4 clear display end menu ppa quit nmid reset specific = = = = = = = = = Clear statistics registers Display LAN Interface status and statistics registers End LAN Interface Administration, return to Test Selection Display the menu PPA Number of the LAN Interface Terminate the Administration, return to shell Network Management ID of the LAN Interface Reset LAN Interface to execute its selftest Go to Driver specific menu
Enter command: display Network Management ID Description Type (value) MTU Size Speed Station Address Administration Status (value) Operation Status (value) Last Change Inbound Octets Inbound Unicast Packets Inbound Non-Unicast Packets Inbound Discards Inbound Errors Inbound Unknown Protocols Outbound Octets Outbound Unicast Packets Outbound Non-Unicast Packets Outbound Discards Outbound Errors Outbound Queue Length Specific Ethernet-like Statistics Group Index Alignment Errors FCS Errors Single Collision Frames Multiple Collision Frames Deferred Transmissions Late Collisions Excessive Collisions Internal MAC Transmit Errors Carrier Sense Errors Frames Too Long Internal MAC Receive Errors = = = = = = = = = = = = 4 0 0 21353 42774 281589 0 0 0 0 0 0 = = = = = = = = = = = = = = = = = = = = = = 4 lan0 Hewlett-Packard LAN Interface Hw Rev 0 ethernet-csmacd(6) 1500 10000000 0x8000935c9bd up(1) up(1) 14465 3606105787 2767086 88379016 0 464396 7114206 458391388 2842387 2874 0 0 0 655367
231. TEXT PAGE: lanscan

The lanscan command displays the LAN card configuration and status. Items displayed include: Hardware address of LAN card slot Link level address of card Hardware status and interface status Other status and configuration information HP man pages on demand Network Interface Card Network Interface status, Link Level Address Standard output device minimal Shows Link Level Address of system. /usr/sbin/lanscan + provides additional status information about network interface cards - no performance information
Syntax
lanscan [-ainv] [system [core]] -a -i -n -v Display station addresses only. Display interface names only. No headings. No headings. No headings.
Display Network Management IDs only.
Verbose output. Two lines per interface. Includes displaying of extended station address and supported encapsulation methods.
Examples
Output from a 10.x system:
# lanscan Hardware Station Crd Hardware Net-Interface Path Address In# State NameUnit State 2/0/2 0x080009D2C2DE 0 UP lan0 UP NM ID 4 MAC Type ETHER HP DLPI Mjr Support Num Yes 52
Output from an 11.x system:
# lanscan Hardware Station Crd Hdw Net-Interface Path Address In# State NamePPA 2/0/2 0x08000978BDB0 0 UP lan0 snap0
NM ID 1
MAC Type ETHER
HP-DLPI DLPI Support Mjr# Yes 119
232. TEXT PAGE: nettune (HP-UX 10.x Only)

The nettune command allows modifications to be made to network parameters, which in previous releases were not modifiable. This command was not included with any HP-UX 11.x release. Parameters that can be modified with nettune include: arp configuration socket buffer sizes enable or disable IP forwarding Use caution when making modifications with the tool. It is possible to hurt network performance severely or disable the LAN card when using this tool. HP man pages, nettune help options (-?, -l, -h) on demand Kernel registers and NIC Global LAN tunable parameters Standard output device minimal Change values of network parameters, which cannot otherwise be changed Change TCP send and receive buffer sizes without need for source code Full Pathname: Pros and Cons: /usr/contrib/bin/nettune + provides ability to modify networking behavior without needing source code + provides access to tunable parameters normally not available - can have a negative impact on performance if used the wrong way - minimal documentation
Syntax
CAUTION: Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead Unique Feature:
nettune nettune nettune nettune -h
[-w] object [parm...] -h [-w] [object] -l [-w] [-b size] [object [parm...]] -s [-w] object [parm...] value... (help) Print all information related to the object. This information provides helpful hints about changing the value of an object.
-l
(list) Print information regarding changing the value of object.
-s
(set)
Set object to value. An object may require more than one value. Display warning messages (for example, 'value truncated'). These are normally discarded when the command is successful.
-w
Examples
To get help information on all defined objects: nettune -h arp_killcomplete: The number of seconds that an arp entry can be in the completed state between references. When a completed arp entry is unreferenced for this period of time, it is removed from the arp cache. . . . To get help information on all TCP-related objects: nettune -h tcp tcp_receive: The default socket buffer size in bytes for inbound data. tcp_send: The default socket buffer size in bytes for outbound data. . . . To set the value of the ip_forwarding object to 1: nettune -s ip_forwarding 1 To get the value of the tcp_send object (socket send buffer size): nettune tcp_send
233. TEXT PAGE: ndd (HP-UX 11.x Only)

The ndd command allows the examination and modification of several tunable parameters that affect networking operation and behavior. It accepts arguments on the command line or may be run interactively. The -h option displays all the supported and unsupported tunable parameter that ndd provides. CAUTION: ndd was ported to HP-UX and contains references to some parameters that have not been implemented on the HP-UX O/S at this time. Reference the man page when in doubt. (Just because you can display a symbol's value and set it doesn't necessarily mean that the HP-UX kernel references the symbol!)
The ndd utility command accesses kernel parameters through the use of "pseudo device files". These pseudo device files are referred to as a network device on the ndd command line and selected from the following list: /dev/arp /dev/ip /dev/rawip /dev/tcp /dev/udp Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead Unique Feature: Full Pathname: Pros and Cons: For ARP cache-related values For IP routing and forwarding parameters Default IP time-to-live header value Transport Connect Protocol (connection based) parameters User Datagram Protocol (connectionless) parameters HP man pages, ndd -h (for help options) on demand network device pseudo device files (reference above) Global LAN tunable parameters Standard output device minimal Change values of network parameters, which cannot otherwise be changed /usr/bin/ndd + provides ability to modify networking behavior without needing source code + provides access to tunable parameters normally not available - can have a negative impact on performance if used the wrong way - minimal documentation
Syntax ndd -get network device parameter
ndd -set ndd -h ndd -h ndd -h ndd -c At boot:
network device sup[ported] unsup[ported] [parameter]
parameter
The file /etc/rc.config.d/nddconf contains tunable parameters that will be set automatically each time the system boots.
Examples
To list the contents of the "arp cache": ndd -get /dev/arp arp_cache_report
To get help information on all supported tunable parameters: ndd -h supported
To get a detail description of the tunable parameter, ip_forwarding: ndd -h ip_forwarding
To get the current value of the tunable parameter, ip_forwarding: ndd -get /dev/ip ip_forwarding
To set the value of the default TTL parameter for UDP to 128: ndd -set /dev/udp udp_def_ttl 128
To re-read the configuration file, /etc/rc.config.d/nddconf without rebooting the system: ndd -c
234. TEXT PAGE: NetMetrix (HP-UX 10.20 and 11.0 Only)

The NetMetrix product makes use of LAN probes to collect network traffic information. The LAN probes attach to the physical network and collect detailed information regarding the packets that pass through the probe. Tools available with NetMetrix include: packet decoders network alarming capabilities reports including top packet generating systems data collection for trending HP man pages, NRF (Network Response Facility) manual on demand LAN probes LAN traffic number of packets through cross-section of network NetMetrix binary file Varies, depending on the number of LAN probes Provides statistics regarding traffic on the entire network + Statistics regarding total packet traffic - Additional cost - Requires LAN probes
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Pros and Cons:
NetMetrix Notes
NetMetrix makes use of highly sophisticated devices (LAN probes) capable of collecting large amounts of detailed network information. NetMetrix is a truly distributed network management product that makes use of "midlevel managers" for data storage and alarming. There are a number of modules available with NetMetrix. NetMetrix's Internet Response Manager (IRM) and Internet Response Agent (IRA) fully integrate with HP OpenView products to provide a complete system and network management solution.
235. SLIDE: Performance Administrative Tools (Standard UNIX)
Performance Administrative Tools (Standard UNIX)

Resource ipcs ipcrm nice renice List Semaphores, Message Queues, and Shared Memory Segments Destroy Semaphores, Message Queues, and Shared Memory Segments Setting Process Priorities Modifying Process Priorities Super User Access Required No Yes Yes Yes
Student Notes
This slide shows the standard UNIX administrative performance tools included with HP-UX. These tools are used to tune or modify system resources to better improve the performance of a system. These tools are typically used to change or tune a system's component, as opposed to viewing or displaying characteristics about the component. Only the root user is allowed to use these commands, as making these modifications affects the performance for all users on the system. NOTE: The ipcs program is really a performance-monitoring command; however, because it is usually run in conjunction with ipcrm, it is covered here to emphasize the relationship between the two commands.
236. TEXT PAGE: ipcs, ipcrm

The ipcs command displays information about active interprocess communication facilities. With no options, ipcs displays information in short format about message queues, shared memory segments, and semaphore sets that are currently active in the system. The ipcrm command removes one or more specified message-queue, semaphore-set, or shared-memory identifiers. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: Standard UNIX (System V) man pages on demand Kernel registers Global, limited process semaphore sets, message queues, shared memory Standard output device varies, depending on the IPC resource in use Shows the size, owner, and last user of message queues and shared memory segments. /usr/bin/ipcs and /usr/bin/ipcrm + shows orphan IPC entries + shows size of message queues and shared memory segments - process information limited to owner and last user
Syntax
ipcrm [-m shmid] [-q msqid] [-s semid]
ipcs [-mqs] [-abcopt] [-C -m Display information -q Display information -s Display information -b -c -o -p -t Display Display Display Display Display
corefile] [-N namelist] about active shared memory segments. about active message queues. about active semaphore sets.
largest-allowable-size information creator's login name and group name information on outstanding usage process number information time information
Examples
# ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:56:36 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-raroot sys s 3 0x412000a9 --ra-ra-raroot root s 4 0x00446f6e --ra-r--r-root root s 6 0x01090522 --ra-r--r-root root s 7 0x013d8483 --ra-r--r-root root s 200 0x4c1c2f79 --ra-r--r-daemon daemon
# ipcrm -s 7
# ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:57:42 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-raroot sys s 3 0x412000a9 --ra-ra-raroot root s 4 0x00446f6e --ra-r--r-root root s 6 0x01090522 --ra-r--r-root root s 200 0x4c1c2f79 --ra-r--r-daemon daemon
237. TEXT PAGE: nice, renice

The nice command executes a command at a nondefault CPU scheduling priority. (The name is derived from being "nice" to other system users by running large programs at a weaker priority.) The renice command alters the nice value of a existing process. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: /usr/bin/nice and /usr/bin/renice + allows less important processes to run in the background + allows more important processes to run in the foreground - not an intuitive interface or syntax Standard UNIX (System V) man pages on demand process table processes priority standard output device minimal
Syntax
nice [-n newoffset_from_default_20] command [command_args] renice [-n newoffset_from_current_value] [-g|-p|-u] id ... An unsigned newoffset increases the system nice value for the command or process, causing it to run at a weaker priority. A negative value requires superuser privileges, and assigns a lower system nice value (strongerer priority) to the process.
Examples
# ps -l F S 1 S 1 R # nice sh # ps -l F S 1 S 1 S 1 R # exit UID 0 0 PID 6044 8286 PPID 6042 6044 C PRI NI 1 158 20 6 179 20 ADDR ff6680 1003d80 SZ 85 22 WCHAN TTY 87cec0 ttyp2 - ttyp2 TIME COMD 0:00 sh 0:00 ps
UID 0 0 0
PID 6044 8290 8293
PPID C PRI NI 6042 11 158 20 8287 0 158 30 8290 4 199 30
ADDR ff6680 ff1680 feae80
SZ 85 85 22
WCHAN 87cec0 100d3e0 -
TTY ttyp2 ttyp2 ttyp2
TIME 0:00 0:00 0:00
COMD sh sh ps

# nice -10 sh # ps -l F S 1 S 1 R 1 S
UID 0 0 0
PID 6044 8297 8294
PPID C 6042 0 8294 7 6044 10
PRI 158 199 158
NI 20 30 30
ADDR ff6680 ff1280 fea380
SZ 85 22 121
WCHAN 87cec0 87e0c0
TIME 0:00 0:00 0:00
COMD sh ps sh
# nice -5 ps -l F S UID 1 S 0 1 R 0 1 S 0 # nice -n 30 sh # ps -l F S 1 S 1 S 1 S 1 R # exit # nice -n -30 sh # ps -l F S 1 S 1 S 1 S 1 R
PID 6044 8304 8294
PPID C PRI 6042 0 158 8294 10 210 6044 10 158
NI 20 35 30
ADDR ff6680 1003e80 fea380
SZ 85 22 121
WCHAN 87cec0 87e0c0
TIME 0:00 0:00 0:00
COMD sh ps sh
UID 0 0 0 0
PID 6044 8305 8294 8308
PPID C PRI NI 6042 0 158 20 8294 19 158 39 6044 6 158 30 8305 4 220 39
ADDR ff6680 fb3300 fea380 feae80
SZ 85 121 121 22
WCHAN 87cec0 87d6c0 87e0c0 -
TTY ttyp2 ttyp2 ttyp2 ttyp2
TIME 0:00 0:00 0:00 0:00
COMD sh sh sh ps
UID 0 0 0 0
PID 6044 8306 8309 8312
PPID 6042 8294 8306 8309
C 0 1 7 6
PRI NI 158 20 158 30 158 0 139 0
ADDR ff6680 f86200 fea380 1003980
SZ 85 121 121 22
WCHAN 87cec0 87dc40 87e0c0 -
TTY ttyp2 ttyp2 ttyp2 ttyp2
TIME 0:00 0:00 0:00 0:00
COMD sh sh sh ps
238. SLIDE: Performance Administrative Tools (HP-Specific)
Performance Administrative Tools (HP-Specific)

Resource getprivgrp setprivgrp rtprio rtsched scsictl serialize fsadm getext setext newfs tunefs/vxtunefs PRM/WLM WebQoS List system privileged groups Allocate special system privileges Set real time process priority (HP) Set POSIX real time process priority Set parameters on SCSI devices Mark a program to run serially Online JFS management tool Display JFS extent attributes Sets/changes JFS extent attributes Create a file system Change a file systems attributes Process Resource Mgr/Work Load Mgr Web Quality of Service Super User Access Required No Yes Privileged Access Privileged Access Yes Privileged Access Yes No Yes Yes Yes Yes Yes
Student Notes
This slide shows the HP-specific administrative performance tools available on HP-UX systems. Many of the tools shown on the slide come standard with the base OS. The only tools that are add-on products are PRM, WLM, WebQoS, and Advanced JFS (getext, setext, and fsadm). These HP-specific tools were developed to allow modifications and performance enhancements to the functionality unique to the HP-UX operating system.
239. TEXT PAGE: getprivgrp, setprivgrp

The getprivgrp command lists the access privileges of privileged groups. The setprivgrp command sets the access privileges of privileged groups. If a group_name is supplied, access privileges are listed for that group only. The superuser is a member of all groups. Access privileges include RTPRIO, RTSCHED, MLOCK, CHOWN, LOCKRDONLY, SETRUGID, MPCTL, SPUCTL, and SERIALIZE. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname Pros and Cons: HP man pages on demand /etc/group and kernel data structures users and groups privilege access Standard output device minimal Gives non-root users access to privileges normally requiring root access. /usr/bin/getprivgrp and /usr/sbin/setprivgrp + ability to assign additional privileges to groups - requires additional system management - cannot give privilege to a single user; must assign privileges to groups
Syntax
getprivgrp [-g|group_name] setprivgrp [-g|groupname] [privileges] -g Specify global privileges that apply to all groups.
Examples
# getprivgrp global privileges: CHOWN # setprivgrp class CHOWN SERIALIZE RTPRIO # getprivgrp global privileges: CHOWN class: RTPRIO CHOWN SERIALIZE
Notes
Group privileges which can be modified are: RTPRIO RTSCHED MLOCK CHOWN LOCKRDONLY SETRUGID SERIALIZE MPCTL Can use rtprio() call to set real-time priorities. Can use sched_setparam() call and sched_setscheduler() call to set POSIX.4 real-time priorities. Can use plock() to lock process text and data into memory, and the shmctl() SHM_LOCK function to lock shared memory segments Can use chown() to change file ownership. Can use lockf() to set locks on files that are open for reading only. Can use setuid() and setgid() to change, respectively, the real user ID and real group ID of a process. Can use serialize() to force the target process to run serially with other processes that are also marked by this system call. Can use mpctl() to lock a process or a thread to a specific processor on SMP systems. If processor sets are available, can be used to lock a process or a thread to a specific processor set. Can use spuctl() to enable and disable specific processors on SMP systems. (V-class, T-class, N-class, L-class, and Superdome only)
SPUCTL
240. TEXT PAGE: rtprio

The rtprio command executes a specified command with a real-time priority, or changes the real-time priority of a currently executing process with a specific PID. Real-time priorities range from zero (strongest) to 127 (weakest). Real-time processes are not subject to priority degradation and are considered of greater importance than all non-real-time processes. CAUTION: Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: Special care should be taken when using this command. It is possible to lock out other processes (including system processes) when using this command. HP man pages on demand process table process process priority none varies, depending on the activity of the process assign real time priority to a process /usr/bin/rtprio + Can significantly improve the performance of a program - Can severely impact the performance of the system (if used incorrectly)
Syntax rtprio priority command [arguments] rtprio priority -pid rtprio -t command [arguments] rtprio -t -pid -t execute command with a timeshare (non-real-time) priority, or change the currently executing process pid from a possibly real-time priority to a timeshare priority.
Examples
Execute file a.out at a real-time priority of 100: rtprio 100 a.out Set the currently running process PID 24217 to a real-time priority of 40: rtprio 40 24217
241. TEXT PAGE: rtsched

The rtsched executes commands with POSIX or HP-UX real-time priority, or changes the real-time priority of currently executing process PID. All POSIX real-time priority processes are of greater scheduling importance than processes with HP-UX real-time or HP-UX timeshare priority. Neither POSIX nor HP-UX real-time processes are subject to degradation. POSIX real-time processes can be scheduled with one of three different POSIX scheduling policies specified: SCHED_FIFO, SCHED_RR, or SCHED_RR2. The number of POSIX real-time priority queues is tunable between the values of 32 and 512, and show up as a negative number between -1 and -512 when viewed with the ps -ef or ps el commands. CAUTION: Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: Special care should be taken when using this command. It is possible to lock out other processes (including system processes) when using this command. HP man pages (also see rtsched(2) ) on demand process table process process priority none varies, depending on the activity of the process assign real time priority to a process /usr/bin/rtsched + Can significantly improve the performance of a program - Can severely impact the performance of the system (if used incorrectly)
Syntax rtsched -s scheduler -p priority command -p [arguments] pid
rtsched [ -s scheduler ] -p priority -s
Specifies which scheduler to use, SCHED_FIFO (POSIX real-time), SCHED_RR (POSIX real-time), SCHED_RR2 (POSIX real-time), SCHED_RTPRIO (HP-UX real-time), or SCHED_HPUX (HP-UX timeshare)
Examples
Execute file a.out at a POSIX real-time priority of 4: rtsched -s SCHED_FIFO -p 4 a.out
Set the currently running process pid 24217 to a real-time priority of 20: rtsched -s SCHED_RR -p 20
242. TEXT PAGE: scsictl

The scsictl command provides a mechanism for controlling a SCSI device. It can be used to query mode parameters, set configurable mode parameters, and perform SCSI commands. The operations are performed in the same order as they appear on the command line. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons HP man pages on demand SCSI disks disks immediate reporting, I/O queue standard output device minimal Provides control over the behavior of an individual SCSI disk /usr/sbin/scsictl + can improve performance by modifying the drive behavior - not all SCSI devices support the command - could misconfigure a disk, causing data to be lost in the event of a system crash
Syntax
scsictl [-akq] [-c command]... [-m mode[=value]]... device -a -m mode Display the status of all mode parameters available. Display the status of the specified mode parameter. ir For devices that support immediate reporting, this displays the immediate reporting status.
queue_depth For devices that support a queue depth greater than the system default, this mode controls how many I/Os the driver will attempt to queue to the device at any one time. -m mode=value Set the mode parameter mode to value. The available mode parameters and values are listed above.
Examples
To display a list of all of the mode parameters, turn immediate_report on, and redisplay the value of immediate_report. scsictl -a -m ir=1 -m ir /dev/rdsk/c0t6d0 will produce the following output: immediate_report = 0; queue_depth = 8; immediate_report = 1
243. TEXT PAGE: serialize

The serialize command is used to force the target process to run serially with other processes also marked by this command. Once a process has been marked by serialize, the process stays marked until process completion, unless serialize is reissued. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man pages on demand process table process priority standard output device minimal decreases CPU and memory contention problems using standard functionality. /usr/bin/serialize + allows system to behave more efficiently when CPU and memory resources are scarce. - minimal documentation - only helps when CPU and memory resources are scarce
Syntax serialize command [command_args] serialize [-t] [-p pid] -t Indicates the process specified by pid should be returned to timeshare scheduling.
Examples
Use serialize to force a database application to run serially with other processes marked for serialization. Type: serialize database_app Force a currently running process with a PID value of 215 to run serially with other processes marked for serialization. Type: serialize -p 215 Return a process previously marked for serialization to normal timeshare scheduling. The PID of the target process for this example is 174. Type: serialize -t -p 174
244. TEXT PAGE: fsadm

The fsadm command is designed to perform selected administration tasks on HFS (10.20 or later) and JFS file systems. These tasks may differ between file system types. For HFS file systems, fsadm allows conversions between large and nolarge files. For VxFS file systems, fsadm allows file system resizing, extent (and directory) reorganization, and large/nolarge file conversions. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Features: Full Pathname: Pros and Cons: Veritas and HP man pages on demand File System superblock and header structures file system header and data fragmentation standard output device Medium to large (up to 33%), depending on number of user and amount of activity Can defragment a file system, improving performance. (JFS) Can increase the size of a file system while it's mounted. (JFS) /usr/sbin/fsadm + provides greater manageability of file systems - many features (including defragmentation) are only available for JFS - requires purchasing the AdvanceJFS or OnlineJFS product.
Syntax
/usr/sbin/fsadm [-F vxfs|hfs] [-V] [-o largefiles|nolargefiles] mount_point|special /usr/sbin/fsadm [-F vxfs] [-V] [-b newsize] [-r rawdev] mount_point /usr/sbin/fsadm [-F vxfs] [-V] [-d] [-D] [-s] [-v] [-a days] [-t time] [-p passes] [-r rawdev] mount_point
Examples
HFS Example Convert a nolargefiles HFS file system to a largefiles HFS file system: fsadm -F hfs -o largefiles /dev/vg02/lvol1 Display relevant HFS file system statistics: fsadm -F hfs /dev/vg02/lvol1
JFS Example Increase the size of the var file system to 100 MB while it is mounted and online: lvextend -L 100 /dev/vg00/lvol7 fsadm -F vxfs -b 102400 /var Display fragmentation statistics for the /home file system: fsadm -D -E /home
245. TEXT PAGE: getext, setext

The getext command displays extent attribute information of associated files on a JFS file system. The setext command allows attributes related to JFS file systems and files within the JFS file system to be modified and tuned. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: Veritas man pages on demand JFS file system File system metadata structures File system space allocation standard output device minimal Allows attributes of JFS files to be set /usr/sbin/getext and /usr/sbin/setext + can improve file system performance by modifying file attributes - require purchase of the AdvancedJFS or OnlineJFS product
Syntax
/usr/sbin/getext [-V] [-f] [-s] file...

/usr/sbin/setext [-V] [-e extent_size] [-r reservation] [-f flag] file
Example
Display file attributes for the file, file1: getext file1 file1: Bsize 1024 Reserve 36 Extent Size 3 align noextend
The above output indicates a file with 36 blocks of reservation, a fixed extent size of 3 blocks, all extents aligned to 3-block boundaries, and the file cannot be extended once the current reservation is exhausted.
246. TEXT PAGE: newfs, tunefs, vxtunefs

The newfs command is a "friendly" front-end to the mkfs command. The newfs command calculates the appropriate parameters and then builds the file system by invoking the mkfs command. The tunefs command displays detailed configuration information for an HFS file system and allows some of the file system parameters to be modified. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and; Cons: BSD 4.x, modified by HP, Veritas man pages not applicable, on demand file system header and superblock file system metadata structures Block size, Fragment size, Mininum space standard output minimal Allows file system parameters to be displayed and set. /usr/sbin/newfs, /usr/sbin/tunefs, /usr/sbin/vxtunefs + File system parameters can be viewed and tuned for optimal performance - To tune many parameters, a re-initialization of the file system is required
Syntax
/usr/sbin/newfs [-F FStype] [-o specific_options] [-V] special /usr/sbin/tunefs [-A] [-v] [-a maxcontig] [-d rotdelay] [-e maxbpg] [-m minfree] special-device /usr/sbin/vxtunefs
Notes
The initial file system parameters are set when the file system is first created with newfs. A small set of these parameters can be changed after the file system is created with tunefs. vxtunefs changes the attributes of the JFS file system when the file system is mounted. NOTE: The tunefs command works only for HFS file systems. The JFS file systems use other commands (getext, setext, vxtunefs).
Examples
Create a file system on vg01 called lvol1. newfs -F hfs -b 16384 -f 2048 /dev/vg01/rlvol1
mkfs (hfs): Warning - 2 sector(s) in the last cylinder are not allocated. mkfs (hfs): /dev/vg01/rlvol1 - 20480 sectors in 133 cylinders of 7 tracks, 22 sectors 21.0Mb in 9 cyl groups (16 c/g, 2.52Mb/g, 384 i/g) Super block backups (for fsck -b) at: 16, 2512, 5008, 7504, 10000, 12496, 14992, 17488, 19728
View the file system's configuration parameters:

tunefs -v /dev/vg01/rlvol91 super block last mounted on: magic 95014 clean FS_CLEAN time Fri Nov sblkno 8 cblkno 16 iblkno 24 dblkno sbsize 2048 cgsize 2048 cgoffset 16 cgmask ncg 9 size 10240 blocks 9858 bsize 16384 bshift 14 bmask 0xffffc000 fsize 2048 fshift 11 fmask 0xfffff800 frag 8 fragshift 3 fsbtodb 1 minfree 10% maxbpg 38 maxcontig 1 rotdelay 0ms rps 60 csaddr 48 cssize 28672 csshift 10 csmask ntrak 7 nsect 22 spc 154 ncyl cpg 16 bpg 154 fpg 1232 ipg nindir 4096 inopb 128 nspf 2 nbfree 1230 ndir 2 nifree 3452 nffree cgrotor 0 fmod 0 ronly 0 fname fpack cylinders in last group 5 blocks in last group 48
28 07:02:58 1997 48 0xfffffff8
0xfffffc00 133 384 9
For VxFS file systems use: # fsdb -F vxfs /dev/vg/NN/rlvolN > 8192 B > p S
247. TEXT PAGE: Process Resource Manager (PRM)

Process Resource Manager (PRM) allows the administrator to guarantee that important processes will receive the amount of memory, disk, and CPU time required to meet your performance objectives. PRM works in conjunction with the standard HP-UX scheduler to improve response times for critical applications. PRM provides state-of-the-art resource allocation that has long been missing in the UNIX environment. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Features: HP PRM man pages (prmconfig) on demand kernel registers and counters process groups as defined by the PRM configuration file. CPU time, memory, and disk I/O bandwidth allocated to groups of processes standard output, glance, gpm, perfview/OVPM, measureware/OVPA PRM only applies to time-shared processes. Real-time processes are not affected. allows the system administrator to control which groups of processes receive a certain percentage of the CPU's time, memory paging, and/or disk I/O request preference. CPU (per PRM group) entitlement and capping DISK (per PRM group per VG) entitlement Memory (per PRM group) entitlement, capping and selection method Application (per PRM group) Full Pathname: Pros and Cons: /usr/sbin/prmconfig + Greater control of resource distributions - Optional product. Does not come standard with the OS. If you are running 11i in the Enterprise or Mission Critical Operating Environments, PRM is included.
See the course U5447S HP-UX Resource Management with PRM & WLM for a more complete discussion of PRM.
248. TEXT PAGE: Work Load Manager (WLM)

The Work Load Manager sits on top of PRM and tunes it as necessary to meet the desired performance goals. The goals are defined in a configuration file in the form of Service Level Objectives (SLOs). The administrator defines these goals in the file and then lets WLM tweek PRM until either the goals are reached or they are attained as closely as possible. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Features: HP WLM man pages (wlmd) on demand kernel registers and counters process groups as defined by the WLM configuration file As defined in the WLM configuration file Data can be sent to an EMS (Event Monitoring System) Data collection of defined metrics and adjusting of PRM configuration allows the system administrator to define what Service Level Objectives are desired on the system and lets WLM to tune the system (via PRM) to obtain performance as close to those objectives as possible. CPU (per WLM group) entitlement DISK (per WLM group per VG) entitlement Memory (per WLM group) entitlement Application (per WLM group) Full Pathname: Pros and Cons: /opt/wlm/bin/wlmd + Greater control of CPU distribution - Optional product. Does not come standard with the OS.
See the course U5447S HP-UX Resource Management with PRM & WLM for a more complete discussion of WLM.
249. TEXT PAGE: Web Quality of Service WebQoS

WebQoS manager is an example of the growing number of system performance management/enhancement products focused on specific server applications and environments. The modern paradigm for application server management requires looking past simple performance metrics and forces us to start think a little out of the box. Do all requests received by a Web server warrant the same level of service? WebQoS allows the administrator to make decisions on the service level, based on several different criteria: admission control user differentiation activity differentiation application differentiation. A discussion of the specifics of this product is beyond the scope of this class. Tool Source: Typical metrics: Purpose: Pros and Cons: HP Number of concurrent users and response times. Maximize successful customer interactions and peak throughput. + Greater control of Web server resources tuned to specific client requests. - Optional product. Does not come standard with the OS.
250. SLIDE: System Configuration and Utilization Information (Standard UNIX)
System Configuration and Utilization Information (Standard UNIX)

Resource bdf df mount Local and remote mounted file system space Mounted file system space Local and remote file system mounts Portability Some Yes Yes
Student Notes
This slide shows the standard UNIX tools for displaying system configuration and utilization information on an HP-UX system. System configuration and utilization tools are those which display configurations of LVM disks, file systems, and kernel resources.
251. TEXT PAGE: bdf, df

The bdf command displays the amount of free disk space available. If no file system is specified, the free space on all of the normally mounted file systems is printed. Free inode information can be displayed by using the i option. The df command displays the number of free 512-byte blocks and free inodes available for file systems by examining the counts kept in the superblock or superblocks. Blocks can be displayed in 1KB sizes by using the k option. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax
df, Standard UNIX (System V) bdf, Standard UNIX (Berkeley 4.x) man pages on demand File system superblocks Disk space resources Disk space utilization Standard output Minimal Shows how much disk space is being utilized. /usr/bin/bdf, /usr/bin/df + Easy to use - minimal tuning statistics
/usr/bin/bdf [-b] [-i] [-l] [-t type | [filesystem|file] ... ] /usr/bin/df [-befgiklnv] [-t|-P] [-o specific_options] [-V] [special|directory]...
Examples bdf Command

# bdf /usr Filesystem /dev/vg00/lvol7 # bdf -i / Filesystem /dev/vg00/lvol3 # bdf -ib /home Filesystem dev/vg00/lvol4 Swapping g # ll /home/paging total 0 kbytes 307200 used 279059 avail %used Mounted on -9635 103% /usr
kbytes 40960
used 25093
avail %used 14869 63%
iused 3284
ifree %iuse Mounted on 3960 45% /
kbytes 53248 53248
used 3586 0
avail %used 46546 7% 40546 0%
iused ifree %iuse Mounted on 513 12407 4% /home /home/pagin
Examples df Command
# df /home /opt /tmp /usr /var /stand (/dev/vg00/lvol4 (/dev/vg00/lvol5 (/dev/vg00/lvol6 (/dev/vg00/lvol7 (/dev/vg00/lvol8 (/dev/vg00/lvol1 ): ): ): ): ): ): 93062 177124 90010 52732 100122 23596 blocks blocks blocks blocks blocks blocks 12403 23598 11982 7011 13320 5358 i-nodes i-nodes i-nodes i-nodes i-nodes i-nodes
252. TEXT PAGE: mount

The mount command is used to mount file systems on the system. Other users can use mount to list mounted file systems. If mount is invoked without any arguments, it lists all of the mounted file systems from the file system mount table, /etc/mnttab. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: standard UNIX (System V) man pages on demand kernel mount table and /etc/mnttab file file system file system type and mount options the file /etc/mnttab and standard output minimal used to mount HFS, JFS, and NFS file systems. /sbin/mount + displays valuable data regarding how file systems are mounted - different options depending on the type of file system being mounted
Syntax
/usr/sbin/mount [-l] [-p|-v]
Examples
# mount -p /dev/root /dev/vg00/lvol1 /dev/vg00/lvol6 /dev/vg00/lvol5 /dev/vg00/lvol4 /dev/dsk/c0t4d0 /dev/vg00/lvol7 / /stand /usr /tmp /opt /disk /var vxfs hfs vxfs vxfs vxfs hfs vxfs log defaults delaylog delaylog delaylog defaults delaylog 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# mount -v /dev/root on / type vxfs log on Thu Sep 11 12:15:08 1997 /dev/vg00/lvol1 on /stand type hfs defaults on Thu Sep 11 12:15:11 1997 /dev/vg00/lvol6 on /usr type vxfs delaylog on Thu Sep 11 12:17:06 1997 /dev/vg00/lvol5 on /tmp type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/vg00/lvol4 on /opt type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/dsk/c0t4d0 on /disk type hfs defaults on Thu Sep 11 12:17:08 1997 /dev/vg00/lvol7 on /var type vxfs delaylog on Thu Sep 11 12:17:23 1997 #
253. SLIDE: System Configuration and Utilization Information (HP-Specific)
System Configuration and Utilization Information (HP-Specific)

Resource diskinfo dmesg ioscan vgdisplay pvdisplay lvdisplay swapinfo sysdef kmtune kcweb Size and model of local disk drives I/O tree and memory details I/O tree and addressing Local volume group contents/attributes Local physical volume contents/attributes Local logical volume contents/attributes Swap space utilization Sizes and values of kernel tables and parms Query, set, or reset system parameters Query, set, or reset system configuration Portability No Some No No No No No Some Some Some
Student Notes
This slide shows the HP-specific commands for displaying system configuration and utilization information. All the commands on the slide come standard with the base OS; none are add-on products. These commands display the configuration and utilization of HP-specific subsystems. Many of these commands have corresponding commands on other UNIX systems that perform similar functions.
254. TEXT PAGE: diskinfo

The diskinfo command determines whether the character special file named by character_devicefile is associated with a SCSI, CS/80, or Subset/80 disk drive. If so, diskinfo summarizes the disk's characteristics. Both the size of disk and bytes per sector represent formatted media. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man pages on demand controller on disk disk specific disk capacity, sector size standard output minimal shows model number and manufacturer of disk /usr/sbin/diskinfo + can determine size and manufacturer of disk without having to open system - minimal tuning information
Syntax
/usr/sbin/diskinfo [-b|-v] character_devicefile The diskinfo command displays information about the following characteristics of disk drives: vendor name, manufacturer of the drive (SCSI only) product identification number or ASCII name type, CS/80 or SCSI classification for the device size of disk specified in bytes sector size, specified as bytes per sector
Example
# diskinfo /dev/rdsk/c0t6d0 SCSI describe of /dev/rdsk/c0t6d0: vendor: QUANTUM product id: PD425S type: direct access size: 416575 Kbytes bytes per sector: 512
255. TEXT PAGE: dmesg

The dmesg command looks in a system buffer for recently printed diagnostic messages and prints them on the standard output. The messages are those printed by the system when unusual events occur (such as when system tables overflow or the file systems get full). Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man pages on demand kernel diagnostic buffer system diagnostic messages kernel startup information standard output device minimal displays kernel diagnostic messages /sbin/dames + Allows kernel diagnostic messages to be recalled - Diagnostic messages can be lost since kernel buffer is a fixed size
Syntax
/usr/sbin/dmesg [-] If the - argument is specified, dmesg computes (incrementally) the new messages since the last time it was run and places these on the standard output. This is typically used with cron (see cron(1)) to produce the error log /var/adm/messages by running the command: /usr/sbin/dmesg every 10 minutes. >> /var/adm/messages
Example
# dmesg Oct 17 12:39 vuseg=1815000 inet_clts:ok inet_cots:ok 1 graph3 2 bus_adapter 2/0/1 c720 2/0/1.0 tgt 2/0/1.0.0 stape 2/0/1.2 tgt 2/0/1.2.0 sdisk 2/0/1.3 tgt 2/0/1.3.0 stape 2/0/1.4 tgt 2/0/1.4.0 sdisk 2/0/1.7 tgt 2/0/1.7.0 sctl 2/0/2 lan2 2/0/3 hil

2/0/4 asio0 2/0/5 asio0 2/0/6 CentIf 2/0/7 c720 2/0/7.5 tgt 2/0/7.5.0 sdisk 2/0/7.6 tgt 2/0/7.6.0 sdisk 2/0/7.7 tgt 2/0/7.7.0 sctl 2/0/8 audio 4 eisa 4/0/4 lan2 8 processor 9 memory System Console is on the ITE Networking memory for fragment reassembly is restricted to 36265984 bytes Logical volume 64, 0x3 configured as ROOT Logical volume 64, 0x2 configured as SWAP Logical volume 64, 0x2 configured as DUMP Swap device table: (start & size given in 512-byte blocks) entry 0 - major is 64, minor is 0x2; start = 0, size = 819200 Dump device table: (start & size given in 1-Kbyte blocks) entry 0 - major is 31, minor is 0x26000; start = 68447, size = 393217 Starting the STREAMS daemons. B2352B HP-UX (B.10.20) #1: Sun Jun 9 08:03:38 PDT 1996 Memory Information: physical page size = 4096 bytes, logical page size = 4096 bytes Physical: 393216 Kbytes, lockable: 302512 Kbytes, available: 349504 Kbytes Using 1932 buffers containing 15360 Kbytes of memory. SCSI: Request Timeout -- lbolt: 7543017, dev: cd000001 lbp->state: 0 lbp->offset: ffffffff lbp->uPhysScript: 2a24000 From most recent interrupt: ISTAT: 06, SIST0: 04, SIST1: 00, DSTAT: 80, DSPS: 00000006 lsp: 1febc00 bp->b_dev: cd000001 scb->io_id: 57b13 scb->cdb: 08 00 00 08 00 00 lbolt_at_timeout: 7544517, lbolt_at_start: 7543017 lsp->state: 30d lsp->uPhysScript: 196e000 lsp->upScript: 196a000 lsp->upActivePtr: 196a000 lsp->uActiveAdjust: 0 lsp->upSavedPtr: 196a000 lsp->uSavedAdjust: 0 lsp->upPeakPtr: 196a000 lsp->uPeakAdjust: 0 lbp->owner: 1febc00 scratch_lsp: 0 Pre-DSP script dump [1b20020]: 78051800 00000000 78030000 00000000 0e000002 02a24700 80000000 00000000 Script dump [1b20040]: 9f0b0000 00000006 98080000 00000005 98080000 00000001 58000008 00000000
256. TEXT PAGE: ioscan

The ioscan command scans system hardware, usable I/O system devices, or kernel I/O system data structures as appropriate, and lists the results. For each hardware module on the system, ioscan displays the hardware path to the hardware module, the class of the hardware module, and a brief description. By default, ioscan scans the system and lists all reportable hardware found. The types of hardware reported include processors, memory, interface cards and I/O devices. Scanning the hardware may cause drivers to be unbound and others bound in their place in order to match actual system hardware. Entities that cannot be scanned are not listed. On very large systems, ioscan will operate much faster with the k option. This will force ioscan to read kernel structures built at boot time, rather than sending fresh inquiries to each hardware module. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man pages on demand SCSI devices status and Hardware address hardware status standard output minimal polls SCSI bus to retrieve status of SCSI devices /usr/sbin/ioscan + Displays hardware addresses and corresponding device filenames. - Minimal performance data
Syntax
/usr/sbin/ioscan [-k|-u] [-d driver|-C class] [-I instance] [-H hw_path] \ [-f[-n]|-F[-n]] [devfile]
Examples
# ioscan -f Class I H/W Path Driver S/W State H/W Type Description =========================================================================== bc 0 root CLAIMED BUS_NEXUS graphics 0 0 graph3 CLAIMED INTERFACE Graphics ba 0 2 bus_adapter CLAIMED BUS_NEXUS Core I/O Adapter ext_bus 0 2/0/1 c720 CLAIMED INTERFACE Built-in SCSI target 0 2/0/1.0 tgt CLAIMED DEVICE disk 0 2/0/1.0.0 sflop CLAIMED DEVICE TEAC FC-1 HF 07 target 1 2/0/1.1 tgt CLAIMED DEVICE tape 0 2/0/1.1.0 stape CLAIMED DEVICE HP HP35470A target 2 2/0/1.2 tgt CLAIMED DEVICE disk 1 2/0/1.2.0 sdisk CLAIMED DEVICE TOSHIBA CD-ROM XM-3301TA target 5 2/0/1.5 tgt CLAIMED DEVICE disk 4 2/0/1.5.0 sdisk CLAIMED DEVICE QUANTUM FIREBALL1050S target 6 2/0/1.6 tgt CLAIMED DEVICE

disk target ctl lan hil tty tty ext_bus audio processor memory 5 7 0 0 0 0 1 1 0 0 0 2/0/1.6.0 2/0/1.7 2/0/1.7.0 2/0/2 2/0/3 2/0/4 2/0/5 2/0/6 2/0/8 8 9 sdisk tgt sctl lan2 hil asio0 asio0 CentIf audio processor memory CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED DEVICE DEVICE DEVICE INTERFACE INTERFACE INTERFACE INTERFACE INTERFACE INTERFACE PROCESSOR MEMORY QUANTUM PD425S Initiator Built-in LAN Built-in HIL Built-in RS-232C Built-in RS-232C Built-in Parallel Interface Built-in Audio Processor Memory
# ioscan -fC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S
# ioscan -fnC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S /dev/dsk/c0t6d0 /dev/rdsk/c0t6d0
257. TEXT PAGE: vgdisplay, pvdisplay, lvdisplay

The vgdisplay command displays information about volume groups. If a specific vg_name is specified, information for just that volume group is displayed. The pvdisplay command displays information about specific physical volumes (or disks) within an LVM volume group. The lvdisplay command displays information about specific logical volumes within an LVM volume group. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man pages on demand LVM header structures and /etc/lvmtab LVM configuration mirroring, stripping, other I/O policies standard output device minimal shows LVM configuration information /usr/sbin/vgdisplay, /usr/sbin/pvdisplay, /usr/sbin/lvdisplay + Only commands for viewing LVM configurations - Minimal tuning capabilities
Syntax
/sbin/vgdisplay [-v] [vg_name ...] /sbin/lvdisplay [-k] [-v] lv_path ... /sbin/pvdisplay [-v] [-b BlockList] pv_path ...
Examples
# vgdisplay --- Volume groups --VG Name VG Write Access VG Status Max LV Cur LV Max PV Cur PV Max PE per PV VGDA PE Size (Mbytes) Total PE Alloc PE
/dev/vg00 read/write available 255 9 16 2 1016 4 4 726 279
Free PE Total PVG # pvdisplay /dev/dsk/c0t5d0 --- Physical volumes --PV Name VG Name PV Status Allocatable VGDA Cur LV PE Size (Mbytes) Total PE Free PE Allocated PE Stale PE IO Timeout # lvdisplay /dev/vg00/lvol1 --- Logical volumes --LV Name VG Name LV Permission LV Status Mirror copies Consistency Recovery Schedule LV Size (Mbytes) Current LE Allocated PE Stripes Stripe Size (Kbytes) Bad block Allocation
447 0
/dev/dsk/c0t5d0 /dev/vg00 available yes 2 7 4 249 0 249 0 default
/dev/vg00/lvol1 /dev/vg00 read/write available/syncd 0 MWC parallel 48 12 12 0 0 off strict/contiguous
258. TEXT PAGE: swapinfo

The swapinfo command prints information about device and file-system paging space. This information includes reserved space as well as used swap space. NOTE: The term swap refers to an obsolete implementation of virtual memory; HP-UX actually implements virtual memory by way of paging rather than swapping. This command and others retain names derived from "swap" for historical reasons. HP man pages on demand kernel swap tables swap space swap used, swap reserved, swap space configurations standard output device minimal Command can total all configured swap space into a one-line summary. Displays pseudoswap information (if configured). Full Pathname: Pros and Cons /usr/sbin/swapinfo + provides valuable swap space configuration information - minimal documentation on psuedoswap
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature:
Syntax
/usr/sbin/swapinfo [-mtadfnrMqw]
Examples
# swapinfo -t Kb AVAIL 159744 42112 201856 Kb USED 19868 51220 15300 86388 Kb FREE 139876 -51220 26812 115468 PCT USED 12% 36% 43% START/ Kb LIMIT RESERVE 0 -
TYPE dev reserve memory total
PRI 1
NAME /dev/vg00/lvol2
259. TEXT PAGE: sysdef

The sysdef command analyzes the currently running system and reports on its tunable configuration parameters. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: HP man pages on demand /stand/vmunix and the currently running kernel Tunable kernel parameters Current configuration of kernel parameters Standard output device Minimal Shows current value and possible range of values /usr/sbin/sysdef + Shows current setting of kernel parameters - reboot required to change most parameters
Syntax
/usr/sbin/sysdef [kernel [master]]
Example
# /usr/sbin/sysdef NAME acctresume acctsuspend allocate_fs_swapmap bufpages create_fastlinks dbc_max_pct dbc_min_pct default_disk_ir dskless_node eisa_io_estimate eqmemsize file_pad fs_async hpux_aes_override maxdsiz maxfiles maxfiles_lim maxssiz maxswapchunks maxtsiz maxuprc maxvgs msgmap nbuf ncallout VALUE 4 2 0 2841 0 50 5 1 0 768 15 10 0 0 16384 60 1024 2048 256 16384 75 10 2555904 4788 292 BOOT MIN-MAX -100-100 -100-100 00-1 00-1 256-655360 30-2048 30-2048 256-655360 1-16384 256-655360 3306UNITS FLAGS -
Pages
Pages
Pages Pages

ncdnode ndilbuffers netisr_priority netmemmax nfile nflocks ninode no_lvm_disks nproc npty nstrpty nswapdev nswapfs public_shlibs remote_nfs_swap rtsched_numpri sema semmap shmem shmmni streampipes swapmem_on swchunk timeslice unlockable_mem Name Value Boot Min Max Units Flags 150 30 -1 5378048 800 200 476 0 276 60 60 10 10 1 0 32 0 4128768 0 200 0 1 2048 10 801 1-1-127 142141011-25 1-25 0-1 40-1 3-1024 02048-16384 kBytes -1-2147483648 Ticks 0Pages -
The name of the parameter The current value of the parameter The value of the parameter at boot time The minimum allowed value of the parameter The maximum allowed value of the parameter The units by which the parameter is measured Further describe the parameter M Parameter may be modified without rebooting
A comparable command, introduced at HP-UX 11.00, is kmtune(1m).
260. TEXT PAGE: kmtune, kcweb

The kmtune command is used to query, set, or reset system parameters. kmtune displays the value of all system parameters when used without any options or with the -S or -l option. kmtune reads the master files and the system description files of the kernel and kernel modules. On 11i v2, kmtune is front-ended and will eventually be replaced entirely by kctune. kctune is part of a new, larger utility called kcweb. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname:
Syntax
HP man pages on demand /stand/vmunix and the currently running kernel Tunable kernel parameters Current configuration of kernel parameters Standard output device Minimal Works with dynamic and static kernel modules /usr/sbin/kmtune
/usr/sbin/kmtune [-l] [[-q name] . . ] [-S system file] /usr/sbin/kmtune [[-s {+|=}value] . . ] [[-r name] . . ] [-S system file]
Examples
# /usr/sbin/kmtune
Parameter Value =================================================================== NSTRBLKSCHED 2 NSTREVENT 50 NSTRPUSH 16 NSTRSCHED 0 . . .

# /usr/sbin/kmtune l -q maxdsiz Parameter: Value: Default: Minimum: Module: Version: Dynamic: maxdsiz 0x04000000 0x04000000 (11i only) (11i only)
261. SLIDE: Application Profiling and Monitoring Tools (Standard UNIX)
Application Profiling and Monitoring Tools (Standard UNIX)

Resource prof gprof arm Application Profiler Enhanced Application Profiler Define and measure response time of transactions for an application Super User Access Required No No No
Student Notes
This slide shows the standard UNIX application profiling performance tools included with HP-UX. Application profiling tools provide in-depth details regarding the execution of a program, including the number of times each subroutine is called and the amount of time spent in each subroutine.
262. TEXT PAGE: prof, gprof

The prof and gprof tools are used to ascertain the library routines being called during the execution of a program. The prof utility profiles the execution of an application by displaying the names of the routines being called, the number of times the different routines were called, and how much time was spent in each routine. The gprof utility is an enhanced version of prof. It shows all the information available with prof, plus it displays a call graph tree, which details the call hierarchy of the routines. The call graph tree allows the parent routines, which called the children routines to be viewed. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons: standard UNIX (System V) man pages on demand kernel routines called by the application function call flow time spent in each function, number of times function was called binary file mon.out significant delays in the execution of the application shows the flow of the function calls /usr/bin/prof + shows where an application is spending its time - requires access to source code - requires application to be recompiled
Syntax
prof [-tcan] [-ox] [-g] [-z] [-h] [-s] [-m mdata] [prog] gprof [options] [a.out [gmon.out...]]
Examples
cc -p prog.c -o program ./program prof program
cc -G prog.c -o program ./program gprof program
263. TEXT PAGE: Application Response Measurement (ARM) Library Routines

Description
The ARM library routines allow you to define and measure response time of transactions in any application that uses a programming language that can call a 'C' function. The ARM library is named "libarm" and is provided in two versions, an archive version and a shared library version. It is strongly recommended that you use the shared (sometimes referred to as dynamic) library version. In-depth discussion of this product is beyond the scope of this class. NOTE: arm is a cross-platform tool and functionally replaces the ttd discussed in the next section. glance and gpm work equally well with either arm or ttd. man 3 arm configurable + Integrates with PerfView/MWA and other distributed management/monitoring tools. - requires source code modification The six function calls used by arm are: Return a unique ID based on application and user. Return a unique ID based on a transaction name. Mark the beginning of a specific transaction. Provide information or show progress of a specific transaction. Mark the end of a specific transaction. Mark the end of an application.
Documentation: Interval: Pros and Cons:
Platforms supported: HP-UX, IBM AIX, Sun Solaris, NCR
Syntax:
arm_init arm_getid arm_start arm_update arm_stop arm_end
264. SLIDE: Application Profiling and Monitoring Tools (HP-Specific)
Application Profiling and Monitoring Tools (HP-Specific)

Description ttd caliper Tracks how much time is spent between specific lines of code in a program caliper is a runtime performance analyzer for programs compiled with C, C++ and Fortran 90 compilers on Itanium systems. Super User Access Required No No
Student Notes
This slide shows some HP-specific application profiling tools included with HP-UX. Currently, the Transaction Tracker (ttd), and caliper are available for monitoring application behavior and performance. In 10.20, there was a tool called puma which came with all standard programming language compilers (like C, Pascal, and Fortran). The puma tool allowed profiling data to be collected without having to modify the application source code, or recompiling the application (in many cases). puma has been excluded from the more recent releases of HPUX. The Transaction Tracker allows a programmer to time how long a program is spending within a certain area of code. The Transaction Tracker requires the source code be modified to include the starting point and the stopping point. The Transaction Tracker is included as part of the MeasureWare/OVPA product. Transaction Tracker is HPUX specific. arm (discussed earlier) is the generic version of the Transaction Tracker. caliper is thread-aware, MP-aware, and features an easy command-line interface.
265. TEXT PAGE: Transaction Tracker

Description
The Transaction Tracker is a set of function calls that allow a programmer to time the execution of a particular body of code (referred to a transaction). The function calls are inserted into the source code to mark where a particular transaction begins and ends. Glance and gpm can then be used to monitor how many times the transaction is called, and how long it takes for the transaction to complete. Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature Full Pathname: Pros and Cons: HP MeasureWare Users manual Every time Transaction Tracker function call is invoked within program The ttd process Application execution times Times to one hundredth of a second Binary file /var/opt/perf/datafiles/logtrans Medium to large, depending on number of transactions being timed. Shows amount of time spent in a particular body of code Function calls defined in /opt/perf/include/tt.h + Integrated with glance and gpm; makes it easy to monitor how long transactions take. - Cannot be used within shell programs; C programs only (or programs which can call C routines).
Syntax
The four function calls used by Transaction Tracker are: tt_getid tt_start tt_end tt_abort Names the transaction and returns a unique identifier. Signals the start of a unique transaction. Signals the end of the transaction. Ends the transaction without recording times for the transaction.
266. TEXT PAGE: caliper HP Performance Analyzer

Description
HP Caliper is a general-purpose performance analysis tool for applications on Itaniumbased HP-UX systems. HP Caliper allows you to understand the performance of your application and to identify ways to improve its run-time performance. HP Caliper works with any Itanium-based binary and does not require your applications to have any special preparation to enable performance measurement. The two primary ways to use HP Caliper are: As a performance analysis tool. As a profile based optimization (PBO) tool invoked by HP compilers.
The latest version of HP Caliper is available on the HP Caliper home page. You can find it at the http://www.hp.com/go/hpcaliper/ site.
Overview
HP Caliper helps you dynamically measure and improve the performance of your native Itanium-based applications in three ways: Commands to measure the overall performance of your program. Commands to drill down to identify performance parameters of specific functions in your program. A simple way to optimize the performance of your program based on its specific execution profile.
HP Caliper does not require special compilation of the program being analyzed and does not require any special link options or libraries. HP Caliper selectively measures the processes, threads, and load modules of your application. An application's load modules are the main executable and all shared libraries it uses. HP Caliper uses a combination of dynamic instrumentation of code and the performance monitoring unit (PMU) in the Itanium processor. HP Caliper uses the least-intrusive method available to gather performance data.
Supported Target Programs

HP Caliper includes support for: Programs compiled for Itanium- and Itanium 2-based systems. HP Caliper does not measure programs compiled for PA-RISC processors. Code generated by native and cross HP aC++, C++ and Fortran compilers, including inlined functions and C++ exceptions. Programs compiled with optimization or debug information, or both. This includes support for both the +objdebug and +noobjdebug options.
Both ILP32 (+DD32) and LP64 (+DD64) programs, both 32-bit and 64-bit ELF formats. Archive-, minshared- or shared-bound executables. Both single- and multi-threaded applications, including MxN threads. Applications that fork() or vfork() or exec() themselves or other executables. Shell scripts and the programs they spawn.
Features
HP Caliper is simple to run because it uses a single command for all measurements. You specify the type of measurement and the target program as command-line arguments. For example, to measure the total number of CPU cycles used by a program named myprog, just type: caliper total_cpu myprog HP Caliper features include: Multiple performance measurements, each of which can be customized through configuration files. All reports are available in text format, comma-delimited (CSV) format, and most reports are also available in HTML format for easier browsing. Performance data can be correlated to your source program by line number. Easy inclusion and exclusion of specific load modules, such as libc, when measuring performance. Both per-thread and aggregated thread reports for most measurements. Performance data reported by function, sorted to show hot spots. Support for multi-process selection capabilities. The ability to save performance data in files that you can use to aggregate data across multiple runs to generate reports without having to re-run HP Caliper. The ability to attach and detach to running processes for certain measurements. The ability to restrict PMU measurements to specific regions of your programs. Limited support for dynamically generated code.
267. SLIDE: Summary
Summary
Different categories of performance tools Standard UNIX tools versus HP-specific tools Separately purchasable tools Kernel register-based tools versus midaemon-based Tools
Student Notes
To summarize this module, there are many performance tools for many different purposes. The objective of this module was to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool worked. In general, you should become most familiar with these tools: sar vmstat top glance/gpm (if available) These tools will tend to be your most commonly used tools. Other tools will tend to be useful in more specialized situations. Remember, never try to rely on just one tool to do everything. No tool will tell you everything. And every tool will mislead you somewhere down the line. No tool is perfect. Thats why you need to be familiar with multiple tools.
268. LAB: Performance Tools Lab
Lab
Before we continue with a more focused discussion of glance and gpm, lets spend some time exploring the generic UNIX and HP-UXspecific tools discussed so far. As you answer the following questions, try to categorize each tool as to its type and scope.
Student Notes
The goal of this lab is to gain familiarity with performance tools. A secondary goal is to get familiar with the metrics reported by the tools, although they will be explored in depth during the next days.
Directions
Set up: Change directories to: # cd /home/h4262/tools Execute the setup script: # ./RUN Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion.
1. How many processes are running on the system? Which tools can you use to determine this?
2. Are there any real-time priority processes running? If so, list the name and priority. What tools can you use to determine this?
3. Are there any nice'd processes on the system? If so, list the name and priority for each. What tools can you use to determine this?
4. Are there any zombie processes on the system? If so, how many are there? What tools can you use to determine this?
5. What is the length of the run queue? What are the load averages? What tools can you use to determine this?
6. How many system processes are running? What tools can you use to determine this? NOTE A system process is defined as a process whose data space is the kernel's data space (such as swapper, vhand, statdaemon, unhashdaemon, and supsched). ps reports their size as zero.
There are three ways this can be determined. If you get stuck on this question, move on. Don't spend more than a few minutes trying to answer this question.
7. What percentage of time is the CPU spending in different states? What tools can you use to determine this?
8. What is the size of memory? What is the size of free memory? What tools can you use to determine this?
9. What is the size of the swap area(s)? What is the percentage of swap utilization? What tools can you use to determine this?
10. What is the size of the kernels incore inode table? How much of the inode table is utilized? What tools can you use to determine this?
11. Are there any CPU-bound processes running (processes using a lot of CPU)? If so, what is the name of the process? What steps did you take to determine this?
12. Are there any processes running which are using a lot of memory? (A "lot" is relative, i.e. a large RSS size compared to other processes.) If so, what is the name of the process? What steps did you take to determine this? Is memory utilization changing?
13. Are there any processes running which are doing any disk I/O? If so, what is the name of the process? What steps did you take to determine this? What are the I/O rates of the disk bound processes? What files are open by this (these) process(es)? NOTE: No processes are really doing a lot of physical disk I/O. However, lab_proc3 is doing a LOT of logical I/O.
14. What is the current rate of semaphore or message queue usage? What tools can you use to determine this?
15. Is there any paging or swapping occurring? What tools can you use to determine this?
16. What is the system call rate? What tools can you use to determine this?
17. What is the buffer cache hit ratio? What tools can you use to determine this?
18. What is the tty I/O rate? What tools can you use to determine this?
19. Are there any traps (interrupts) occurring? What tools can you use to determine this?
20. What information can you collect about network traffic? What tools can you use to determine this?
21. What information can be gathered on CPUs in an SMP environment? What tools can you use to determine this?
22. What information can be gathered on Logical Volumes? What tools can you use to determine this?
23. What information can be gathered on Disk I/O? What tools can you use to determine this?
24. Shut down the simulation by entering: # ./KILLIT
Module 3 GlancePlus
Objectives
Upon completion of this module, you will be able to do the following: Compare GlancePlus with other performance monitoring/management tools. Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).
Module 3 GlancePlus
3-1. SLIDE: This Is GlancePlus
This is GlancePlus
Features
Motif-based interface that offers exceptional ease-of-learning and ease-of-use State-of-the-art, award-winning on-line Help system. Rules-based diagnostics that use customizable system performance rules to identify system performance problems and bottlenecks. Alarms that are triggered when customizable system performance thresholds are exceeded. Tailor information gathering and display to suit your needs. Integrated into OpenView environments.
Capabilities
Get detailed views of CPU, disk, and memory resource activity View disk I/O rates and queue lengths by disk device to determine if your disk loads are well balanced Monitor virtual memory I/O and paging Measure NFS activity And much more ...
Student Notes
GlancePlus is a performance monitoring diagnostic tool. GlancePlus software visually gives you the useful, accurate information you need to pinpoint potential or existing problems involving your systems CPU, memory, disk, or network utilization. To help you monitor and interpret your systems performance data, GlancePlus software includes a rules-based adviser. Whenever threshold levels for measurements such as CPU utilization or disk I/O rates are exceeded the adviser notifies you with on-screen alarms. The adviser also applies rules to key performance measurements and symptoms and then gives you information to help you uncover bottlenecks or other performance problems. NOTE: GlancePlus is integrated into OpenView Windows at the menu bar level.
Module 3 GlancePlus
GlancePlus offers a viewpoint into many of the critical resources that need to be measured in the open system environment.
Benefits
Save time and effort managing your system resources Better understand your computing environment Satisfy your end users system performance needs quickly Leverage from a standard interface across vendor platforms
The features in the product yield a performance monitoring diagnostic solution that offers many benefits to the user. GlancePlus offers a tool that will make your analysis activities easier and quicker to perform. This will save you time. The display of various types of information will also allow you to get a better understanding of your own environment. The same GUI on the Motif version is used on all the supported platforms, which provides a leverage point for a standard user interface across several UNIX platforms. Many times, just by cursory use of the product, people will discover certain things about their systems. You do not have to have a performance problem to use GlancePlus. This simple cursory use of the product has let many people gain a better understanding of their systems. This helps out when a problem does exist. Knowing what is normal can help identify what has become abnormal in your environment.
Module 3 GlancePlus
3-2. SLIDE: GlancePlus Pak Overview
GlancePlus Pak Overview

Forecasting and capacity planning Central alarm monitoring and event management Performance analysis and correlation
central management system
PerfView
PerfView Planner PerfView Monitor PerfView Analyzer
managed node
GlancePlus
MeasureWare
Performance data collection and alarming Online performance monitoring and diagnostic
NETWORKS SYSTEMS INTERNET
APPS DATABASES
Student Notes
The view here is from the heights. For our purposes, we will focus our discussion on the capabilities of glance and gpm and the information and reports they can produce from a running HP-UX system. Also understand that GlancePlus may be used in conjunction with MeasureWare/OVPA to enhance and extend its capabilities. Many of you may have purchased glance in the GlancePlus Pak, which includes a license to run glance, gpm and to configure and run the MeasureWare/OVPA Agent (mwa) on your system. The GlancePlus and MeasureWare/OVPA Agent products can be purchased separately or combined in the GlancePlus Pak. The Pak also includes (as of C.03.58.00 June 2002 application release) some event monitoring and graphical configuration components.
Module 3 GlancePlus
The components share a common measurement infrastructure, thus metrics, as well as applications have similar alarming mechanisms.
GlancePlus Pak
GlancePlus
Interfaces include: /opt/perf/bin/gpm /opt/perf/bin/glance
MeasureWare/OVPA
Interfaces include: /opt/perf/bin/extract /opt/perf/bin/utility
PerfView/OVPM
Interfaces include: /opt/perf/bin/pv
Complete information on the configuration and use of MWA/OVPA and PerfView/OVPM are fully covered in the Hewlett-Packard Education Services' course: PerfView MeasureWare (catalog number B5136).
Module 3 GlancePlus
3-3. SLIDE: gpm and glance
gpm and glance
Student Notes
GlancePlus provides dual user interfaces: The gpm GUI See history of activity of the system with multiple window capability Monitor your system while doing other work Use alarms, symptoms and color to assist with monitoring The glance Character Mode Monitor performance remotely over slow datacom line When no high resolution monitor is available Creates less load on the system being monitored
Module 3 GlancePlus
Notes on starting the user interfaces: gpm and glance Starting the GUI # gpm [options] Starting the character based interface: # glance [options]
-nosave
Do not save the current configuration at the next exit Specify one or more additional report windows Share color scheme with other applications Set the gpm nice value Use X-Toolkit options such as -display
-j interval
Preset the number of seconds between screen refreshes Specify the continuous print option destination. Allows glance to lock itself into memory Set the glance nice value
-rpt
-p dest
-sharedclr
-lock
-nice Xoptions
-nice
Module 3 GlancePlus
3-4. SLIDE: glance The Character Mode Interface
glance The Character Mode Interface
Student Notes
With glance you can run on almost any terminal or workstation, over a serial interface and relatively slow data communication links, and with lower resource requirements. The default Process List screen is shown in the above screen capture, and provides general data on system resources and active processes. In addition, the user may drill down to more specific levels of detail in areas of CPU, memory, disk I/O, network, NFS system calls, swap, and system table screens. Specific details on a per-process level are also available through the individual process screens. For your convenience, the next two pages contain a hot key quick reference guide for the glance character mode interface.
Module 3 GlancePlus
Glance Hot Key Quick Reference Top Level Screen Hot Keys
Hot Key a c d g i l m n t u v w A B D G H I J K N P T Y Z ?
Screen Displayed/Description CPU By Processor CPU Report Disk Report Process List I/O By File System Network By Interface Memory Report NFS By System System Tables Report I/O By Disk I/O By Logical Volume Swap Space Application List Global Waits DCE Global Activity Process Threads Alarm History Thread Resources Thread Wait DCE Process List NFS Global Activity PRM Group List Transaction Tracker Global System Calls Global Threads Commands Menu
Module 3 GlancePlus
Secondary Level Screen Hot Keys

Hot Key S s F L M R W Screen Displayed/Description Select a NFS system/Disk/Application/Trans/Thread Select a single process Process Open Files Process System Calls Process Memory Regions Process Resources Process Wait States
Miscellaneous Screen Hot Keys

Hot Key b f h j o p e/q r y z > < ! Screen Displayed/Description Scroll page backward Scroll page forward Online HELP Adjust refresh interval Adjust process threshold Print toggle (start|stop auto-printing) Quit GlancePlus Refresh the current screen Renice a process Reset statistics to zero Display next logical screen Display previous logical screen Invoke a shell
Module 3 GlancePlus
3-5. SLIDE: Looking at a glance Screen
Looking at a glance Screen
Student Notes
Above is an example of an easy and common performance problem a runaway looping process. Why is the global CPU utilization < 100%, although the sum of the individual process CPU utilizations > 100 %? Hint: Is this a UP or MP system? Also note that / (slashes) are used in glance reports to separate current metric values from cumulative averages. NOTE: For the record there were two CPUs on this system.
Module 3 GlancePlus
On a three-way multiprocessor system with two processes in the same application looping, each process can use nearly 100% of each of 2 CPUs. Over a 10-second interval, each uses nearly 10 seconds of CPU time, so the application used nearly 20 seconds of CPU time in 10 seconds of elapsed time. Process CPU utilization is 100% for each of the 2 looping processes, but global CPU utilization would be 66%. On HP-UX 11.0, processes can have multiple threads, each of which can consume CPU time independently of the others. On a four-way MP system, with one process that has three threads looping, the process as a total uses 300% of the CPU. The application and global CPU utilization would report the CPU utilization at 75%.
Module 3 GlancePlus
3-6. SLIDE: gpm The Graphical User Interface
gpm The Graphical User Interface
Student Notes
gpm presents the same metrics as character-mode glance in graphical form. Significant global metrics, as well as bottleneck adviser symptom status and alarms are shown in the main window. The process list, as well as other reports, is available via menu selections. The process list is very customizable (and customizations are preserved) with filters, sorting, highlights, chosen metrics, and column rearrangement. The online Users Guide is very useful. The ? button on every window is a shortcut into the on-item help, which is useful especially for metric definitions.
Module 3 GlancePlus
This is another screen shot of the gpm interface.
Note the icon reflecting an adviser alarm.
Module 3 GlancePlus
3-7. SLIDE: Process Information
Process Information
Process Information Detailed data on each active process CPU data Disk I/O data Memory Use Wait Reasons Open Files
Process Features Access via Main Reports selection Process List Each Process has: Process Resources Open Files
Student Notes
The Process Information screen in gpm presents the user with detailed information on each active process (including CPU utilization, disk I/O data, memory usage, wait state reasons, open() file information, and so on). This screen also allows the user to select a specific process and "drill down" to greater detail via the Reports selection menu.
Resource Diagnostic Monitoring

GlancePlus provides an abundant set of performance metrics to help analyze the current system. Careful thought and consideration have been given to ensure that the proper metrics are displayed. The product with its Motif GUI offers a way to efficiently display performance information, without overloading the customer with screen after screen of detailed data.
Module 3 GlancePlus
Customizable GUI
GlancePlus uses the power of Motif and its industry-leading approach to display technology, to provide the user with a powerful graphical user interface that can be customized to fit your needs. Fonts, color, window size and more are configuration options. Additional configuration choices are available in "list" windows to allow easy manipulation of column tabular data for display and sort uses. The gpm Process List and GlancePlus - Main screen provide a pull-down menu to access the numerous, detailed Report screens. These reports allow a logical approach to the extensive amount of system resources and process specific data.

Resource History Window CPU Info Memory Info Disk Info Network Info System Info Global Info Swap Space Wait States Transaction Tracking Application List PRM Group List Process List Thread List
Next Level contains additional graphs and tables
Module 3 GlancePlus
3-8. SLIDE: Adviser Components
Adviser Components
Adviser Windows Symptom History Symptom Status/Snapshot Alarm History Adviser Syntax Button Label Colors Alarm Button for Alarm Statements Graph Buttons for Symptom Statements Icon Border Color (in OpenView) Changes to Red or Yellow on Alarms
Student Notes
GlancePlus supports performance alarms and a rules-based adviser to help automate the interpretation of performance data. The alarm rules can be customized by the user to reflect local system characteristics. Note: Both interfaces will report alarms, and the same syntax is used for alarms in glance and gpm. Alarms are configured through the /var/opt/perf/advisor.syntax file.
Module 3 GlancePlus
3-9. SLIDE: adviser Bottleneck Syntax Example
adviser Bottleneck Syntax Example

# # # # # # # The following symptoms are used by the default Alarm Window Bottleneck alarms. They are re-evaluated every interval and the probabilities are summed. These summed probabilities are checked by the bottleneck alarms. The buttons on the gpm main window will turn yellow when a probability exceeds 50% for an interval, and red when a probability exceeds 90% for an interval. You may edit these rules to suit your environment:
symptom CPU_Bottleneck type=CPU rule GBL_CPU_TOTAL_UTIL > rule GBL_CPU_TOTAL_UTIL > rule GBL_CPU_TOTAL_UTIL > rule GBL_PRI_QUEUE >
75 85 90 3
prob prob prob prob
25 25 25 25
alarm CPU_Bottleneck > 50 for 2 minutes start if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", else yellow alert "CPU Bottleneck probability= repeat every 10 minutes if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", else yellow alert "CPU Bottleneck probability= end reset alert "End of CPU Bottleneck Alert"
CPU_Bottleneck, "%" ", CPU_Bottleneck, "%"
CPU_Bottleneck, "%" ", CPU_Bottleneck, "%"
Student Notes
The bottleneck alarms are a little complex. The CPU bottleneck symptom definition and corresponding alarm is shown. Just because a resource is fully utilized doesnt mean that it is a bottleneck. It is only a bottleneck if there is activity that is hindered waiting for that resource. Therefore, utilization alone is not a good bottleneck indicator. Both utilization and queue lengths are combined to define the symptom probability. Some of the key metrics for performance analysis are the ones we use in the default syntax to define bottleneck alarms.
Module 3 GlancePlus
3-10. SLIDE: The parm File
The parm File

application = and the associated parameters defines the logical groupings used to define each application on the machine.
Examples: application=Real Time priority=0-127 application=Prog Dev Group 1 file=vi,xdb,abb,ld,lint user=bill,debbie application=Prog Dev Group 2 file=vi,xdb,abb,ld,lint user=ted,rebecc,test* application=Compilers file=cc,ccom,pc,pascomp
application = application = user = user = file = file = priority = priority = group = group =
parm file application definitions are used by both GlancePlus and MeasureWare. A .parm in a user's $HOME directory will override the system parm file.
Student Notes
By now you are starting to see the range and scope of the performance metric data that glance and gpm display. While this is invaluable when it comes to understanding the behavior of a single process, many times what we really need is to evaluate and baseline the performance of an entire application suite. This could be achieved by adding up the individual metrics of all processes within the application suite, but this could be a daunting task for all but the simplest of applications. Through the use of the configuration file /var/opt/perf/parm, glance and gpm can help to collect all the metrics from the individual processes within an application suite and present the information in a concise manner for your review. One challenge is in the definition of what constitutes an application. To address this issue, the parm file has several different methods for describing which processes belong to which application definition. Application member processes can be defined by their UID, the front-store file from which they were fork()'d , the priority at which they execute, their GID, or any combination of the above. This provides a very versatile framework for application profiling.
Module 3 GlancePlus
NOTE:
glance and gpm share the same application definitions (via the parm configuration file) as mwa.
# /var/opt/perf/parm for host system garat id = garat # Parameters for what data classes scopeux will log: log global application process dev=disk,lvm transaction # Parameters to control maximum size of scopeux logfiles: size global=10, application=5, process=2, device=1, transaction=1.5 # Thresholds which determine what process data scopeux will log: threshold cpu = 1, disk = 1, nonew, nokilled # Web server: application = WWW user = www or file = httpd # Untrustworthy users: application = HighRisk user = fred,barney,root
The order in which applications are defined is very important. Once a process meets the definition of an application, its data will be contributed to that application's metrics. Care must be taken to assure that ambiguity is avoided in the definition of applications.
Module 3 GlancePlus
3-11. SLIDE: GlancePlus Data Flow
GlancePlus Data Flow

Terminal display Adviser output Motif display
glance
Adviser definitions
gpm
Shared Memory
parm file (application definitions)
midaemon
HP-UX kernel
KI
Student Notes
Without going into a lot of detail, note that both interfaces share a common instrumentation source and common application definitions. Instrumentation comes partly from interfaces also accessed by standard UNIX utilities such as vmstat, and partly from special HP-UX KI trace-based instrumentation. There is no generally available API to these interfaces. They are written specifically for use by GlancePlus and MeasureWare/OVPA.
Module 3 GlancePlus
Significant Directories
/opt/perf /opt/perf/bin Product files from installation media Executables
/opt/perf/ReleaseNotes Release Notes /opt/perf/examples /opt/perf/paperdocs /var/opt/perf Supplementary configuration examples Electronic versions of documentation Product and configuration files created during and after installation
Always check ReleaseNotes for version-specific information. (New for C.02.30 and later releases: example configuration files) Config files come from /opt/perf/newconfig if they dont already exist under /var/opt/perf. Compare new default parm file with that on your system if you are updating from a previous release. The directory /var/opt/perf contains the status and data files.
Module 3 GlancePlus
3-12. SLIDE: Key GlancePlus Usage Tips
Key GlancePlus Usage Tips

Use it for Whats going on right now. The gpm online help is very useful especially on item help. Drill down from higher level reports to more detailed resource reports. Understand what the adviser is telling you. Sort, filter, and choose metrics in gpm; especially the Process List. In character-mode glance use:

? screen to navigate h for help o screen for setting thresholds and process list sorting
Edit the adviser alarms to be right for you. Adjust update interval to control CPU overhead. Process details including thread lists, wait states, memory regions, open files, and system call reports can be used to impress your programming staff ! 8^)
Student Notes
Module 3 GlancePlus
3-13. SLIDE: Global, Application, and Process Data
Global, Application, and Process Data

Global metrics reflect system-wide activity (sum of all applications). Process metrics reflect specific per-process (including thread) activity. Application metrics sum activity for a set of processes. They keep track of activity for all processes, however short-lived, even if they are not reported individually. Glance updates all metric values at the same time. MeasureWare summarizes Global, Application, and other class data over 5-minute intervals and summarizes Process data over 1-minute intervals. Multiprocessor effects: Global and Application CPU percentages reflect normalization over the number of processors (percentage of availability for entire system). Process and thread-level CPU percentages are not normalized by the number of processors.
Student Notes
It is important to understand the interrelationships among metric classes.
Module 3 GlancePlus
3-14. SLIDE: Can't Solve What's Not a Problem
Cant Solve Whats Not a Problem!

A looping process by itself is not a problem. Know whats normal for your environment. Keep historical performance data for reference. Measure response times. Use the tools to find out what is affecting performance. Isolate bottlenecks and address them when there is a problem. When tuning, make only one change at a time and then measure its effect. Document everything you do! Optimize your time resource: dont fix what isnt broken; sometimes more hardware is the cheapest answer; set yourself up to react quicker next time.
Student Notes
One of the hardest skills is to determine what to measure and how to interpret its significance. After all, if the users response time is satisfactory, then oftentimes there is no problem even if an operation metric is higher than normal.
Module 3 GlancePlus
3-15. SLIDE: Metrics: "No Answers without Data"
Metrics: No Answers without Data

Rate and utilization metrics are more useful than counts and times, because they are independent of the collection interval. Cumulative metrics measure over the total duration of collection. Most metrics are broken down into subsets by type. Work from the top down. Blocked states reflect individual process or thread wait reasons. Global queue metrics are derived from process blocked states. CPU is a symmetric resource. Scheduler will balance load on the multiprocessor, whereas disks and network interface activity depend on where data is located. Memory utilization is not as important as paging activity and buffer cache sizing.
Student Notes
CPU utilization and disk I/O rates compare well on different summarization intervals, whereas CPU time and I/O counts are always larger when the collection interval grows. Examples of breakdowns: Global disk I/O rate is a sum of the BYDSK_ metrics, each class in turn breaks down activity between reads and writes and file system versus raw and system access. For disk bottlenecks, it is often useful to correlate between DSK, FS, and LV classes. Memory utilization is frequently nearly 100% with dynamic buffer cache. If page outs occur or while in raw disk access environments, shrink the buffer cache to avoid paging. Programmers frequently dont know they can view specific system-call metrics, as well as memory region and open file information on a per-process basis.
Module 3 GlancePlus
3-16. SLIDE: Summary
Summary
Dont try to understand all the capabilities and extensions to the tools, just the ones of most use to you. Start with developing an understanding of what is normal on your systems. Refine and develop alarms customized for your environment. Work from examples in documentation, gpm online help, config files, and example directories.
Student Notes
Remember that performance tuning is an art, and the following two rules apply to most engagements: Rule #1: Rule #2: When answering a question about computer system performance, the initial answer is always, It depends. Performance tuning always involves a trade-off.
Suggested reading: HP-UX Tuning and Performance by Robert F. Sauers and Peter S. Weygant, available through the Hewlett-Packard Professional Books, Prentice Hall Press (ISBN 0-13-102716-6)
Module 3 GlancePlus
3-17. SLIDE: HP GlancePlus Guided Tour
HP GlancePlus Guided Tour
memo ry
process
cpu
Topics Main Window CPU Bottlenecks Memory Bottlenecks Configuration Information Alarm and Symptoms
Student Notes
To take the guided tour of GlancePlus, run the gpm GUI and select Help on the menu bar. Next, select the Guided Tour option. This will introduce you to the product. It features captured windows of the actual product, with annotations to help point out the important features of certain screens or windows. Quick Tip: gpm provides an excellent online Help system. Click the right mouse button for the On-Item Help feature. For help in glance, press the h key.
Module 3 GlancePlus
3-18. LAB: gpm and glance Walk-Through

Directions
The following lab is intended to familiarize the student with gpm and glance. To achieve this result, the lab will walk the student through a number of windows and tasks in both the ASCII and X-Windows versions of gpm and glance.
The Graphical Version GlancePlus

1. Log in. If you have not already done so, please log into the system with the user name and password provided by your instructor. 2. Start GlancePlus. From a terminal window, invoke GlancePlus by entering gpm. # gpm In a few seconds gpm will come up. The first thing will be a license notification informing you that you are starting a trial version of GlancePlus, along with ordering and technical support information. On the gpm Main screen, you will see four graphs for CPU, Memory, Disk, and Networking. By default, the graphs are in the resource history format. This means that for each interval (configurable) there will be a data point on the graph, up to the maximum number of intervals (also configurable). 3. Interval Customizations. Click on Configure in the menu bar, and select Measurement. Set the sample interval to 10 seconds and the number of graph points to 50. This will allow you to see up to 500 seconds of system history. Click on OK. NOTE: This setting will be saved for you in your home directory in a file called $HOME/.gpmhp-system_name. This means that all GlancePlus users will have their customizations saved.
Start a program from another window: # cd /home/h4262/cpu/lab1; # ./RUN & 4. Main Window. Below each graph within the GlancePlus Main window, you will find a button. These buttons display the status color of adviser symptoms. This is a powerful feature of GlancePlus that we will investigate later. Clicking on one of these buttons displays details of that particular graph. To view the advisor symptoms from the main window, select: Adviser -> Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.
Module 3 GlancePlus
View CPU details: Click the CPU button. To view a detailed report regarding the CPU, select: Reports -> CPU Report Select: Reports -> CPU by Processor This is a useful report, even on a single processor system. 5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? . Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal. 6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own. An alarm is simply a notification that a symptom has been detected. From the main window, select: Adviser -> Symptom History For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window. 7. Process Details. Close all windows except for the main window. Select: Reports -> Process List This shows the interesting processes on the system (interesting in terms of size and/or activity). To customize this listing, select: Configure -> Choose Metrics This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related
Module 3 GlancePlus
metrics available in GlancePlus. Note that the familiar ? button is also available from this window. Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK. 8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select Configure -> Sort Fields The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select Configure -> Arrange Columns Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order. 9. More Customizations. It is possible to modify the definition of interesting processes by selecting: Configure -> Filters An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows: Change Enable Filter to ON Change Filter Relation to >=
Module 3 GlancePlus
Change Filter Value to 3.0 Change Change Change Change Enable Highlight to ON Highlight Relation to >= Highlight Value to 3.0 Highlight Color to any LOUD color
Reset the logic condition make to OR, then click OK. Verify the filter took effect. 10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced. In the Process List window, select the proc8 process. To access the Admintools, select: Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select: Admin -> Kill Click OK, and note the process is no longer present. 11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process. Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated. After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows. 12. Exit GlancePlus. From the Main window, select: File -> Exit GlancePlus
Module 3 GlancePlus
13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance. NOTE: Never run glance or gpm in the background.
If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface. 14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system. Retrieve online help related to this window by typing h, which brings up a help menu. Select: Current Screen Metrics Use the cursor keys to select CPU Util NOTE: This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the Page Down key to toggle to the next page of help.
Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select: Screen Summaries Use the cursor keys to select Global Bars From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.
Module 3 GlancePlus
15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen. Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads. 16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports. Feel free to use this time to ask the instructor "How Do I . . .?" types of questions. Glance
COMMAND *a b *c *d e f *g h *i j *l *m *n o p q r *s *t *u *v *w y z ! ? <CR> FUNCTION All CPUs Performance Stats Back one screen CPU Utilization Stats Disk I/O Stats Exit Forward one screen Global Process Stats Help I/O by Filesystem Change update interval Lan Stats Memory Stats NFS Stats Change Threshold Options Print current screen Quit Redraw screen Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Renice process Zero all Stats Shell escape Help with options Update screen data
GlancePlus (gpm)
"REPORT" CPU by Processor
CPU Report Disk Report
Process List I/O by Filesystem Network by LAN Memory Report NFS Report
Process List, double-click process System Table Report Disk Report,double-click disk I/O by Logical Volume Swap Detail Administrative Capabilities
Module 4 Process Management

Objectives
Upon completion of this module, you will be able to do the following: Describe the components of a process. Describe how a process executes, and identify its process states. Describe the CPU scheduler. Describe a context switch and the circumstances under which context switching occurs. Describe in general, the HP-UX priority queues.
41. SLIDE: The HP-UX Operating System
The HP-UX Operating System
User Level Kernel Level
Gateway
System Call Interface File Subsystem Interprocess Communication Scheduler Memory Management
Process Control Subsystem
Buffer Cache Character Block I/O Subsystem Device Drivers
Hardware Control Interface Kernel Level Hardware Level Hardware Devices
Student Notes
The main purpose of an operating system is to provide an environment where processes can execute. This includes scheduling processes for time on the CPU, managing the memory which is assigned to processes, allowing processes to read data from disk, and many other things. When processes execute within the HP-UX operating system, there are two modes that they can be in: User mode and Kernel (system) mode.
User Mode and Kernel Mode

User mode refers to instructions that do not require the assistance of the kernel program in order to execute. These include numeric calculations, string manipulations, looping constructs, and many others. In general, it is good when a process can spend the majority of its time in user mode, because it implies the CPU is executing instructions that are related to the process, as opposed to instructions related to the kernel. Kernel mode refers to time spent in the kernel executing instructions on behalf of the process. Processes access the kernel through system calls, often referred to as the System Call Interface. Examples include performing I/O, creating new processes, and expanding data space.
H4262S C.00 4-2 2004 Hewlett-Packard Development Company, L.P. http://education.hp.com
Kernel mode is also used for background activities, performed by the kernel on behalf of processes. Examples include page faulting the program's text or data in from disk, initializing and growing a process's data space, paging a portion of the process to swap space, performing file system reads and writes, and many other things. In general, when a process spends too much time in kernel mode, it is considered bad for performance. This is because too much time (overhead) is being spent to manage the environment in which the process executes, and not enough time on executing the actual process itself (which is user mode).
Performance Tools
Most all performance tools that track CPU utilization distinguish between time spent by the CPU in user mode versus time spent in kernel mode. On a good, healthy system with plenty of memory resources, a typical ratio between user mode and kernel mode time is 4:1. This means the process spends 75-80% of its execution in user mode and 20-25% in kernel mode. Another general rule of thumb is, kernel mode CPU time should not exceed 50%. When this happens, it generally means too much time is being spent managing the system (i.e. memory and swap space management, context switching), and not enough is being spent executing process code.
42. SLIDE: Virtual Address Process Space (PA-RISC)
Process Virtual Address Space (PA-RISC)

32-bit 64-bit
Text Data
Shared Objects Text
(1GB/quadrant)
(4TB/quadrant)
Shared Objects
Data
Shared Objects
Shared Objects
Student Notes
Each process views itself as starting at address 0 and ending at the maximum address addressable by 32 or 64 bits. This address space is known as the Virtual Address Space for a process. The virtual address space is a logical addressing scheme used internally by the process to reference related instructions and data variables. The physical memory address locations cannot be used, because a program does not know where in physical memory it will be loaded. In fact, a program could be loaded at different memory locations each time it executes.
The Four Quadrants (32-bit)

Each process segments its virtual address space into four quadrants, with each quadrant containing 1 GB of address space. The first quadrant is reserved for the program's instructions (also known as text). Though an address range of 1 GB is reserved for text, very rarely does the program need all these addresses. Most of the time, only a fraction (often less than 10%) of this space is needed to address the program's text.
The second quadrant holds the programs private data variables. Again, 1 GB of address space is reserved for data variables, and only a fraction of this space is used (in general). Since this quadrant is limited to 1 GB of address space, a maximum global data size of approximately 900 MB is imposed. (In HP-UX, changes were made to allow the global data to use addresses in other quadrants for private data, thereby increasing its maximum size to 3.9 GB.) The third and fourth quadrants are usually used to address shared memory segments, shared text segments, shared memory-mapped files, and other shared structures, such as the System Call Interface.
64-Bit HP-UX 11.00 Update

With the introduction of HP-UX 11.00 and its 64-bit operating system, the virtual address space changes dramatically. A 32-bit process running under the 64-bit kernel is given the same space allocations as under a 32-bit kernel. With a 64 bit process, the addressable space increases to 16 Terabytes. This limits each quadrant to 4 TB (for a total of 16 TB of virtual address space), but the capability exists to increase this address space, if necessary, in future releases. Notice also that the locations of the various components of the process have been shifted among the quadrants.
43. SLIDE: Virtual Address Process Space (IA-64)
Process Virtual Address Space (IA-64)

32-bit 64-bit
Text Data
(1GB/octant)
Shared Objects Shared Objects Text Data Data Shared Objects Shared Objects
(2EB/octant)
Shared Objects Shared Objects
(2EB/octant)
Kernel
Kernel
Student Notes
There is no 32-bit kernel running on the IA-64 processor. The virtual address space is always 16 EB in size, although it may not all be used or allocated while a particular process is running. The space is divided into eight equal-sized octants each octant is 2 EB in size. When executing a PA-RISC 32-bit process, the first four octants are set up just like the PARISC, 32-bit virtual address space, using only 1 GB out of each octant to simulate the four original quadrants. The last octant holds the kernel and all of its related structures.
64-Bit Processes
With a 64-bit process, the virtual address space changes dramatically. The first two octants become the equivalent to the first PA-RISC quadrant and hold shared objects. The third octant holds the text. The fourth and fifth octants are reserved for any process private data, and the sixth and seventh octants contain more shared objects. Only the last octant is laid out exactly the same for both 32-bit and 64-bit processes.
44. SLIDE: Physical Process Components
Physical Process Components
Kernel Proc Table Entry OS Tables
Text Data
MemMap
LibTxt Stack UArea ShMem
Memory
Student Notes
Each process executing in memory contains an entry in the kernel's process table. The entry in the proc table then references the locations of the program's four main components: text, data, stack, and uarea. The text segment contains the program's executable code. The data segment contains the programs' global data structures and variables. The stack area contains the programs' local data structures and variables. The uarea is an extension of the proc table entry. In a multithreaded process, each thread will have its own uarea. Other components that may or may not be associated with a process are shared libraries, shared memory segments, and memory-mapped files. The text and initialized global data segments of the process are taken from the executed program file on disk during process startup. In an attempt to save on startup time, the uninitialized global data segments and the stack area are zero filled, and no pages of a program are loaded at startup. Copying the entire text and data into memory would generate long startup latency. This latency problem is avoided in HP-UX by demand paging the program's text and data as needed.
Using this demand paging approach, the program is loaded into memory in smaller pieces (pages), on an as-needed basis. One page on HP-UX 10.X is equal to a 4-K size. On HP-UX 11.00, the page size is variable (meaning the initial program could page in sizes greater than 4 KB).
45. SLIDE: The Life Cycle of a Process
The Life Cycle of a Process
filesys
filesys C P U C P U
Process Starts C P U
Cache
End
Main Memory
Stop Disk
CPU Queue
Swap
Student Notes
The life cycle of a process can be generalized by the above slide. When a process is born (or starts), its text must be paged in from the file system on disk (on demand) in order to be executed. (Remember, the operating system only pages in a text page when it determines that a process needs a particular page in order to execute.) In addition, space must be reserved on the swap partition for the process in the event it may need to page portions of the data area out to swap. Once the swap space is reserved and the process is initialized, the process can begin executing on the CPU. As the process executes, it often performs actions that require it to wait. These actions include reading data from the disk or the network, waiting for a user to enter a response at a terminal window, or waiting on a shared resource (like semaphores). Once the item, which the process is waiting on, becomes available, the process puts itself in the CPU run queue so it can begin executing again. This is the standard cycle that a process goes through: WAIT for a resource, enter the CPU run queue when the resource is available, execute on the CPU. The waiting on a resource is symbolized in the slide as the octagon (or stop sign). The entering of the CPU run queue is symbolized by the triangle, and the execution on the CPU is indicated by the CPU in the rectangle.
An advantage of the glance performance tool is that it displays on a per process basis (or system-wide) the various reasons why a process is blocked or waiting on the CPU.
46. SLIDE: Process States
Process States
ZOMBIE SZOMB
Exit Go to user mode
USER MODE SRUN

Go to kernel mode Debugger or Job Control Stop
STOP SSTOP
SLEEP (IN MEMORY) SSLEEP Wait on an

event
KERNEL MODE SRUN
Context Switch
RUNNABLE (IN MEMORY) SRUN
fork completes
IDLE SIDL
SLEEP (SWAP DEVICE) SSLEEP
Wakeup, event completed Wakeup, current completed
RUNNABLE (SWAP DEVICE) SRUN
Student Notes
The process table entry contains the process state. This state is logically divided into several categories of information to do the following: scheduling, identification, memory management, synchronization, and resource accounting. There are five major process states: SRUN SSLEEP SIDL SZOMB SSTOP The process is running or is runnable, in kernel mode or user mode, in memory or on the swap device. The process is waiting for an event in memory or on the swap device. The process is being setup via fork. The process has released all system resources except for the process table entry. This is the final process state. The process has been stopped by job control or by process tracing and is waiting to continue.
Most processes, except the currently executing process, are placed in one of three queues within the process table: a run queue, a sleep queue, or a deactivation queue. Processes that are in a runnable state (ready for CPU) are placed on a run queue, processes that are blocked awaiting an event are located on a sleep queue, and processes that are temporarily out of the scheduling mix are placed on a deactivation queue. Deactivated processes typically only occur during a system memory management crisis. Processes either terminate voluntarily through an exit system call or involuntarily as a result of a signal. In either case, process termination causes a status code to be returned to the parent of the terminating process. This termination status is returned to the parent process using a version of the wait() system call. Within the kernel, a process terminates by calling the exit() routine. The exit(0) routine completes the following tasks: cancels any pending timers, releases virtual memory resources, closes open file descriptors, and handles stopped or traced child processes. Next, the process is taken off the list of active processes and is placed on a list of zombie processes, which is finally changed to being a no process state. The exit() routine continues to record the termination status in the proc structure, bundles up the process's accumulated resource usage for accounting purposes, and notifies the deceased process's parent. If a process in SZOMB state is found, the wait() system call will copy the termination status from the deceased process and then reclaim the associated process structure. The process table entry is taken off the zombie list and returned to the freeproc list. As of HP-UX 10.10, the concept of a thread was introduced into the kernel. Processes became an environment in which one (or more) threads could execute. Each thread was visible and manageable by the kernel separately. When this occurred, processes were in any of the following states: SINUSE SIDL SZOMB The process structure is being used to define one or more threads. The process is being setup via fork. The process has released all system resources except for the process table entry. This is the final process state.
Whereas threads now took on the previous states of the process: TSRUN TSSLEEP TSIDL TSZOMB The thread is running or is runnable, in kernel mode or user mode, in memory or on the swap device. The thread is waiting for an event in memory or on the swap device. The thread is being setup via fork. The thread has released all system resources except for the thread table entry. This is the final thread state.
TSSTOP
The thread has been stopped by job control or by process tracing and is waiting to continue.
The generic UNIX tools have no awareness of threads and so they continue to report process states and all other metrics from the viewpoint of the process, Only the HP-specific tools (such as glance, gpm, PerfView/OVPM, and MeasureWare/OVPA) have the ability to look at individual threads and report their metrics separately from the process. Of course, the vast majority of processes are single-threaded. In those cases, there is no practical difference between the reports of the various tools.
47. SLIDE: CPU Scheduler
CPU Scheduler
The CPU scheduler handles: Context switches Interrupts CPU Kernel OS Tables
CPU Scheduler
Proc A pri=156
Proc B pri=220
Proc C pri=172
Proc D pri=186
Memory
Student Notes
Once the required data is available in memory, the process waits for the CPU scheduler to assign the process CPU time. CPU scheduling forms the basis for the multitasking, multiuser operating system. By switching the CPU between processes that are waiting for other events, such as I/O, the operating system can function more productively. HP-UX uses a round robin scheduling mechanism. The CPU lets each process run for a preset maximum amount of time, called a quantum or time slice (default = 1/10th second), until the process completes, or is preempted to let another process run. Of course, the process can always voluntarily surrender the CPU before its timeslice expires when it realizes that it cannot continue. The CPU saves the status of the first process in a context and switches to the next process. When a process is switched out due to its timeslice expiring, it drops to the bottom of the run queue to wait for its next turn. If it is preempted by a stronger priority process, it is placed back onto the front of the run queue. If it voluntarily gives up the CPU, it goes onto one of the sleep queues, until the resource its waiting for becomes available. When that resource does become available, the process moves the end of the run queue.
As a multitasking system, HP-UX requires some way of changing from process to process. It does this by interrupting the CPU to shift to the kernel. The clock interrupt handler is the system software that processes clock interrupts. It performs several functions related to CPU usage including gathering system and accounting statistics and signaling a context switch. System performance is affected by how rapidly and efficiently these activities occur.
Terms
CPU scheduler System clock Clock Interrupt handler Context switching Schedules processes for CPU usage Maintains the system timing Executes the clock interrupt code and gathers system accounting statistics Interrupts the currently running process and saves information about the process so that it can begin to run after the interrupt, as if it had never stopped.
48. SLIDE: Context Switching
Context Switching
A context switch occurs when A timeslice expires (a thread accumulates 10 clock ticks) (Forced) A preemption occurs (a stronger priority thread is runnable) (Forced) - if the stronger thread is RT, immediate preemption - if the stronger thread is not RT, at next convenient time A thread becomes non-computable, i.e. - it goes to sleep - it is stopped - it exits (Voluntary)
Student Notes
A context switch is the mechanism by which the kernel stops the execution of one process and begins execution of another. A context switch occurs under the circumstances shown on the slide. There are two types of context switches: forced and voluntary. A forced context switch occurs when the process is forced to give up the CPU before it is ready. These include timeslice expiration or a stronger priority process becoming runnable. A voluntary context switch occurs when the process itself gives up the CPU without using its full timeslice. This happens when the process exits, or puts itself to sleep (waiting on a resource), or puts itself into a stopped state (debugging). The glance tool distinguishes between forced and voluntary context switches on a per process basis.
49. SLIDE: Priority Queues
Priority Queues
-32
-1
127
128 131
152 155
172 - 176 - 180 175 179 183
252 255
...
...
...
...
...
PSWP (128) Real Time Priority Queues (1 priority wide) POSIX Real Time (rtsched) HP-UX Real Time (rtprio)
PZERO (153)
PUSER (178)
Time-shared Priority Queues (4 priorities wide) System Level Priorities Nonsignalable User Level Priorities
Signalable Priorities
Signalable Priorities
Student Notes
Every process has a priority associated with it at creation time. These priorities determine the order in which processes execute on the CPU. Processes with the weakest priority number always execute before processes with stronger numbers. In UNIX, stronger priorities are represented by smaller numbers and weaker priorities are represented by larger numbers. HP-UX uses adjustable priorities to schedule its time slicing for general timeshare processes generated by all users (priorities 128-255). By that we mean, a processs priority can be adjusted, up or down, by the kernel, according to how favored a process might be. In general, the more a process executes, the less favorable it will be treated by the kernel. However, since HP-UX also supports real-time processing, it must include priority-based scheduling for those processes (priorities 0-127). As of HP-UX 10.X, support is also provided for POSIX real-time processes (priorities -32 through -1). The /usr/include/sys/param.h file contains some extra information on the priorities used in the system. Each processor in an HP system has its own run queue. Each run queue is further broken down into multiple priority queues, to make it easier for that processor to select the most deserving process to run.
Real-Time Process Priorities

Real-time priority queues are one wide, i.e. each queue represents one priority value. The strongest priority real-time process preempts all others (of weaker priority) and runs until it sleeps or exits or is preempted by a stronger or timesliced by an equal real-time process. Equal priority real-time processes run in a round robin fashion. A process can be made to run with a real-time priority by using the rtprio(1) or rtsched(1) command. The rtsched command can also be used to disable timeslicing for a particular process, by assigning it a different scheduling policy. Because a real-time process will execute at the expense of all time-share processes, make sure that you consider the impact on your users before invoking the command. A CPU-bound, real-time process will halt all other use of the system. A POSIX real-time process (ttisr) runs on HP-UX at priority -32.
Time Share Process Priorities

Timeshare priority queues are four-wide, i.e. each priority queue represents four, adjacent priority values. For example, the first timeshare priority queue is used by processes with priorities of 128, 129, 130, and 131. Timeshare processes are grouped into system and user processes. Priorities 128-177 are reserved for runnable system processes and sleeping processes (both system and user), and priorities 178-255 are for runnable user processes. A nice value is assigned to a timeshare process, which will be used in the calculation of a new priority for the process. This value will be used by the kernel to help determine how to adjust the priority of the process. Nice values have no effect on real-time processes.
410. SLIDE: Nice Values
Nice Values
177
ProcB
Priority
nice = 20
ProcA nice = 39
255
ProcA Running ProcB Running
ProcA Sleeping ProcB Sleeping
ProcA Running ProcB Sleeping
ProcA Sleeping ProcB Sleeping
ProcA Running ProcB Running
(ProcA nice=20) (ProcB nice=39)
Student Notes
Time shared processes are all initially assigned the priority of the parent when they are spawned. The user can make modifications to how much the kernel favors a process with the nice value. Timeshare processes lose priority as they execute, and regain priority as they wait their turns. The rate at which a process loses priority is linear, but the rate at which it regains priority is exponential. A process's nice value is used as a factor in calculating how fast a process regains priority. The nice value is the only control a user has to give greater or less favor to a time share process. The default nice value is 20. Therefore, to make a process run at a weaker priority, it should be assigned a higher nice value (maximum value 39). The superuser can assign a lower nice value to a process (minimum value 0), effectively giving it a stronger priority.
411. SLIDE: Parent-Child Process Relationship
Parent-Child Process Relationship
Kernel OS Tables
ksh
sam
ksh
su
csh
sh
glance
sh
Memory
Student Notes
One item to keep in mind related to process management is the relationship between parent and child processes. Every process started from a terminal window on the system has a parent process that spawns it. The parent process does not terminate once a child is spawned. Instead, it goes to sleep waiting for the child to terminate from its execution. If a child process does not exit properly, for example, if it spawns a new process rather than exiting to its parent, then the system could end up with many processes sleeping in memory and using proc table entries unnecessarily. The example in the slide shows a ksh shell that spawns a sam process. Within sam, the system administrator shells out to su to a regular user. Once in the login shell, the user starts glance. From within glance, they shell out, and now decide they'd rather be in a csh shell. This string of events caused eight different processes to be started. If the user decides he wants to return to sam by typing sam, would the previous sam process be reactivated, or would a new sam process be spawned? (Answer: A new sam process is spawned).
412. SLIDE: glance Process List
glance Process List
B3692A GlancePlus B.10.12 14:52:27 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S N N CPU Util | 22% 29% 51% F Disk Util | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B Swap Util | 25% 24% 35% U U R R -------------------------------------------------------------------------------PROCESS LIST Users= 11 User CPU Util Cum Disk Thread Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Count -------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1 supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1 lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1 glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1 statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1 midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2 ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1 dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1 rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1 syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1 xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1 Page 1 of 13
Student Notes
The next four slides are designed to illustrate how the management of processes can be monitored through glance. Topics just covered (like kernel versus user CPU time, process components, process wait states, nice values, and process priorities) can all be viewed through glance. The first Global Bar graph, which displays on every glance screen, is the CPU Util. This displays how the CPU is being distributed. S = System or Kernel Time N = User Time (executing processes who have had their nice value set greater than 20. (21-39) U = User Time (executing processes with a nice value of 20) A = User Time (executing processes who have had their nice value set less than 20 (0 19). In other words: Anti-nice. R = Real Time (executing processes with priorities 127 and less)
The Process List screen (g key), as shown on the slide, can be used to see process priorities. The order in which the processes are displayed can be configured (o key) to display by CPU usage, memory usage, or disk I/O activity. In HP-UX version 10.X, the thread count column was the blocked on column. The blocked on information can still be obtained by looking at the individual processes resource summary screens.
413. SLIDE: glance Individual Process
glance Individual Process
B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab -------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEP User/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mb System CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8 Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0 Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 339 Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358 Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957 Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998 CPU Switches : 0 Bytes Xfer: 410kb
Student Notes
From the Process List screen, an individual process can be selected for further analysis (s key). The above slide shows some of the additional details available when analyzing a process further. Items of interest from the Individual Process screen include the process's nice value, the number of Forced versus Voluntary context switches, the current Wait reason, and the Parent PID.
414. SLIDE: glance Process Memory Regions
glance Process Memory Regions
B3692A GlancePlus B.10.12 10:17:41 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S N N CPU Util | 22% 29% 51% F Disk Util | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B U U R R Swap Util | 25% 24% 35% -------------------------------------------------------------------------------Memory Regions for PID: 16013, netscape PPID: 14061 euid: 520 User:sohrab Type RefCt RSS VSS Locked File Name -------------------------------------------------------------------------------NULLDR/Shared 64 4kb 4kb 0kb <nulldref> TEXT /Shared 3 4.3mb 9.5mb 0kb /opt//netscape-bin DATA /Priv 1 5.8mb 8.6mb 0kb /opt//netscape-bin MEMMAP/Priv 1 4kb 20kb 0kb /opt//netscape-bin MEMMAP/Priv 1 36kb 36kb 0kb /opt//netscape-bin MEMMAP/Priv 1 12kb 12kb 0kb <memmap> STACK /Priv 1 28kb 28kb 0kb <stack> UAREA /Priv 1 16kb 16kb 0kb <uarea> LIBTXT/Shared 85 56kb 60kb 0kb /usr/lib/dld/sl Text RSS/VSS:4.3mb/9.5mb Shmem RSS/VSS: 0kb/ 0kb Data RSS/VSS:5.8mb/8.6mb Other RSS/VSS:4.1mb/5.7mb Stack RSS/VSS: 28kb/ 28kb
Student Notes
From the Individual Process screen, the memory regions (i.e. process components) corresponding to that process can be viewed (M key). The above slide shows the memory regions for the currently selected process. Items of interest from the Memory Region screen include the location of the process's Text, Data, Stack, and U-Area, along with its Shared/Private flag, its Resident Set Size and Virtual Set Size, and its reference count. If the process is associated with Memory Map files (MEMMAP), Shared Libraries (LIBTXT), or Shared Memory Segments (SHMEM), these will be displayed. In HP-UX version 11.X, glance no longer displays the addresses of each memory region. However, gpm still does.
415. SLIDE: glance Process Wait States
glance Process Wait States
B3692A GlancePlus B.10.12 10:23:03 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Wait States for PID: 14205, netscape PPID: 14061 euid: 520 User:sohrab Event % Blocked On % -------------------------------------------------------------------------------IPC : 0.0 Cache : 0.0 CPU Util : 13.7 Job Control: 0.0 CDROM IO : 0.0 Wait Reason: SLEEP Message : 0.0 Disk IO : 0.0 Pipe : 0.0 Graphics : 0.0 RPC : 0.0 Inode : 0.0 Semaphore : 0.0 IO : 0.0 Sleep : 77.2 LAN : 0.0 Socket : 0.0 NFS : 0.0 Stream : 0.0 Priority : 9.1 Terminal : 0.0 System : 0.0 Other : 0.0 Virtual Mem: 0.0
C - cum/interval toggle
% - pct/absolute toggle
Page 1 of 1
Student Notes
From the Process List screen, the process wait states can be viewed (W key). The above slide shows the categories of wait states and where/what the selected process has waited on. Items of interest from the Process Wait State screen include the percentage of time the process has spent in each of the possible wait state categories.
416. LAB: Process Management

Directions
The following lab is designed to manage a group of processes. This includes observing the parent-child relationship and modifying process nice values (and thus indirectly priorities) with the nice/renice command .
Modifying Process Priorities

This portion of the lab uses glance to monitor and modify nice values of competing processes. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Start seven long processes in the background. # ./long & ./long & ./long & ./long & ./long & ./long & ./long & 3. Start a glance session. Answer the following questions. How much CPU time is each long process receiving? _______sec, _______% How are the processes being context switched (forced or voluntary)? _______________ How many times over the interval is the process being dispatched? ____________ What is the ratio of system CPU time to user CPU time? __________ What are the processes being blocked on? _________________ What are the nice values for the processes? _________ 4. Select one of the processes and favor it by giving it a more favorable nice value. What is the PID of the process being favored? __________ To change the processes nice value, enter: # renice n -5 <PID of selected process> Watch that processs percentage of the CPU over several display intervals with glance or top. What effect did it have on the process? _____________________________ ____________________________________________________________________
5. Select another long process and set the nice value to 30. # renice n 10 <PID of another selected process> What effect did that have on that process? ___________________________________ ______________________________________________________________________ 6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with: # kill $(ps el | grep long | cut c18-22)

Objectives
Upon completion of this module, you will be able to do the following: Describe the components of the processor module. Describe how the TLB and CPU cache are used. List four CPU related metrics. Identify how to monitor CPU activity. Discuss how best to use the performance tools to diagnose CPU problems. Specify appropriate corrections for CPU bottlenecks.
51. SLIDE: Processor Module
Processor Module
CPU
TLB
Cache
Coprocessor
System Bus
Student Notes
A typical HP processor module consists of a central processing unit (CPU), a cache, a translation lookaside buffer (TLB), and a coprocessor. These components are connected via internal processor busses, with the entire processor module being connected to the system bus. The cache is made up of very high-speed memory chips. Cache can be accessed in one CPU cycle. Its contents are instructions and data that recently have been or are anticipated to be used soon by the CPU. Cache size varies between processors. The size of the cache can have a big effect on system performance. The translation lookaside buffer (TLB) is used to translate virtual addresses into physical addresses. It is a high-speed cache whose entries consist of pairs of recently accessed virtual addresses and their associated physical addresses, along with access rights and an access ID. The TLB is a subset of a system-wide translation table (page directory) that is held in memory. TLB size also affects system performance, and different HP 9000 processors have different TLB sizes.
The address translations kept in the TLB enable us to locate the appropriate data and instructions in the memory. The memory is accessed via the physical address. Without the translation in the TLB, we would not be able to find the information in the memory. Note these other points regarding the TLB: Each process has a unique virtual address space. Each TLB entry refers to a page of memory, not a single location. In all 64-bit architectures used by HP, pages are fundamentally 4KB in size, but can be any multiple of 4K under various circumstances to reduce the number of entries needed in the TLB.
52. SLIDE: Symmetric Multiprocessing
Symmetric Multiprocessing
CPU
CPU
TLB
Cache
Coprocessor
TLB
Cache
Coprocessor
System Bus
Student Notes
Symmetric Multiprocessing (SMP) refers to systems containing two or more processor units. SMP is implemented on all Hewlett-Packard workstations and servers capable of supporting more than one CPU. Each processor on an SMP system has exactly the same characteristics, including the same processing unit, the same CPU cache design, and the same size translation lookaside buffer (TLB).
53. SLIDE: Cell Module
Cell Module
Processor Processor
Processor Processor
Memory
Cell Internal Bus I/O Buses
Student Notes
A more recent design of HP systems is based on the cell architecture. In a cell, there are multiple processors, some memory and some I/O buses. Each cell could act as an independent SMP system, or as part of a collection of cells, forming a larger SMP system. Each processor in the cell has the same access speed (or latency) to the memory within the same cell. However, if one of those processors would have to access a location in the memory of a different cell, the latency would be greater. Each processor within the cell does have its own cache memory and TLB. Each processor has equal access to the I/O buses that are part of the same cell. They may also have access (with somewhat greater delays) to the I/O of other cells in the same system.
54. SLIDE: Multi-Cell Processing
Multi-Cell Processing
I/O
P P P P
Memory
P P P P
Memory
I/O
Memory
Memory
I/O
I/O P P P P
High-Speed Memory interconnect
P P P P
Student Notes
The best example HP currently has of a SMP using cell architecture is the Superdome. Here we find 4 cells, each with four processors, some memory, and some I/O buses. Each cell could be configured (using Node Partitioning or NPars) into a separate and individual system capable of booting its own operating system. It would be functionally apart from the other cells. The only way that the operating system on that cell could communicate with the software running on any other cell would be through a network interface. On the other hand, multiple cells could be configured to act as a unit. They would pool their resources and boot a single operating system. They would seamlessly act as a SMP system. This architecture gives the customer and the system administrator tremendous flexibility in how to set up their hardware. They could even change it relatively easily from one configuration to another as their needs changed. On a wider range of systems, you may be using Virtual Partitioning (VPars). There are similar to NPars, but are not limited to cell boundaries and are handled entirely by software. A system could use both NPars and VPars at the same time. Using software, processors can be moved from one VPar to another.
Finally, on an even wider range of systems, we have the concept of processor sets (psets). Multiple psets could exist within the same partition (either NPar or VPar). Each pset would be set aside for use by a particular application of group of applications. Using software, psets can be created and removed, and processors could be moved from one pset to another.
55. SLIDE: CPU Processor
CPU Processor
Shadow Registers
General Registers
Control Registers
Special Function Unit Registers
Space Registers CPU Coprocessor Registers TLB Cache Coprocessor Process Status Word Instruction Address Queues
Student Notes
The CPU ultimately is responsible for your system speed. The kernel loads the process text for the CPU to execute. The processor module has many Registers, which assist in the execution of instructions. The definition of all these registers is beyond the scope of this course. The primary objective of this module is to focus on CPU clock speed, the size of the CPU cache, and the effects of the TLB related to overall system performance. Each HP 9000 server and workstation has a chip at its heart. The latest version PA-RISC chip is the 64-bit, PA-8xxx. HP has also introduced systems using the 64-bit, IA-64 Itanium chip. A selection of the range of current systems is listed on the following pages. Note the difference not only in clock speeds, but also in cache size. The following tables list the specifics of several HP-UX servers and workstations. It is very difficult to keep a list of this nature up to date in training materials but it has been included merely to demonstrate the wide variety of system characteristics present in the HP computing products family.
Business Servers Model rp3410-2 (PA-8800) rp3440-4 (PA-8800) rp4440-8 (PA-8800) rp7420-16 (PA-8800) rp8420-32 (PA-8800) Superdome (PA-8800) rx1600 (Itanium 2) rx2600 (Itanium 2) rx4640 (Itanium 2) rx5670 (Itanium 2) rx7620 (Itanium 2) rx8620 (Itanium 2) Superdome (Itanium 2) 8 (2 cells) 16 (4 cells) 64 (16 cells) 1.5 GHz 512 6MB(L3) 1.5 GHz 128 6MB(L3) 1.5 GHz 64 6MB(L3) 15 PCI (128-bit) 16+16 PCI (128-bit) 0/128/64 PCI * 4 1.5 GHz 96 6MB(L3) 0/6/3 PCI * 4 1.5 GHz 64 6MB(L3) 0/4/2 PCI * 2 1.5 GHz 24 6MB(L3) 0/4/0 PCI * 16 (2 cells) 32 (4 cells) 128 (16 cells) 2 1 GHz 16 1 GHz 1024 1 GHz 128 1 GHz 64 8 1 GHz 64 4 1 GHz 24 No. of CPUs 2 Clock Speed 800 MHz Max. RAM (GB) 6 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L3) 0/1/1 PCI * 192 PCI 16 PCI 15 PCI 6 PCI 4 PCI (64-bit) 2-PCI (64-bit) Cache (KB) I/O Slots
Workstations Model B2600 (PA-8600) B3700 (PA-8700) C3750 (PA-8700+) J6750 (PA-8700+) zx2000 (Itanium 2) zx6000 (Itanium 2 * 2/3/1 means 2 32-bit PCIs, 3 64-bit PCIs and 1 128-bit PCI. All Itanium 2 processors include 32KB of L1 cache and 256KB of L2 cache. To determine the specifics of your system, refer on-line to http://www.hp.com/go/enterprise, select "Products Index" and scroll down to select your system platform name [i.e. J-Class (HP 9000)]. This will display the "Product Information" screen for the selected hardware. 2 1.5 GHz 24 6144(L3) 3 PCI 1 AGP 1 1.4 GHz 8 1536(L3) 5 PCI - 1 AGP 2 875 MHz 16 768/1536 0/0/3 PCI * 1 875 MHz 8 768/1536 2/3/1 PCI * 1 750 MHz 8 768/1536 2/3/1 PCI * No. of CPUs 1 Clock Speed 500 MHz Max. RAM (GB) 4 512/1024 2/2/0 PCI * Cache (KB) I/O Slots
56. SLIDE: CPU Cache
CPU Cache
Memory
CPU Instruction to Execute xxxx
TLB Cache Coprocessor
xxxx Process Text
System Bus
Student Notes
The CPU loads instructions from memory and runs multiple instructions per cycle. To minimize the time that the CPU spends waiting for instructions and data, the CPU uses a cache. The cache is a very high-speed memory that can be accessed in one CPU cycle with the contents being a subset of the contents of main memory. As the CPU requires instructions and data, they are loaded into the cache. The size of the cache has a large bearing on how busy the CPU is kept. The larger the cache, the more likely it is that it will contain the instructions and data to be executed. Most current processors support multi-level caches. The Level 1 cache (L1) is the fastest operating at the same speed as the CPU. It is relatively small. The Level 2 cache (L2) operates at one-half the speed of the CPU. It is somewhat larger. The IA-64 has a Level 3 cache (L3) that is even larger and slower.
57. SLIDE: TLB Cache
TLB Cache
Memory
Instruction Address Queues
CPU
Page VA\PA Directory

0 . . . 4GB
Instruction to Execute
VA | PA TLB
xxxx
Cache Coprocessor
xxxx Process Text
System Bus
Student Notes
All 32-bit programs view their address space as starting at address 0, and ending at address 4 GB. All addresses referenced by the program are referenced relative to this address space. This is referred to as the program's virtual address space. A program's physical address is the address location in physical memory where the program is loaded at execution time. When the CPU executes a program, it is presented with the virtual address containing the instruction to be executed. In order to fetch this instruction from physical memory, the CPU must convert the virtual address (VA) into the corresponding physical address (PA). To do this, the CPU checks the TLB. If the VA->PA is present, it then knows the PA in memory of the instruction. If the VA is not present, it then needs to fetch the information from the PDIR (Page DIRectory) table in memory. This memory fetch of the PDIR table is relatively expensive from a performance standpoint. Once the PA is known, the CPU then checks the Instruction Cache on the CPU for the PA. If the PA is present, it then loads the instruction straight from Instruction Cache. If not present, it then needs to fetch the instruction from memory, which is relatively expensive (performance-wise).
The size of the TLB is anywhere from 96 to 160 entries (each entry points to a variable-sized memory page) on a PA-RISC and an IA-64.
58. SLIDE: TLB, Cache, and Memory
TLB, Cache, and Memory
TLB Hit Hit Miss Miss
Cache Hit Miss X X
Memory Hit Hit Hit Miss
Consequence 1 CPU cycle fetch Data/instruction memory fetch PDIR memory fetch Page fault
X = Dont Care
Student Notes
The slide shows some of the permutations of hits and misses on memory, cache, and the TLB, as well as the consequences of each. The best situation is when the VA has an entry in the TLB, and the corresponding PA has an entry in the CPU cache. This allows the instruction or data to be present to the CPU in one clock cycle. The next-best scenario, would be to have a hit on the TLB, but a miss on the CPU cache. An example number of clock cycles to fetch a PA from memory to the CPU cache is 50 clock cycles. Another scenario would be to have a miss on the TLB, but a hit on the CPU cache. The miss on the TLB requires the PDIR table in memory to be searched, and an appropriate entry to be loaded into the TLB. This takes a variable number of cycles to perform. On one model the average was 131 clock cycles. Therefore, a miss on the TLB is more expensive than a miss on CPU cache. A miss on both the TLB and the CPU cache would translate into 131 + 50 or 181 clock cycles on average to access the instruction or data that the CPU needs. This could have been accessed in 1 clock cycle had the VA been in TLB, and the PA been in CPU cache.
The worst scenario, performance-wise, is not having the instruction or data loaded in memory at all. In this case, a page fault would occur to retrieve the information from disk. Assuming a 1-GHz clock, a 10-ms disk transfer rate, and an idle disk drive, this would correspond to 10,000,000 clock cycles to access the data or instruction.
59. SLIDE: HP-UX Performance Optimized Page Sizes
HP-UX 11.00 Performance Optimized Page Sizes (POPS)
VA
PA
HP-UX 10.x Fixed Page Size 4 KB
0 8192 4096 65536 8192 128000 12288 256000 16384 512000
0 8192 65536 128000 256000 512000
Filesystem
VA 0 4096 8192 12288 16384
TLB (on CPU)
File Memory
VA
PA
HP-UX 11.00 Variable Page Size Range: 4 KB 64 MB
0 8192 16384 512000
0 8192
Filesystem
VA 0
512000
TLB (on CPU)
Memory
File
16384
Student Notes
HP-UX 11.00 is the first release of the operating system to have general support for performance optimized page sizes (POPS), also known as variable page sizes. Partial support for variable memory page sizes has existed since HP-UX 10.20. HP-UX 11.00 allows customers to configure executables to use specific performance optimized page sizes, based on the program's text and data sizes. Page sizes can be selected from a range of 4 KB to 4 GB. The use of performance optimized page sizing can significantly increase performance of applications that have very large data or instruction sets. NOTE: Performance-optimized page sizing works on PA-8000-based and IA-64-based systems.
Fixed Page Sizes (Prior to 11.00)

Prior to HP-UX 11.00, all page sizes were fixed at 4 KB. As a program executed, each 4 KB page would be mapped into physical memory, and a TLB entry would be created to map the virtual address corresponding to that page to the physical memory address. Selected models
had a few Block TLB entries, which could map multiple pages into a single entry, if the pages were contiguous in both virtual and physical address spaces. These entries were reserved for mapping the kernel, the I/O pages, and other segments that were locked into memory. At some point, the TLB would become full, and the virtual-to-physical address mapping would only be stored in the PDIR table in memory, not in the TLB on the CPU. This meant that if a virtual address needed to be translated, there would be a chance that the address would not have an entry in the TLB, and time would have to be spent to look up the address within the PDIR table in memory. This handling of the TLB miss was expensive in terms of performance.
Performance Optimized Page Sizes (11.00 and Beyond)

With the release of HP-UX 11.00, support for variable page sizes is available. With POPS, a larger portion of the process's virtual address space can be referenced within a single page or within a few, large pages. Therefore, a larger portion of the process can be referenced with much fewer TLB entries. Below are two tables showing what sizes of pages are available in the PA-RISC and the IA-64 architectures.
PA-RISC
4K 16K 64K 256K 1M 4M 16M 64M 256M 1G -
IA-64
4K 8K 16K 64K 256K 1M 4M 16M 64M 256M 4G
Affecting Page Sizes

There are two methods of affecting page size in a process. One is through tunable kernel parameters. vps_pagesize determines what the default page size will be with no other information. The size is given in 1K units and the setting is typically 4. vps_ceiling determines how large the kernel can promote a page size for a process, if it notices that a process is getting a very large number of TLB misses. The default setting for this is 16 (1K). The second method is done by the system administrator. A command, chatr, can be used to provide the kernel with a hint of what page sizes would work best with this process. Following is an example of this command. chatr pi 16 pd 256 /opt/app/bin/app The above command would hint to the kernel that this process would best execute with 16K pages for the instructions (text) and 256K pages for the data. This hint would be stored in the
header of the executable file and be visible to the kernel whenever the program was invoked. The kernel would do its best to see that the hint is followed. However, if memory pressure exists, the kernel may not be able to honor the request and may end up demoting the size of the page to be able to manage it in memory. There is a third tunable parameter, vps_chatr_ceiling, that determines the maximum value a chatr command can assign to an executable file.
510. SLIDE: CPU Metrics to Monitor Systemwide
CPU Metrics to Monitor Systemwide

User CPU utilization Nice/Anti-Nice utilization Real time processes System CPU utilization System call rate Context switch rate Idle CPU utilization CPU run queues (load averages)
Student Notes
The load on the CPU can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor CPU performance.
User CPU Utilization

This is the percentage of time the CPU spent running in user mode. This corresponds to executing code within user processes, as opposed to code within the kernel. It is better to see user CPU utilization higher than system CPU utilization (preferably two to three times higher).
Nice/Anti-Nice Utilization
This is the percentage of time the CPU spent running user processes with nice values of 21-39 (Nice) or 0-19 (Anti-Nice). This is typically included in USER CPU utilization, but some tools, like glance, track this separately to see how much CPU time is being spent on weaker or stronger priority processes.
Real Time Processes

This is the amount of time spent executing real time processes that are running on the system. Real time processes get the CPU immediately when they are ready to execute, and can have a big impact on the performance of time-shared processes.
System CPU Utilization

This is the percentage of time the CPU spent running in system (or kernel) mode. This corresponds to executing code within the kernel. We have to have some kernel time just to do minimum management on the system. However, excessive time spent managing the system is bad for performance. Excessive system CPU utilization is considered to be when system utilization is greater than the user utilization.
System Call Rate

The system call rate is the rate at which system calls are being generated by the user processes. Every system call causes a switch to occur between user mode and system (or kernel) mode. A high system call rate typically corresponds to a high system CPU utilization. If the system call rate is high, it is recommended to investigate which system calls are being generated, the frequency of each system call, and the average duration of each system call.
Context Switch Rate

This is the number of times the CPU switched processes (on average) per second. This is typically included in system CPU utilization, but some tools, like glance, track this separately.
Idle CPU
This is the percentage of time the CPU spent doing nothing (i.e. it did not execute any user or kernel code). It is good to see some, even lots, of idle CPU time. A non-idle CPU means the CPU run queue is never exhausted (or emptied), which means processes are always having to wait before reaching the CPU. The size of the line (CPU run queue) grows, as idle CPU time approaches 0.
CPU Run Queues/Load Average

Both these terms reference the same thing. This is the number of processes in the CPU run queue. For best performance, the average load in the CPU run queue should not exceed three.
511. SLIDE: CPU Metrics to Monitor per Process
CPU Metrics to Monitor per Process

Process priority Process nice value Amount of CPU user time Amount of CPU system time
Student Notes
Individual processes vary greatly in terms of the load they place on the CPU. Metrics to monitor on an individual process include the following.
Process Priority
This is the priority of the process. If the priority is 127 or less, we know it is a real time process. If the priority is 128-177, either it is a system process, or it is a user process that is sleeping. If the priority is 178-255, then we know the process is executing in USER mode.
Process Nice Value

This is the nice value associated with the process. This only applies to time-share processes. This value determines how fast the process regains priority while it is waiting for the CPU. Small nice values (0-19) should be given to more important processes allowing them to regain priority quickly. Large nice values (21-39) should be given to less important processes, causing them to regain priority slowly.
User CPU Time vs. System CPU Time

This is the percentage of time the individual process spent in user mode (i.e. having the CPU execute user code) and system mode (i.e. having CPU execute kernel code). This is helpful in determining where the CPU spends its time when executing the process: user code or kernel code. It is generally desirable to see more time in user code.
512. SLIDE: Activities that Utilize the CPU
Activities that Utilize the CPU

Process management File system I/O Memory management activities System calls Applications (for example, CAD-CAM and database processes) Batch jobs
Student Notes
Examples of activities that place a load on the CPU include the following.
System Activities
System activities are those activities which execute in kernel mode. Examples of system activities include system processes and user processes executing system calls. Process startup Process scheduling File system and raw I/O Memory management Handling of system calls
User Activities
User activities are those activities that execute in user mode. CAD/CAM applications Database processing Client/server applications Compute-bound applications Background jobs (i.e. batch jobs)
513. SLIDE: glance CPU Report
glance CPU Report
B3692A GlancePlus B.10.12 05:00:42 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util S S U | 85% 83% 85% U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------CPU REPORT Users= 4 State Current Average High Time Cum Time -------------------------------------------------------------------------------User 18.9 6.0 32.3 0.96 3.61 Nice 0.0 2.4 5.7 0.00 1.47 Negative Nice 0.4 0.8 16.2 0.02 0.51 RealTime 0.4 0.4 0.7 0.02 0.22 System 3.3 7.0 16.2 0.17 4.21 Interrupt 1.8 1.7 2.7 0.09 1.02 ContextSwitch 0.6 0.7 1.4 0.03 0.40 Traps 0.0 0.0 0.0 0.00 0.00 Vfaults 0.0 0.7 3.6 0.00 0.45 Idle 74.6 80.2 91.2 3.79 48.18 Top CPU user: PID Active CPUs: 1 2097, dthelpview 19.5% cpu util Page 1 of 2
Student Notes
The glance CPU report (c key) provides details on where the CPU is spending its time from a global perspective. User mode: This is time spent by the CPU in user mode for all processes on the system. This includes processes with a nice value of 20 (user), processes with nice values between 21-39 (nice), processes with nice values between 0-19 (negative nice), and realtime priority processes. System mode: This is time spent by the CPU in system mode for all processes on the system. It includes time spent handling general system calls (system), and time spent handling interrupts, context switches, traps, and Vfaults (virtual faults). Load Average: This is the number of jobs in the CPU run queue averaged over three time intervals. It includes the average length of the run queue over the last 1 minute, the last 5 minutes, and the last 15 minutes. The CPU load average data is viewable on page 2 of this glance report. Also on page two are the System Call Rate, the Interrupt Rate, and the Context Switch Rate.
514. SLIDE: glance CPU by Processor
glance CPU by Processor
B3692A GlancePlus B.10.12 05:13:18 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util | 85% 83% 85% S S U U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------CPU BY PROCESSOR Users= 4 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------0 Enable 25.4 0.6/ 0.4/ 0.3 72187 1061
Page 1 of 2 CPU Util User Nice NNice RealTm Sys Intrpt CSwitch Trap Vfault -------------------------------------------------------------------------------0 25.4 20.7 0.0 0.0 0.0 4.7 0.0 0.0 0.0 0.0
Page 2 of 2
Student Notes
The glance CPU-by-processor report (a key) provides details on a per CPU basis. CPU Utilization: This is the CPU utilization for the specific processor. If two or more processors exist on the system, the Global CPU Util bar graph shows an average CPU utilization. That is, a CPU that is 100% utilized and a second CPU that is 0% utilized will display 50% CPU utilization. This report displays utilization on a per processor basis. Load Average: This is the number of processes, on average, in the CPU run queue over the last 1 minute, 5 minutes, and 15 minutes. This report displays CPU run queue information on a per processor basis. Page two of this display shows the Utilization broken down into User mode, Nice, Negative Nice, Realtime, System, Interrupts, Context Switches, Trap and Virtual Faults on a perprocessor basis.
515. SLIDE: glance Individual Process
glance Individual Process
B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab -------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEP User/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mb System CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8 Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0 Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 339 Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358 Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957 Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998 CPU Switches : 0 Bytes Xfer: 410kb
Student Notes
The glance individual process report (s key followed by the PID) displays CPU usage for an individual process, and the distribution of CPU time when executing the process (user, system, interrupt, context switch). Ideally, a process should spend more time in User/Nice/RT mode than in any of the other three modes. Also displayed on a per-process basis is the Priority and Nice values for the selected process. In addition, the total number of forced context switches (time slice expiration or process preemptions) and voluntary context switches (process putting itself to sleep) are displayed.
516. SLIDE: glance Global System Calls
glance Global System Calls
B3692A GlancePlus B.10.12 05:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util S S U | 85% 83% 85% U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------GLOBAL SYSTEM CALLS Users= 4 System Call Name ID Count Rate CPU Time Cum CPU -------------------------------------------------------------------------------syscall-0 0 16 3.1 0.05921 2.19037 fork 2 0 0.0 0.00000 0.01398 read 3 105 20.5 0.00210 0.07625 write 4 47 9.2 0.00208 0.13624 open 5 16 3.1 0.00143 0.03146 close 6 16 3.1 0.00040 0.00848 wait 7 1 0.1 0.00011 0.00031 time 13 46 9.0 0.00023 0.00446 chmod 15 0 0.0 0.00000 0.00009 ioctl 54 503 57.8 0.00900 0.79813 poll 269 277 48.5 0.00983 1.83466 Cumulative Interval: 87 secs Page 1 of 7
Student Notes
The glance global system calls report (Y key) displays all the system calls that have been executed system-wide. When system CPU utilization is high, this report can be used to identify on which system calls the CPU is spending most of its time.
517. SLIDE: glance System Calls by Process
glance System Calls by Process
B3692A GlancePlus B.10.12 05:39:20 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------System Calls for PID: 1822, netscape PPID: 1775 euid: 503 User:roc Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------read 3 477 93.5 0.16884 742 49.1 0.24275 write 4 219 42.9 0.02831 352 23.3 0.06787 open 5 63 12.3 0.01396 99 6.5 0.02491 close 6 9 1.7 0.00046 20 1.3 0.00104 time 13 34 6.6 0.00031 89 5.8 0.00083 brk 17 27 5.2 0.00171 45 2.9 0.00264 lseek 19 69 13.5 0.00150 135 8.9 0.00304 stat 38 4 0.7 0.00131 13 0.8 0.00415 ioctl 54 636 124.7 0.01463 1167 77.2 0.02813 utssys 57 0 0.0 0.00000 3 0.1 0.00013 Cumulative Interval: 15 secs Page 1 of 3
Student Notes
While examining an individual process, the system calls generated by that particular process can be viewed using the L key. When the system time utilization is high for an individual process, this report can be used to view the specific system calls the process is performing, how many times the system calls are being invoked, and (most importantly) how much time is being spent by the CPU to execute the system calls. The read() and write() system calls often take the most time, as they require physical I/O to the disk drives.
518. SLIDE: sar Command
sar Command
$ sar option <Interval size> <Number of intervals> Options: -u -q -M -c CPU Utilization (usr, sys, wio, idle) Queue lengths/utilization (run, swap) Above information in per-processor format System calls
Student Notes
The sar command can be used to display global statistics on several important CPU operations. Using the u option, information can be displayed on the time the system spent in User mode, System mode, Waiting for (disk) I/O, and idle. The Waiting for (disk) I/O is not reported by any other tool. Other tools simply lump it in with idle time. An example of the sar output with the u option is shown below:
# sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average %usr 64 61 61 61 61 %sys 36 39 39 39 39 %wio 0 0 0 0 0 10/14/97 %idle 0 0 0 0 0
Using the q command, information can be displayed on the length and utilization of the run queue and the swap queue. We are most interested at this time in the run queue. An example of the sar output with the q option is shown below:
# sar -q 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:33:24 08:33:29 08:33:34 08:33:39 08:33:44 Average 10/14/97
runq-sz %runocc swpq-sz %swpocc 8 100 0 0 8 100 0 0 8 100 0 0 8 100 0 0 8 100 0 0
The M option is always used in conjunction with u and/or q. It causes the metrics to be broken down by processor, so you can see how each processor is being utilized. The c option shows the total number of system calls being executed per second and singles out four specific system calls for further detail. They are the read(), write(), fork(), and exec() system calls. Also reported on this display is the average number of characters transferred in or out each second. An example of this output follows:
# sar -c 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:33:24 scalls/s 08:33:29 332 08:33:34 435 08:33:39 270 08:33:44 524 Average 390 sread/s 3 4 3 20 7 10/14/97 fork/s 0.00 0.00 0.00 0.20 0.05 exec/s 0.00 0.00 0.00 0.20 0.05 rchar/s 38630 30310 6758 73523 37187 wchar/s 2657 2662 0 0 1331
swrit/s 9 24 14 15 15
519. SLIDE: timex Command
timex Command
$ timex real user sys prime_med 25.65 20.71 3.43
Student Notes
The timex command can be used to benchmark how long the execution of a particular process takes in seconds. The command measures: real time the amount of elapsed time from when the program started to when the program completed (sometimes referred to as the wall clock time). user time the amount of time spent by the program executing in user mode. sys time the amount of time spent by the program executing in kernel mode.
The example on the slide shows a total of 25.65 seconds elapsed from when the program prime_med started to when it completed. The execution spent 20.71 seconds executing in user mode and 3.43 seconds executing in kernel mode. The difference between user + system and real time is attributed to time the process spent not running on the CPU. The process may not get CPU time either because it was waiting on some resource (like disk or CPU) or because it was in a sleep state waiting for an event (like a child process waiting to finish executing).
520. SLIDE: Tuning a CPU-Bound System Hardware Solutions
Tuning a CPU-Bound System Hardware Solutions

Upgrade to a faster processor Upgrade the system with a larger data/instruction cache Add a processor to a multiprocessor system Spread applications to multiple systems
Student Notes
Practically speaking, the easiest performance gains are usually achieved by adding more and faster hardware. This could be upgrading to a faster processor, upgrading to a processor with more cache, adding another processor, or buying another system and off-loading some applications to the second system. Upgrading to a faster processor may be possible with a simple module swap, but, more than likely, it would involve upgrading your entire system to a newer model. Some systems come with two or three possible processors and yours may not have the fastest available processors. If so, you may be able to upgrade the systems processors to faster versions without touching the rest of the system. Nowadays, its unlikely that youll be able to upgrade the cache memory or TLB to larger sizes. Each processor chip seems to come with a predetermined amount of cache and Sized TLB. Only going to a different processor chip (and thus a larger model) will you be able to affect the cache memory and TLB sizes. If your system is not yet at its full complement of processors, it may relieve your workload to add more processors. If you have a cell-based architecture, you may be able to add more processors to each cell, or even add more cells. Some servers come with extra processors
installed, but not enabled. These systems have a feature called ICOD (Instant Capacity On Demand). By simply contacting HP, these disabled processors can be enabled, giving you more processing power with a minimum of time. If, at a later date, those processors are no longer needed, they can be disabled in a similar fashion. Finally, if you have a system which is heavily loaded and another system which is lightly loaded, it may be possible to transfer some of the tasks from the busy system to the one which is less busy. The disadvantage of these solutions is that most of them cost money.
521. SLIDE: Tuning a CPU-Bound System Software Solutions
Tuning a CPU-Bound System Software Solutions

Nice less important processes Anti-nice more important processes Consider using rtprio or rtsched on most important processes Run batch jobs during non-peak hours Consider using PRM/WLM Consider using the processor affinity call mpctl() Optimize/recompile application
Student Notes
If the easiest performance gains are upgrading the hardware, then the greatest performance gains that are likely to be achieved are improving the software. A system with the fastest and most current hardware can still run slowly if the software is not configured properly. One way to improve the performance of specific processes is to improve the priority of those processes. You can do this by improving the process's nice value or by making the process a real-time process. Or, you can reduce the nice value of other processes. Be careful when promoting a process to real time. If the process is not well-behaved, it can take over your entire system. By well-behaved, we mean that it is not compute bound and it is free of serious bugs. Running batch jobs at non-peak hours has been a standard performance solution for many years on many systems. Other software performance improvements can be realized by using PRM (Process Resource Manager), WLM (Workload Manager), or the mpctl() system call.
522. SLIDE: CPU Utilization and MP Systems
CPU Utilization and MP Systems
Processor 1
CPU
Processor 2
CPU
Memory
Process
TLB
Cache
Coprocessor
TLB
Cache
Coprocessor
mpctl (proc2)
System Bus
Is each processor pulling its weight?

The sar -uqM command string can help you monitor the CPU loading on the individual processors in a MP system.
Student Notes
The sar command can be utilized to report CPU utilization for the overall system on a perprocessor basis (when the -u and -M options are specified). In addition the -q option will report average run queue length while occupied, and percent of time occupied. Both of these metrics can assist in the evaluation of CPU loading and should be considered before making processor affinity calls. top can also show you how your CPU resource is being distributed over the system. It automatically breaks down the load and utilization percentages on a per-processor basis when invoked. Remember, when you are running a system that supports Partitions (NPars or VPars), these tools only show you what is happening within a partition, as each partition has booted its own copy of the operating system and is acting as an independent system.
523. SLIDE: Processor Affinity
Processor Affinity
Processor 1
CPU
Processor 2
CPU
Memory
Process
TLB
Cache
Coprocessor
TLB
Cache
Coprocessor
mpctl (proc2)
System Bus
The mpctl() system call assigns the calling process to a specific processor.
Student Notes
The mpctl() system call provides a means for determining how many processors are installed in the system (or partition), how many processors are in this pset, and assigning processes or threads to run on specific processors (also known as processor affinity) or within specific psets, and much, much more. Refer to the man page for mpctl() on your system. Much of the functionality of this capability is highly dependent on the underlying hardware. An application that uses this system call should not be expected to be portable across architectures or implementations. Processor sets are supported by the pset() system call. If your version of the operating system supports psets, refer to the man page for pset() for full details.
5-24. LAB: CPU Utilization, System Calls, and Context Switches Directions
General Setup
Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available: # # # # vgdisplay v | grep Name (Note which disks are already in use by LVM) ioscan fnC disk (Note any disks not mentioned above, select one) pvcreate -f <raw disk device file> vgextend vg00 <block disk device file>
In either case: # # # # # # lvcreate -n vxfs vg00 lvextend -L 1024 /dev/vg00/vxfs <block disk device file> newfs -F vxfs /dev/vg00/rvxfs mkdir /vxfs mount /dev/vg00/vxfs /vxfs prealloc /vxfs/file <75% of main memory in bytes>
The lab programs are under /home/h4262/cpu/lab0 # cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system otherwise results are unpredictable. If the executables are missing, generate them by typing: # make all
CPU Utilization: System Call Overhead

Use the dd command to size the read and write operations. Thus their number can be varied to change the number of system calls used to transfer the same amount of information. Then we can see the overhead of the system call interface. The first command loads the entire file into buffer cache. # timex dd if=/stand/vmunix of=/dev/null bs=64k Now we take our measurements. # timex dd if=/stand/vmunix of=/dev/null bs=64k real __ user __________ system ____________
# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real __ user __________ system ____________
System Calls and Context Switches

This lab shows you the maximum system call and context switch rates that your system can take. Three programs are supplied: syscall loads the system with system calls of one type filestress (shell script) generates file system-related system calls cs loads the system with context switches
1. What is the system call rate when your system is "idle"? ________________ 2. Run filestress in the background. What is the system call rate now? What system calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar c 10 4 3. Terminate the filestress process by entering the following commands: # kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22) 4. Run the syscall program and again answer question 2. Is the system call rate lower or higher than with filestress? Why? _____________________________________________________________________ Kill the syscall program, before proceeding. # kill $(ps el | grep syscall | cut c18-22) 5. Using cs, compare the number of context switches on an idle system and a loaded system. Idle ________ Loaded ______________
6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem. # kill $(ps el | grep cs | cut c18-22) # rm f /vxfs/file # umount /vxfs
525. LAB: Identifying CPU Bottlenecks Directions

The following labs are designed to show the symptoms of a CPU bottleneck.
Lab 1
1. Change directory to /home/h4262/cpu/lab1 # cd /home/h4262/cpu/lab1 2. Start the processes running in the background. # ./RUN 3. Start a glance session and answer the following questions. What is the CPU utilization? _______ What are the nice values of the processes receiving the most CPU time? _______ What is the average number of jobs in the CPU run queue? ______ 4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs? Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 5. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ 6. Compare your results to the baseline established in the lab exercise in module 1, step 5. 7. End the CPU load by executing the KILLIT script. # ./KILLIT
Lab 2
1. Change directory to /home/h4262/cpu/lab2. # cd /home/h4262/cpu/lab2 2. Start the processes running in the background. # ./RUN 3. In one terminal window, start glance. In a second terminal window run # sar -u 5 200. Answer the following questions: What does glance report for CPU utilization? _______ What does sar report for CPU utilization? ________ What is the priority of the process receiving the most CPU time? _______ How much time is the process spending in the sigpause system call? ______ How is the process being context switched (forced or voluntary)? ______ 4. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ 5. End the CPU load by executing the KILLIT script. # ./KILLIT

Objectives
Upon completion of this module, you will be able to do the following: Describe how the HP-UX operating system performs memory management. Describe the main performance issues that involve memory management. Describe the UNIX buffer cache. Describe the sync process. Identify the symptoms of a memory bottleneck. Identify global and process memory metrics. Use performance tools to diagnose memory problems. Specify appropriate corrections for memory bottlenecks. Describe the function of the serialize command.
H4262S C.00 2004 Hewlett-Packard Development Company, L.P.
61. SLIDE: Memory Management
Memory Management
Swap Space Virtual Memory
Memory
Student Notes
Memory management refers to the subsystem within the kernel that is responsible for managing the main memory (also known as RAM) of the computer. When managing main memory, the kernel allocates memory pages (default size is 4 KB) to processes as they need space. When main memory runs low on free space, the kernel will try to free up some pages in memory by copying those pages out to swap space on disk. The swap space can be thought of as an extension of main memory (like an overflow area) that is used when main memory becomes full. Processes paged out to the swap area cannot be referenced again until they are paged back in to main memory. The term virtual memory refers to how much memory the kernel perceives as being available for allocation to processes. When the kernel allocates space to a process, it must track that page for the life of the process. Virtual memory includes main memory and swap space, as pages allocated to processes may be moved to swap space.
Example
In the slide, there are three different processes being tracked: a one-page process, a two-page process, and a three-page process. The one-page process started in main memory and was
subsequently paged to swap space. The two-page process is entirely resident in main memory. And the three-page process has been partially paged to swap space (two of three pages are on swap). From a virtual memory standpoint, the three processes are taking up six pages of memory: three pages in main memory and three pages on swap. The preceding example is pretty simple. Reality is a little more complex. Processes actually consist of two basic types of pages, text and data. The data pages have write capabilities and thus their contents must be preserved when they are moved out of memory (to swap space). The text pages cannot be modified by the executing program. They are initially read in from the file system. If the memory manager should want to release the space that a text page is taking, it does not have to copy it out to swap, or even back to the file system.
62. SLIDE: Memory Management Paging
Memory Management Paging
1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1
F 0 0 0 0 0 0 1 1 1 1 0 1 1 0 1
1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 1
1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1
1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
F 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0
F 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1
F 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1
F F 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1
F 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1
Free Hand
Vhand Process
Reference Hand
1 = Page is being referenced 0 = Page is NOT being referenced F = Freed Memory Page by vhand process
Memory
Student Notes
The vhand daemon is responsible for keeping a minimum amount of memory free on the system at all times. The vhand daemon does this by monitoring free pages and trying to keep their number above a threshold to ensure sufficient memory for efficient demand paging. The vhand daemon utilizes a "two-handed" clock algorithm as seen on the slide. The first hand (also known as the reference hand or age hand) clears reference bits on a group of pages in an active part of memory. If the bits are still clear by the time the second hand (also known as the free hand or steal hand) reaches them, the pages are paged out. The kernel automatically keeps an appropriate distance between the hands, based on the available paging bandwidth, the number of pages that need to be stolen, the number of pages already scheduled to be freed, and the frequency in which vhand runs. In essence, the distance between the hands determines how aggressive vhand is behaving. It behaves more aggressively as the memory pressure increases.
63. SLIDE: Paging and Process Deactivation
Paging and Process Deactivation
Non-Kernel memory
Free Mem Pages
LOTSFREE DESFREE MINFREE
Paging begins with possibility of stabilization. Paging continues at maximum rate, with no possibility of stabilization.
0 MB Paging Scanning Rate
Process deactivation begins to occur.
Student Notes
The system uses a combination of paging and deactivation to manage the amount of free memory. A minimum amount of free memory is needed to allow the demand paging system to work properly. No paging occurs until the free memory falls below a threshold call LOTSFREE. Upon falling below LOTSFREE, paging will occur at a minimum level becoming more aggressive as the number of free pages decreases. If the demand for memory continues, then paging will continue. However, if the demand for memory subsides, then there is a possibility that the amount of free memory will stabilize below the LOTSFREE threshold. If free memory falls below a second threshold call DESFREE, then there is no possibility of stabilization (until free memory goes back above DESFREE) and the paging rate becomes much more aggressive compared to the initial paging rate. Finally, if free memory falls below MINFREE, then process deactivation begins. A process is chosen by the kernel to be deactivated, and it is placed on the deactivation queue. Because the process is deactivated (therefore its pages are not being referenced) vhand will be able to page all its pages (including the uarea) out to the swap partition. The process will be
reactivated automatically once free memory rises above MINFREE. When a process is reactivated, only the uarea is immediately paged in. Other pages are faulted in as needed. Below are the default formulae for LOTSFREE, DESFREE, and MINFREE. (NKM = Non-Kernel Memory) <=32 MB LOTSFREE DESFREE MINFREE 1/8 of NKM 1/16 of NKM 1/2 of DESFREE >=32 MB, <=2GB 1/16 of NKM 1/64 of NKM 1/4 of DESFREE >2 GB 64 MB 12 MB 5 MB
NOTE
The values of LOTSFREE, DESFREE, and MINFREE were made tunable kernel parameters in HP-UX 11.00. Prior to the 11.00 release, these values were fixed and could not be changed. It is recommended by HP, however, that the parameters not be tuned manually.
64. SLIDE: The Buffer Cache
The Buffer Cache

Pool of memory designed to retain the most commonly accessed files from disk Used only for file system I/O (not raw I/O) Size of buffer cache controlled by dbc_min_pct and dbc_max_pct
Buffer Cache
Filesystem
Process
File
Memory
Student Notes
The buffer cache exists to speed up file system I/O. The system tries to minimize disk access by going to disk as infrequently as possible, because disk access is often a bottleneck on most systems. Therefore, the most recently- or commonly-accessed files from disk persist in the portion of memory called the buffer cache. It is called dynamic because the size of the buffer cache grows or shrinks dynamically, depending on competing requests for system memory. Its minimum size is governed by the tunable parameter dbc_min_pct, and it cannot grow larger than the size specified in dbc_max_pct. These two parameters are expressed as percentages of total physical memory on the system. Let's say dbc_min_pct is set to 10, while dbc_max_pct is 50. This means that initially 10% of physical memory is allocated to the buffer cache. As the system needs more space to buffer files read in from disk, the buffer cache will allocate more memory, and this will continue until it occupies 50% of memory, its maximum size. Later, when the system requires more memory for another use, say processes, the buffer cache could shrink an appropriate amount, but will never be less than the 10% minimum value. Therefore, a larger buffer cache is able to hold more files and will minimize their access time but will leave less memory available for other uses.
NOTE:
The buffer cache is dynamic in nature only when two other tunable parameters, bufpages and nbuf, are both set to their default values of 0.
Another example: if dbc_min_pct and dbc_max_pct are both set to the same value, say 20, the kernel will always use exactly that percentage of physical memory for the buffer cache.
65. SLIDE: The syncer Daemon
The syncer Daemon

All entries stay in the buffer cache for a minimum of 30 seconds before being flushed. The syncer daemon runs once every 6 seconds and flushes 20% of the buffer cache to disk.
Buffer Cache
Filesystem flushes File
syncer
Memory
Student Notes
For disk writes, data flows from the buffer cache to disk. How does it get to the buffer cache? The kernel writes data to it. The syncer process takes care of flushing data in the buffer cache to the files on the disk. When a user edits a file, makes changes to that file, and saves the changes, those changes do not go to disk right away. The kernel writes the data to the buffer cache, and some time later (within 60 seconds) the data finally arrives at the disk. This time period is chosen as a balance between ensuring that the file system is fairly up-to-date in case of a crash and efficiently performing disk I/O. There are many applications that do not rely on the operating system's built-in processes to flush data to disk, but instead take over that operation themselves. In other words, they create their own buffers and manage the flushing at appropriate intervals. A common example is a database application that needs to guarantee the completion of a transaction within a specified time interval.
66. SLIDE: IPC Memory Allocation
IPC Memory Allocation

# ipcs -mob IPC status from /dev/kmem as of Sat Feb 14 06:53:27 T ID KEY MODE OWNER GROUP Shared Memory: m 5 0x06347849--rw-rw-rwroot root m 7 0x000c0568 --rw------- root root
1998 NATTCH 0 2
SEGSZ 77384 131516
Shared Memory Segment
Shared Memory Segment
Memory
Text Data Sh. Mem Sh. Lib Text Data Sh. Mem Sh. Lib
Student Notes
UNIX implements interprocess communications using different mechanisms. Three mechanisms that require additional system memory are semaphores, shared memory, and message queues. Semaphores are used to synchronize memory resources between competing processes. Shared memory segments are resources capable of holding (in memory) large amounts of data that can be shared between processes. Message queues hold strings of information (messages) that can be transferred between processes. Two types of processes that utilize message queues are networking and database processes.
Shared memory provides a mechanism to reduce interprocess communication costs significantly. Two processes that are ready to share data, address the same portion of shared memory into their addressable space. Changes made to the shared memory are seen immediately by all processes and do not require kernel services. So from a kernel perspective, other than initially setting up the shared memory, there is very low cost in using shared memory.
On the slide, each process has a shared memory segment that references one and the same shared memory area. The more processes that allocate shared memory segments, the higher the memory usage. The shared memory segments in physical memory can be viewed with the ipcs -mob command or a reporting tool like glance. From time-to-time, they might have to be cleaned up or removed manually if an application terminates ungracefully. This is done by the superuser with the ipcrm command. A worthwhile baseline measurement for a system administrator is to run the ipcs -mob command during a quiet period. It is also eye opening to repeat this command when the system is at its busiest.
67. SLIDE: Memory Metrics to Monitor Systemwide
Memory Metrics to Monitor Systemwide

Is vhand Active?

Pages scanned by vhand (SR) Pages freed by vhand (FR)
Pages paged out Is swapper Active?

Processes deactivated (SO) Amount of free memory relative to - lotsfree - desfree - minfree
Size of dynamic buffer cache Size of IPC Shared memory segments
Student Notes
The utilization of memory can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor memory usage. The first metrics you want to look at are those that will tell you whether vhand is active.
Pages Scanned by vhand

This is the number of pages the vhand process has scanned (i.e. dereferenced with the reference hand) when looking for pages to free in memory. This tells you that vhand is actively scanning pages in an attempt to free them up. There is some memory pressure.
Pages Freed by vhand

This is the number of pages the vhand process has freed (i.e. the reference bit was still dereferenced when the free hand looked at it). The ratio between pages scanned and pages freed indicates how successful the vhand process is when looking for memory pages to free.
Amount of Paging
This indicates the level of disk activity to the swap partition. If a consistent amount of paging to swap space is occurring, then performance is impacted (most likely significantly). Next, check to see if the swapper is active.
Process Deactivations
This indicates that processes are being deactivated, meaning free memory has fallen below the MINFREE threshold. There is severe memory pressure.
Amount of Free Memory

This indicates the severity of the free memory situation. If free memory has fallen below LOTSFREE, then we know some paging has taken place. vhand is active. If it is below DESFREE, then the situation is more severe, and much more paging is occurring. vhand is aggressively active. Finally, if free memory is below MINFREE, then a high level of paging and process deactivation is occurring. vhand and swapper are both active. To determine what the values are for lotsfree, desfree, and minfree, use the following commands: # echo lotsfree/D | adb k /stand/vmunix /dev/mem # echo desfree/D | adb k /stand/vmunix /dev/mem # echo minfree/D | adb k /stand/vmunix /dev/mem The settings for these three values in the kernel will then be displayed in 4K pages. You can then compare them to the current size of the free page list. These values will not change, unless you change the size of Non-Kernel Memory. (Remember the formulas shown earlier?)
Size of Dynamic Buffer Cache

This is the amount of memory being consumed by the buffer cache. If memory is full and the buffer cache is large, it will most likely cause paging, since the buffer cache typically shrinks slower than the rate at which new memory is needed. Heavy disk I/O demands may prevent the buffer cache from shrinking at all.
Size of IPC Shared Memory Segments

This is the amount of memory used for interprocess communications. Of special interest will be the number and sizes of shared memory segments, as these can be quite large, especially if graphical applications or a database management system is being used.
68. SLIDE: Memory Metrics to Monitor per Process
Memory Metrics to Monitor per Process

Size of RSS/VSS Size of text, data, and stack segments Number of shared memory segments Amount of time blocked on virtual memory
Student Notes
Individual processes vary greatly in terms of the amount of memory they use. Metrics to monitor memory utilization on a per-process basis include the following:
Size of RSS/VSS
The Resident Set Size (RSS) for a process is the portion of the process (in KB) that is currently resident in physical memory. Since the entire process does not have to be resident in memory in order to execute, this shows how much of the process is actually resident in memory. The Virtual Set Size (VSS) for a process is the total size of the process (in KB). This indicates that if the entire process were to be loaded, this is how much memory the entire process would consume. Very rarely is the entire process resident in memory. If the entire process were in memory, then the RSS value would be equal to the VSS value.
Size of Text, Data, and Stack Segments

These are the RSS and VSS sizes for the three main components of a process. Since every process has a single text, data, and stack segment, these values should be monitored, especially for large processes. The data segment is the most likely to be large.
Each of these three segments has a maximum size to which they can grow limited by tunable kernel parameters. They are maxtsiz, maxdsiz, and maxssiz for a 32-bit process. They are maxtsiz_64bit, maxdsiz_64bit, and maxssiz_64bit for a 64-bit process. If a process tries to grow one of these segments beyond its maximum size, then the process terminates (and in some cases core dumps).
Number and Size of Shared Memory Segments

These are the shared memory segments to which a process is attached. The maximum size of a shared memory segment is limited by the kernel parameter, shmmax. The number of shared memory segments a process can attach to is limited by the kernel parameter, shmseg.
Amount of Time Spent Blocked on Virtual Memory

This is the amount of time the process was prevented from executing because it was waiting (or blocked) on a text or data page to be paged in.
69. SLIDE: Memory Monitoring vmstat Output
Memory Monitoring vmstat Output
#=> vmstat -n 5 VM memory avm free 9140 3824 CPU cpu us sy id r 9 5 86 1 9017 3500 24 17 60 0 10292 2255 67 24 9 5 10227 976 67 33 0 7 10958 400 67 31 3 8 10759 454 62 20 18 6 13448 404 32 15 53 0
re 3 procs b 100 41 100 65 102 89 103 81 110 33 111 21 118
at 4
pi 0
page po 0
fr 0
de 0
sr 0
in 675
faults sy cs 824 140
w 0 49 0 20 0 19 0 12 0 3 0 0 0
11 41 85 91 98 65
0 0 0 48 51 74
0 0 0 26 24 39
0 0 0 0 0 0
0 0 0 194 268 282
1257 1419 1698 1791 1598 1021
2823 3795 4771 5847 4313 3175
329 481 641 697 598 354
Student Notes
A useful command to view virtual memory statistics is vmstat. The slide shows vmstat's output being updated every 5 seconds. When viewing vmstat's output, always keep an eye on the po (pages paged out) parameter. Ideally, you want this to be zero, indicating no paging out is occurring. Statistics regarding the vhand algorithm, the fr (pages freed per second) and sr (pages scanned by the clock algorithm, per second) parameters show the actual behavior of vhand.
Output Headings
procs
r b w
In run queue Blocked for resources (I/O, paging, and so on) Runnable or short sleeper (less than 20 seconds) but deactivated
Module 6 Memory Management memory
avm free re at pi po fr defacto sr

faults
Active virtual pages (run during the last 20 seconds) Size of free list page (in 4K pages) Page reclaims per second Address translation faults per second (page faults) Pages paged in per second Pages paged out per second Pages freed per second Anticipated short term memory shortfall Pages scanned by algorithm per second
in sy cs
CPU
Non-clock device interrupts per second System calls per second CPU context switches per second
us sy id
with -S option
Percentage of time CPU spent in user mode Percentage of time CPU spent in system mode Percentage of time CPU is idle
si so
Processes reactivated per second Processes deactivated per second
610. SLIDE: Memory Monitoring glance Memory Report
Memory Monitoring glance Memory Report
B3692A GlancePlus B.10.12 17:33:59 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------MEMORY REPORT Users= 19 Event Current Cumulative Current Rate Cum Rate High Rate -------------------------------------------------------------------------------Page Faults 78 287 7.5 24.3 139.3 Paging Requests 3 21 0.2 1.7 12.0 KB Paged In 52kb 336kb 5.0 28.4 189.3 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Reactivated 0kb 0kb 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 3 6 0.2 0.5 2.0 VM Writes 0 0 0.0 0.0 0.0 Total VM : Active VM: 78.9mb 23.4mb Sys Mem : Buf Cache: 10.6mb 19.1mb User Mem: Free Mem: 78.0mb 20.3mb Phys Mem: 128.0mb Page 1 of 1
Student Notes
glance has extensive memory monitoring abilities. Like vmstat, it can give paging statistics, in addition to showing if any processes are being deactivated. Remember, this is an indication of severe memory shortage. There is other valuable information on this report, such as the statistics at the bottom showing the current Dynamic Buffer Cache size, the current amount of Free Memory, and the total Physical Memory in the system.
611. SLIDE: Memory Monitoring glance Process List
Memory Monitoring glance Process List
B3692A GlancePlus B.10.12 14:52:27 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------PROCESS LIST Users= 11 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1 supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1 lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1 glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1 statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1 midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2 ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1 dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1 rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1 syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1 xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1 Page 1 of 13
Student Notes
The glance Process List report can be used to monitor process statistics, including how much memory processes are currently consuming. The highlighted column, RSS (Resident Set Size), shows memory being used on a per-process basis. Very simply put, this helps to identify the "memory hogs" on the system. For example, the process called netscape has an RSS of 14.7 MB, while statdaemon is minimal. Other large processes include glance, xload, and dtterm. What do all these processes have in common? They are all GUI (graphical user interface) programs running as windows in a graphical window environment. Moral: programs that open their own windows are relatively memory-intensive and should be minimized. Users should be encouraged not to leave several windows open on their screens if they do not have a continuing need for them.
612. SLIDE: Memory Monitoring glance Individual Process
Memory Monitoring glance Individual Process
B3692A GlancePlus C.03.70.00 15:52:03 r206c42 9000/800 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 15% 15% 15% Disk Util F | 1% 0% 2% Mem Util S S U | 96% 96% 96% U B B Swap Util U | 15% 15% 15% U R R -------------------------------------------------------------------------------Resources PID: 28030, glance PPID: 27993 euid: 0 User: root -------------------------------------------------------------------------------CPU Usage (util): 0.1 Log Reads : 1 Wait Reason : STRMS User/Nice/RT CPU: 0.1 Log Writes: 0 Total RSS/VSS : 3.6mb/ 5.6mb System CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 1/ 10 Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 6/ 0 Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0 Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 0 Nice Value : 10 VM Writes : 0 Mesg Sent/Recd : 0/ 0 Dispatches : 6 Sys Reads : 0 Other Log Rd/Wt: 38/ 172 Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 4 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Tue Mar 16 15:49:14 2004 CPU Switches : 0 Bytes Xfer: 0kb : C - cum/interval toggle % - pct/absolute toggle Page 1 of 1
Student Notes
The glance Individual Process report displays memory usage for an individual process, and the RSS and VSS sizes for the process. Also displayed on a per-process basis, is the VM reads and VM writes being performed by the process. This indicates how much paging from/to the swap device the individual process is performing. If performance is poor for an individual process, this is a good field to check.
613. SLIDE: Memory Monitoring glance System Tables
Memory Monitoring glance System Tables
B3692A GlancePlus C.03.70.00 15:58:40 r206c42 9000/800 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 15% 15% 15% Disk Util F | 0% 0% 4% Mem Util | 96% 96% 96% S S U U B B Swap Util U | 15% 21% 45% U R R -------------------------------------------------------------------------------SYSTEM TABLES REPORT Users= 1 System Table Available Requested Used High -------------------------------------------------------------------------------Inode Cache (ninode) 2884 na 645 645 Shared Memory 12.5gb 11.1mb Message Buffers 800kb na 0kb 0kb Buffer Cache 314.4mb na 314.4mb na Buffer Cache Min 32.0mb Buffer Cache Max 320.0mb DNLC Cache 8004 Model : 9000/800/A400-6X OS Name : HP-UX OS Release: B.11.11 OS Kernel Type: 64 bits Phys Memory :640.0mb Network Interfaces : Number CPUs : 1 Number Swap Areas : Number Disks: 2 Avail Volume Groups: Mem Region Max Page Size: 1024mb Page 2 of 2 2 1 2
Student Notes
The glance System Table report displays the size of kernel tables in memory, and the current utilization of theses tables. It is important not to set the size of these tables too large, as the tables are memory resident (and the bigger the table, the more memory it consumes). Yet, it is even more important that enough resources be allocated so that the kernel does not have to wait for a resource to become free (or even error out) when a particular resource is requested. The Available column displays the total size of the particular table, and the Used column shows how many entries within the table are currently being used. In general, the Used value should not be close to the Available value. If it is, then the kernel is close to running out of that particular resource. The High % column shows the high water mark for the resource since glance has been running. Also of interest in this report are the buffer cache statistics, especially the Buffer Cache that shows the current size of the buffer cache.
NOTE:
There are two pages to this report. Shown here is the second page of this report. More system tables are shown on the first page.
614. SLIDE: Tuning a Memory-Bound System Hardware Solutions
Tuning a Memory-Bound System Hardware Solutions

Add more physical memory Reduce usage of X-terminals
Student Notes
An obvious hardware solution to a memory bottleneck is to add more physical memory. While this solution requires an outlay of money, it may pay for itself quickly by saving the system administrator hours of time looking for ways to reduce memory consumption. If adding more memory is not an option, then a second hardware suggestion is to look at the use of X terminals on the system. An X terminal typically consumes a large portion of memory. X terminals will take up 34 MB of memory for light application usage, and as much as 1020+ MB for heavy application usage. These figures do not take into account any additional RAM that the system will use for window managers or any other X-related overhead.
615. SLIDE: Tuning a Memory-Bound System Software Solutions
Tuning a Memory-Bound System Software Solutions

Look for unnecessary processes

Extra windows Screen Savers Long strings of child processes
Reduce dbc_max_pct (max size of dynamic buffer cache). Identify programs with memory leaks. Check for unreferenced shared memory segments. Use serialize command to reduce process thrashing. Use PRM to prioritize memory allocation.
Student Notes
Quite often, users will run X-windows type programs to enhance the look of their desktop. Examples include an X-eyes program, a bouncing ball program, or fancy screen savers. All of these graphical programs consume system resources, including memory. The biggest consumer of memory will most likely be the buffer cache. We saw earlier that if the buffer cache is dynamic, it will grow to its maximum size, as long as memory is available. The problem with this is when a process needs additional memory, and the free memory is below LOTSFREE, then the buffer cache is slow to shrink (if at all!), causing paging to occur among the processes. To prevent this situation, the tunable parameter dbc_max_pct should be tuned to limit the maximum size in which the buffer cache can grow. A recommendation for dbc_max_pct is 25 or less. Programs with memory leaks will allocate memory and then stop using without returning it to the system for use elsewhere. These programs may require you to shut them down periodically, to release the memory. They may even require you to reboot the system occasionally to reclaim the memory. There are a number of third party tools that will help you locate memory leaks in applications such as Purify.
Unreferenced shared memory segments can also be a problem. An application sets one up and then forgets to deallocate it when the application exits. Here is a possible procedure for locating abandoned shared memory segments: First, look for any shared memory segments that have no processes attached to them. # ipcs ma Note which shared memory segments have a 0 in the NATTCH column. If they are owned by root, let them stay. Otherwise, write down their ID numbers and their CPID numbers. Second, one at a time, find out whether the creating process still exists. # ps el | grep <CPID number>
If it does, its probably just a quiescent segment, But if not, the segment is probably abandoned. Finally, remove the segment. # ipcrm m <ID number>
The serialize command will be discussed later in this chapter. You may wish to use PRM to control your memory resource and its allocation.
6-16: SLIDE: PA-RISC Access Control
PA-RISC Access Control
Control Register resident Access ID keys

Memory resource
Access ID keys stored in the kernel tables
Student Notes
Since we are discussing system memory and performance there is one other topic that we should think about, hardware based memory page access control. The processor architecture has several features related to assuring that a process thread can not access areas of physical memory that are not part of its process space. An in depth discussion of page access control is presented in the HP-UX training course; "Inside HP-UX, course number H5081S and we won't attempt to recreate it here. There is one particular aspect of this hardware feature that we will spend some time with in discussion though, and that is "Protection ID's". Every discrete region of virtual memory assigned to a process (text space, private data space, shared memory space, shared library data space, etc) is assigned a unique ID "key", called an Access Key. Any process attempting to access that memory space must have a copy of a matching ID "key", called a Protection Key. To speed things up, the most frequently or likely used Protection Keys are kept in processor registers. (These registers are part of a process threads "context" and are preserved across switches and interrupts.) The hardware performs the Protection check as part of the actual memory access instruction.
Now here is the catch, there is only room in the control registers for a limited number of frequently used Protection Keys. The rest are stored in kernel space in memory management tables, which are accessed when a protection ID fault occurs. The fault handler will search for and find these other "keys" when they are needed but at the cost of CPU cycles! To better understand the dynamics of this process consider the following analogy:
The Key Ring

I have many keys to many locks around my home and office. It is not practical to carry all of my keys around with me all the time due to their bulk and weight. To solve this problem I have two key rings. One is small and has only those keys that I need on a daily basis, my car key, house key, desk key, and garage key. The other key ring is large and bulky with dozens of other miscellaneous keys; my workshop, tool boxes, garden shed, lawnmower (wish I could loose that one!), boat ignition, etc This method is a blessing and a curse. When I need to start the car or unlock the front door the key I need is readily available in my pocket and I can quickly gain access. When I actually have time to go fishing, it is always a hassle to go find my utility key ring and remember to take the boat key with me. (Once I actually hauled the boat all the way to the lake, several miles away from my home, only to realize that I had not remembered the boat key!). To somewhat address this problem I now move the boat key to my everyday key ring during the summer months (replacing the snow-blower key) and reverse the procedure in the fall. The HP-UX kernel performs a similar process, every time a protection ID fault occurs. The fault handler moves the key it had to search for to the register context of the thread (replacing the least recently used key). PA-RISC 1.x has room for 4 keys in the register context while PA-RISC 2.X has room for 8 keys. IA-64 has room for at least 16 keys. Depending on how frequently a process moves from one memory region to another the number of protection ID faults will vary. With the larger number of Protection Registers in the later processors, Protection Register thrashing has become much less a problem than it has been in the past. It should also be noted that shared library regions on 11.x were modified to use a type of "skeleton" key, i.e., a key that always matches so that attempted access to them will never result in a protection ID fault.
617. SLIDE: The serialize Command
The serialize Command
Kernel OS Tables
Swap Space
Proc I Proc K Proc J Proc L
500MB of Available Memory
Each process: CPU bound Large (400 MB) Timeshare priority
Memory
Student Notes
The serialize (1) command can help if a system has a number of large processes and is experiencing memory pressures. The serialize command will allow these big processes to run one after another, instead of running all at the same time. By running the processes sequentially, rather than in parallel, the CPU can spend more time executing the process code (i.e. user mode) and less time managing the competing processes (i.e. kernel mode).
Thrashing
On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy moving pages in and out that the system spends too much time paging and not enough time running processes. When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, meaning it is doing more overhead work than productive work.
How serialize Helps Reduce Thrashing

All processes marked via the serialize command will run serially with other processes marked the same way. The serialize command addresses the problem caused when a group of large processes all try to make forward progress at once, which results in degrading throughput. In such a case, each process constantly faults in its working set, only to have the pages stolen when another process starts running. By using the serialize command to run large processes one at a time, the system can make more efficient use of the CPU, as well as system memory. Lets look at the example on the slide. We have a system with 500MB of available memory. We are trying to execute four processes. Each process is CPU bound, has large memory requirements (400MB), and has a timeshare priority level. The first process (I) executes. As it executes, its pages are faulted into memory. At the end of its timeslice (typically 100ms), it is switched out and process J is started. As it executes, it pages in a large number of pages forcing the pages belonging to process I to be paged out. 100ms later process J is switched out and process K starts up, pulling its pages into memory and pushing the other processs pages out. The system spends so much time pulling pages in and pushing pages out, that it literally has no time left to perform any useful work. The culprit here is the timeslice. OK, we could simply disable timeslicing altogether via the tunable parameter (timeslice). But that may be overkill more than we want to do. After all, its just these four processes that are causing the thrashing. A better solution would be to serialize these processes. when you do that, each process executes until it either voluntarily gives up the CPU or it is preempted by a stronger priority process which will happen much less frequently than the timeslice! Thus more real work will get done and much less paging will be needed. In 10.20, the kernel was given the authority to serialize processes automatically, if it detects that memory thrashing is taking place and it can identify which processes are responsible for the thrashing.
618. LAB: Memory Leaks

There are several performance issues related to memory management, memory leaks, and swapping/paging, protection ID thrashing. Let's investigate a few of them. 1. Change directories to /home/h4262/memory/leak: # cd /home/h4262/memory/leak Memory leaks occur when a process requests memory (typically through the malloc()or shmget() calls) but doesn't free the memory once it finishes using it. The five processes in this directory all have memory leaks to different degrees.
2. Before starting the background processes, look up the current value for maxdsiz using the kmtune command on 11i v1 and the kctune command on 11i v2. On the rp2430: # kmtune lq maxdsiz On the rx2600: # kctune avq maxdsiz The default maxdsiz on 11i v2 is 1 GB. This will make proc1 very slow in reaching its limits. You can change maxdsiz to a more reasonable number for this lab exercise by:
# kctune maxdsiz=0x10000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000
Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? # vmstat 2 2
3. Use the RUN script to start the background processes: # ./RUN
4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions fairly quickly, before the memory leaks get too large. What is the current amount of free memory? What is the size of the buffer cache? Is there any paging to the swap space? How much swap space is currently reserved? Which process has the largest Resident Set Size (RSS)? What is the data segment size of the process with the largest RSS?
5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs. What happens when the process reaches its maximum data size? Why does disk utilization become so high at this point?
6. As the other processes grow towards their maximum data segment size, continue to monitor the following: Free memory Swap space reserved The size of the processes' data segments The RSS of the processes The number of page-outs/page-ins to the swap space
7. Run the two baseline programs, short and diskread. # timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread How does the performance of these programs compare to their earlier runs?
8. When finished monitoring the behavior of processes with memory leaks, clean up the processes. Exit glance. Execute the KILLIT script: # ./KILLIT If you changed maxdsiz, change it back:
# kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000
Module 7 Swap Space Performance

Objectives
Upon completion of this module, you will be able to do the following: Describe the difference between swap usage and swap reservation. Interpret the output of the swapinfo command. Define and configure pseudo swap. Define and configure swap space priorities. Define and configure swchunk and maxswapchunks.
71. SLIDE: Swap Space Management Simple View
Swap Space Management Simple View
Kernel and OS Tables CPU
Swap (55 MB) Swap
Reserved: 20 MB Used : 0 MB
Processes Memory
Program
Usr
Disk
New program wants to execute; not enough space for program to fit into memory.
Student Notes
The purpose of swap space is to relieve the pressure on memory when memory becomes too full. When free memory falls below a certain threshold, processes (or parts of processes) will be written out to the swap partition on disk in order to free up space in memory for other processes. For simplicity, the above slide assumes each process is 1 MB in size, and the amount of available memory for process execution is 20 MB. The slide also assumes (for simplicity) that each process reserves 1-MB on the swap partition each time it executes. Therefore, since 20 processes are currently present in memory (as shown on the slide), 20 MB of swap space has been reserved1 MB for each process. The HP-UX operating system reserves swap space for each process that executes on the system. The reservation of swap space is done so that the operating system knows how much swap space potentially may be needed for all the processes currently running on the system. For example, if all the processes in memory were to be swapped out, the operating system would know it had enough swap space to perform that function.
Analogy
A good analogy for swap space reservation, is a hotel that takes room reservations. When a hotel takes a reservation, it subtracts one from the count of available rooms. If a hotel had 55 rooms, and it took 20 reservations, then it would only have 35 rooms still available, even though none of the 55 rooms were currently occupied. The same holds true for swap space. In the above example, a total of 55 MB of swap space exists, 20 MB of the space is reserved by processes currently running in memory, even though none of the processes are currently using the swap space they have reserved. To take the analogy even further, the hotel does not earmark a particular room to satisfy a reservation. Room assignments are done when the occupant shows up at the front desk. Likewise, a swap reservation is not associated with a particular block out on the swap device. Only when the kernel actually wants to move a page in memory out to the swap device does it select a block. It knows it has the swap space available. It just doesnt know where it is until it needs to use it.
Current Situation
In the above slide, all the memory is in use by the 20 processes. Now assume a new program from disk wants to execute. What happens? How does it fit in memory if all the memory is in use?
72. SLIDE: Swap Space After a New Process Executes
Swap Space After a New Process Executes
Kernel and OS Tables 1 CPU 4 Program 2
Reserved: 20 MB Used : 1 MB
Swap 3
Processes Memory
Usr
Disk
Student Notes
Below is the basic sequence of steps that occurs when a new process wants to execute and there is not enough memory available: 1. The operating system selects a process (or portion of a process) to be written out to the swap partition on disk. The process selected is one that is not expected to execute in the near future. 2. Once the process is written to the swap partition, the amount of swap space used is incremented accordingly and the amount of swap space reserved is decremented by the same amount. 3. The new program which wants to execute reserves swap space for itself. The amount of swap space reserved is incremented accordingly. 4. The new program is copied into memory and the operating system initializes the process. The new process uses the physical memory that was just freed.
73. SLIDE: The swapinfo Command
The swapinfo Command
# swapinfo -mt Mb TYPE AVAIL dev 32 localfs 23 reserve total 55
Mb USED 1 0 20 21
Mb FREE 31 23 -20 34
PCT USED 3% 38%
START/ LIMIT 0 none -
Mb RESERVE 0 0
PRI 1 1 -
NAME /dev/vg00/lvol2 /home/paging
Student Notes
The swapinfo command displays important swap-related information, including how much swap space is used, and how much swap space is reserved. With todays systems, we recommend that you always use the m option to display all spaces in MB rather than the default KB. The swapinfo mt command shows information related to device (raw) swap partitions and file system swap space and their totals, including: Mb AVAIL Mb USED Mb FREE PCT USED The total amount of swap space available. For file system swap, this value may vary, as more swap space is needed. The current amount of swap space being used. The current amount of swap space free. The Mb FREE plus Mb USED is equal to Mb AVAIL. The percentage of swap space in use on that device.
START/LIMIT
Applies only to file system swap. Specifies the starting block within the file system of the paging file. The LIMIT specifies the maximum size to which the paging file can grow.
Mb RESERVED Applies only to file system swap, and is only applicable when no limit is given to the maximum size of the paging file. In these situations, this value specifies how much file system space to reserve for user files on the file system. PRI The priority of the swap area. The highest priority swap areas are used first. The swap priorities range from 0-10. (Note: stronger priority swap areas have smaller priority numbers.)
The swapinfo command also shows how much swap space all the processes on the system are reserving currently. This is indicated by the reserve entry. The columns described above for device and file system swap do not apply to the reserve entry in the output of the swapinfo command. In the example, there are 32 MB of device swap on a raw disk, and 23 MB of swap in the /home file system, making a total of 55 MB. 1 MB is in use on the device swap and 20 MB are reserved, leaving 34 MB available.
74. SLIDE: Swap Space Management Realistic View
Swap Space Management Realistic View
Initial Allocation
Reserved Used Kernel and OS Tables CPU Reserved Processes Memory Used Program : 20 MB : 0 MB : 0 MB : 0 MB
Swap Avail : 55 MB Swap (55 MB)
Current Allocation
Swap Avail : 35 MB
Disk
New program wants to execute; not enough memory for program to fit.
Student Notes
An earlier slide implied that specific space was allocated on a swap device for each process running in memory. The analogy was of a hotel subtracting one from the count of available rooms when a customer phoned in for a reservation. As mentioned earlier, specific space is not allocated on a swap device for a reservation. Instead, a variable is maintained call SWAP_AVAIL. The SWAP_AVAIL variable is initialized when the system boots to equal the total amount of swap space available. As each new process begins executing, this variable is decremented according to the amount of swap space the process would need if its entire contents were to be swapped out. When a process terminates, it returns the amount of swap space it reserved back to the SWAP_AVAIL variable. The slide above shows what the SWAP_AVAIL variable would contain when 20 MB worth of processes is executing on the system. Each process has caused the SWAP_AVAIL variable to be decremented, but no specific space has been allocated on the swap partition. No specific swap space is allocated until processes need to be paged out, as shown on the next slide.
75. SLIDE: Swap Space After a New Process Executes
Swap Space After a New Process Executes
Current Allocation
Kernel and OS Tables CPU Processes Memory 3 1 Swap (55 MB) Reserved Used : 20 MB : 1 MB
Swap Avail : 34 MB 2 Program
Disk
Student Notes
This is an updated description of the sequence of events that occurs when a program is being executed and not enough memory is available: The operating system selects a process (or portion of a process) to be written out to the swap partition on disk. Since no specific swap space has been reserved, swap space is allocated from the strongest priority swap device, first available block. Once the process is written to the swap partition, the amount of swap space used is incremented accordingly, and the old program unreserves its swap space by incrementing the SWAP_AVAIL variable. Then the new program decrements SWAP_AVAIL to reserve its swap space. In effect, the amount of swap space reserved is decremented by the amount of space being moved out to swap space and then incremented by the new reservation amount. In the slide, the process being swapped out causes the USED swap to become 1 MB, causing the SWAP_AVAIL to become 34 MB. Then the old process releases its 1 MB reservation, causing the SWAP_AVAIL to increase back to 35 MB. Finally, the new process starts up and causes the SWAP_AVAIL to decrease from 35 to 34 MB.
The new program is copied into memory, and the operating system initializes the process after it has confirmed that it can successfully reserve the needed swap for the new process (SWAP_AVAIL does not go negative when the swap reservation is made).
76. SLIDE: Swap Space When Memory Equals Data Swapped
Swap Space When Memory Equals Data Swapped
Current Allocation
Kernel and OS Tables 1 CPU 3 Processes Available Memory (20 MB) Swap (55 MB) Reserved Used : 20 MB : 20 MB
Swap Avail : 15 MB 2 Program
Disk
Student Notes
The above slide shows the state of the system and the current swap space allocations when 20 MB (or all of available memory) has been paged out to the swap partition. The swap partition contains 20 MB worth of processes, which is the size of available memory. The initial 20 MB of processes is shaded in gray, to distinguish them from the second 20 MB of processes, which are filled with black. With this color code, we can see only 4 MB of the original processes are still loaded in memory, everything else (including 4 MB of the 21st to 40th processes) has been paged to the swap partition. The swap space allocation reflects 20 MB worth of processes that have reserved swap space, and 20 MB that is currently in use. This would be analogous to stating that a hotel received 40 room reservations, and 20 of those reservations are currently being used. The SWAP_AVAIL variable is down to 15 MB, because the total amount of swap space is 55 MB and 40 MB of that space is reserved or in use.
77. SLIDE: Swap Space When Swap Space Fills Up
Swap Space When Swap Space Fills Up
Kernel and OS Tables CPU 1
Current Allocation
Swap (55 MB) Reserved Used : 20 MB : 35 MB
Swap Avail : 0 MB Processes Available Memory (20 MB) 2 ERROR: no more swap space
Program
Q: Could this error have been prevented? A: YES!! Use pseudo swap.
Disk
Student Notes
The above slide shows the situation when SWAP_AVAIL equals 0 MB. In this situation, the error message, ERROR: no swap space available is displayed, even though there is swap space to page an existing process to the swap partition and thus free up memory for a new program to load. The reason the system reports no swap space is available is because 35 MB of memory have been paged out, and the remaining 20 MB of swap space are reserved by the existing processes currently executing in memory.
Could this error have been prevented?

From a resource perspective, the new program should be able to execute, because memory is available for the new process. A tunable OS parameter, referred to as pseudo swap, would have allowed the program to execute under these conditions.
78. SLIDE: Pseudo Swap
Pseudo Swap
Definition: Pseudo swap is fictitious, make-believe, swap space. It does NOT exist physically, but logically the operating system recognizes it. Purpose: Pseudo swap allows more swap space to be made available than physically exists. Benefit: Pseudo swap adds 75% of physical memory to the amount of swap space that the operating system thinks is available. This lessens swap space requirements (especially helpful on large memory systems.) **NOTE: Pseudo swap is NOT allocated in memory!
Student Notes
Pseudo swap is HP's solution for large memory customers who do not wish to purchase a large amount of disks to use for swap space. The justification for purchasing large memory systems is to prevent paging and swapping, therefore, the argument becomes, Why purchase a lot of device swap space if the system is not expected to page or swap? Pseudo swap is swap space that the operating system recognizes, but in reality it does not exist. Pseudo swap is make-believe swap space. It does not exist in memory; it does not exist on disk; it does not exist anywhere. However, the operating system does recognize it, which means more swap space can be reserved than physically exists. The purpose of pseudo swap is to allow more processes to run in memory than could be supported by the swap device(s). It allows the operating system (specifically the SWAP_AVAIL variable) to recognize more swap space, thereby allowing additional processes to start when all physical swap has been reserved. By having the operating system recognize more swap space than physically exists, large memory customers can now operate without having to purchase large amounts of swap space, which they will most likely never use. The size of pseudo swap is dependent on the amount of memory in the system. Specifically, the size is (approximately) 75% of physical memory. This means the SWAP_AVAIL variable
will have an additional amount (75% of physical memory) added to its content. This additional amount allows more processes to start when the physical swap has been completely reserved. NOTE: Pseudo swap is enabled through a tunable OS parameter call swapmem_on. If the value for swapmem_on is 1, then pseudo swap will be enabled (turned on). If the value for swapmem_on is 0, then pseudo swap will be disabled (turned off).
Analogy
A good analogy for pseudo swap is an airline overbooking a flight. Airlines know that customers sometimes dont show up for their flight. If they reserved only enough seats for the plane, they would likely depart with a plane that wasnt full lost revenue. So they reserve more seats than actually exist on the plane, betting that a certain percentage of customers wont show. That way they can fly a plane that is much closer to full and get more revenue. Of course, they are occasionally wrong.
79. SLIDE: Total Swap Space Calculation with Pseudo Swap
Total Swap Space Calculation with Pseudo Swap
Memory Size x Pseudo Swap + Physical Swap Total Swap
= 32 MB 0.75 = 24 MB = 55 MB = 79 MB
Student Notes
The above slide shows how Total Available Swap Space (also known as SWAP_AVAIL) is calculated with pseudo swap turned on. The SWAP_AVAIL variable is calculated as all of the configured physical swap space (device and file system swap) PLUS 75% of physical memory (pseudo swap). (The calculation of the size of pseudo swap is actually more complex than given here. The resultant value of pseudo swap can vary anywhere from 67% to 88% of physical memory. But well use 75% as a pretty typical figure.) In our example, the total amount of physical swap was 55 MB, and the amount of physical memory was 32 MB. Since the size of pseudo swap is estimated at 75% of physical memory, the pseudo swap size in our example is 24 MB.
This means the Total Available Swap Space (SWAP_AVAIL) is: 55 MB (Physical Swap) + 24 MB (Pseudo Swap) --------79 MB (Total Avail Swap)
710. SLIDE: Example Situation Using Pseudo Swap
Example Situation Using Pseudo Swap

Allocation without Pseudo Swap Reserved Used Kernel and OS Tables Swap (55 MB) Allocation with Pseudo Swap Processes Available Memory (20 MB) Reserved Program Used : 20 MB : 35 MB : 20 MB : 35 MB
Swap Avail : 0 MB
CPU
Swap Avail : 24 MB
Disk
New program wants to execute; not enough memory for program to fit. With pseudo swap turned ON, program can now execute!
Student Notes
The above slide revisits our previous situation with pseudo swap turned ON. In our previous situation, we had swap space of 55 MB, of which 35 MB was in use and the remaining 20-MB was reserved. With pseudo swap turned OFF, we saw that no new processes could start because no physical swap space was available for reservation purposes. With pseudo swap turned ON, the total available swap space is 79 MB (not 55 MB). Therefore, when the system runs out of physical swap, it still has 24 MB (due to pseudo swap), which it thinks it can allocate and therefore can reserve. Consequently, the operating system is able to support more processes without having to allocate more physical swap space. This is important for large memory customers who do not want to purchase a lot of swap space on disk in order to support the large memory.
711. SLIDE: Swap Priorities
Swap Priorities
Equal Priorities 1st chunk of swap - disk 1, chunk 1 2nd disk 2, chunk 1 3rd disk 1, chunk 2 4th disk 2, chunk 2 5th chunk will be allocated here Unequal Priorities 1st chunk of swap 2nd 3rd 4th disk 1, chunk 1 disk 1, chunk 2 disk 1, chunk 3 disk 1, chunk 4
5th chunk will be allocated here
1 3
2 4
1 4 2 3
Swap - Priority 1
Swap - Priority 1
Swap - Priority 1
Swap - Priority 2
Student Notes
When the HP-UX operating system needs to page something from memory to a swap device, it selects the smallest-numbered, strongest-priority swap device. A system administrator can define a priority number for each swap device on the system. The priority numbers range from 0 to 10, with 0 being the strongest priority, and 10 being the weakest priority. If multiple swap devices are available when the system needs to page out to swap, the strongest priority swap device is used. The slide shows two examples. The first example illustrates how the system behaves when two equal priority swap devices are available. In this situation, the system alternates between the two swap devices, with the first chunk of swap being allocated on swap device #1, and the second chunk of swap being allocated on swap device #2. The second example illustrates how the system behaves when two unequal priority swap devices are available. In this situation, the system will continue to allocate chunks of swap from the lowest-numbered (strongest priority) swap device. Only when that device is 100% full will the system begin allocating chunks from the second swap device.
712. SLIDE: Swap Chunks
Swap Chunks
1 3
2 4
Swap - Priority 1
Swap - Priority 1
Space on the swap device is allocated to the kernel in increments called swapchunks. The default swapchunk size is 2 MB.
Student Notes
A swap chunk is the amount of space that the operating system allocates swap devices. The default swap chunk size is 2 MB. In the above example, two equal priority swap devices are available to the system. The system will allocate the first swap chunk to be on swap device #1, and this size will be 2 MB by default. Once this swap chunk has been filled by 512 pages (page size = 4 KB), then the system will allocate a second swap chunk to be on swap device #2. The system continues alternating swap space between the two systems in swap chunk increments. Swap chunks are also the unit in which swap space is allocated on file system swap devices. With file system swap devices, the operating system will only allocate swap space on the file system if the space is needed; if it does not need the swap space, then it does not allocate space. When it does need swap space, it allocates the file system swap space in swap chunk sizes. Files are created each of a size equal to a swap chunk and named hostname.N, where N is a number from 0 on up.
713. SLIDE: Swap Space Parameters
Swap Space Parameters
DEV_BSIZE
Device block size. This is the size (in bytes) of a block on the disk. The default size is 1024 bytes. This is the number of blocks to allocate to the kernel when it need swap space. The default is to allocate swap space to the kernel in 2-MB increments. The default value is 2048. The maximum value is 65,536.
swchunk
maxswapchunks This is the maximum number of swchunks which can be allocated to the kernel. The default value is 256. The maximum value is 16,384. Total swap space recognized by the kernel = maxswapchunks x swchunk x DEV_BSIZE Defaults: 256 x 2048 x 1024 = 512 MB
Student Notes
There are two configurable parameters and one fixed, non-configurable parameter that affect swap space configurations and allocations. DEV_BSIZE The size in bytes of a block of disk space. The default size is 1 KB. It is not configurable. The number of blocks (of size DEV_BSIZE) to associate with a chunk of swap space, referred to as a swap chunk. The default value is 2048 blocks or 2 MB. The maximum value is 65,536 or 64 MB. This is the maximum number of swap chunks that will be recognized systemwide. The default value is 256. The maximum value is 16,384.
swchunk
maxswapchunks
Using these defaults, the maximum amount of swap space that the operating system recognizes is 512 MB. This means if a system is configured physically for 1 GB of swap space, only 512 MB of the 1 GB will be used by the system. In order for the system to use the other 512 MB, the tunable OS parameter maxswapchunks needs to be increased to 512.
If you were to install HP-UX on a system that had 2 GB of physical memory, the installation process would automatically increase maxswapchunks to accommodate the larger memory. In this example, it would set maxswapchunks to 1024. However, if you were to add more memory at a later date (without reinstalling the kernel), you would have to manually tune maxswapchunks to be able to allocate enough swap space and use all of your available memory. Or, use pseudo swap. In 11.23 (11i v2), maxswapchunks has been eliminated and no longer becomes an issue.
714. SLIDE: Summary
Summary
Swap space reservation Pseudo swap Swap priorities Swap chunks Swap space parameters
Student Notes
To summarize this module, all processes must reserve swap space by decrementing a variable called SWAP_AVAIL when they initialize. If this variable cannot be decremented, the process will not be able to start. To allow this variable to recognize more swap space than physically exists, setting a tunable parameter, swapmem_on, to 1 will turn on pseudo swap. This allows more processes to execute than the amount of swap space can support. This is not considered a problem on large memory systems, because these machines are not expected to swap. If a system does need to swap, it will swap to the lowest-numbered (strongest) priority swap device first. The priority of a swap device is specified when the device is activated. If two swap devices have the same priority, the system will alternate between the two devices. Swap chunks are the unit of disk space by which swap space is allocated. By default, the size of a swap chunk is 2 MB. By default, the system recognizes a maximum of 512 MB of swap space. If more swap space exists, the tunable parameter, maxswapchunks, must be increased, in order for the additional swap space to be recognized. If maxswapchunks is already set to the maximum value, then increase the value of swchunk.
715. LAB: Monitoring Swap Space Preliminary Steps

A portion of this lab requires you to interact with the ISL and boot menus, which can only be accomplished via a console login. If you are using remote lab equipment, access your systems console interface via the GSP/MP. You may get some file system full messages while you are shutting down the system. You can ignore these messages.
Directions
The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities. 1. Use the swapinfo command to display the current swap space statistics on the system. List the MB Avail and MB Used for the following three items:
MB Available dev reserve memory
MB Used
2. To see total swap space available and total swap space reserved, enter: # swapinfo -mt What is the total swap space available (including pseudo swap)? What is the total space reserved?
3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so lets not use the m option. Upon verification, exit the shell. Is the swap space returned upon exiting the shell process?
4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Utilization percentage increases in glance. Type: # /home/h4262/memory/paging/mem256 & Use the process that most closely matches your physical memory size. This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Utilization in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? What prevented an additional mem256 process from being started? Kill all mem256 processes to restore performance.
5. Recompile the kernel, disabling pseudo swap. Use the following procedure: 11i v1 or earlier: # # # # # # cd /stand/build /usr/lbin/sysadm/system_prep -s system echo "swapmem_on 0" >> system mk_kernel -s ./system cd / shutdown -ry 0
11i v2 and later:

# cd / # kctune swapmem_on=0 NOTE: The configuration being loaded contains the following change(s) that cannot be applied immediately and which will be held for the next boot: -- The tunable swapmem_on cannot be changed in a dynamic fashion. WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? no NOTE: The backup will not be updated. * The requested changes have been saved, and will take effect at next boot. Tunable Value Expression swapmem_on (now) 1 Default (next boot) 0 0 # shutdown ry 0
6. Reboot from the new kernel. Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test 7. Once the system reboots, login and execute swapinfo. Is there a memory entry? Why or why not? Will the same number of mem256 processes be able to execute as earlier? How many mem256 processes can be started now? Kill all mem256 processes to restore performance.
8. If you have a two disk system. If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk continue this lab at question 13. If you did not add the second disk earlier:
# # # #
vgdisplay v | grep Name (Note the physical disks used by vg00) ioscan fnC disk (Note which disks are unused) pvcreate f <raw_dev_file_of_unused_disk> vgextend /dev/vg00 <block_dev_file_of_second_disk>
To create the new swap device on the second disk: # lvcreate n swap1 /dev/vg00 # lvextend L 512 /dev/vg00/swap1 <dev_file_of_second_disk> Note in our case the primary swap is 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap. 9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work. # swapon p 1 /dev/vg00/swap1 swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging. Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use.) You need to override. # swapon -p 1 -f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1. Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify maxswapchunks and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel, increasing maxswapchunks. Use the following procedure: # # # # # cd /stand/build echo "maxswapchunks 512" >> system mk_kernel -s ./system cd / shutdown -ry 0
10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system. Otherwise, skip to step 11. Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test
And now add the new swap device: # swapon -p 1 -f /dev/vg00/swap1 Verify that the new swap space has be recognized by the kernel: # swapinfo -mt Done! 11. Start enough mem256processes to make the system start paging.
12. Measure the disk I/O to see what is happening with swapspace. Go to question 15 when you have finished.
13. If you have a single disk system. Create three additional swap devices with sizes of 20 MB. # lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00 List the current amount of swap space in use. If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out? A) The distribution would be 10 MB and 10 MB. B) The distribution would be 15 MB and 5 MB. Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.
14. Activate the newly created swap devices. Activate two with a priority of 1, and the third with a priority of 2. # swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Is the new paging activity being distributed evenly across the paging devices?
15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo swap and remove the additional swap devices. For 11i v1 and earlier, follow this procedure: # cd / # shutdown ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown ry 0

Objectives
Upon completion of this module, you will be able to do the following: List three ways disk space can be used. List disk device files. Identify disk bottlenecks. Identify kernel system parameters.
81. SLIDE: Disk Overview
Disk Overview
Tracks
Cylinder 0 Cylinder 1 Cylinder 2 . . . Cylinder N-1
Data Blocks
Physical View
Logical View
Internal Cylinder View
Student Notes
Disks are used to store data for the operating system and the applications. A disk can be used several different ways, but they boil down to just two file system and raw. If a disk holds a file system, there are several structures which are built on the disk (using the data blocks of the disk) to help support the software in the kernel, which needs to access and manage the file system files and their contents. If the disks are to be used raw (such as a device swap space or an application database), no kernel structures are built out on the disk. The related code simply reads, manages and organizes the data blocks as it sees fit. There are several types of file systems available with the HP-UX 10.x and 11.x releases. The two primary types of local file systems are HFS (High performance File System), which was the original file system for HP-UX and has continually been enhanced since, and JFS (Journaled File System), which was introduced with the HP-UX 10.01 release and continues to grow in popularity and functionality. In the near future, you should see another type of file system become available for HP-UX the Advanced File System (AdvFS) ported over from Tru64 UNIX. In later modules, we will
discuss the performance issues that pertain to each of the available file systems. In this module, well address the issues pertaining to all disks.
Physical View
From a physical disk perspective, the disk drives upon which a file system is placed contains sectors, tracks, platters, and read/write heads. A key behavior of most all disk drives is that the read/write heads move in parallel across the platters in such a way that each read/write head is over the same track within each platter at the same time. To maximize the I/O throughput of the disk, it is desirable to minimize the amount of head movement. To help achieve this goal, all the sectors in a cylinder are addressed in sequential order.
Cylinder Analogy
Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder would be all the same lanes from each floor's jogging track. In other words, all lane 1 tracks would make up cylinder 1; all lane 2 tracks would make up cylinder 2, etc. By organizing space on disks in cylinders, the software can logically distribute its sectors across all platters of the disk evenly and uniformly. For example, in the slide above, the first 6 sectors would be allocated as follows: block block block block block block #1: #2: #3: #4: #5: #6: Platter Platter Platter Platter Platter Platter #1, #1, #1, #1, #1, #1, Track Track Track Track Track Track #1, #1, #1, #1, #1, #1, Sector Sector Sector Sector Sector Sector #1 #2 #3 #4 #5 #6
By allocating disk space in this manner, a multiple block read (say 6 blocks) could be read in one operation.
Logical View
From a logical view, each cylinder is simply a repository for a certain amount of data, which can be read or written without having to move the heads. This data area is further broken down into blocks. The block is the most fundamental unit of data that can be read from or written to the disk. We mentioned in an earlier chapter a value in the kernel, called DEV_BSIZE. It is equal to 1024 bytes. This is the block size from the kernels perspective. The disk can be viewed as simply a series of blocks running from block 0 to block N-1, where N is the total number of blocks on the disk. The closer two blocks are to each other, the more likely they will be in the same cylinder. If they are in the same cylinder, a minimum amount of time is needed to read or write both blocks.
82. SLIDE: Disk I/O Read Data Flow
Disk I/O Read Data Flow

1. Process issues read system call (logical I/O generated). 2. Block to be read is not in buffer cache; physical I/O is issued. 3. Block on disk is accessed through seek, latency, and transfer. 4. Data is read into buffer cache, completing physical I/O request. 5. Data is returned to process, completing the logical I/O and system call.
Disk I/O Queue Buffer Cache
Filesystem
3
Process
File
Seek, Latency, Transfer
Memory
Student Notes
Up to this point, we have looked at I/O from the standpoint of the disk. The following slide illustrates disk I/O activities from the standpoint of memory and the process initiating the I/O. The assumption here is that we are dealing with a disk that has a file system on it, so the buffer cache becomes a factor in the operation. If this were a raw disk, the buffer cache would be bypassed by all I/O operations.
Asynchronous vs. Synchronous Reads

There are two possible approaches to doing reads synchronous and asynchronous. By, default any read will be synchronous, i.e., the process will wait (and sleep, if necessary) until the data can be transferred to the data area of the process. If the read is asynchronous, the process informs another driver (an asynchronous I/O driver) that it will need certain data in the future. The driver fetches the data from the disk and places it in the buffer cache, while the process continues with other operations. When the data is in the cache, the driver signals the process and the read is now executed. The data is guaranteed to be in the buffer and the process never has to sleep. Asynchronous reads are significantly more difficult to program, so they are used only in the more sophisticated applications.
Buffered Read Data Flow

The flow diagram on the slide highlights the main actions from the time a process issues a read() system call, to when the data is returned to the process. 1. A process issues the read() system call. This is viewed by the kernel as a logical I/O, meaning the kernel will satisfy the request any way it can, either through the buffer cache or by performing a physical I/O. 2. The buffer cache is searched, looking for the data blocks being requested. If the data block is found in the buffer cache, the read() system call is returned with the corresponding data. If the data block is not found, The requesting process goes to sleep and a physical I/O request is generated to read the data block into the buffer cache. We will assume the data block was not found. NOTE: Logical I/Os may or may not generate corresponding physical I/Os. The goal of the buffer cache is to handle as many logical I/Os with as few physical I/Os as possible.
3. The physical read is performed because the data was not in the buffer cache. Because physical I/O involves movement of the disk head (seek time), waiting for the data on the platter to rotate under the disk head (latency time), and moving the data from the platter into memory (transfer time), the cost of a physical I/O is high from a performance standpoint. Physical I/Os are the most time-consuming operations that the kernel performs. If the disk I/O queue is long (3 or more requests), the time spent waiting to be serviced can be longer than the time to actually service the I/O request. 4. Once the physical I/O request returns, the data is stored in the buffer cache so that future I/O requests for the same file system block can be satisfied without having to perform another physical I/O. This step completes the physical I/O initiated by the kernel. 5. The final step is to return the data to the original calling process that issued the read(). The sleeping process is awakened and transfers the desired data from the buffer (in buffer cache) to the data area of the process. Then the process returns from the read() system call. This step completes the logical I/O initiated by the process.
Raw Read Data Flow

If the read operation is raw, the buffer cache is bypassed. Data is transferred directly from the disk to the data area of the calling process. All raw reads are synchronous and therefore result in the process sleeping until the data has been read in.
83. SLIDE: Disk I/O Write Data Flow (Synchronous)
Disk I/O Write Data Flow (Synchronous)

1. 2. 3. 4. 5. 6. 7. Process issues write system call. Block is assigned on disk, and image for block is allocated in buffer cache. Once data is written to buffer cache, a physical I/O to disk is generated. Data is written to disk controller cache. Data is then transferred from the disk controller to the corresponding platter. Upon completion of I/O, the disk controller sends an acknowledgment to the kernel. Write system call returns to process.
2
3 4
Disk Controller Cache
Process Memory
Student Notes
As with reads, there are two methods for performing write() system calls: asynchronous and synchronous. Although the default write operation is asynchronous (the writing process does not sleep waiting for the write to complete), it is quite simple for a program to choose synchronous writes. It can be done by simply setting a flag on the open file before issuing the write. This can be done when the file is opened or at some later time.
Synchronous Writes
The slide shows the data flow of a synchronous write, from the time the write()system call is issued, to when the write call returns to the process. 1. The process issues a synchronous write() system call. 2. Assuming the process is writing to a new file data block, a new file system block is allocated on disk and an image of that block is allocated in the buffer cache. 3. Once the data is copied from the data area of the process to the buffer cache, an I/O request is placed in the disk I/O queue for that particular disk. The calling process goes to sleep until the write is reported to be complete.
4. When the physical write is performed, the data is first copied from the buffer cache to the firmware cache on the disk drive controller. NOTE Most SCSI disk drive controllers can be configured to return an I/O complete acknowledgment at this point, rather than waiting for the data to be transferred to the physical platters. This condition is called immediate reporting.
5. The data is transferred from the disk controller cache to the platter. This operation is often the most time consuming part of the write, as it involves seek, latency, and data transfer operations. 6. Once the data has been successfully transferred to the platters, the disk drive controller returns an I/O complete acknowledgment to the kernel (assuming this was not done in step 4 with immediate reporting). 7. The kernel, upon receiving the I/O complete acknowledgment, Wakes the sleeping process, which then returns from the write call.
Asynchronous Writes
An asynchronous write does not wait for the data to get to the disk. An asynchronous write system call returns immediately upon the data being written to the buffer cache. In the diagram on the slide, the write call would return following step 2. The advantage of asynchronous writes is performance the process does not have to wait for the physical I/O. The disadvantage is lack of data integrity. Because the process continues executing before the data is written to disk, it can perform additional actions that are dependent upon the data being written successfully. If for some reason the data does not get written (a disk goes offline or a disk head crashes), the additional actions can leave the system in an inconsistent state. For example, assume a database record is written asynchronously. Because it is written asynchronously, the database process continues its execution. A subsequent action is to update a corresponding entry in another table of the database located on another disk. Assume the first asynchronous write is posted to a busy disk with a long queue, and the subsequent write is posted to a disk with an empty queue. The second write finishes before the first write begins! If the system were to crash after the second write, but before the first write, the database would be out-of-sync and corrupted, because the second write assumed that the first write succeeded. There is no signaling to the writing process to let it know that a write has completed. For that, the process would have to do synchronous writes.
84. SLIDE: Disk Metrics to Monitor Systemwide
Disk Metrics to Monitor Systemwide

Utilization of disk drives Disk I/O queue length Amount of physical I/O to Device (i.e., Disk)

Logical volume File system
Buffer cache hit ratio
Student Notes
When monitoring disk I/O activity, the main metrics to monitor are: Percent utilization of the disk drives: As utilization of the disk drives increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the disk is 50% busy, than it does when the disk is idle. Therefore, we consider that a disk may be experiencing a bottleneck if the disk is 50% busy or more. Requests in the disk I/O queue: The number of requests in the disk I/O queue is one of the best indicators of a disk performance problem. If the average number of requests is above two, then requests are forced to wait in the queue longer than the amount of time needed to service their own requests. If the average number of requests is three or greater, you should also see that the average wait time for a request is greater than the average service time. Amount of physical I/O: If the amount of disk activity is high, it is important to investigate on which disk, which logical volume, and which file system the activity is occurring on.
Buffer cache hit ratio: One reason disk activity could be high is that read or write requests are not finding corresponding disk blocks in the buffer cache. As a result, physical I/O requests are being generated to the disk. The read cache hit ratio on the buffer cache indicates how frequently read data is found in the buffer cache. The minimum read hit ratio should be 90% or higher for optimal performance. Less than 90% indicates the buffer cache may be too small, causing (potential) excess disk activity. It may also indicate that the application is not using buffer cache in an efficient manner, e.g. doing a lot of random I/O or very large I/O. The write cache hit ratio on the buffer cache indicates how frequently a write to a buffer does not trigger a physical read or write to the disk. (if only a portion of a block is being written, and the image of that block is not already in a buffer, it may be necessary to read the original contents of the block into buffer cache before modifying it with the new write data.) The minimum write cache hit ratio should be 70% or higher for optimal performance. Less than 70% indicates the buffer cache may be too small, causing (potential) excess disk activity. Again, the fault may lie with the applications use of the buffer cache.
85. SLIDE: Disk Metrics to Monitor Per Process
Disk Metrics to Monitor Per Process

Amount of physical and logical I/O being performed on a per process basis Type and amount of system calls (I/O-related) being generated by processes performing large amounts of I/O Paging to swap device (VM read/writes) on a per process basis Files opened by processes performing large amounts of I/O
Student Notes
On a per process basis, it is important to identify which processes are generating large amounts of disk I/O. Metrics that help to identify I/O activity on a per process basis are: Amount of physical and logical I/O: This indicates how much I/O the process is performing. For processes performing large amounts of I/O, the additional three metrics shown below should be investigated. Type and amount of I/O related system calls being generated: For each process performing high I/O, the number of read(), write(), and other I/O related calls should be inspected. Amount of VM reads and VM writes: If the I/O activity being generated is due to paging (VM read and VM writes), then the problem is probably not a disk I/O problem, but more like a memory problem. Files opened with heavy access: For each process performing large amounts of file system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, consider relocating these files to other disks that are less busy. To determine how random the I/O requests are, hit <CR>
frequently while looking at the list of open files for that process (in glance), then inspect how quickly the offset to each file changes and whether it is monotonically increasing or varies up and down.
86. SLIDE: Activities that Create a Large Amount of Disk I/O
Activities that Create a Large Amount of Disk I/O

Buffer cache misses Synchronous I/O Accessing sequentially with a small block size Accessing many files on a single disk Accessing many disk drives from a single disk controller card
Student Notes
Common causes of disk-related performance problems are shown on the slide. Buffer cache misses cause physical I/O to occur. When the appropriate buffer is not found in the buffer cache, a physical I/O is triggered. By the way, a buffer cache can be too large as well. A very large buffer cache takes more time to search to see if the appropriate buffer exists! More on how to properly size a buffer cache will be given later in this module. Synchronous I/O forces the write system calls to wait until the I/O physically completes. Very good for data integrity, very poor for performance. Sequential access, with a small block size, causes excessive amounts of physical I/O. Accessing lots of files on one disk, versus many disks, creates an imbalance of disk drive utilization. This leads to performance problems with the busy disks and under utilization with the less busy disks. Accessing lots of disks on the same disk controller creates contention problems on the SCSI bus. You can determine this by noticing that multiple disks on the same controller
have request queues that are consistently three or greater in length and the average time a request waits to be serviced is greater than the average time it takes to actually service the request. The individual disks may not show a disk utilization 50% or greater! If this situation occurs, it would be best to spilt up the busiest disks onto separate controllers.
87. SLIDE: Disk I/O Monitoring sar d Output
Disk I/O Monitoring sar -d Output

# sar -d 5 6 05:23:50 device 05:23:55 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:00 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:05 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:10 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:15 c0t4d0 c0t5d0 c0t6d0 05:24:20 c1t6d0 c1t5d0
%busy 0.60 62.40 33.20 54.80 1.20 63.80 39.20 61.80 2.20 56.40 35.60 62.80 0.20 68.60 33.80 60.00 24.40 23.00 50.60 0.60 1.40
avque 0.50 10.51 2.76 8.10 0.50 10.84 2.94 19.60 0.50 18.40 2.69 18.41 0.50 13.00 3.25 5.72 4.25 3.46 18.77 0.50 1.17
r+w/s 2 46 16 31 3 48 19 36 3 39 17 36 2 51 16 33 15 14 28 0 2
blks/s 35 2783 1226 2166 39 2943 1427 2371 45 2392 1258 2643 35 3118 1226 2301 823 851 1846 2 23
avwait 1.55 127.97 42.89 242.52 1.97 129.23 38.85 331.15 3.85 234.33 39.96 192.28 1.01 154.68 47.82 238.43 60.83 43.33 306.13 4.63 9.85
avserv 5.07 152.92 143.96 193.15 6.72 159.47 154.55 208.49 13.04 163.10 138.81 178.66 4.86 159.02 147.32 203.88 180.68 118.87 233.36 11.53 21.50
Student Notes
The sar -d report shows disk activity on a per disk drive (spindle) basis. The key fields within this report are: % busy avque avwait avserv Indicates the average percent utilization of the disk over the interval (5 seconds in the slide). Indicates the average number of requests in the disk I/O queue. Indicates the average amount of time a requests spends waiting in the disk I/O queue. Indicates the average amount of time to service a disk I/O request.
The sar -d report on the slide shows that when the disk had the most requests in the queue (19.60 and 18.77), the average wait time was at its highest. The slide also shows that there are five disk drives spread across two disk controllers. One disk controller (c0) appears to have two busy drives (t4 and t6), and a relatively low usage drive (t5). Disk controller (c1) has two disks that are mainly idle. One performance solution
here would be to balance the disk activity across the two controllers by moving one disk (say c0t4) over to the less busy disk controller (c1).
88. SLIDE: Disk I/O Monitoring sar b Output
Disk I/O Monitoring sar -b Output
#=> sar -b 10 20 HP-UX e2403roc B.10.20 U 9000/856 02/09/98
05:51:04 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 05:51:14 0 0 0 1 1 25 0 0 05:52:04 0 0 0 0 1 85 0 0 05:52:14 0 0 0 1 8 87 0 0 05:52:24 0 0 0 0 4 100 0 0 05:52:34 0 0 0 0 1 100 0 0 05:52:54 1 68 99 0 0 33 0 0 05:53:04 7 11936 100 1 2 13 0 0 05:53:14 6 19506 100 1 1 0 0 0 05:53:24 28 24147 100 1 2 65 0 0 05:53:34 64 16659 100 0 14 99 0 0 05:53:44 118 118 0 2 3 46 0 0 05:53:54 0 0 0 3 3 0 0 0 05:54:04 0 0 0 18 19 4 0 0 05:54:14 179 179 0 18 18 3 0 0 05:54:24 179 179 0 13 14 4 0 0 Average 29 3639 99 3 5 39 0 0
Student Notes
The sar -b report shows disk activity related to the buffer cache. The key fields within this report are: bread/s Indicates the average number of physical I/O reads per second over the interval. The term bread refers to block reads. Indicates the average number of logical I/O reads per second over the interval. Indicates the average percent read cache hit rate. This shows what percentage of read requests were satisfied through the buffer cache. Ideally, this value should be consistently 90% or greater. Indicates the average number of physical I/O writes per second over the interval. The term bwrit refers to block writes. Indicates the average number of logical I/O writes per second over the interval.
lread/s
%rcache
bwrit/s
lwrit/s
%wcache
Indicates the average percent write cache hit rate. This shows what percentage of write requests were satisfied through the buffer cache. Ideally, this value should be consistently 70% or greater.
The sar -b report on the slide shows the two extreme situations. The first extreme is a 100% cache hit rate, which occurs when there are lots of logical I/O requests and all requests are satisfied through the buffer cache, rather than having to go to disk. This is a very desirable condition. The other extreme is a 0% cache hit ratio. This occurs when every logical I/O request required a physical I/O from disk. In this case, the number of physical reads or writes is equal to the number of logical reads or writes. This is most undesirable.
89. SLIDE: Disk I/O Monitoring glance Disk Report
Disk I/O Monitoring glance Disk Report
B3692A GlancePlus B.10.12 06:16:25 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------DISK REPORT Users= 4 Req Type Requests % Rate Bytes Cum Req % Cum Rate Cum Byte -------------------------------------------------------------------------------Local Logl Rds 68 2.7 13.6 5kb 1260 7.8 9.6 3.2mb Logl Wts 2455 97.3 491.0 19.2mb 14798 92.2 112.9 114.8mb Phys Rds 10 1.7 2.0 80kb 189 5.1 1.4 1.8mb Phys Wts 565 98.3 113.0 18.9mb 3520 94.9 26.8 112.4mb User 571 99.3 114.2 18.9mb 3448 93.0 26.3 112.2mb Virt Mem 0 0.0 0.0 0kb 66 1.8 0.5 968kb System 4 0.7 0.8 32kb 195 5.3 1.4 1.2mb Raw 0 0.0 0.0 0kb 0 0.0 0.0 0kb Remote Logl Rds 0 0.0 0.0 0kb 0 0.0 0.0 0kb Logl Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kb Phys Rds 0 0.0 0.0 0kb 1 100.0 0.0 0kb Phys Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kb
Student Notes
The glance disk report (d key) shows local and remote I/O activity. The I/O distribution can be viewed from the following: Logical Perspective (logical reads and logical writes) Physical Perspective (physical reads and physical writes) I/O Type Perspective (User, Virtual Mem, System, Raw)
Items of interest in this report include the number of logical I/O requests (read and writes), the number of physical I/O requests (reads and writes), and the ratio between the two. In the slide, disk utilization is 94% (very high), with the majority of the I/Os being writes (92%) as opposed to reads. It is also interesting to note the logical to physical write ratio is 14,798 / 3,520 or approximately 4:1, which is an acceptable write performance ratio.
810. SLIDE: Disk I/O Monitoring glance Disk Device I/O
Disk I/O Monitoring glance Disk Device I/O
B3692A GlancePlus B.10.12 06:31:12 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S R U U |100% 100% 100% Cpu Util F Disk Util F | 83% 22% 84% Mem Util | 94% 95% 96% S S U U B B Swap Util | 21% 21% 22% U U R R -------------------------------------------------------------------------------IO BY DISK Users= 4 Idx Device Util Qlen KB/Sec Logl IO Phys IO -------------------------------------------------------------------------------1 56/52.6.0 0/ 0 0.0 0.0/ 1.8 na/ na 0.0/ 0.2 2 56/52.5.0 1/ 1 0.0 16.0/ 5.1 na/ na 2.0/ 0.7 3 56/36.4.0 78/ 9 18.2 1584.8/ 178.4 na/ na 48.0/ 5.6 4 56/36.5.0 52/ 6 3.8 932.8/ 120.5 na/ na 24.0/ 3.0 5 56/36.6.0 68/ 9 10.6 1172.8/ 154.9 na/ na 35.8/ 4.6 6 56/52.2.0 0/ 0 0.0 0.0/ 0.0 0.0/ 0.0 0.0/ 0.0
Top disk user: PID
3280, disc
106.4 IOs/sec
S - Select a Disk
Student Notes
The glance disk device report (u key) shows current and average utilization of each disk drive on the system. The report also shows the current I/O queue length for each disk. This display shows basically the same information as sar d. In the slide, three disks show utilization greater than 50% and queue lengths greater than 3. This is normally a valid reason for further investigation. The 10.6 and 18.2 queue lengths are high, but, because the average utilization of both the drives is 9%, this may just be a spike in disk activity. In this case, monitor the situation further to see if the high queue lengths persist or if they were just spikes in disk usage.
811. SLIDE: Disk I/O Monitoring glance Logical Volume I/O
Disk I/O Monitoring glance Logical Volume I/O
B3692A GlancePlus B.10.12 06:34:41 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------IO BY LOGICAL VOLUME Users= 4 Idx Vol Group/Log Volume Open LVs LV Reads LV Writes -------------------------------------------------------------------------------1 /dev/vg00 10 0.0/ 0.0 0.0/ 0.0 2 /dev/vg00/group 0.0/ 0.0 0.0/ 0.0 3 /dev/vg00/lvol3 0.0/ 0.0 0.2/ 0.0 4 /dev/vg00/lvol2 0.0/ 0.0 0.0/ 0.0 5 /dev/vg00/lvol1 0.0/ 0.0 0.0/ 0.0 9 /dev/vg00/lvol7 0.0/ 0.0 0.0/ 0.0 10 /dev/vg00/lvol4 0.0/ 0.0 0.0/ 0.0 12 /dev/vg01 2 0.0/ 0.0 0.0/ 0.0 13 /dev/vg01/lvol1 0.0/ 0.0 105.6/ 19.2 Open Volume Groups: 2 S - Select a Volume
Student Notes
The glance logical v volume report (v key) shows disk activity on a per logical volume basis. Only physical I/O activity (not logical I/O activity) is shown with this report. In the previous slide, we saw high activity across three disk drives (drives 4, 5, and 6). The logical volume report on the slide shows all this activity is being performed against one logical volume (/dev/vg01/lvol1), which implies that the logical volume is being spread across three disks (a good idea since the I/O to the logical volume is so high).
812. SLIDE: Disk I/O Monitoring glance System Calls per Process
Disk I/O Monitoring glance System Calls per Process
B3692A GlancePlus B.10.12 06:48:15 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------System Calls for PID: 4055, disc PPID: 2410 euid: 0 User:root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------write 4 377 754.0 0.10650 12851 477.7 4.10153 open 5 3 6.0 0.05910 100 3.7 0.61923 close 6 3 6.0 0.00006 100 3.7 0.00225 lseek 19 0 0.0 0.00000 75 2.7 0.00204 ioctl 54 3 6.0 0.00007 100 3.7 0.00259 vfork 66 0 0.0 0.00000 25 0.9 0.34908 sigprocmask 185 0 0.0 0.00000 50 1.8 0.00088 sigaction 188 0 0.0 0.00000 150 5.5 0.01340 waitpid 200 0 0.0 0.00000 25 0.9 1.47745
Cumulative Interval:
27 secs
Student Notes
The glance system calls report (L key), available only from the select process report (s key), shows the names of the system calls being generated by the selected process. The system calls report can be viewed for individual processes (as shown on the slide), or globally for all processes on the system (Y key). Significant system calls, which typically consume a lot of time, are the file I/O related calls, such as read(), write(), open(), and close(). In the slide, the write() system call is being invoked heavily by the selected process (754 times/second) and has accounted for 4.1 seconds of the CPU's time over a 27-second period (approximately 15%).
813. SLIDE: Tuning a Disk I/O-Bound System Hardware Solutions
Tuning a Disk I/O-Bound System Hardware Solutions

Add additional disk drives (and off load busy drives). Add additional controller cards (and balance disk drive load across controllers). Add faster disk drives. Implement disk striping. Implement disk mirroring.
Student Notes
The hardware solutions on the above slide will help to lessen the performance impact of high disk I/O on a system. Add more disk drives and load balance across disks. This spreads the amount of I/O over more drives, decreasing the average number of I/O requests for each disk. Many smaller disks are better than a few large disks. Add more disk controllers and balance load across disk controllers. This spreads the amount of I/O over more controllers, decreasing the likelihood that any one disk controller will become overloaded with I/O requests. Add faster disk drives. This decreases the amount of time it takes to service an I/O request, which decreases the amount of time requests spend waiting in the disk I/O queue. Implement disk striping. This increases the number of disk heads having access to the striped data (the more disks striped across, the more heads accessing the data,
simultaneously). It also allows for overlapping seeks, meaning that one disk head can be seeking the next block, while a second disk head is reading the current data block. Implement disk mirroring. This can increase read performance, as either the primary or mirrored copy of the data can be read. In fact, the data will be read from whichever disk has the fewest I/Os pending against it. However, it will negatively impact write performance. In order to maintain the integrity of the mirrors, duplicate writes must be done to each copy of the mirrored volume/disk. Mirroring is primary a data integrity feature, but under the right circumstances (read-intensive data) it can improve performance, as well.
814. SLIDE: Tuning a Disk I/O-Bound System Perform Asynchronous Meta-data I/O
Tuning a Disk I/O-Bound System Perform Asynchronous I/O

Configure individual disk drives to behave somewhat asynchronously with immediate reporting feature of SCSI disk controllers. Configure immediate reporting with the scsictl command.

Disk Controller Cache
Process
Memory
Student Notes
Asynchronous I/O significantly improves write performance over synchronous I/O because the write requests (and thus the requesting processes) do not have to wait for the data to be written to the disk platters.
Immediate Reporting for Selected Disks

Immediate reporting can be turned on at boot time by setting the tunable parameter default_disk_ir to ON. An alternative to turning on default_disk_ir is to enable certain disk controllers selectively to report immediately to the kernel when the data reaches the disk controller cache. For normal writes, the disk waits until data is transferred from the controller cache to the disk platters, before returning to the kernel. By setting immediate reporting to ON for individual disk controllers, processes do not have to wait for the seek or latency times when writing to those disks. The scsictl command can be used to turn immediate reporting ON (1) for a particular SCSI disk. The default for immediate reporting is OFF (0).
Module 8 Disk Performance Issues Examples
To view the device settings for the controller at SCSI adapter address "0" and SCSI target address 6: # /usr/sbin/scsictl -m ir /dev/rdsk/c0t6d0 immediate_report = 0 To change the value of immediate reporting to ON: # /usr/sbin/scsictl -m ir=1 /dev/rdsk/c0t6d0 To view the changes in the device settings: # /usr/sbin/scsictl -a /dev/rdsk/c0t6d0 immediate_report = 1; queue_depth = 8
815. SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Controllers
Tuning a Disk I/O-Bound System Load Balance across Disk Controllers
PVG1
C0
System
C1
PVG2
Volume Group vg01
Student Notes
Another potential solution to a disk I/O performance problem is to spread the write requests across the disk controllers as evenly as possible. This helps ensure no one controller becomes overloaded with I/O requests.
Mirroring Logical Volumes

A popular feature of LVM is the ability to mirror logical volumes to separate disk drives. This involves writing one copy of the data to the primary disk and one copy to the mirrored disk. When the primary disk and mirror disk are on the same disk controller, a performance bottleneck often results because the disk controller has to service the writes for both the primary and mirrored data.
Physical Volume Groups
Physical volume groups (PVGs) allow disk drives to be grouped, based on the disk controller to which they're attached. Used in conjunction with LVM mirroring, it ensures the mirrored data not only goes to a different disk, but also goes to a different PVG group (that is, a different disk controller).
Module 8 Disk Performance Issues How to Set Up PVGs
The PVG groups are defined in the /etc/lvmpvg file. This file can be manually edited or updated with the -g option to the vgcreate and vgextend commands. A sample /etc/lvmpvg file, based on the four disks on the slide are: VG /dev/vg01 PVG PV_group0 /dev/dsk/c0t6d0 /dev/dsk/c0t5d0 PVG PV_group1 /dev/dsk/c2t5d0 /dev/dsk/c2t4d0
Configuring LVM to Mirror to Different PVGs
The command to configure LVM mirroring for different PVGs is lvchange. The strict option to this command, -s, contains the following three arguments: y n g This indicates all mirrored copies must reside on different disks. This indicates mirrored copies can reside on the same disk as the primary copy. This indicates all mirrored copies must reside with different PVGs.
For example, to configure /dev/vg01/lvol1 to mirror to different PVG: lvchange -s g /dev/vg01/lvol1
816. SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Drives
Tuning a Disk I/O-Bound System Load Balance across Disk Drives
1 2 3 4 5 6
1 3 5 7 9 11
System
100 % Util
90% Util System
52% Util
2 4 6 8 10 12
90% Util
5% Util
20% Util
52% Util
20% Util
Volume Group vg01
Volume Group vg01
Without Striping
With Striping
Student Notes
Balancing the disk activity so that the utilization across drives is approximately the same helps to ensure that no one disk becomes overloaded with I/O requests (that is, 50% or greater utilization, with three or more requests in the disk queue). The slide illustrates a situation in which one disk is heavily utilized (100%) while another disk is only 5% utilized. One potential solution is to stripe the heavily utilized logical volume on the first disk to both disks.
LVM Striping
The ability to stripe a logical volume across multiple disks (at a file system block level) was introduced into LVM at the HP-UX 10.01 release. A logical volume must be configured for striping at the time of creation. Once a logical volume is created, it cannot be striped without recreating the logical volume.
The command to create a striped logical volume is lvcreate. The syntax, related to striping, for this command is:
lvcreate -i [number of disks] -I [stripe size] -L [size in MB] vg_name
Example:
lvcreate -i 2 I 8 /dev/vg01 lvextend -L 50 /dev/vg01/lvol2 /dev/dsk/c0t5d0 /dev/dsk/c0t4d0
817. SLIDE: Tuning a Disk I/O-Bound System Tune Buffer Cache
Tuning a Disk I/O-Bound System Tune Buffer Cache
Kernel and OS Tables
Fixed Buffer Cache 5% Additional Buffer Cache
Defaults dbc_min_pct=5%
User Process and Shared Memory Area
0 - 45%
dbc_max_pct=50%
Memory
Student Notes
With the introduction of HP-UX 10.0, the buffer cache becomes dynamic, growing and shrinking between a minimum size and a maximum size. NOTE: Space for the buffer cache is allocated in two different areas of memory: the minimum size is created in the O/S area of memory, and anything above the minimum size is allocated from the User Process area.
How the Buffer Cache Grows

As the kernel reads in files from the file system, it will try to store the data in the buffer cache. If memory is available and the buffer cache has not reached its maximum size, the kernel will grow the buffer cache to make room for the new data. As long as there is memory available, the kernel will keep growing the buffer cache until it reaches its maximum size (50% of memory, by default). If memory is not available, or the buffer cache is at its maximum size when new data is read, the kernel will select buffer cache entries that are least likely to be needed in the future, and reallocate those entries to store the new data.
The main point is that if there is available memory, the buffer cache will grow into this memory until there is no memory left (or until the buffer cache reaches its maximum size).
How the Buffer Cache Shrinks

As memory falls below LOTSFREE, the vhand-paging daemon wakes up and begins paging out 4-KB pages of memory. The eligible pages include process segments (text, data, and stack), shared memory segments, and the buffer cache. In other words, the buffer cache is shrunk by having vhand page out its pages. The buffer cache is treated by vhand as just another structure in memory with pages that it can dereference and free. Like process text pages, buffer cache pages are not written out to the swap space. But, since their contents may have been modified, they could be flushed out to the file system being placed back on the free page list. NOTE: It should be noted that the kernel global value, dbc_steal_factor, determines how aggressive the vhand daemon is at stealing buffer cache pages in comparison to process pages. A value of 16 says to treat buffer cache pages no differently than process pages; the default value of 48 says to steal buffer cache pages three times as aggressively! However, if the buffer cache is referencing those pages, vhand will find few buffers to free up.
Buffer Cache Performance Implications

Because the buffer cache grows quickly into free memory and shrinks slowly by necessitating vhand to page it out, one consideration is to limit the maximum size to which the buffer cache can grow. The default maximum size is 50% of total memory. This probably was a fairly reasonable number when the parameter was introduced, but with the very large memory systems existing nowadays, its probably much too high. By setting the dbc_max_pct tunable kernel parameter to a smaller number (say 20 or 25), the buffer cache can still grow to a significant size, but will not be so large that it takes a long time to shrink when more processes become ready to execute. Prior to HP-UX11i, there was a definite performance penalty for having a buffer cache that was too large. It took a long time to search the cache to determine if the needed buffer was already there. Improvements in the search algorithm in 11i have reduced that penalty significantly.
Fixed vs. Dynamic Buffer Cache

Should you use a fixed-size buffer cache or a dynamic buffer cache? If your buffer cache requirements are constant over time, of course you should use a fixed-size buffer cache. Simply set the dbc_min_pct and dbc_max_pct parameters to the same value. If your buffer cache requirements change over time, do they change rapidly or slowly. There is some overhead associated with growing and shrinking buffer cache. Plus, shrinking buffer cache is not a very fast operation. If your buffer cache changes slowly over time, it would be best to use a dynamic buffer cache. The overhead of growing and shrinking would be spread out and become relatively insignificant.
If, however, your buffer cache requirements change rapidly over time, you probably would be better served with a fixed-size buffer cache, properly sized to give you adequate buffers most of the time. Only on relatively rare occasions, would buffer cache be a bottleneck and only for short periods. In the long run, your performance would be better than trying to deal with the rapidly changing needs using a dynamic buffer cache.
Sizing Buffer Cache

Here is a set of recommendations for properly sizing your buffer cache. 1. Are you getting at least a 90% read cache hit rate and a 70% write cache hit rate? If so, your buffer cache may already be larger than necessary. If you are experiencing no memory pressure, and no apparent disk bottlenecks, leave the buffer cache as it is. 2. If you are experiencing memory pressure or apparent disk bottlenecks, try shrinking the size of your buffer cache. Adjust dbc_max_pct down, in increments, no more than 10% at a time, until your performance figures fall to 90%/70%. 3. If you are not getting 90%/70% performance from your buffer cache, it may be too small or your application may be using it in an inefficient manner. Try increasing its size. If the figures improve, keep increasing the size until either you reach 90%/70% or your performance ceases to improve. Leave the size there. 4. If increasing the size of the buffer cache does not produce an immediate improvement in performance, your application may need to be tuned to use the buffer cache more efficiently. However, your buffer cache may still be larger than it needs to be. After you have tuned your application, recheck your buffer cache performance, as above.
818. LAB: Disk Performance Issues

Directions
The following lab illustrates a number of performance issues related to disks. 1. A file system is required for this lab. One was created in an earlier exercise. Mount it now. # mount /dev/vg00/vxfs /vxfs We also need to assure that the controller does not have " SCSI immediate reporting" enabled. Enter the following command and check your current state: (fill in the device file name as appropriate) # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) If the current immediate_report = 1 then enter the following: # scsictl -m ir=0 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) 2. Copy the lab files to the file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files /vxfs /vxfs
Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs
4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key. From the first window, time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: glance Disk Report Logl Rds: Phys Rds:
5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same command and record the results below: # timex cat file* > /dev/null real: user: sys: NOTE: glance Disk Report Logl Rds: Phys Rds: The conclusion is that I/O is much faster coming from the buffer cache, than having to go to disk to get the data.
6. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
7. The glance I/O by Disk report Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long, timing the execution. Record results below: # ./disk_long Util: glance I/O by Disk Report Qlen:
8. The glance I/O by File System report Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long, timing the execution. Record results below: # ./disk_long glance I/O by Disk Report Logl I/O: Phys I/O:
9. Performance tuning immediate reporting. Ensure the immediate reporting options are set for the disk that the file system is located on. If immediate reporting is not set, set it. # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache. # # # # cd / umount /vxfs mount /dev/vg00/vxfs /vxfs cd /vxfs
10. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
How do the results of step 11 compare to the results in step 6?

________________________________________________________________
Module 9 HFS File System Performance

Objectives
Upon completion of this module, you will be able to do the following: List three ways HFS file systems are used. List basic HFS file system data structures. Identify HFS file system bottlenecks. Identify HFS kernel system parameters.
91. SLIDE: HFS File System Overview
HFS File System Overview
Tracks
Cylinder Group
Primary Superblock
Cylinder Group 1 Cylinder Group 2 Cylinder Group 3 . . . Cylinder Group N Data Blocks
Red. Cylinder Inode SprBk Grp Hdr Table
Data Blocks
Physical View
Logical View
Internal Cylinder Group View
Student Notes
The HFS model is a foundation for all other file system variants. We will begin our discussion of File System performance using the HFS file system model.
The HP-UX File System

The HFS file system strategically lays out its data structures on disk to most efficiently utilize the geometry of the disk. The design of the HFS file system can best be explained by looking at the file system from three perspectives.
Physical View
From a physical disk perspective, the disk drive upon which a file system is placed contains sectors, tracks, platters, and disk heads. A key behavior of most all disk drives is that the disk heads move in parallel across the platters in such a way that each disk head is over the same track within each platter at the same time. To maximize the file system I/O throughput of the disk, it is desirable to have as many file blocks close to each other as possible, to minimize the time it takes to read or write the various blocks of a file. To help achieve this goal, the blocks on the disk are allocated to
the HFS file system in units call cylinder groups. A cylinder group is all the tracks, from every platter, grouped together, of several adjacent cylinders.
Cylinder Group Analogy
Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder group would be the same group of lanes from each floor's jogging track. In other words, all lane 1, 2, and 3 tracks would make up cylinder group 1; all lane 4, 5, and 6 tracks would make up cylinder group 2, etc. By organizing space on disks in cylinder group units, the HFS file system can logically keep all the blocks of a given file close to each other. For example, in the slide above, the first 6 blocks of a file might be allocated as follows: File File File File File File block block block block block block #1: #2: #3: #4: #5: #6: Platter Platter Platter Platter Platter Platter #1, #1, #1, #1, #2, #3, Track Track Track Track Track Track #1, #1, #3, #3, #7, #9, Sector Sector Sector Sector Sector Sector #1 #2 #5 #6 #10 #7
By allocating file system space in this manner, a multiple block read (say 6 blocks) could be read with less than six separate reads. In the example above, file blocks 1 and 2 could be read with one read operation, followed by a head switch (no carriage movement) to track 3, another read for file blocks 3 and 4, a short seek to the next cylinder and a head switch to read file block 5, and repeat for file block 6. Four reads could then read the six blocks. The more contiguous the blocks that make up the file, the more efficient the reads and writes can be.
Logical View
From a logical perspective, an HFS file system contains a series of cylinder groups. Even though the physical cylinder groups are laid out from top to bottom, transcending all the platters, logically, we view the cylinder groups as horizontal units going from left to right. The HFS file system is made up of multiple cylinder groups, where the number of cylinder groups is dependent on the size of the file system. In the slide, we assume the HFS file system takes the whole disk, therefore, there are N cylinder groups in the sample file system. Typically, they are numbered from 0 to N-1. A critical data structure contained with every HFS file system is the primary superblock. The primary superblock is located at the start of every HFS file system at the start of the first cylinder group, and contains the critical header information for the HFS file system. Data structures contained within the superblock include the free block list, the mount flag, the starting address of each cylinder group, and much more.
Internal Cylinder Group View

Within each cylinder group, the following data structures exist: Data blocks The data blocks are where files are stored within the cylinder group. The data blocks are distributed in such a way that a portion of the data blocks come before the cylinder group header structures and the rest come after the cylinder group header structures. This ensures that the cylinder group header structures are randomly placed throughout the cylinder groups. A redundant copy of the primary superblock is contained within each cylinder group. These redundant copies are kept to protect against the loss of the primary superblock. The locations of the redundant superblocks can be viewed by displaying the contents of the /etc/sbtab file. Should the primary superblock become lost or corrupted, the file system could still be recovered by executing the fsck command and specifying the location of one of the alternate superblocks. The cylinder group header contains the header information for the cylinder group. This information includes the free blocks within the cylinder group, the starting addresses of the inode tables for that group, and a list of free inodes for the local inode table. The inode table contains all the inodes (file header structures) for files located within the cylinder group. Every file within a file system is managed by an inode, which describes the attributes and location of the file. The inode table is divided into equal-sized sections and a section is stored in each cylinder group. Inodes within a cylinder group point to files usually contained within the same cylinder group.
Redundant Superblock
Cylinder Group Header
Inode Table
92. SLIDE: Inode Structure
Inode Structure
Inode for File Data Blocks

Type Red. Cylinder SprBk Grp Hdr Inode Table Owner Atime Permissions Group Mtime Links Size CTime
File
Data Block Pointers
Student Notes
An inode contains all the header information for a particular file. Every file has a corresponding inode, usually located within the same cylinder group as the file. Fields contained within the inode include: File type File access permissions Number of hard links to the file Owner and group of the file Size of the file in bytes Time stamps (file access, file modification, inode changes) Data block pointers (direct and indirect) Although the size of the inode differs from one type of file system to another, the basic types of data contained is virtually the same, the main differences are in the data pointer structures.
NOTE:
93. SLIDE: Inode Data Block Pointers
Inode Data Block Pointers
Direct Access Inode
Single Indirection Inode
Double Indirection Inode
Data Block Data Block Data Block . . . . . . Data Block
Inode Extension
Data Blocks 3 Logical I/Os needed to access each 8 KB of data
Data Blocks 4 Logical I/Os needed to access each 8 KB of data
2 Logical I/Os needed to access each 8 KB of data
Student Notes
One of the structures within each HFS inode is the array of data block pointers that reference the data blocks within the file. The size of the data block pointer array is 15 entries, meaning there are a maximum of 15 file system block addresses within the array. The first 12 addresses within the data block pointer array are direct access addresses. The thirteenth entry is a single indirection block address, the fourteenth is a double indirection block address, and the fifteenth (and last) entry is a triple indirection block address.
Direct Access
A direct access address points directly to a file's data block. When accessing a file using a direct access address, a minimum of two logical I/Os are needed: one I/O to access the file's inode (containing the direct access address), and one I/O to access the file's corresponding data block.
Single Indirection
Single indirection implies the address within the inode references a block on disk that acts as an inode extension block. The inode extension block, in turn, contains addresses that point to the file's corresponding data blocks. It should be noted that three logical I/Os are needed to access a file's data blocks using single indirection: one I/O for the file's inode, one I/O for the inode extension block, and one I/O for the data block itself.
Double Indirection
Double indirection means access to a file's data blocks require going through two inode extension blocks. The first inode extension block references the address of a second inode extension block, which contains addresses referencing the file's datablocks. Double indirection is needed only for files above 16 MB (with a default block size of 8KB). When accessing files requiring double indirection, a total of four logical I/Os are required: an I/O for the file's inode, an I/O for each of the two inode extension blocks, and an I/O for the file's data block.
Triple Indirection
Triple indirection (not shown on the slide) adds one more level of indirection when accessing a file's data blocks. Triple indirection is only needed to access files larger than 32 GB (with a default block size of 8KB). NOTE: Every level of indirection adds an additional logical I/O when accessing the file's data. In the case of triple indirection, five logical I/Os are needed compared to two I/Os for direct access data blocks.
As you can see, the performance of an HFS file system tends to favor small files (12 blocks or less), and tends to penalize large files that have to use single, double, or even triple indirection. You can delay this performance degradation somewhat, by building the file systems with larger block sizes. (More on that later in the module.)
94. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd?
How Many Logical I/Os Does It Take to Access /etc/passwd?
Type Owner Inodes (0 - 249) 2 / 504 /etc 2 Blk. 74 ATime
Permissions Group MTime
Links Size CTime
Inode 2 / (root)
74
Type Blk. 717

1123 host 1824 passwd
Links Size CTime
Owner Inodes (500 - 749)

504
Inode 504 /etc
ATime
717
Type Blk. 2240 Owner Atime
Links Size CTime
Inode 1824 /etc/passwd
root::0:3;. Inodes (1750 - 1999) 1824 sys::3:3:..
2240
Student Notes
The above slide illustrates how a file within the HFS file system is accessed. It may surprise some people when they find out how many logical I/Os are needed to access the /etc/passwd file.
Starting from the Top

When the full pathname of a file is specified for access (as in /etc/passwd), the kernel starts with the only inode it knows: the inode of the root directory of the root file system. Inode number 2 is always the inode of the root directory of any file system. / symbolizes (in the kernel) the root directory of the root file system. Using the slide as an example, after reading inode 2 of the root file system (first logical I/O), the kernel discovers that the contents of the root directory (the listing of the files contained in that directory) are located at file system block 74. Upon reading block 74 (second logical I/O), the names of the files in the root directory and their corresponding inode numbers are known. Directories are primarily listings of file names and the numbers of the inodes that manage them.
From this information, the kernel discovers the inode for the etc directory (in /) is 504. Inode 504 is then read (third logical I/O) and from that the kernel learns the etc directory is located at file system block 717. Block 717 is read (4th logical I/O) and the file names and inodes contained within that directory are now known. One of the entries within block 717 is the passwd file and its corresponding inode number 1824. The inode 1824 is read (5th I/O), and from this the kernel finally learns that block 2240 is the one that contains the contents of the /etc/passwd file. Block 2240 is read (6th I/O) and the kernel finally has the data it set out to access. So, the answer to the question at the top of the slide, How many logical I/Os does it take to access /etc/passwd? is . . . 6.
95. SLIDE: File System Blocks and Fragments
File System Blocks and Fragments
FileA FileB FileB FileC FileC Fragment
FileD FileD FileD FileD
FileE FileE FileE FileE FileE
FileF FileF FileF FileF FileF FileF
End of Disk
File System Block
FileA, FileC, and FileD grow by 1 fragment

FileB FileB FileA FileA FileD FileD FileD FileD FileD FileE FileE FileE FileE FileE FileC FileC FileC FileF FileF FileF FileF FileF FileF End of Disk
Student Notes
The concept of blocks and fragments was introduced when the HFS file system was designed. There is always a tradeoff when managing a resource based on a fixed allocation unit size (the file system "block" in this case). If the block size is large we can manage them with fewer pointers (system overhead) but if it is too large there is an opportunity for inefficient utilization of the resource (very small files still require a block). In the case of the HFS file system this concern was addressed by making the block capable of uniform subdivision. The fragment was created for this purpose.
Definitions
Sector A sector is the smallest unit of space addressable on the physical disk. The sector size is used when the disk is formatted to appropriately place timing markers on the platter. The default sector size for HP-UX and most UNIX systems is 512 bytes. A fragment is the increment in which space is allocated to files within the HFS file system. The default fragment size is 1 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 1K, 2K, 4K, and 8K.
Fragment
File System Block
A file system block is the minimum amount of data transferred to/from the disk when performing a disk I/O on an HFS file system. The default file system block size is 8 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 4K, 8K, 16K, 32K, and 64K.
Example Top Half

The top half of the slide shows the allocation of disk space when the following six files are created (assuming only 5 file system blocks are free within the HFS file system). File A (size 1 KB): File B (size 2 KB) The kernel searches for the first free fragment. On the slide, the first fragment in the first file system block is allocated. The kernel searches for the first 2-KB continuous fragment that is available. This is in the same file system block in which FileA was allocated. The fact that FileA has already been allocated in this file-system block does not matter. Multiple files can be allocated within the same file system block. The first basic rule is: Best fit on close. The kernel searches for the first 2-KB continuous fragment available. This is in the same file system block as FileA and FileB. Hence, FileC is allocated 2 KB from this same file-system block. If any of these three files are accessed, then all three files are read into the file system buffer cache as a single unit.
File C (size 2 KB):
File D (size 4 KB): The kernel searches for the first four contiguous 1KB fragments available (within the same file system block). This is in the second file system block. The kernel does not allocate 3 fragments from the first file system block and 1 fragment from the second file system block, because that would require two logical I/Os to read in the entire 4 KB. This is inefficient, as only one I/O is required if the file is contained within the same file system block. The second basic rule is: If the size of a file is 8 KB or less, the kernel will fit the entire file within a single file-system block. File E (size 5 KB) and File F (size 6 KB): The kernel searches for the first available file-system block that can hold the entire file. On the slide, FileE is allocated in file system block 3, and FileF is allocated in file system block 4.
Example Bottom Half

The bottom half of the slide illustrates how the growth of three files affects allocation within the HFS file system. FileA (1KB -> 2KB): When FileA grows, it cannot grow into the next fragment because FileB is occupying this spot. Therefore, the kernel relocates FileA to the first free 2 KB that is within the same file system block. (Why transfer another block into memory, at this point?) FileC (2KB -> 3KB): When FileC grows, it cannot grow into the next fragment, because FileA is now in that spot. Therefore, the kernel relocates FileC to the first free 3 KB that is in a different file system block. (It can no longer fit into the first block). It selects block three because that block has a space exactly suited for the three-block FileC. The third basic rule is: if a file owns multiple fragments within the same block, they must be contiguous. FileD (4KB -> 5KB): When FileD grows, it simply grows into the next fragment because it is still free.
96. SLIDE: Creating a New File on a Full File System
Creating a New File on a Full File System
New FileG (4 KB) is created

FileB FileB FileA FileA FileD FileD FileD FileD FileD FileC FileC FileC FileE FileE FileE FileE FileE FileF FileF FileF FileF FileF FileF
FileG FileG FileG FileG End of Disk
What happens when new FileH (1 KB) is created?
Student Notes
As an HFS file system becomes full, the performance impact of creating a new file becomes significant. This is due to the behavior of the kernel when creating a new file: When a new file is created on HFS file systems, the kernel tries allocates a block-sized buffer in buffer cache for the file to grow into. Upon the file being closed, the kernel allocates the file's fragments to an already allocated file system block, if possible.
FileG Is Created
In the example on the slide, FileG is opened/created as a new file. Not knowing the size to which FileG will grow, the kernel allocates a block-sized buffer in buffer cache for FileG to grow into. When FileG is closed, the kernel searches for a set of four contiguous 1KB fragments in a block. Since there are no shared blocks that have four contiguous fragments, the file is written to a new, empty block.
What Happens When FileH Is Created?

The impact of creating new files on a full file system can be seen when FileH is created. When FileH is opened for creation, the kernel allocates a block-sized buffer in buffer cache.
As it turns out, FileH is closed after writing only 1 KB worth of data. Upon closure, FileH is moved to file system block 1, first fragment. NOTE: Performance on HFS file systems typically degrades when free space falls below 10%, due to the length of time it takes to find free file system blocks for new files. For this reason, it is recommended that MINFREE always be 10% or greater, even for large file systems (greater than 4 GB).
The fourth basic rule is: No fragment belonging to another file will be moved to make room for this file.
97. SLIDE: HFS Metrics to Monitor Systemwide
HFS Metrics to Monitor Systemwide

Utilization of the file systems File system I/O queue lengths Amount of physical I/O to the file systems File system free space Open files for each process
Student Notes
When monitoring disk I/O activity, the main metrics to monitor are: Percent utilization of the file systems: As utilization of the file system increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the file system is 50% busy, than it does when the file system is idle. Requests in the file system I/O queue: The number of requests in the file system I/O queue is one of the best indicators of a file system performance problem. If the average number of requests is three or greater, then requests are having to wait in the queue longer than the amount of time needed to service those requests. Amount of physical I/O: If the amount of file system activity is high, it is important to investigate on which file system the activity is occurring. File system free space: As an HFS file system becomes full (greater than 90%), it takes longer and longer to find an available free fragment for a new file or to grow an existing file. This creates additional disk activity, leading to slow file system performance.
Files opened with heavy access: For each process performing large amounts of file system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, (hit <CR> frequently, then inspect how quickly the offset to each file changes) consider relocating these files to other disks that are less busy.
98. SLIDE: Activities that Create a Large Amount of File System I/O
Activities that Create a Large Amount of File System I/O

File writes on an almost full file system Long, inefficient PATH variables Deep subdirectory structures Accessing large files sequentially with a small READ block size Accessing many files on a single disk
Student Notes
Common causes of disk-related performance problems are shown on the slide. Full file system cause excessive I/O due to locating free fragments. Long, inefficient PATH variables cause excessive directory I/O (especially when the command is found in the last directory within the PATH variable). Deep subdirectories cause lots of logical I/Os (two logical I/Os for each subdirectory in the full path name). Sequential file access, with a small file system block size, causes excessive amounts of physical I/O. Accessing lots of files on one file system, versus many, creates an imbalance of utilization. This leads to performance problems with the busy file systems and under utilization with the others.
99. SLIDE: HFS I/O Monitoring bdf Output
HFS I/O Monitoring bdf Output
# bdf Filesystem /dev/root /dev/vg00/lvol1 /dev/vg00/lvol6 /dev/vg00/lvol4 /dev/dsk/c0t4d0 /dev/vg00/lvol7 /dev/vg00/lvol5
kbytes used 81920 38018 47829 22403 286720 257116 360448 346127 1177626 1113204 122880 102098 53248 22589
avail %used Mounted on 40901 48% / 20643 52% /stand 28003 90% /usr 13444 96% /opt 0 100% /disk 19257 84% /var 28549 44% /tmp
Student Notes
The bdf report shows how much file system space is being used (and how much is free) for all file systems currently mounted on the system. The key fields are: avail %used Indicates the amount of disk space available on the file system (in KB). Indicates the percentage of disk space used.
The slide shows there are three file systems with 90% usage or more, and one of the file systems is at 100% utilization. Recall that when an HFS file system becomes full, performance on that file system suffers due to fragments being moved. The good news is that the amount of free space which is being held back by the file system parameter, MINFREE, is already subtracted from the values. In fact, if you compare the kbytes, used, and avail columns, youll see that something is missing. used + avail do not add up to be kbytes. The difference is MINFREE. For example, look at /stand. Clearly, 22403 + 20643 does not equal 47829. In fact, 22403+20643 divided by 47829 equals 90%, indicating that MINFREE must be set to 10% for this file system.
910. SLIDE: HFS I/O Monitoring glance File System I/O
HFS I/O Monitoring glance File System I/O
B3692A GlancePlus B.10.12 06:39:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------IO BY FILE SYSTEM Users= 4 Idx File System Device Type Logl IO Phys IO -------------------------------------------------------------------------------1 / /dev/root vxfs 0.3/ 0.6 0.0/ 0.0 2 /stand /dev/vg00/lvol1 hfs 0.0/ 0.0 0.0/ 0.0 3 /var /dev/vg00/lvol9 vxfs 1.0/ 1.8 0.1/ 0.3 4 /usr /dev/vg00/lvol8 vxfs 9.2/ 2.8 1.5/ 0.6 5 /tmp /dev/vg00/lvol7 vxfs 0.0/ 0.0 0.1/ 0.0 6 /opt /dev/vg00/lvol6 vxfs 0.0/ 0.0 0.0/ 0.0 7 /home.lvol5 /dev/vg00/lvol5 vxfs 0.0/ 0.0 0.0/ 0.0 8 /export /dev/vg00/lvol4 vxfs 0.0/ 0.0 0.0/ 0.0 9 /disk /dev/vg01/lvol1 vxfs 463.8/ 86.4 105.8/ 20.1 10 /cdrom /dev/dsk/c1t2d0 cdfs 0.0/ 0.0 0.0/ 0.0 11 /net e2403roc:(pid604) nfs 0.0/ 0.0 0.0/ 0.0 Top disk user: PID 3603, disc 104.0 IOs/sec S - Select a Disk
Student Notes
The glance file system I/O report (i key) shows activity on a per file system basis. Only total I/O activity (not reads versus writes) is shown with this report. This report is similar to the logical volume report (discussed in the previous module) except this report shows logical I/O compared to physical I/O, and does not distinguish between read and write activities. The logical volume report shows reads compared against writes, but does not distinguish between logical and physical activities. From the report on the slide, we note that all the file system activity is being performed against one file system. Note: The file system I/O report shows I/O activity for all types of mounted file systems, including CDFS file systems and NFS-mounted file systems.
911. SLIDE: HFS I/O Monitoring glance File Opens per Process
HFS I/O Monitoring glance File Opens per Process
B3692A GlancePlus B.10.12 06:44:39 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------Open Files for PID: 3911, disc PPID: 2410 euid: 0 User:root Open Open FD File Name Type Mode Count Offset -------------------------------------------------------------------------------0 /dev/pts/1 chr rd/wr 6 13582826 1 /dev/pts/1 chr rd/wr 6 13582826 2 /dev/pts/1 chr rd/wr 6 13582826 3 <reg,vxfs,inode:3024,/...ol9,vnode:0x00f9e000> reg read 1 85 4 /stand/file5 reg write 1 32768 10 /dev/null chr read 2 0
Student Notes
The glance open files report (F key), available only from the select process report (s key), shows the names of files opened for the currently selected process. Sometimes, the full path name of the file is shown. Otherwise, the inode number, and device name are shown and you would have to translate that information into the filename. NOTE: To determine the full pathname of a file, given its inode number and logical volume name, use the ncheck command: ncheck -F vxfs -i [inode #] [device name] Another way to determine the full pathname of a file, given its inode number and logical volume name, is to use the find command: find [mountpoint of device] inum [inode #] -xdev
To determine whether I/O activity is occurring against a file, enter the open file report for a particular process, and press <CR> multiple times in succession. Watch the offset field for each file. If the offset field is constantly changing, it indicates the file is currently being accessed.
Performance Scenario
A system is experiencing slow performance due to high file system utilization. Upon further investigation, not all file systems are heavily utilized. In fact, some show no activity at all. By sorting the processes within glance by disk I/O activity, then selecting those processes to obtain further details, you can determine which files are getting the majority of the activity. To take advantage of the underutilized file system, move the heavily accessed files to this file system and create a symbolic link to the file from its original location, thereby removing a heavily accessed file from a busy file system and putting it on an underutilized file system.
912. SLIDE: Tuning a HFS I/O-Bound System Tune Configuration for Workload
Tuning an HFS I/O-Bound System Tune Configuration for Workload

Tune the following parameters, based on workload: File System Block and Fragment Sizes Blocks per cylinder group (maxbpg) File system mount options The mkfs options when creating the file system The tunefs options can modify parameters on existing file systems Tune other configurations, based on workload: Optimize $PATH variables Use flat directory structures when possible Ensure sufficient freespace exists on file systems
Student Notes
Every workload and every application is different. Each has different resource requirements and each places different demands on the system. There is no one configuration that is optimal for all applications. For example, CAD-CAM application stress memory (and graphics); accounting applications that do forecasting stress CPU; NFS-based applications stress the disks (and the network); and RDBMS applications stress all resources.
File System Blocks and Fragments

Tips and notes for choosing the sizes of file system block and fragments follow:
Fragments
Fragment sizes can be 1, 2, 4, or 8 KB in size. Fragments can be 1/8, , , or equal to the file system block size. For large files which are opened and closed a lot during their growth, large fragments are recommended. For file systems with lots of small files, small fragments are recommended.
Module 9 HFS File System Performance File System Blocks
File system blocks sizes can be 4, 8, 16, 32, or 64 KB. For file systems with large files, large file system blocks are recommended. For file systems with large files, increase maxbpg (maximum blocks per group). For applications which perform a lot of sequential I/O (with read-aheads and write-behinds), large file system blocks are recommended.
HFS Mount Options

The mount options affect performance by specifying when files on the file system are updated. These options can be specified in the options column of the /etc/fstab file. The HFS-specific mount options include: behind delayed fs-async Enable, when possible, asynchronous writes to disk. This is the default for workstations. It does not use the sync daemon. Enable delayed or buffered writes to disk. This is the default for servers. It does use the sync daemon. Enable relaxed (asynchronous) posting of file system metadata (changes to the superblocks, inodes, etc.). This option may improve file system performance, but increases exposure to file system corruption in the event of power failure.
no_fs_async Force rigorous (synchronous) posting of file system metadata to disk. This is the default.
mkfs Options
mkfs is usually not executed directly, but is called by newfs -F hfs instead. File system tuning is best accomplished when the file system is created. The workload for a file system should be well understood and dedicated before serious attempts are made to tune one. Many options are also dependent on the type of physical device on which a file system is being created. The HFS specific options include: size largefiles The size of the file system in DEV_BSIZE blocks (the default is the entire device). The maximum size of a file can be up to 128 GB.
nolargefiles The maximum size of a file will be limited to 2 GB. ncpg minfree The number of cylinders per cylinder group (range 1-32, the default is 16). the minimum percentage free disk space reserved for non-root processes (default is 10%). Beginning with HP-UX 10.20 the bdf command does not conceal this free space and as a result will report free disk space accurately. This means that a file system cannot show 111% utilization anymore.
nbpi
The number of bytes per inode. This value determines how many inodes are allocated given a file system of a certain size. (The default is 6144.)
tunefs Options Some parameters can be changed once the file system has been created, with tunefs(1m). There are minfree and maxbpg. minfree is explained above. maxbpg The maximum number of data blocks that a large file can use out of a cylinder group, before it is forced to continue to grow in a different cylinder group. This value does not apply to any file which size is 12 blocks or less.
tunefs can also be used to display the contents of an HFS file system: # tunefs v /dev//
Other Configurations
Optimize $PATH
The PATH variable in a user's environment specifies a list of directories to search when a command is entered. Having an excessive number of directories or duplicate directories to search can increase disk access, particularly when the user makes a mistake typing a command. This problem can be greatly exacerbated if the user's PATH variable contains directories that are mounted automatically with the NFS automount utility, causing the network mount of a file system because of a typographical error.
Use Flat Directory Structures
Long directory path names create more work for the system because each directory file and its associated inode entry require a disk I/O in order to bring them into memory. Recall that six logical I/Os were required to read the /etc/passwd file. Conversely, you dont want thousands of files in the same directory, as it would take many I/O operations to read and search the directory.
Ensure Sufficient Freespace
As the file system becomes full (greater than 90%), the kernel begins to take longer and longer to find available free fragments. The algorithm gets very lengthy when the file system free space falls below 10%. Of course, if you do not have any files that grow and you are not adding any new files, this would waste 10% of your file system free space for no reason.
913. SLIDE: Tuning a HFS I/O-Bound System Use Fast Links
Tuning a HFS I/O-Bound System Use Fast Links
Type Permissions Links
Type Permissions Links

Owner Group Size
Blk. 74 /data 12
Owner Group 74
Size
ATime MTime CTime
12
ATime MTime CTime

/ d a t a
/usr/data -> /data
/usr/data -> /data
Standard Symbolic Links
HP Fast Links
Student Notes
There are two ways symbolic links can be stored on HFS file systems.
Standard Symbolic Links

Standard symbolic links are implemented in the same way as they are on other UNIX systems. The inode for the symbolic link points to a data block on disk, and the contents of the data block contains the name of the file being referenced by the symbolic link. In the example on the slide, /usr/data is the symbolic link with an inode number of 12. The contents of inode 12 contain an address pointer to data block 74, and the contents of data block 74 contain the name of the file being referenced (in the example, /data). Two logical I/Os are required to resolve the symbolic link, one I/O to retrieve the inode and one I/O to retrieve the data block containing the referenced name.
HP Fast Links
HP fast links allow symbolic links to be resolved with one logical I/O instead of two. HP fast links store the name of the referenced file in the inode of the symbolic link itself, rather than in a data block that the inode references. In the example, when the inode (12) of the symbolic link is retrieved, the contents of the inode contain the name of the referenced file.
HP fast links can be configured by setting the tunable OS parameter create_fastlinks to 1, and recompiling the kernel. Upon booting from the new kernel, all future symbolic links created will use HP fast links. No existing standard symbolic links will be automatically converted to fast symbolic links. The standard symbolic links would have to be removed and then recreated to convert them. Fast symbolic links will only work for link destinations that can be expressed in 59 characters or less as this is the limit of the space within the inode where the fast link information is stored. If a symbolic link contains more than 59 characters, it will be stored as a standard symbolic link, regardless of the value of create_fastlinks.
Transition Links
Saving one logical I/O when accessing a symbolic link may not seem significant, until considering that HP-UX makes heavy use of transition links (which are an implementation of symbolic links). Transition links allow an HP-UX file system to contain older 9.x directory paths. The 9.x directory names are symbolic links that point to the correct, current location (for example, /bin > /usr/bin). Many HP-UX installations have applications (including HP-UX applications), which rely on and make heavy use of transition links. A quick performance gain for all HP-UX systems is to convert these transition links from standard symbolic links to HP fast links. The procedure for making this conversion is: 1. Recompile the kernel to use HP fast links (i.e. set the create_fastlinks to 1). 2. Shut down and reboot the system. 3. Execute tlremove to remove all the transition links from the system. Over 500 links will be removed. 4. Execute tlinstall to reinstall (that is, recreate) the transition links. When the links are reinstalled, they will be created with HP fast links.
914. LAB: HFS Performance Issues Directions

The following lab illustrates a number of performance issues related to HFS file systems. 1. A 512 MB HFS file system is required for this lab. Use the mount and bdf commands to determine if such a file system is available. # mount v # bdf If there is no such HFS file system available, create one using the commands below: # lvcreate -n hfs vg00 # lvextend L 512 /dev/vg00/hfs /dev/dsk/cXtYdZ (second disk) # newfs -F hfs /dev/vg00/rhfs # mkdir /hfs # mount /dev/vg00/hfs /hfs 2. Copy the lab files to the newly created HFS file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files /hfs /hfs
Next, execute the make_files program to create five 4-MB ASCII files. # cd /hfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs
4. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: 5. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
6. Performance tuning recreate the file system with larger fragment and file system block sizes. Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below: # lvcreate -n custom-lv vg00 # lvextend L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs 7. Copy the lab files to the customized HFS file system, execute the make_files program, and purge the buffer cache. # cp /hfs/disk_long /cust-hfs
# cp /hfs/make_files /cust-hfs # cd /cust-hfs # ./make_files # cd / # umount /cust-hfs
# mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs 8. Time how long it takes to read the files with the cat command. Record the results below:
# timex cat file* > /dev/null
real: user: sys: How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________ 9. Performance tuning change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties. # cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs 10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________
Module 10 VxFS Performance Issues

Understand JFS structure and version differences Explain how to enhance JFS performance Set block sizes to improve performance Set Intent-Log size and rules to improve performance Understand and manipulate synchronous and asynchronous IO Identify JFS tuning parameters Understand and control fragmentation issues Evaluate the overhead of online backup snapshots
101. SLIDE: Objectives
Objectives
Upon completion of this lesson, you will be able to: Understand JFS structure and version differences Explain how to enhance JFS performance Set block sizes to improve performance Set Intent-Log size and rules to improve performance Understand and manipulate synchronous and asynchronous I/O Identify JFS tuning parameters Understand and control fragmentation issues Evaluate the overhead of online backup snapshots
Student Notes
Upon completion of this module, you will be able to do the following: Understand JFS Structure and version differences These course notes are based on the JFS Version 3.5 file system, built on Version 4 disk layout. The next few slides will describe the basic differences between versions and relate them to HP-UX releases. HP JFS 3.5 and HP OnlineJFS 3.5 are available for HP-UX 11i and later systems. The standard (base) version of HP JFS has been bundled with HP-UX since release 10.01. The advanced HP OnlineJFS is a purchasable product with additional administrative features for higher availability and tunable performance. These notes will make clear which features belong to the base product and which belong to the OnlineJFS version. The Operating Environment delivery model of HP-UX 11i includes JFS as follows: HP-UX 11i OE HP-UX 11i Enterprise OE HP-UX 11i Mission Critical OE BaseJFS 3.3 OnlineJFS 3.3 OnlineJFS 3.3
You can download JFS 3.5 for HP-UX 11i for free from the HP Software Depot (http://www.software.hp.com), or you can request a free JFS 3.5 CD from the Software Depot. You can purchase HP OnlineJFS 3.3 (product number B3929CA for servers and product number B5118CA for workstations) for HP-UX 11.0 or HP-UX 11i from your HP sales representative. JFS 3.5 is included with HP-UX 11i systems. Explain how to enhance JFS performance The HFS file system uses block based allocation schemes, which provide adequate random access and latency for small files but limit throughput for larger files. As a result, the HFS file system is less than optimal for commercial environments. VxFS addresses this file system performance issue through an alternative allocation scheme and increased user control over allocation, I/O, and caching policies. Set Block Sizes to improve performance It is often advantageous to match the block size of a file system to the I/O size of the application. We will show you how! Set Intent Log size to improve performance The JFS intent log provides for rapid fsck recovery after a system crash. In general the intent log is not protecting your data; the focus is on structural integrity and not data integrity! Fast fsck comes at a price and that price is performance. Setting the correct intent log size is important as it cannot be changed once a file system is created. Understand and manipulate synchronous and asynchronous I/O Programmers and data base providers do different types of I/Os to obtain the best possible balance between data integrity and performance. We will investigate all the gray areas and tune the JFS file system to meet our administrative and performance goals which might be quite different to those of the programmer! Identify JFS tuning parameters The JFS is tunable through mount options, the command line, configuration files and kernel parameters. We will learn where and how to tune. Understand and control fragmentation issues The extent based file allocation design of JFS is ideal for performance of large files. One weakness of this approach is the potential fragmentation of files and free space over the life of the file system. In general this will only occur in dynamic work file orientated JFS file systems (e.g. a mail server) and is unlikely in fixed large file file systems where major I/O rates occur to static files (e.g. a data base). We will investigate ways of measuring and fixing fragmentation.
Evaluate the overhead of online backup snapshots OnlineJFS supports online backups via snapshot mounts. We will discuss the performance issues involved when working with snapshots.
102. SLIDE: JFS History and Version Review
JFS History and Version Review

JFS introduced in 1995 with HP-UX 10.01 Version 2 structure at introduction Version 3 structure at 10.20 allows 1TB files Version 4 structure at 11.00 allows more tunable controls and supports ACLs Do not use V4 structure on 11.00 for /, /usr, /opt, /var vxupgrade(1M) tool can migrate up through versions (not down!) 11i delivers JFS 3.5 software on V4 structure Differences between Base JFS 3.5 and OnlineJFS 3.5
Student Notes
The HP-UX Journaled File System (JFS) was introduced by HP in August, 1995, on the HP-UX 10.01 release. The journaled file system attempts to improve on the high-performance file system (HFS) by offering the following enhancements: Extent-based allocation of disk space Fast file system recovery through an Intent Log Greater control and flexibility of file system behavior through new mount options and tunable options.
Disk Layout Versions

Version 1 Version 2 The Version 1 disk layout was never used in HP-UX. The Version 2 disk layout has the following changes and features: Many internal JFS structures are dynamic files themselves. Internal filesets separate data files (User Fileset) from structural files (Structural Fileset). Allocation units now contain data and data map structures only, inode tables are elsewhere. inode allocation is dynamic and cannot run out. Optional support for quotas. The Version 3 disk layout offers additional support for: Files up to one terabyte File systems up to one terabyte Indirect inode extent maps can now address variant length file extents. V2 restricts all indirect extents to the size of the first indirect extent. Hence large files and sparse files possible with less overhead. Version 4 is the latest disk layout: The Version 4 disk layout supports Access Control Lists. The Version 4 disk layout does not include significant physical changes from the Version 3 disk layout. Instead, the policies implemented for Version 4 are different, allowing for performance improvements, file system shrinking, and other enhancements. HP-UX 11i with Version 4 layout now supports both files and file systems up to 2TB in size.
Version 3
Version 4
Table: Matching HP-UX version to JFS version

HP-UX Release VxFS Version Supported Disk Layouts Default Disk Layout
10.10 10.20 11.00 with JFS 3.1 11.00 with JFS 3.3 11i v1 11i v2
2.3 3.0 3.1 3.3 3.3 3.5
2 2,3 2,3 2,3,4 2,3,4 2,3,4
2 3 3 3 4 4
vxupgrade(1M)
The vxupgrade command can upgrade an existing Version 3 VxFS file system to the Version 4 layout while the file system remains online. vxupgrade can also upgrade a Version 2 file system to the Version 3 layout. See vxupgrade(1M) for details on upgrading VxFS file systems. You cannot downgrade a file system that has been upgraded.
NOTE:
You cannot upgrade the root (/) or /usr file systems to Version 4 on an 11.00 system running JFS 3.3. Additionally, we do not advise upgrading the /var or /opt file systems to Version 4 on an 11.00 system. These core file systems are crucial for system recovery. The HP-UX 11.00 kernel and emergency recovery media were built with an older version of JFS that does not recognize the Version 4 disk layout. If these file systems were upgraded to Version 4, your system might have errors booting with the 11.00 kernel as delivered, or booting with the emergency recovery media.
Comparing Base and Advanced JFS

Table: Comparing Base and Online JFS Feature extent-based allocation extent attributes fast file system recovery access control list (ACL) support enhanced application interface enhanced mount options improved synchronous write performance support for large files (up to two terabytes) support for large file systems (up to two terabytes) enhanced I/O performance support for BSD-style quotas unlimited number of inodes file system tuning [vxtunefs(1M)] online administration ability to reserve space for a file and set fixed extent sizes and allocation flags online snapshot file system for backup direct I/O, supporting improved database performance data synchronous I/O DMAPI (Data Management API) How to tell if JFS 3.5 is installed To determine if a vmunix file has JFS3.5 compiled into it, you can run: what /stand/vmunix | grep libvxfs.a or nm /stand/vmunix | grep vx_work If you get output from either of these two commands, then the vmunix file has JFS 3.5 compiled into it, e.g.: JFS 3.5 * * * * * * * * * * * * * OnlineJFS 3.5 * * * * * * * * * * * * * * * * * * *

# what /stand/vmunix | grep libvxfs.a $Revision: libvxfs.a: 22 PST 2000 $ # nm /stand/vmunix | grep vx_work [13585] | 9746968| st_gettag [13587] | 9746976| st_enqueue [13589] | 9746984| st_thread [13591] | 9746992| st_process [13593] | 9747000| read_set [34118] | 991200| ueue [27664] | 1940288| tag [23820] | 13229528| h [22805] | 13182888| [36804] | 13762256| [39078] cess [33997] ead [23090] ead_sv [36954] eup [31238] reate [13579] et | | | | | | 1744792| 1745344| 12350056| 1745232| 2034928| 7215680|
CUPI80_BL2000_1108_2 Wed Nov
8 10:59:
8|OBJT |LOCAL|0| .rodata|S$704$vx_workli 8|OBJT |LOCAL|0| .rodata|S$705$vx_workli 8|OBJT |LOCAL|0| .rodata|S$706$vx_workli 8|OBJT |LOCAL|0| .rodata|S$707$vx_workli 8|OBJT |LOCAL|0| .rodata|S$708$vx_workth 232|FUNC |GLOB |0| 96|FUNC |GLOB |0| 40|OBJT |GLOB |0| 16|OBJT |GLOB |0| 40|OBJT |GLOB |0| 436|FUNC |GLOB |0| 196|FUNC |GLOB |0| 8|OBJT |GLOB |0| 84|FUNC |GLOB |0| 48|FUNC |GLOB |0| 232|FUNC |LOCAL|0| .text|vx_worklist_enq .text|vx_worklist_get .bss|vx_worklist_hig .bss|vx_worklist_lk .bss|vx_worklist_low .text|vx_worklist_pro .text|vx_worklist_thr .sbss|vx_worklist_thr .text|vx_worklist_wak .text|vx_workthread_c .text|vx_workthread_s
103. SLIDE: JFS Extents
JFS Extents
Start Length
40 128 200 64 8 5 ...
Extent 1 Extent 2 Extent 3 Different Files
JFS Inode (data pointers) Disk
Student Notes
JFS allocates space to files in the form of extents - adjacent blocks of disk space treated as a unit. Extents can vary in size from a single block (minimum 1 KB in size) to many megabytes. Organizing file storage in this manner allows JFS to better support large I/O requests, with more efficient reading and writing to continuous disk space areas. JFS extents are represented by a starting block number and a block count. In the example on the slide, the first extent starts at block 40 and contains a length of 128 blocks (or 128 KB, assuming blocks are 1KB in size). When the file grew past the 128 KB size, JFS tried to increase the size of the last extent. Since another file was already occupying this location, a new extent was allocated, starting at block 200. This extent grew to a size of 64 KB, before encountering another file. At this point, a third extent was allocated at block 8. Initially, 8 KB were allocated to the third extent, but upon closing the file, any space not used by the last extent is returned to the operating system. Since only 5 KB were used, the extra 3 KB were returned.
Direct and Indirect Extents in Version 2 Disk Layout

Unlike the HFS inode, the vxfs inode is 184 (rather than 128) bytes long and contains direct and indirect pointers. In the HFS inode, the pointers address data blocks (8K by default) with 12 direct pointers and 3 additional indirect pointers for single, double and triple indirection. In reality triple indirection is rarely needed. Mapping large files in HFS is complex due to the levels of indirection needed to address many 8K blocks. The JFS (vxfs) inode has 10 direct pointers and three additional pointers for single, double, and triple indirect addressing. The pointers no longer address single blocks of data but rather large extents of data. It is unlikely that any indirect pointers will be needed at all as the 10 direct pointers can define large spaces due to the variant length of the extents themselves.
Version 3 and Version 4 Extent Mapping.. Typed Extents.

The above discussion is true only for Version 2 disk layout. In addition to the above, in V3/V4 we also have Typed Extents which basically allow any level of indirection allowing very large files to be created from many small extents if required (this is not desirable however!). Version 2 also imposes the limit that all indirect extents be the same size (direct extents can be variable in length). In V3/V4 we can have indirect extents of any size mix. V3/V4 will always attempt to use the simplest approach. We use the 10 direct pointers when we can! Inodes are converted to Typed Indirect when the file exceeds the capability of 10 direct extents.
104. SLIDE: Extent Allocation Policies
Extent Allocation Policies

Disk space allocation: the Block Size can be 1K, 2K, 4K, 8K. Extents are predefined in free space - Power of 2 Rule Preferred allocation rules Largest single extent is 16MB (with 8K block size). Full use of Single Indirection in default HFS would also be 16MB VxFS supports large files without indirection.
Student Notes
Disk Space Allocation: The Block Size
Disk space is allocated by the system in 1024 byte device blocks (DEV_BSIZE). An integral number of device blocks are grouped together to form a file system block. VxFS supports file system block sizes of 1024, 2048, 4096, and 8192 bytes. The default block size is: 1024 bytes for file systems less than 8 gigabytes; 2048 bytes for file systems less than 16 gigabytes; 4096 bytes for file systems less than 32 gigabytes; 8192 bytes for file systems 32 gigabytes or larger.
The block size may be specified as an argument to the mkfs or newfs utility and may vary between VxFS file systems mounted on the same system. VxFS allocates disk space to files in extents. An extent is a set of contiguous blocks (up to 2048 blocks in size).
Extents in Free Space - Power of 2 Rule

Free space is described by bitmaps in each allocation unit. The allocation units are split into 16 sections. Each section has a series of bitmaps that represent all the possible extents with sizes from 1 block to 2048 blocks by powers of 2. The first bitmap represents all the blocks in the section as one block extents, the second as two block extents, the third as four block extents, etc. The first bitmap, of 2048 bits, represents the section as 2048 one-block extents. The second bitmap, of 1024 bits, represents the section as 1024 two-block extents. This continues for all powers of 2 up to the single bit that represents one 2048 block extent. The file system uses this bitmapping scheme to find an available extent closest in size to the space required. This keeps files as contiguous as possible for faster performance. The largest possible extent on a file in a VxFS file system (with the largest block size of 8 KB) is 2048 * 8 KB = 16 MB.
Preferred Allocation
The following rules are satisfied wherever possible starting with the preferred rules at the top and working down to less preferred rules. Allocate files using contiguous extent of blocks Attempt to allocate each file in one extent of blocks If not possible, attempt to allocate all extents for a file close to each other If possible, attempt to allocate all extents for a file in the same allocation unit
An allocation unit is an amount of contiguous (and therefore close together) file system space equal to 32 MB in size. It is roughly analogous to the HFS cylinder group, but is not dependent on the geometry of the disk drive in any way.
105. SLIDE: JFS Intent Log
JFS Intent Log
= meta data update in memory (i.e. superblock or inode table update)
= JFS Intent Log Write
= Sync
= System Crash
Student Notes
A key advantage of JFS is that all file system transactions are written to an Intent Log. The logging of file system transactions helps to ensure the integrity of the file system, and allows the file system to be recovered quickly in the event of a system crash.
How the Intent Log Works

When a change is made to a file within the file system, such as a new file being created, a file being deleted, or a file being updated, a number of updates must be made to the superblock, inode table, bit maps, and other structures for that file system. These changes are called metadata updates. Typically, there are multiple metadata updates, which take place every time a change is made to a file. With JFS, after every successful file change (also called a transaction), all the metadata updates related to that transaction get written out to a JFS Intent Log. The purpose of the Intent Log is to hold all completed transactions that have not yet been flushed out to disk. If the system were to crash, the file system could quickly be recovered by checking the file system and applying all transactions in the intent log. Since only completed transactions are logged, there is no risk of a file change being only partially updated (i.e. only some metadata
updates related to the transactions being logged, and other metadata updates related to the same transaction not being logged). The logging of only COMPLETED transactions prevents the file system from being out-of-sync due a crash occurring in the middle of a transaction. Either the entire transaction is logged or none of the transaction is logged. This allows the JFS intent log to be used in a recovery situation as opposed to a standard fsck. The JFS recovery is done in seconds, as opposed to a standard fsck that (on a big file system) could take minutes, or even hours.
Example
Using the example on the slide, assume that each file transaction requires from one to four metadata updates. After each successful file transaction, all the related metadata updates are written to the JFS intent log. After 30 seconds, all the metadata updates are written out to disk by the sync daemon, and a corresponding DONE record is written to the JFS intent log for each JFS transaction that was flushed during the sync. The system can now reuse that space in the JFS intent log for new JFS transactions. When a crash occurs (in our example, in the middle of a file transaction), the uncompleted transaction never has any metadata written to the JFS intent log; therefore only one transaction is in the JFS intent log since the last sync. Only this transaction needs to be redone and then the file system is recovered and in a stable state. Compare this with having to do a standard fsck.
Performance Impacts
The intent log size is chosen when a file system is created and cannot be subsequently changed. The mkfs utility uses a default intent log size of 1024 blocks. The default size is sufficient for most workloads. If the system is used as an NFS server, for intensive synchronous write workloads, or for dynamic work file loads with many metadata changes, performance may be improved using a larger log size. File data is not normally written to the intent log. However, if the application has designated to do synchronous writes and the writes are 32 KB or smaller, the file data will be written to the intent log, along with the meta-data. This behavior can be modified by mount options (discussed later in this module). With larger intent log sizes, recovery time is proportionately longer and the file system may consume more system resources (such as memory) during normal operation. There are several system performance benchmark suites for which VxFS performs better with larger log sizes. As with block sizes, the best way to pick the log size is to try representative system loads against various sizes and pick the fastest. The performance degradation occurs when the entire JFS intent log becomes filled with pending JFS transactions. In these situations, all new JFS transactions must wait for DONE records to arrive for the existing JFS transactions. Once the DONE records arrive, the space used by the corresponding transactions can be freed and reused for new transactions. Having to wait for DONE records to arrive can significantly decrease performance with JFS. In these cases, it is suggested the JFS file system be reinitialized with a larger JFS intent log.
CAUTION:
Network file systems (NFS) can generate a large number of metadata updates if accessed currently by multiple systems. For JFS file systems being exported for network access via NFS, it is strongly recommended these file systems have an intent log size of 16 MB (maximum size for intent log).
106. SLIDE: Intent Log Data Flow
Intent Log Data Flow
1
Superblock Inodes Bitmaps
3
SB Inode Intent Log Allocation Unit
Buffer Cache
JFS Transaction
Allocation Unit Inode
Process
Memory
Disk
Student Notes
The following slide shows a graphical representation of how JFS transactions are processed. System call is issued (for example, write call). 1. All in-memory data structures related to the transaction are updated. These in-memory structures would include the superblock, the inode table, and the bitmaps. 2. Once the in-memory structures are updated, a JFS transaction is packaged containing the modifications to the in-memory structures. This packaged transaction contains all the data needed to reproduce the transaction (should that be necessary). 3. Once the JFS transaction is created, it is written to the intent log. (When it is written depends on mount options.) At this point, control is returned to the system call. 4. Since the transaction is now stored on disk (in the intent log), there is no hurry to flush the in-memory data structures to their corresponding disk-based data structures.
Therefore, the in-memory structures are transferred to the buffer cache, and the sync daemon flushes out these transactions within the next 30 seconds. 5. After the metadata structures are flushed out, a DONE record is written to the intent log indicating the transaction has been updated to disk, and the corresponding transaction no longer needs to be kept in the intent log.
107. SLIDE: Understand Your I/O Workload
Understand your I/O Workload

Is it data-intensive?
few files, large chunks being shuffled around
Is it attribute-intensive? many files, small chunks being shuffled Is the access pattern random or sequential I/O? Check for read(), write(), and lseek() system calls What is the bandwidth and size of the I/Os? Are these consistent?

Spindles Win Prizes! LVM or VxVM Stripes Use XP Disk Arrays
Student Notes
Understand your I/O Workload
Tuning the file systems parameters to optimize performance can only be done effectively when you know what type of I/Os the application is doing. It would be wrong to tune for large block size and maximum contiguous space allocation if the application does many small random I/Os to many small files.
Data Intensive?
Commercial data base applications generally deal with very large files in the table space and large I/Os to those files. Any high degree of small random I/O should be taken care of by the data bases own buffers (System Global Area) and the HP-UX buffer cache (if it is being used). We may choose to increase the block size in this situation and tune for maximum read ahead/write behind. The following slides will cover this type of tuning.
Attribute Intensive?
Some applications generate many small I/Os to many small files. In this situation a large block size and maximum read ahead/write behind would be inappropriate, generating more I/O than is necessary. A Mail Server or Web Server could be regarded as such an application.
Sequential or Random IO?

We need to characterize the I/O from an application into sequential or random types. Again, in general sequential I/O will benefit from larger block size and continuous files and random I/O will require smaller block size to increase the number of blocks that can be maintained in the buffer cache. With sequential I/O we are more interested in maximizing the MB/sec throughput of the disk (as seen with sar or glance etc). With random I/O we will be looking to the I/Os Per Second metrics associated with the disk (r+w/s in sar). Remember that the fastest random I/Os we do are the ones that never go to the disk (!) because they are in the buffer cache (we hope). The Direct I/O feature of OnlineJFS 3.5 is an attempt to recognize when I/Os to a file are very large and sequential. Direct I/O will then attempt to bypass the buffer cache to benefit the generator of the large I/Os in question. Most applications are not designed to handle their own buffering and will lose a great deal of performance if they attempt to use Direct I/O.
Disk Bandwidth
In the end we can only get so much performance out of a single spindle. Modern fast disks (10,000+ RPM, 5ms access time) can only provide an absolute maximum of approx 10 MB/s for very sequential I/O and around 150 I/Os Per Second. Once your file system is extracting these sorts of numbers (or even 50% of them!) you can consider that the hardware has become the limiting factor. Stop tuning and buy more disks! Remember that spindles win prizes. LVM or VxVM striping will help in this situation as the single spindle performance can be aggregated by the number of spindles. Using expensive RAID technology like the HP XP256, XP512, or XP1024 Disk Arrays will also improve apparent spindle performance. The author has seen a single XP512 logical device provide a sustained 60MB/s read performance for sequential I/O and over 1500 I/Os per second for a single threaded random application test to a single logical device.
108. SLIDE: Performance Parameters
Performance Parameters
Things that an administrator can change to optimize JFS: Choosing a Block Size Choosing an Intent Log Size Choosing Mount Options Kernel Tunables
Internal Inode Table Size
Monitoring Free Space and Fragmentation Changing extent attributes on individual files I/O Tuning

Tunable VxFS I/O Parameters Command Line Configuration file (/etc/vx/tunefstab)
Student Notes
We will discuss the following choices over the next slides. Note that the some parameters can only be set when the file system is created. At file system creation time (only):
Choosing a Block Size Choosing an Intent Log Size
After file system creation:
Choosing Mount Options Kernel Tunables Kernel Inode Table Size Monitoring Free Space and Fragmentation Changing extent attributes on individual files I/O Tuning Tunable VxFS I/O Parameters
109. SLIDE: Choosing a Block Size
Choosing a Block Size

Choose the right block size for the application. Consider maximum block size (8K) for large file data base

Small files will waste space System overhead will be less Files approaching 1GB are large
Consider minimum block size (1K) for small file mail server or web server
More system overhead if files are large
Use large block size for sequential I/O application Use small block size for random I/O application
Student Notes
You specify the block size when a file system is created; it cannot be changed later. The standard HFS file system defaults to a block size of 8K with a 1K fragment size. This means that space is allocated to small files (up to 12 blocks) in 1K increments. Allocations for larger files are done in 8K increments. Because many files are small, the fragment facility saves a large amount of space compared to allocating space 8K at a time. The unit of allocation in VxFS is a block. There are no fragments because storage is allocated in extents that consist of one or more blocks. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on devices of less than 8 gigabytes. Choose a block size based on the type of application being run. For example, if there are many small files, a 1K block size may save space. For large file systems, with relatively few files, a larger block size is more appropriate. The trade-offs of specifying larger block sizes are: 1) a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, 2) an increase in the maximum extent size, and 3) a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not a multiple of the block size.
Larger block sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size. The easiest way to judge which block sizes provide the greatest system efficiency is to try representative system loads against various sizes and pick the fastest.
Specifying the Block Size

The following newfs command creates a VxFS file system with the maximum block size and support for large files. # newfs F vxfs b 8192 o largefiles /dev/vgjfs/rlvol1 The block size for files on the file system represents the smallest amount of disk space that can be allocated to a file. It must be a power of 2 selected from the range 1024 to 8192. The default is 1024 for file systems less than 8 gigabytes, 2048 for file systems less than 16 gigabytes, 4096 for file systems less than 32 gigabytes, and 8192 for larger file systems.
1010. SLIDE: Choosing an Intent Log Size
Choosing an Intent Log Size

Intent log size cannot be changed after file system creation mkfs applies a default log size of 1024 blocks Performance may improve as using a larger log size NFS server will benefit from a 16MB (largest) log size
Synchronous write intensive applications
Student Notes
The intent log size is chosen when a file system is created and cannot be changed afterwards. The default intent log size chosen by mkfs is 1024 blocks and is suitable in most situations. For some types of applications (NFS server or intensive synchronous write loads), performance may be improved by increasing the size of the intent log. Note that recovery time will also be proportionally longer as the log size increases. Memory requirements for the log maintenance will also increase as the log size increases. Ensure that the log size is not more than 50% of the physical memory size of the system or fsck will not be able to fix it after a system crash. Ideal log size for NFS is 2048 with a file system block size of 8192.
Specifying the Intent Log Size

To create a VxFS file system with a default block size and a 16MB intent log: # newfs F vxfs o logsize=16384 /dev/vgjfs/rlvol1 -o logsize= specifies the number of file system blocks to allocate for the transactionlogging area. It must be in the range of 32 to 16384 blocks. The minimum number for Version 2 disk layouts is 32 blocks. The minimum number for Version 3 and Version 4 disk layouts is the number of blocks that make the log no less than 256K. If the file system is: greater than or equal to 8MB, default is 1024 blocks greater than or equal to 2MB, and less than 8 MB, default is 128 blocks less than 2MB, default is 32 blocks
While logsize is specified in blocks, the maximum size of the intent log is 16384 KB. This means the maximum values for logsize are: 16384 for a block size of 1024 bytes 8192 for a block size of 2048 bytes 4096 for a block size of 4096 bytes 2048 for a block size of 8192 bytes
1011. SLIDE: Intent Log Mount Options
Intent Log Mount Options

Full logging* Delayed logging Temporary logging No logging Disallow small sync I/Os in log Force clear new file blocks log delaylog tmplog nolog nodatainlog (50% perf cost!) blkclear (10% perf cost!)
*Note only the first option is default for mount
Student Notes
JFS offers mount options to delay or disable transaction logging to the intent log. This allows the system administrator to make trade-offs between file system integrity and performance. Following are the logging options: Mount Option Full logging (log) Description File system structural changes are logged to disk before the system call returns to the application (synchronously). If the system crashes, fsck(1M) will complete logged operations that have not completed. Some system calls return before the intent log is written. This improves the performance of the system, but some changes are not guaranteed until a short time later when the intent log is written. This mode approximates traditional UNIX system guarantees for correctness in case of system failure. The intent log is almost always delayed. This improves
Delayed logging (delaylog)
Temporary logging
(tmplog)
performance, but recent changes may disappear if the system crashes. This mode is only recommended for temporary file systems. The intent log is disabled. The other three logging modes provide for fast file system recovery; nolog does not provide fast file system recovery. With nolog mode, a full structural check must be performed after a crash. This may result in loss of substantial portions of the file system, depending upon activity at the time of the crash. Usually, a nolog file system should be rebuilt with mkfs(1M) after a crash. The nolog mode should only be used for memory resident or very temporary file systems. The nodatainlog mode should be used on systems with disks that do revectoring. Normally, a VxFS file system uses the intent log for synchronous writes. The inode update and the data are both logged in the transaction, so a synchronous write only requires one disk write instead of two. When the synchronous write returns to the application, the file system has told the application that the data is already written. If a disk error causes the data update to fail, then the file must be marked bad and the entire file is lost. If a disk supports bad block revectoring, then a failure on the data update is unlikely, so logging synchronous writes should be allowed. If the disk does not support bad block revectoring, then a failure is more likely, so the nodatainlog mode should be used. A nodatainlog mode file system should be approximately 50 percent slower than a standard mode VxFS file system for synchronous writes. Other operations are not affected.
No logging (nolog)
nodatainlog
blkclear
The blkclear mode is used in increased data security environments. The blkclear mode guarantees that uninitialized storage never appears in files. The increased integrity is provided by clearing extents on disk when they are allocated to a file. Extending writes are not affected by this mode. A blkclear mode file system should be approximately 10 percent slower than a standard mode VxFS file system, depending on the workload.
1012. SLIDE: Other JFS Mount Options
Other JFS Mount Options

mincache options (buffer cache) closesync* direct dsync unbuffered tmpcache convosync options (synchronous I/O) closesync direct dsync unbuffered delay
* NOTE: This is the only additional option available with BaseJFS, all other options require OnlineJFS.
Student Notes
Understanding asynchronous, data synchronous (O_DSYNC) and fully synchronous (O_SYNC) application I/O.
When an application program opens a file with the open() system call, the programmer makes a decision on how the I/Os will occur between the application memory and the file system. The following three options are available, in order, ranging from highest performance (lowest integrity) to lowest performance (best integrity). In this discussion integrity refers to the potential damage to file system structures and customer data during a system crash. 1. Asynchronous I/O Standard Mode High performance / Low integrity
In asynchronous mode, all application I/Os are done to buffer cache including data and inode modifications. The write() system call will return quickly to the application which can continue in faith that the data will make it to the disk. Data integrity will be fully compromised by a system crash and new just created files may even disappear.
2. Data Synchronous I/O O_DSYNC
Low performance / Good integrity
If the file is opened with the O_DSYNC flag, the file is in Data Synchronous mode. In this situation, write() system calls that modify data do not return until the disk has acknowledged the receipt of the data. However, some inode changes (time stamps, etc.) are still performed asynchronously and may not have arrived at the disk in the case of a system crash.
3. Synchronous I/O
O_SYNC
Lowest performance / Best integrity
Fully synchronous behavior is obtained by opening the file with O_SYNC. All operations are now synchronous and write() system calls block for both data and inode modifications. Minimal damage will now occur in the event of a system crash.
mincache vs. convosync

mincache manipulates the behavior of the buffer cache. All of the mincache options except mincache=closesync require the OnlineJFS product (see slide). convosync (convert osync) changes the behavior of data synchronous (O_DSYNC) and synchronous (O_SYNC) writes. All convosync options require OnlineJFS. The mincache and convosync options generally control the integrity of the user data, where the log options (log, delaylog, tmplog, nolog) control the integrity of the metadata only.
mincache
mincache=closesync Flush data to disk synchronously when file is closed.
The mincache=closesync mode is useful in desktop environments where users are likely to shut off the power on the machine without halting it first. In this mode, any changes to the file are flushed to disk synchronously when the file is closed. To improve performance, most file systems do not synchronously update data and inode changes to disk. If the system crashes, files that have been updated within the past minute are in danger of losing data. With the mincache=closesync mode, if the system crashes or is switched off, only files that are currently open can lose data. A mincache=closesync mode file system should be approximately 15 percent slower than a standard mode VxFS file system, depending on the workload. mincache=direct Bypass the buffer cache for all data and inode changes, forces fully synchronous behavior and totally skips buffer cache. Bypass the buffer cache for data only. Inode changes are cached. Forces data synchronous-like behavior with no data in cache.
mincache=unbuffered
mincache=dsync
Equivalent to normal data synchronous behavior. Write does not return until data is on disk but data does go through buffer cache.
The mincache=direct, mincache=unbuffered, and mincache=dsync modes are used in environments where applications are experiencing reliability problems caused by the kernel buffering of I/O and delayed flushing of non-synchronous I/O. The mincache=direct and mincache=unbuffered modes guarantee that all nonsynchronous I/O requests to files will be handled as if the VX_DIRECT or VX_UNBUFFERED caching advisories had been specified. The mincache=dsync mode guarantees that all nonsynchronous I/O requests to files will be handled as if the VX_DSYNC caching advisory had been specified. Refer to vxfsio(7) for explanations of VX_DIRECT, VX_UNBUFFERED, and VX_DSYNC. The mincache=direct, mincache=unbuffered, and mincache=dsync modes also flush file data on close as mincache=closesync does. mincache=tmpcache Speeds up file growth by breaking data initialization rules.
The -o mincache=tmpcache option only affects write extending calls and is not available to files performing synchronous I/O. write extending calls refer to write calls that cause new file system blocks to be assigned to the file, extending the size of the file in blocks. The normal behavior for write extending calls is to write the new user data first, and insist on metadata to be written only after the user data. Write extending calls are expensive from a performance standpoint, because the write call has to wait for the user data and the metadata to be written. A non-extending write call only requires the call to wait for the metadata. With the -o mincache=tmpcache option, write extending calls do not have to wait for the user data to be written. This option allows the metadata to be written before user data (and the write call to return before the user data is written), significantly improving performance. CAUTION: The -o mincache=tmpcache option significantly increases the likelihood of non-initialized file system blocks (i.e. junk) appearing in files during a system crash. This is due to the file pointing to data blocks before the data is actually there. If the system crashes between the file's inode being updated (done first) and the user data being written (done second), then uninitialized data will appear in the file. The tmpcache option should only be used for memory resident or very temporary file systems.
convosync
NOTE: Use of the convosync=dsync option violates POSIX guarantees for synchronous I/O.
The convert osync (convosync) mode has five values: convosync=closesync, convosync=direct, convosync=dsync, convosync=unbuffered, and convosync=delay. The convosync=closesync mode converts synchronous and data synchronous writes to non-synchronous writes and flushes the changes in the file to disk when the file is closed. The convosync=delay mode causes synchronous and data synchronous writes to be delayed rather than to take effect immediately. No special action is performed when closing a file. This option effectively cancels any data integrity guarantees normally provided by opening a file with O_SYNC. See open(2), fcntl(2), and vxfsio(7) for more information on O_SYNC. Caution! Extreme care should be taken when using the convosync=closesync or convosync=delay mode because they actually change synchronous I/O into non-synchronous I/O. This may cause applications that use synchronous I/O for data reliability to fail, if the system crashes and synchronously written data is lost.
The convosync=direct and convosync=unbuffered mode convert synchronous and data synchronous reads and writes to direct reads and writes, bypassing the buffer cache. The convosync=dsync mode converts synchronous writes to data synchronous writes. As with closesync, the direct, unbuffered, and dsync modes flush changes in the file to disk when it is closed. These modes can be used to speed up applications that use synchronous I/O. Many applications that are concerned with data integrity specify O_SYNC in order to write the file data synchronously. However, this has the undesirable side effect of updating inode times and therefore slowing down performance. The convosync=dsync, convosync=unbuffered, and convosync=direct modes alleviate this problem by allowing applications to take advantage of synchronous writes without modifying inode times as well. NOTE: Before using convosync=dsync, convosync=unbuffered, or convosync=direct, make sure that all applications that use the file system do not require synchronous inode time updates for O_SYNC writes.
1013. SLIDE: JFS Mount Option: mincache=direct
JFS Mount Option: mincache=direct
Buffer Cache
Buffer Cache
SGA Database Cache
ORACLE Database
SGA Database Cache
ORACLE Database
Oracle Process
Oracle Process
Data Flow with default mount options
Data Flow with mount option mincache=direct
Student Notes
The above slide illustrates the impact of setting the -o mincache=direct option. By default, all JFS file system I/O goes through the system's buffer cache. When an application does its own caching (e.g. an Oracle database application), there are two levels of caching. One cache is managed by the application; the other cache is managed by the kernel. Using two caches is inefficient from both a performance and a memory usage standpoint (data exists in both caches). When the file system is mounted with the -o mincache=direct option, it causes bypassing of the system's buffer cache and the data is written directly to disk. This improves performance and keeps the buffer cache available for other file systems that do not go through an application cache.
CAUTION:
Use of the -o mincache=direct option can lead to a significant decrease in performance if used in the wrong situation. This option should only be used if: 1. An application creates and maintains its own data cache, and 2. All the files on the file system are cached in the application's data cache. If there are some files being accessed on the mounted file system and these files are not being cached by the application, this option should not be used.
NOTE:
This option is only available with the OnlineJFS product.
10-14. SLIDE: JFS Mount Option: mincache=tmpcache
JFS Mount Option: mincache=tmpcache
2
SB Inode AU SB Intent Log SB Inode AU Allocation Unit Buffer Cache JFS Transaction Process Buffer Cache
1
SB Intent Log Allocation Unit
JFS Transaction Process
Memory
Memory
File
File
Disk
Disk
default
mincache=tmpcache
Student Notes
By default, when a process performs a write extending call, the new data is written to disk before the file's inode is updated. In the slide above, the left side shows the default behavior: 1. Write data to newly allocated file system block. 2. Write JFS transaction meta-data out to the disk. The system call returns. The advantage of this behavior is that uninitialized data will not be found within the file should a system crash occur. This is important from a data integrity standpoint. The disadvantage of this behavior is slow performance, because the JFS transaction must wait for the user data I/O to complete before it can be written to the intent log.
Behavior with -o mincache=tmpcache Option

Performance can be improved (at the expense of data integrity) by mounting file systems with the -o mincache=tmpcache option. This option allows the JFS transactions to be written to the intent log before the user data is written to the file. In the slide, the right side shows the tmpcache behavior:
1. Write JFS transaction out to disk. (The system call returns). 2. Write data to newly allocated file system block. The advantage of this behavior is performance of write extending calls is fast. The system does not wait for the user data to be written to disk. The disadvantage of this behavior is data integrity of the file is jeopardized, especially if the file is being updated at the time of a system crash. By updating the file's inode first, the file points to uninitialized data blocks which contains unknown data. The uninitialized file system blocks are expected to be initialized soon after the inode is updated; however, there still exists a small window of time when the file's inode references unknown data. If the system crashes during this small window, then the file will still be referencing the uninitialized data after the crash. CAUTION: The -o mincache=tmpcache option should only be used for memory resident or very temporary file systems.
1015. SLIDE: Kernel Tunables
Kernel Tunables
VxFS inodes are cached in memory, separate from HFS. Kernel parameter ninode has no effect on VxFS. When vx_ninode is zero (default), inode cache is set in proportion to system memory (see table). vx_ncsize sets directory name lookup cache (1KB)
Student Notes
Internal Inode Table Size
VxFS caches inodes in an inode table (see Table below, Inode Table Size). There is a tunable in VxFS called vx_ninode that determines the number of entries in the inode table. A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system for example). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode is set to zero. The kernel then computes a value based on the system memory size.
Module 10 VxFS Performance Issues Total Memory in Mbytes 8 16 32 64 128 256 512 1024 2048 8192 32768 131072 MaximumNumber of Inodes 400 1000 2500 6000 8000 16000 32000 64000 128000 256000 512000 1024000
If the available memory is a value between two entries, the value of vx_ninode is interpolated.
Other VxFS Kernel Parameters

vx_ncsize Controls the size of the DNLC (directory name lookup cache) in the kernel. Recent directory path names are stored in memory to improve performance. This parameter is set in DNLC entries. The size of the DNLC is set to the sum of ninode and vx_ncsize.
1016. SLIDE: Fragmentation
Fragmentation
Keep file system free space over 10% Maintain free space distribution goals Monitor with df(1M) or fsadm(1M) Repack files and free space with fsadm e

Reduces the number of extents in large files Makes small files contiguous (one extent)
Moves small recently used file closer to inode structures Optimizes free space into larger extents Repack directories with fsadm d

Remove empty entries from directories Place recently used files at beginning of directory lists Pack small directories directly in inode if possible
Student Notes
Keep file system free space over 10% In general, VxFS works best if the percentage of free space in the file system does not get below 10 percent. This is because file systems with 10 percent or more free space have less fragmentation and better extent allocation. Regular use of the df(1M) command to monitor free space is desirable. Full file systems should therefore have some files removed, or should be expanded (see fsadm(1M) for a description of online file system expansion). Maintain free space distribution goals 3 factors which can be used to determine the degree of fragmentation: percentage of free space in extents < than 8 blocks in length percentage of free space in extents < than 64 blocks in length percentage of free space in extents of 64 blocks or greater
An unfragmented file system will have the following characteristics:
less than 1% of free space in extents < 8 blocks in length less than 5% of free space in extents < 64 blocks in length more than 5% of total file system size available as free extents in length of 64 or more blocks A fragmented file system will have the following characteristics: greater than 5% of free space in extents < 8 blocks in length more than 50% of free space in extents < 64 blocks in length less than 5% of total file system size available as free extents in lengths of 64 or more blocks in size Using df(1M) The following example shows how to use df to map free space:
# df -F vxfs -o s /usr /usr (/dev/vg00/lvol7 ) : Free Extents by Size 1: 823 2: 206 4: 16: 158 32: 61 64: 256: 23 512: 14 1024: 4096: 1 8192: 1 16384:
55 48 3 0
8: 128: 2048: 32768:
206 43 3 0
Repack files and freespace fsadm e has the following goals for files and free data space Make small files (default: <64k) one contiguous extent Ensure that large files are built from large extents Move small and recently used (default: <14 days) files near inode area Move large or old (>14 days since last access) files to end of allocation unit Consolidate free space in center of data area
Repack directories fsadm d has the following goals for directories Remove unused space from between used directory entries Pack directories and symbolic links into inode immediate area if possible Place directories and symbolic links first, then other files Sort each area by time of last access
fsadm(1M) Overview
Because blocks are allocated and deallocated as files are added, removed, expanded, and truncated, block space can become fragmented. This can make it more difficult for JFS to take advantage of the benefits provided by a contiguous extent allocation. To remove fragmentation, HP OnlineJFS includes a utility called fsadm, which will take fragmented blocks and reallocate them as contiguous extents. The fsadm utility can be run on a live file system (including one containing active databases) safely without interrupting data access.
The fsadm utility will bring the fragmented extents of files closer together, group them by type and frequency of access, and compact and sort directories. The fsadm utility is typically run as a recurring scheduled job and is an effective tool for the management of a highperformance online file system. Even if database software used on top of the file system has its own defragmenter, this additional defragmentation is necessary to make the storage that the database engine sees as contiguous as possible. You can defragment (reorganize) your HP OnlineJFS file system using SAM or with fsadm(1M), directly from the command line. To use SAM: 1. Invoke SAM. 2. Select the Disks and File Systems functional area. 3. Select the File Systems application. 4. Select the JFS file system that you wish to reorganize from the directories' list. 5. Select the Actions menu. 6. Select the VxFS Maintenance menu item. 7. View reports on extent and directory fragmentation, then select Reorganize Extents or Reorganize Directories to defragment your JFS file system.
1017. TEXT PAGE: Monitoring and Repairing File Fragmentation

For optimal performance, the JFS extent allocator must be able to find large extents when it wants them. To maintain the file system performance levels, the fsadm utility should be run periodically against all JFS file systems to reduce fragmentation. The fsadm utility should be run between once a day and once a month against each file system. The frequency depends on file system usage and activity patterns and the importance of performance. The -v option can be used to examine the amount of work performed by fsadm. The frequency of reorganization can be adjusted, based on the rate of file system fragmentation. To perform both directory and extent reorganization and to output reports on the directory and extent fragmentation before and after reorganization, enter the following: # fsadm -F vxfs -d -D -e -E /<jfs_mount_point>
Reorganizing Options
-F vxfs -D Specify the JFS file system type. Report on directory fragmentation. If specified in conjunction with the -d option, the fragmentation report is produced both before and after the directory reorganization. Report on extent fragmentation. If specified in conjunction with the -e option, the fragmentation report is produced both before and after the extent reorganization. Reorganize directories. Directory entries are reordered to place subdirectory entries first, then all other entries in decreasing order of time of last access. The directory is also compacted to remove free space. Extent reorganization. Attempt to minimize fragmentation. Aged files are moved to the end of the allocation units to produce free space. Other files are reorganized to have the minimum number of extents possible. Print a summary of activity at the end of each pass. Verbose. Report reorganization activity. Consider files not accessed within the specified number of days as aged files. The default is 14 days. Aged files are moved to the end of the directory by the -d option and reorganized differently by the -e option. Maximum number of passes to run. The default is 5 passes. Reorganizations are processed until reorganization is complete or until the specified number of passes have been run. Maximum time to run. Reorganizations are processed until reorganization is complete or the time limit has expired. time is specified in seconds.
-E
-d
-e
-s -v -a days
-p passes
-t time
If both the -t and -p options are specified, the utility exits if either of the terminating conditions is reached. If both the -e and -d options are specified, the utility will run all the directory reorganization passes before any extent reorganization passes. fsadm uses the file .fsadm in the lost+found directory as a lock file. When fsadm is invoked, it opens the file lost+found/.fsadm in the root of the file system specified by mount_point . If the file does not exist, it is created. The fcntl(2) system call is used to obtain a write lock on the file. If the write lock fails, fsadm will assume that another fsadm is running and will fail. fsadm will report the process ID of the process holding the write lock on the .fsadm file.
Reporting on Directory Fragmentation

As files are allocated and freed, directories tend to grow and become sparse. In general, a directory is as large as the largest number of files it ever contained, even if some files have been subsequently removed. The command line to obtain a directory fragmentation report is: # fsadm -D /mountpoint_dir The following is example output from the fsadm -D command: # fsadm -D /home Directory Fragmentation Report Dirs Searched 15 0 15 Total Blocks 3 0 3 Immed Dirs 12 0 12 Immeds to Add 0 0 0 Dirs to Reduce 0 0 0 Blocks to Reduce 0 0 0
au 0 au 1 total
The Dirs Searched column contains the total number of directories. A directory is associated with the extent-allocation unit containing the extent in which the directory's inode is located. The Total Blocks column contains the total number of blocks used by directory extents. The Immed Dirs column contains the number of directories that are immediate, meaning that the directory data is in the inode itself as opposed to being in an extent. Immediate directories save space and speed path name resolution. The Immeds to Add column contains the number of directories that currently have a data extent, but that could be reduced in size and contained entirely in the inode. The Dirs to Reduce column contains the number of directories for which one or more blocks can be freed, if the entries in the directory are compressed to make the free space in the directory contiguous. Because directory entries vary in length, large directories may contain a block or more of total free space, but with the entries arranged in such a way that the space cannot be made contiguous. As a result, it is possible to have a non-zero Dirs to
Reduce calculation immediately after running a directory reorganization. The -v (verbose) option of directory reorganization reports occurrences of failure to compress free space. The Blocks to Reduce column contains the number of blocks that can be freed if the entries in the directory are compressed.
Measuring Directory Fragmentation

If the totals in the Dirs to Reduce column are substantial, a directory reorganization should improve the performance of path name resolution. The directories that fragment tend to be the directories with the most activity. A small number of fragmented directories may account for a large percentage of name lookups in the file system.
Directory Reorganization
If the -d option is specified, fsadm will reorganize the directories on the file system whose mount point is mountpoint_dir. Directories are reorganized in two ways: compressing and sorting. For compression, the valid entries in the directory are moved to the front of the directory and the free space is grouped at the end of the directory. If there are no entries in the last block of the directory, the block is released and the directory size is reduced. If the directory entries are small enough, the directory is placed in the inode immediate data area. The entries in a directory are also sorted to improve path name lookup performance. Entries are sorted based on the last access time of the entry. The -a option is used to specify a time interval; 14 days is the default if -a is not specified. The time interval is broken up into 128 buckets, and all times within the same bucket are considered equal. All access times older than the time interval are considered equal, and those entries are placed last. Subdirectory entries are placed at the front of the directory and symbolic links are placed after subdirectories, followed by the most recently accessed files. The directory reorganization runs in one pass across the entire file system. The command line to reorganize directories of a file system is:
fsadm -d [-s] [-v] [-p passes] [-t timeout] [- r rawdev] [-D] /mountpoint_dir
The following example illustrates the output of the command fsadm -d -s command: # fsadm -d -s /home Directory Reorganization Statistics Dirs Searched 2343 582 142 88 3155 Dirs Changed 1376 254 26 24 1680 Total Ioctls 2927 510 38 29 3504 Failed Ioctls 1 0 0 1 2 Blocks Reduced 209 47 21 5 282 Blocks Immeds Changed Added 3120 72 586 28 54 16 36 2 3796 118
au au au au total
0 1 2 3
The Dirs Searched column contains the number of directories searched. Only directories with data extents are reorganized. Immediate directories are skipped. The Dirs Changed column contains the number of directories for which a change was made. The Total Ioctls column contains the total number of VX_DIRSORT ioctls performed. Reorganization of directory extents is performed using this ioctl. The Failed Ioctls column contains the number of requests that failed. The reason for failure is usually that the directory being reorganized is active. A few failures should be no cause for alarm. If the -v option is used, all ioctl calls and status returns are recorded. The Blocks Reduced column contains the total number of directory blocks freed by compressing entries. The Blocks Changed column contains the total number of directory blocks updated while sorting and compressing entries. The Immeds Added column contains the total number of directories with data extents that were compressed into immediate directories.
Reporting on Extent Fragmentation

As files are created and removed over time, the free extent map for an allocation unit will change from having one large free area to having many smaller free areas. This process is known as fragmentation. Also, when files are grown, particularly when growth occurs in small increments, small files can be allocated in multiple extents. In the ideal case, each file that is not sparse will have exactly one extent (containing the entire file), and the free-extent map is one continuous range of free blocks. Conversely, in a case of extreme fragmentation, there can be free space in the file system, none of which can be allocated. For example, on Version 2 JFS file systems, the indirectaddress extent size is always 8 KB long. This means that to allocate an indirect-address extent to a file, an 8-KB extent must be available. If no extent of 8 KB or larger is available, even though more than 8 KB of free space is available, an attempt to allocate a file into indirect extents will fail and return ENOSPC.
Determining Fragmentation
To determine whether fragmentation exists for a given file system, the free extents for that file system need to be examined. If a large number of small extents are free, there is fragmentation. If more than half of the amount of free space is taken up by small extents, (smaller than 64 blocks) or there is less than 5 percent of total file system space available in large extents, then there is serious fragmentation.
Running the Extent-Fragmentation Report

The extent-fragmentation report can be run to acquire detailed information about the degree of fragmentation in a given file system. The following is the command line to run an extent-fragmentation report: fsadm -E [-l largesize] /mountpoint_dir The extent reorganizer has the concept of an immovable extent: if the file already contains large extents, reallocating and consolidating these extents will not improve performance, so
they are considered immovable. How large an extent must be to qualify as immovable can be controlled with the -l option. By default, largesize is 64 blocks, meaning that any extent larger than 64 blocks is considered to be immovable. For the purposes of the extent fragmentation report, the value chosen for largesize will affect which extents are reported as being immovable extents. The following is an example of the output generated by the fsadm -E command: # fsadm -E /home
Extent Fragmentation Report
Files with Extents Au 0 14381 au 1 2822 au 2 2247 au 3 605 total 19992
Total Extents 18607 3304 2884 780 25575
Blocks 30516 24562 22023 24039 101140
Total Distance 4440997 927841 1382962 679867 7431667
au au au au total
0 1 2 3
Consolidatable Extents Blocks 928 2539 461 5225 729 8781 139 1463 2257 18008
Immovable Extents Blocks 0 0 99 13100 58 11058 49 17258 206 41416
Free Extents by Size
au 0 Free 1: 16: 256: 4096: au 1 Free 1: 16: 256: 4096: au 2 Free 1: 16: 256: 4096: au 3 Free 1: 16: 256: 4096:
Blocks 217, Smaller Than 8 - 48%, Smaller Than 15 2: 15 4: 15 8: 0 32: 0 64: 0 128: 0 512: 0 1024: 0 2048: 0 8192: 0 16384: 0 Blocks 286, Smaller Than 8 - 41%, Smaller Than 16 2: 21 4: 15 8: 4 32: 0 64: 0 128: 0 512: 0 1024: 0 2048: 0 8192: 0 16384: 0 Blocks 510, Smaller Than 8 - 15%, Smaller Than 10 2: 14 4: 10 8: 8 32: 6 64: 0 128: 0 512: 0 1024: 0 2048: 0 8192: 0 16384: 0 Blocks 6235, Smaller Than 8 - 3%, Smaller Than 29 2: 33 4: 27 8: 18 32: 8 64: 4 128: 2 512: 2 1024: 1 2048: 0 8192: 0 16384: 0
64 - 100% 14 0 0 64 - 100% 13 0 0 64 - 100% 14 0 0 64 - 15% 30 3 1
au 4 Free Blocks 8551, Smaller Than 8 - 2%, Smaller Than 64 - 22% 1: 29 2: 33 4: 30 8: 38 16: 28 32: 29 64: 26 128: 11 256: 8 512: 3 1024: 0 2048: 0 4096: 0 8192: 0 16384: 0 total Free 1: 16: 256: 4096: Blocks 15799, Smaller Than 8 - 4%, Smaller Than 64 - 24% 99 2: 116 4: 97 8: 109 58 32: 43 64: 30 128: 14 10 512: 5 1024: 1 2048: 1 0 8192: 0 16384: 0
The numbers in the Files with Extents column indicate the total number of files that have data extents. A file is considered to be in the extent-allocation unit that contains the extent holding the file's inode. The Total Extents column contains the total number of extents belonging to files in the allocation unit. The extents themselves are not necessarily in the same allocation unit. The Total Blocks column contains the total number of blocks used by files in the allocation unit. If the total number of blocks is divided by the total number of extents, the resulting figure is the average extent size. The Total Distance column contains the total distance between extents in the allocation unit. For example, if a file has two extents, the first containing blocks 100 through 107 and the second containing blocks 110 through 120, the distance between the extents is 110107, or 3. In general, a lower number means that files are more contiguous. If an extent reorganization is run on a fragmented file system, the value for Total Distance should be reduced. The Consolidatable Extents column contains the number of extents that are candidates to be consolidated. Consolidation means merging two or more extents into one combined extent. For files that are entirely in direct extents, the extent reorganizer will attempt to consolidate extents into extents up to size largesize. All files of size largesize or less typically will be contiguous in one extent after reorganization. Since most files are small, this will usually include about 98 percent of all files. The Consolidatable Blocks column contains the total number of blocks in Consolidatable Extents. The Immovable Extents column contains the total number of extents that are considered to be immovable. In the report, an immovable extent appears in the allocation unit of the extent itself, as opposed to in the allocation unit of its inode. This is because the extent is considered to be immovable, and thus permanently fixed in the associated allocation unit. The Immovable Blocks column contains the total number of blocks in immovable extents. The figures under the Free Extents by Size heading indicate per-allocation unit totals for free extents of each size. The totals are for free extents of size 1, 2, 4, 8, 16, . . . up to a maximum of the number of data blocks in an allocation unit. The totals should match the output of df -o s unless there has been recent allocation or deallocation activity (as this utility acts on
mounted file systems). These figures give an indication of fragmentation and extent availability on a per-allocation-unit basis. For each allocation unit, and for the complete file system, the total free blocks and total free blocks by category are shown. The figures are presented as follows: The Free Blocks figure indicates the total number of free blocks. The Smaller Than 8 figure indicates the percentage of free blocks that are in extents less than 8 blocks in length. The Smaller Than 64 figure indicates the percentage of free blocks that are in extents less than 64 blocks in length.
In the preceding example, 4 percent of free space is in extents less than 8 blocks in length, and 24 percent of the free space is in extents less than 64 blocks in length. This represents a typical value for a mature file system that is regularly reorganized. The total free space is about 10 percent.
Extent Reorganization
If the -e option is specified, fsadm will reorganize the data extents on the file system whose mount point is mountpoint_dir. The primary goal of extent reorganization is to defragment the file system. To reduce fragmentation, extent reorganization tries to place all small files in one contiguous extent. The -l option is used to specify the size of a file that is considered large. The default is 64 blocks. Extent reorganization also tries to group large files into large extents of at least 64 blocks. In addition to reducing fragmentation, extent reorganization improves performance. Small files can be read or written in one I/O operation. Large files can approach raw-disk performance for sequential I/O operations. Extent reorganization also tries to improve the locality of reference on the file system. Extents are moved into the same allocation unit as their inode. Within the allocation unit, small files and directories are migrated to the front of the allocation unit. Large files and inactive files are migrated towards the back of the allocation unit. (A file is considered inactive if the access time on the inode is more than 14 days old. The time interval can be varied using the -a option.) Extent reorganization should reduce the average seek time by placing inodes and frequently used data closer together. fsadm will try to perform extent reorganization on all inodes on the file system. Each pass through the inodes will move the file system closer to the organization considered optimal by fsadm . The first pass might place a file into one contiguous extent. The second pass might move the file into the same allocation unit as its inode. Then, since the first file has been moved, a third pass might move extents for a file in another allocation unit into the space vacated by the first file during the second pass. When the file system is more than 90 percent full, fsadm shifts to a different reorganization scheme. Instead of attempting to make files contiguous, extent reorganization tries to defragment the free-extent map into chunks of at least 64 blocks or the size specified by the -l option.
The following is the command line to perform extent reorganization:

fsadm -F vxfs -e [-sv][-p passes][-t time][-a days][-l largesize] /mountpoint_dir
The following example illustrates the output from the fsadm -F vxfs -e -s command: # fsadm -F vxfs -e -s Allocation Unit 0, Pass 1 Statistics Extents Searched 2467 0 0 0 0 2467 Consolidations Number Extents 11 30 0 0 0 0 0 0 0 0 11 30 Performed Total Errors Blocks File Busy Not Free 310 0 0 0 0 0 0 0 0 0 0 0 0 0 0 310 0 0
au au au au au total
0 1 2 3 4
In Proper Location Extents Blocks au 0 1379 8484 au 1 0 0 au 2 0 0 au 3 0 0 au 4 0 0 total 1379 8484 Moved to Free Area Extents Blocks au 0 231 4851 au 1 0 0 au 2 0 0 au 3 0 0 au 4 0 0 total 231 4851
Moved to Proper Location Extents Blocks 794 10925 0 0 0 0 0 0 0 0 794 10925 In Free Area Extents Blocks 4 133 0 0 0 0 0 0 0 0 4 133 Could not be Moved Extents Blocks 0 0 0 0 0 0 0 0 0 0 0 0
Allocation Unit 0, Pass 2 Statistics Extents Searched 2467 0 0 0 0 2467 Consolidations Number Extents 0 0 0 0 0 0 0 0 0 0 0 0 Performed Total Errors Blocks File Busy Not Free 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4
In Proper Location
Moved to Proper Location
0 1 2 3 4
Extents 2173 0 0 0 0 2173
Blocks 19409 0 0 0 0 19409
Extents 235 0 0 0 0 235
Blocks 4984 0 0 0 0 4984 Could not be Moved Extents Blocks 0 0 0 0 0 0 0 0 0 0 0 0
Moved to Free Area Extents Blocks au 0 0 0 au 1 0 0 au 2 0 0 au 3 0 0 au 4 0 0 total 0 0
In Free Area Extents Blocks 0 0 0 0 0 0 0 0 0 0 0 0
Note that the default five passes were scheduled, but the reorganization finished in two passes. This file system had not had much activity since the last reorganization, with the result that little reorganization was required. The time it takes to complete extent reorganization varies, depending on fragmentation and disk speeds. However, in general, extent reorganization may be expected to take approximately one minute for every 10 megabytes of disk space used. In the preceding example, the Extents Searched column contains the total number of extents examined. The Number column (located under the Consolidations Performed heading) contains the total number of consolidations or merging of extents performed. The Extents column (located under the Consolidations Performed heading) contains the total number of extents that were consolidated. (More than one extent may be consolidated in one operation.) The Blocks column (located under the Consolidations Performed heading) contains the total number of blocks that were consolidated. The File Busy column (located under the Total Errors heading) contains the total number of reorganization requests that failed because the file was active during reorganization. The Not Free column (located under the Total Errors heading) contains the total number of reorganization requests that failed because an extent that the reorganizer expected to be free was allocated at some time during the reorganization. The In Proper Location column contains the total extents and blocks that were already in the proper location at the start of the pass. The Moved to Proper Location column contains the total extents and blocks that were moved to the proper location during the pass. The Moved to Free Area column contains the total number of extents and blocks that were moved into a convenient free area in order to free up space designated as the proper location for an extent in the allocation unit being reorganized. The In Free Area column contains the total number of extents and blocks that were in areas designated as free areas at the beginning of the pass. The Could not be Moved column contains the total number of extents and blocks that were in an undesirable location and could not be moved. This occurs when there is not
enough free space to allow sufficient extent movement to take place. This often occurs on the first few passes for an allocation unit if a large amount of reorganization needs to be performed. If the next to last pass of the reorganization run indicates extents that cannot be moved, then the reorganization fails. A failed reorganization may leave the file system badly fragmented, since free areas are used when trying to free up reserved locations. To lessen this fragmentation, extents are not moved into the free areas on the final two passes of the extent reorganizer and the last pass of the extent reorganizer only consolidates free space. To defragment a BaseJFS you need to perform the same steps you would for an HFS: 1. Back up the file system (with fbackup). 2. Make a new file system (with newfs). 3. Restore the data from tape (with frecover).
1018. SLIDE: Using setext
Using setext
The setext command can manipulate the extent allocation policies of the JFS file system on a file by file basis: Use setext to override default VxFS extent allocation policies Specify the extent size Force files to be continuous Pre-reserve space for future continuous growth Prevent files from growing past reservation Use getext to view file parameters Use ls le to view extent parameters
Student Notes
setext specifies a fixed extent size for a file, and reserves space for a file. The file must already exist.
setext [-F vxfs] [-e extentsize] [-r reservation] [[-f flag]... ] file
Options: -e extentsize -r reservation -f align -f contig -f noextend -f chgsize
Specify fixed extent size (in file system blocks) Pre-allocate space (in file system blocks) All extents aligned on extentsize boundaries relative the start of allocation units Reservation must be allocated contiguously File may not be extended once pre-allocated space has been used Reservation is incorporated into file; on-disk inode updated with size and block count information that includes the reserved space
-f noreserve
-f trim
Reservation made as non-persistent allocation to file; on-disk inode not updated; associated with file until last close, then trimmed to current file size Reservation is trimmed to current file size upon last close by all processes that have the file open
Example using setext # touch bigfile.0 bigfile.1 bigfile.2 # /usr/sbin/setext -F vxfs -r 4096 -f contig bigfile.1 # /usr/sbin/setext -F vxfs -f align -e 128 bigfile.2 # cp bigfile bigfile.0 # cp bigfile bigfile.1 # cp bigfile bigfile.2 # ls -l bigfile* # ls -l bigfile* -rw-r--r-- 1 root other 2691000 Nov 2 10:52 bigfile.0 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.1 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.2 # /usr/sbin/getext -F bigfile.0: Bsize 1024 bigfile.1: Bsize 1024 bigfile.2: Bsize 1024 vxfs bigfile.* Reserve 0 Extent Size 0 Reserve 4096 Extent Size 0 Reserve 0 Extent Size 128
Example output from ls -le

# ls -le bigfile* -rw-r--r-- 1 root other 2691000 Nov 2 10:52 bigfile.0 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.1 :res 4096 ext 0 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.2 :res 0 ext 128
1019. SLIDE: I/O Tunable Parameters
I/O Tunable Parameters

JFS provides a set of eleven (11) tunable I/O parameters. If the default I/O parameters are not acceptable, then the /etc/vx/tunefstab file can be used. mount_vxfs(1M) invokes the vxtunefs(1M)command to process the contents of the /etc/vx/tunefstab file. Failure to set I/O parameters does not prevent mount from occurring
Student Notes
JFS Tunable Parameters
I/O Parameter read_pref_io Description The preferred read request size. The file system uses this in conjunction with the read_nstream value to determine how much data to read-ahead. Default value is 64K. The preferred write request size. The file system uses this in conjunction with the write_nstream value to determine how to do flush-behind on writes. Default value is 64K. The number of parallel read requests of size read_pref_io to have outstanding at one time. The file system uses the product of read_nstream multiplied by read_pref_io to determine its readahead size. Default value for read_nstream is 1.
write_pref_io
read_nstream
write_nstream
The number of parallel write requests of size write_pref_io to have outstanding at one time. The file system uses the product of write_nstream multiplied by write_pref_io to determine when to do flush-behind on writes. Default value for write_nstream is 1.
(Only the first four parameters are described here. Refer to man page for vxtunefs(1m) for the remainder.)
1020. SLIDE: vxtunefs Command for Tuning VxFS
vxtunefs Command for Tuning VxFS
# vxtunefs /tondir Filesystem i/o parameters for /tondir read_pref_io = 65536 # Preferred read request size is 64k read_nstream = 1 # Desired number of parallel read_pref_ios read_unit_io = 65536 write_pref_io = 65536 # Preferred write request size 64k write_nstream = 1 # Desired number of parallel write_pref_ios write_unit_io = 65536 pref_strength = 10 buf_breakup_size = 131072 discovered_direct_iosz = 262144 # Large I/Os treated like direct for speed max_direct_iosz = 131072 default_indir_size = 8192 qio_cache_enable = 0 max_diskq = 1048576 initial_extent_size = 8 max_seqio_extent_size = 2048 max_buf_data_size = 8192
Student Notes
The slide shows the output of the vxtunefs command being used to query the configuration of a VxFS file system.
vxtunefs Command Details

/sbin/vxtunefs [-ps] [-f tunefstab] [-o parameter=value] [{mount_point|block_special}]... Options: -f filename Use filename instead of /etc/vx/tunefstab as the file containing tuning parameters.
-o parameter=value Specify parameters for the file systems listed on the command line. The parameters are listed below. -p Print the tuning parameters for all the file systems specified on the command line.
-s
Set the new tuning parameters for the VxFS file systems specified on the command line or in the tunefstab file.
vxtunefs sets or prints tunable I/O parameters of mounted file systems. vxtunefs can set parameters describing the I/O properties of the underlying device, parameters to indicate when to treat an I/O as direct I/O, or parameters to control the extent allocation policy for the specified file system. With no options specified, vxtunefs prints the existing VxFS parameters for the specified file systems. vxtunefs works on a list of mount points specified on the command line, or all the mounted file systems listed in the tunefstab file. The default tunefstab file is /etc/vx/tunefstab. You can change the default using the -f option. vxtunefs can be run at any time on a mounted file system, and all parameter changes take immediate effect. Parameters specified on the command line override parameters listed in the tunefstab file. If /etc/vx/tunefstab exists, the VxFS-specific mount command invokes vxtunefs to set device parameters from /etc/vx/tunefstab.
1021. SLIDE: /etc/vx/tunefstab Configuration
/etc/vx/tunefstab Configuration
File is read every time a VxFS is mounted. Automatic permanent vxtunefs options implemented here File format as follows: block-device tunefs-options system-default tunefs-options Options set for individual file systems or globally all VxFS file systems
Student Notes
The tunefstab file contains tuning parameters for VxFS file systems. vxtunefs sets the tuning parameters for mounted file systems by processing command line options or by reading parameters in the tunefstab file. Each entry in tunefstab is a line of fields in one of the following formats: block-device tunefs-options system-default tunefs-options block-device is the name of the device on which the file system exists. If there is more than one line that specifies options for a device, each line is processed and the options are set in order. In place of block-device, system-default specifies tunables for each device to process. If there is an entry for both a block device and a system default, the system default value takes precedence. Lines in tunefstab that start with the pound (#) character are treated as comments and ignored.
The tunefs-options correspond to the tuneable parameters that vxtunefs and mount_vxfs set on the file system. Each option in this list is a name=value pair. Separate the options by commas, with no spaces or tabs between the options and commas. See the vxtunefs(1M) manual page for a description of the supported options.
Examples
If you have a four column striped volume, /dev/vg01/lvol3, with a stripe unit size of 128 kilobytes per disk, set the read_pref_io and read_nstream parameters 128 and four, respectively. You can do this in two ways: /dev/vg01/lvol3 or: /dev/vg01/lvol3 /dev/vg01/lvol3 read_pref_io=128k read_nstream=4 read_pref_io=128k,read_nstream=4
To set the discovered direct I/O size so that it is always lower than the default, add the following line to the /etc/vx/tunefstab file:
/dev/dsk/c3t1d0
discovered_direct_iosz=128K
1022. SLIDE: Taking Snapshots and Performance
Taking Snapshots and Performance

Issues for the online Snapped File System:

Read performance should not be affected. Any writes after the snap will be 2-3 times slower Subsequent writes to the same area will perform normally Have the snapshot on a separate physical disk Tests of OLTP show 15-20% degradation
Issues for the backup Snapshot File System Snapshot performance should be equivalent to normal JFS
Read performance suffers if Snapped (online) half is busy
Student Notes
Performance of the Advanced (Snapped) File System.
The write performance of the online (snapped) file system will be degraded but the read performance will stay the same. It is important to ensure that the snapshot file system (the backup) resides on a different physical disk, otherwise backup I/O will use up valuable bandwidth. Initial writes to a block after the snapshot is started will be 2 to 3 times slower. 1. Read the old data 2. Write the old data to the snapshot 3. Write of the new data Multiple snapshots would cause this process to be even slower. Only the initial write suffers, subsequent changes are not recorded in the snapshot and therefore would proceed at normal speed.
Overall impact will depend on the read to write ratio and the mixing of the I/O operations. For example, Oracle running an OLTP workload on a snapped file system was measured about 15 to 20% slower than a file system that was not being snapped.
Performance of the Backup (Snapshot) File System.

Performance of the snapshot is maximized at the expense of writes to the snapped file system. Reads from a snapshot file system will typically be at the same rate as from a normal JFS file system, allowing backups to proceed at the full speed of JFS. Reads from the snapshot are impacted if the snapped file system is very busy. Remember the read data comes from the snapped file system unless it has been modified.
1023. LAB: JFS File System Tuning Directions

The following lab exercise compares performance of JFS with different mount options. The mount options used with JFS can have a big impact on JFS performance. 1. Mount a JFS file system to be used for this lab under /vxfs. # mount /dev/vg00/vxfs /vxfs 2. Because the above mount command specified no special mount options, the default mount options are used. Use the mount -v command to view the default options, including the option for transaction logging type. What type of transaction logging does JFS use by default?
3. Change directory to /vxfs. Time the execution of the disk_long program, which writes 400 MB of data to the file system in 20 MB increments. After each 20 MB is written, the files are deleted. Run the command three times and record the middle results. # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long Record middle results: Real: _____________ User: ____________ Sys: ____________ 4. Remount the JFS file system using delaylog option. This helps performance of noncritical transactions. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o delaylog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________
Based on the results, does the disk_long program perform many noncritical transactions?
5. Remount the JFS file system using tmplog option. This causes the system call to return after the JFS transaction is updated in memory (step 1 from lecture), and before the transaction is written to the intent log. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o tmplog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________ Based on the results, why does the disk_long program show little improvement when mounted with tmplog?
6. Remount the JFS file system using tmpcache option. This allows the JFS transaction to be created without having to wait for the user data to be written in extending write calls. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o mincache=tmpcache /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________
7. Remount the JFS file system using direct option. This option requires all user data and all JFS transactions to bypass the buffer cache and go directly to disk. Run the command just once and record the results. # # # # # cd / umount /vxfs mount -o mincache=direct /dev/vg00/vxfs /vxfs cd /jfs timex ./disk_long
Record results: Real: _____________ User: ____________ Sys: ____________ Based on the results, why does the disk_long program show poor performance results when mounted with mincache=direct? When would this option be appropriate to use?
8. Dismount the VxFS file system. # umount /vxfs
Module 11 Network Performance

Objectives
Upon completion of this module, you will be able to do the following: List factors directly related to network performance. Describe how to determine network workloads (server and client). Evaluate UDP and TCP transport options. Identify a network bottleneck. List possible solutions for a network performance problem.
H4262S C.00 11-1 2004 Hewlett-Packard Develoment Company, L.P.
111. SLIDE: The OSI Model
The OSI Model
Application Presentation Session Transport IP Data Link Physical
nfsd, mountd telnetd, ftpd XDR RPC UDP/TCP IP Data Link Physical
ftp, biod, telnet XDR RPC UDP/TCP IP Data Link Physical
Application Presentation Session Transport IP Data Link Physical
Server
Client
Student Notes
Networking allows one computer (server) to communicate with and share its local files and directories with other computers (clients) in a homogeneous environment.
Network Protocols
NFSD, MOUNTD, FTPD, TELNETD The Networking server daemons respond to requests from clients and perform the requested operations. BIOD, FTP, TELNET The Networking user applications request operations to be performed for them on the server. XDR External data representation is a machine-independent data format used by applications to translate machine-dependent data formats to a universal format that can be used by other networking hosts using XDR.
RPC/Session Layer The remote procedure call mechanism allows a server machine to define a procedure that a client program can call. This is how a client can perform file system operations, such as creating, deleting, modifying, and viewing a directory, creating, deleting, modifying, and copying a file, and so on. UDP/TCP Network protocols that efficiently move large amounts of data. Because there is no acknowledgement from the receiver, UDP is considered unreliable, whereas TCP is considered reliable. However, TCP generally has more overhead and therefore does not perform as well as UDP. IP Internet protocol is a network protocol which is responsible for getting packets between hosts on one or more networks that are linked together. Data Link The data link defines how the packets are assembled on the physical wire. Examples of data link protocols include IEEE 802.3 (CSMA/CD), IEEE 802.4 (Token Bus), IEEE 802.5 (Token Ring). Physical The Physical layer describes the actual transfer media, and how data is transferred on the network. Examples of physical layer protocols include Twisted Pair, Coaxial, and Fiber Optics.
112. SLIDE: NFS Read/Write Data Flow
NFS Read/Write Data Flow
mount
3
server:/data
/
/data
6
/
kernel
7
NFS
data
kernel
NFS
5
data File
Buffer Cache
8 1 2
Buffer Cache
4
Exported NFS File System
biod
process
nfsd
Memory - Client
Memory - Server
Student Notes
As a prime example of how network performance can affect applications, lets look at how NFS works. The above slide shows a high level overview of the sequence of events which occur when an NFS client attempts to access data on an NFS server: 1. A user process issues the read() system call against an NFS mounted file system. The user process goes into a wait state, waiting for the system call to return. 2. Upon checking the buffer cache for the requested data (assume data is not in the buffer cache), the biod daemon immediately follows the original read with a read-ahead request. This is done by biod so subsequent I/O requests have a better chance of being satisfied through the buffer cache. 3. The NFS subsystem within the kernel on the client issues an RPC read request on behalf of the process (and a second on behalf of biod) to the NFS server. 4. The NFS server receives the request and schedules an nfsd process to handle it.
5. The nfsd daemon performs the file system read and the data is returned to the nfsd daemon through the servers' buffer cache. 6. The NFS subsystem within the kernel on the server schedules a reply to the client containing the requested data. 7. The data is returned to the client process through the buffer cache on the client. The data, plus the data read ahead by the biod daemon, is stored in both the client's and server's buffer caches to allow future I/O requests to come from the buffer caches. 8. The read system call is returned (along with the data) to the client process. As you can see, NFS initiates a fair amount of traffic over the network. Other services, such as telnet and ftp, have their own performance profiles. Some are interactive and response time is important. Others are task-oriented and rely mostly on throughput.
113. SLIDE: NFS on HP-UX with UDP
NFS on HP-UX with UDP

NFS packets arrive in the UDP socket (port 2049).
kernel NFS
/ data
The UDP socket is a 256-KB FIFO queue. The UDP socket is emptied by the nfsds. Not enough nfsds cause NFS packets to be backed up in the queue.
port 2048 port 2049 port 2050 nfsd nfsd nfsd nfsd
File
Memory - Server
Student Notes
NFS packets come into the NFS server through the UDP receive queue (port 2049). The size of this queue is 256 KB. The NFS packets are processed sequentially, FIFO. Upon receipt of an NFS packet, an nfsd daemon is awakened, removes the request from the queue, and processes the request. If requests come into the server faster than the daemons can process them, the UDP queue quickly begins to back up with requests. If the UDP queue is full when a new request arrives, the new request is dropped off the back of the queue. This is known as a UDP socket overflow. To prevent this, always have a sufficient number of daemons running. Regardless of how many nfsd daemons are running, only one will be awakened for each incoming request. This allows a site to meet the demands of peak workload without suffering performance problems during periods of light demand. NFS tuning can thus focus on file system and network performance than CPU performance, since the number of nfsd daemons does not impact performance.
114. SLIDE: NFS on HP-UX with TCP
NFS on HP-UX with TCP

16 nfsd processes are started by default. Multiple nfsds respond to udp queue Single multi thread nfsktcpd processes is dedicated to tcp Client establishes udp/tcp method on mounting.
kernel
port 2049 TCP Socket
NFS
data
File
nfsd nfsd nfsd nfsd
nfsktcpd
Memory - Server
Student Notes
Network File System (NFS) is now supported over the connection-oriented protocol, TCP/IP for NFS versions 2 and 3, in addition to running over User Datagram Protocol (UDP). TCP transport increases dependability on wide-area networks (WANs). Generally, packets are successfully delivered more consistently because TCP provides congestion control and error recovery. As a result, with this new functionality, NFS is now supported over WANs. As long as TCP is supported on the WAN, then NFS is supported also. The mount_nfs command now supports a proto= option on the command line where the value for proto can be either UDP or TCP. (In the past, this option was ignored.) This change allows the administrator to specify which transport protocol they wish to use when mounting a remote file system. If the proto= option is not specified, by default, NFS will attempt a TCP connection. If that fails, it will then try a UDP connection. Thus, by default, you will begin using TCP instead of UDP for NFS traffic when you begin using the 11i version of HP-UX. This should have little impact on you. You do, however, have the option to specify either UDP or TCP connections.
If you specify a proto= option, only the specified protocol will be attempted. If the server does not support the specified protocol, the mount will fail. nfsd now opens TCP transport endpoints to receive incoming TCP requests. For TCP, the nfsktcpd is multi-threaded. For UDP, the nfsd is still multi-processed. Kernel TCP threads execute under the process nfsktcpd. When counting the number of nfsd processes, keep in mind the following algorithm: An equal number of nfsd's that support UDP will be created per processor and only one nfsd that supports TCP will be created. In the case of a four-way machine and NUM_NFSDS=4 (set in /etc/rc.config.d/nfsconf), 17 nfsd's will be created: 16 for UDP (4 per processor) and 1 for TCP. nfsstat will now report TCP RPC statistics for both client and server. The TCP statistics will be under the connection-oriented tag and the UDP statistics will be under the connectionless-oriented tag.
115. SLIDE: biod on Client
biod on Client
Read-ahead request
Read request
Write requests
kernel
NFS
kernel
NFS
Buffer Cache
Buffer Cache
write()
read()
biod
1
read()
biod
write()
process
write() write() write() write()
biod
process
Memory - Client
Memory - Client
Student Notes
The biod daemons allow performance on the NFS client to maintain the illusion of having file systems on the local disks. The biod daemons assist in improving NFS client performance by performing read-aheads and write-behinds for the client processes.
Read-Ahead Requests
The biod daemons help read performance on NFS clients by reading ahead (that is, prefetching) data into the buffer cache so that when the client needs the data, it will be in its buffer cache. When an NFS client initiates a read request, and the data is not in its local buffer cache, the process performs the RPC read, itself. To prefetch data for the buffer cache, the kernel has the biod daemons send additional RPC read requests to the NFS server, just as if the NFS client process had requested this data. Subsequent read requests by the client (especially if reading sequentially) will find the data already in the buffer cache.
Write-Behind Requests
The biod daemons assist in write performance by allowing the NFS client process performing the write() call to return immediately rather than waiting for the write() call to complete. When an NFS client performs a write() call, the data is written to the client's buffer cache. Once the data is in the buffer cache, the kernel schedules an RPC write to occur. If there are available biod daemons, the kernel can schedule the write to occur for the biod daemons, rather than the NFS client process. This allows the client process to continue its execution without having to wait for the write() call to return. Instead of the client process waiting for the write call, the biod daemon waits for the write call. NOTE: Without any biod daemons on the client, NFS still works. The difference is no read-aheads are done, causing NFS read performance to suffer. All NFS clients performing writes are forced to wait for the RPC write requests to return, causing NFS write performance to suffer.
116. SLIDE: TELNET
TELNET
kernel
2 6
kernel
3
? port
1 7
23 port
telnet
telnetd
Memory - Client
Memory - Server
Student Notes
Telnet also uses sockets. A socket is simply a system-port pair. A connection is a pair of sockets. On the client (when the user enters the telnet command), a port is assigned to the process from a pool of available ports. Thus a socket is formed on the client. A connection is established between that port and port 23 on the server (used exclusively to handle incoming telnet requests). On the server (as a result of the connection), a telnetd daemon is spawned and linked to port 23. Now, the telnet process running on the client (1) issues a request to execute some command on the server. The command is placed in a packet and sent through the socket on the client (2) to the socket on the server (3). The command is removed from the packet and given to the telnetd daemon (4) to execute. The telnetd daemon executes the command and places the result in a packet. That packet is sent through the socket on the server (5) to the socket on the client (6). The results are removed from the packet and sent to the telnet process (7).
By default, telnet uses TCP for its transfers, since it needs to establish a firm connection between the client process and the server daemon.
117. SLIDE: FTP
FTP
kernel
2 6 10 12
kernel
3 5
?/? ports
20/21 ports
9 13
8 14
ftp
11
ftpd
Memory - Client
Memory - Server
Student Notes
FTP also uses sockets. It uses a pair of connections to perform all its operations one connection passes the commands and their results back and forth while the other connection passes file data back and forth. On the client (when the user enters the ftp command), a port is assigned to the process from a pool of available ports. Thus a socket is formed on the client. A connection is established between that port and port 21 on the server (used exclusively to handle incoming ftp requests). On the server (as a result of the connection), a ftpd daemon is spawned and linked to port 21. Now, the ftp process running on the client (1) issues a request to execute some command on the server. The command is placed in a packet and sent through the socket on the client (2) to the socket on the server (3). The command is removed from the packet and given to the ftpd daemon (4) to execute.
The ftpd daemon executes the command and places the result in a packet. That packet is sent through the socket on the server (5) to the socket on the client (6). The results are removed from the packet and sent to the ftp process (7). If the command involves the transfer of some file data, the ftp process on the client (or the ftpd daemon on the server) initiates the transfer of the data from one socket to the other using port 20 on the server and another available port on the client. For example, lets say that the user entered the ftp command: get /etc/hosts /tmp/hosts When the command arrives at the ftpd daemon, it triggers a read of the /etc/hosts file from the servers file system into the servers buffer cache. Once there, the daemon (8) places the contents of the file into one or more packets (as necessary) and sends them to port 20 (9). The packets arrive at the socket in the client (10) and are reassembled into the image of the file in a network buffer. Then it is copied into a buffer in buffer cache. The ftp process (11) acknowledges the receipt of the file by sending a packet to the socket (12) across the network to the socket on the server (13), where it is extracted and sent on to the daemon (14). By default, ftp uses TCP for its transfers, since it needs to establish two firm connections between the client process and the server daemon.
118. SLIDE: Metrics to Monitor NFS
Metrics to Monitor NFS

Number of nfsd daemons: - Monitor ratio between calls - Monitor CPU time used by all nfsd daemons Number of biod daemons: - Monitor number of waits due to no biods available - Monitor CPU time used by all biod daemons Number of badcalls Number of read and write NFS calls Number of UDP socket overflows, timeouts, retransmissions, and late responses
Student Notes
Number of nfsd daemons
Too few nfsd daemons can hinder performance on the NFS server. If all the nfsd daemons are busy when new NFS requests come in, then the requests have to wait until one of the daemons become free.
Monitor Ratio between calls and nfsd daemons

It is important to monitor the total RPC traffic (represented by the RPC calls field) relative to the NFS traffic (represented by the nfsdrun or NFS calls field). This can be especially helpful when there are multiple RPC-based applications (for example, NIS) running on the same system. NOTE: The nfsdrun field is no longer present on the HP-UX 11.00 release, due to differences in how nfsd daemons are run in the 10.x release. To monitor the ratio of RPC to NFS traffic on HP-UX 11.00, use the calls fields.
Monitor CPU Time Used by All nfsd and biod Daemons

Distribution of NFS requests is spread over all nfsd daemons. This means each daemon is scheduled sequentially as NFS requests arrive. A sample CPU distribution on HP-UX would look something like: CPU Utilization in minutes: 10 | 9 | 8 | 7 | 6 | 5 | X X 4 | X X X X 3 | X X X X 2 | X X X X 1 | X X X X 0 |_____X_________X_________X_________X__ NFSDs 1 2 3 4 While the scheduling algorithm evenly balances the NFS call load across all nfsd daemons, it makes it difficult to determine if enough nfsd daemons are running on the server.
The number of biod daemons

Too many biod daemons can cause NFS servers to be flooded with requests, since each biod daemon can have an outstanding NFS request pending. Increasing the number of biod daemons on a client increases the number of NFS requests the client can have pending. Too few biod daemons could mean an NFS request has to be performed by the client process itself (which means it has to wait) because no biod daemons are available. When the client process performs an RPC, the wait field (from the nfsstat -c command) is incremented by one. Too many waits indicate not enough biod daemons.
Number of Read/Write NFS Calls

The NFS read and NFS write RPC calls are the most resource-intensive of the NFS RPCs. Monitoring the percentage and quantities of these calls helps to give an indication of the total load these calls are placing on the NFS server.
Number of nullrecv
If the nfsd daemons are not being kept busy this counter will be incremented. If this counter is incrementing try reducing the number of nfsd daemons on the system until nullrecv is static.
Use netstat -p udp to View the Number of UDP Socket Overflows

UDP socket overflows can occur when too many NFS clients are sending requests to the NFS server, and too few nfsd daemons are running to handle the requests. When all the nfsd daemons are servicing RPC requests, none of them can read a new request from the UDP socket. Incoming RPCs are queued until the UDP socket structure becomes full. If the socket queue is full when a new request arrives, a UDP socket overflow condition occurs.
Number of badcalls
Bad calls indicate that the NFS server cannot process RPC requests. This could be due to authentication problems caused by having a user in too many groups, attempts to access exported file systems as root, or an improper secure RPC configuration. This can also be due to the server being down, or soft-mounted NFS file systems timing out.
Number of Time-Outs, Retransmissions, and Late Responses

A time-out indicates that the RPC call did not complete within the expected time period. The late responses (also known as badxid) refer to the NFS server responding to the client after the time period has expired. If time-outs and late responses are approximately equal, it indicates a healthy network, but an overloaded NFS server. If time-outs are high and late responses are low (or zero), it indicates packets are never making it to the server, and the network components (interface cards, cables, hubs) need to be examined.
119. SLIDE: Metrics to Monitor Network
Metrics to Monitor Network

Amount of NFS traffic:

Monitor number of collisions Monitor server workload Monitor client workload
Type of network topology and hardware Number of subnets Number of routers
Student Notes
Some key metrics to monitor from an overall network perspective include: Amount of Traffic. The amount of network traffic should be monitored across the entire LAN. However, unless network probes are available, this is very difficult to do. At a minimum, the amount of traffic into and out of the servers should be monitored with the netstat command. When monitoring network traffic, it is important to know the maximum packets per second on the LAN. In the case of a 10 MB Ethernet, this would be: 10 MB / 8 bits_per_byte = 1.2 MB per second 1.2 MB / 1 KB_average_packet_size = 1,200 packets 1,200 * 30%_saturation_point = 360 packets (Total MB per second) (Total packet per second) (Max packets per second with minimal collisions)
Type of Network Topology: Each network topology has different limitations. Ethernet is the most common, but it is the slowest. More recent Ethernet technologies are faster, offering 100 Mbits/sec or even 1000 Mbits/sec. FDDI is the fastest, but it is somewhat expensive. Token Ring has no collision issues (since it is token based), but it is not as pervasive. Number of Subnets: Subnetting is a method for localizing traffic to help reduce packet congestion. If too much traffic exists on a network, it may need to be split into multiple subnets. Number of Routers: Routers are another possible solution to help segment network traffic. In addition, routers can help with network security issues and routing of diverse packet types.
1110. SLIDE: Determining the NFS Workload
Determining the NFS Workload

Run nfsstat -s or netstat c to view total RPC calls Each week, zero counters; at end of week, divide total RPC calls by 5 days, 8 hours per day, 60 mins per hour, 60 seconds per minute Write a script to automate the data collection, calculation, and notification of the system administrator Set up a cron job to execute the script at the necessary times
Student Notes
The NFS workload on a server is defined as the total number of NFS packets received and processed. The NFS workload on a client is defined as the total number of NFS requests initiated from the client. It is important to establish a baseline regarding the NFS workload being placed on an NFS machine. This allows the system administrator to determine periods when the NFS workload is particularly high or low.
Sample Procedure for Calculating the NFS Workload

1. On Monday morning at 8:00 AM, run nfsstat -z. This zeroes out all the NFS registers. 2. On Friday evening at 5:00 PM, run nfsstat rs (on the server) or nfsstat rc (on the client). This shows the total number of NFS calls. Sample outputs are:
# nfsstat -rs Server rpc: calls badcalls 171792344 0 # nfsstat -rc
nullrecv 0
badlen 0
xdrcall 0
nfsdrun 549734423
Client rpc: Connection oriented: N/A Connectionless oriented: calls badcalls retrans 17547240 0 0 badverfs timers toobig 0 7 0
badxid 240 nomem 0
waits newcreds 0 0 buflocks 0
3. Calculate the average number of NFS calls per second by dividing the total RPC calls by 5 days, 8 hours per day, 60 minutes per hour, and 60 seconds per minute: ((((171792344 calls/ 5 days) / 8 hours) / 60 min) / 60 sec) = 1193 RPC calls/sec
Sample Script for Calculating an NFS Workload

A simple script for gathering and calculating the average number of NFS calls/second at the end of a week is shown below. Such a script could be called: usr/local/bin/create_nfs_report #!/usr/bin/sh /usr/bin/nfsstat -rs | tail -1 | read calls trash NFS_CALLS_PER_SEC=$(echo "$calls / 5 / 8 / 60 / 60" | bc) HOST=$(hostname) echo "The average NFS server calls (inbound) received for $(date +%x) was $NFS_CALLS_PER_SECOND on $HOST" | mailx -s "NFS Report" root #!/usr/bin/sh /usr/bin/nfsstat -rc | tail -3 | read calls trash NFS_CALLS_PER_SEC=$(echo "$calls / 5 / 8 / 60 / 60" | bc) HOST=$(hostname) echo "The average NFS server calls (outbound) initiated for $(date +%x) was $NFS_CALLS_PER_SECOND on $HOST" | mailx -s "NFS Report root This report can be mailed to the performance analysis station or to the NFS machine.
Sample cron Entry for Automating the Procedure

To automate the process so that this happens every week without user invention, the following two entries can be placed in root's crontab file: 0 8 * * 1 /usr/sbin/nfsstat -z 0 17 * * 5 /usr/local/bin/create_nfs_report This is a very simplistic form of data collection. Much more involved scripts can be developed. For example, scripts that take into account the time of day that demand is most heavy so that peak demand and demand patterns can be observed. Doing this on all NFS clients of an NFS server is also key. It is important to establish a baseline regarding the NFS workload being placed initiated from all NFS clients. This allows the system administrator to determine periods when the NFS workload is particularly high or low.
1111. SLIDE: NFS Monitoring nfsstat Output
NFS Monitoring nfsstat Output
# nfsstat s Connection oriented: calls badcalls nullrecv 0 0 0 Connectionless oriented: calls badcalls nullrecv 428 0 6 # nfsstat -c Client rpc: Connection oriented: calls badcall badxids 0 0 0 cantconn nomem interrupts 0 0 0 Connectionless oriented: calls badcalls retrans 25345 304 1109 badverfs timers toobig 0 16 0
badlen 0 badlen 0
xdrcall 0 xdrcall 0
dupchecks 0 dupchecks 0
dupreqs 0 dupreqs 0
timeouts 0
newcreds 0
badverfs 0
timers 0
badxids 49 nomem 0
waits 0 bufulocks 0
newcreds 0
Student Notes
The nfsstat -s report shows NFS statistics on an NFS server. The report shows overall RPC statistics and detailed NFS type packets received.
Fields of Interest in This Report

calls (RPC) This is the total RPC calls received. This should be compared to the total NFS calls received. Analyze the ratio of RPC calls to NFS calls to determine the percentage of RPC calls that are NFS related. This is the number of times an NFS daemon (or other RPC daemon) was scheduled to run only to find nothing in the UDP queue. This was very common on a 9.x system, since every time an NFS packet was placed in the UDP queue, all the nfsd daemons were awakened. The first nfsd daemon would take the NFS packet and the other daemons would find no packets in the UDP queue (incrementing the nullrecv field).
nullrecv
The example on the slide shows all the RPC packets received are NFS related. The six nullrecvs explain the difference between the RPC calls and NFS calls.
The reason for the nullrecv may be due to a client retransmission duplicate request. For example, if a client sends an NFS read request and does not receive a response within its time-out period, it will re-send the same request, which causes a duplicate entry to be in the server's UDP queue. When the first nfsd daemon removes the first NFS read request, it will also remove the duplicate request. This causes the second nfsd daemon to find an empty UDP queue when it executes. nfsstat s (Full Output)
# nfsstat -s Server rpc: Connection oriented: calls badcalls 0 0 badlen xdrcall 0 0 dupreqs 0 Connectionless oriented: calls badcalls 0 0 badlen xdrcall 0 0 dupreqs 0 Server nfs: calls 0 Version 2: (0 calls) null 0 0% root 0 0% read 0 0% create 0 0% link 0 0% rmdir 0 0% Version 3: (0 calls) null 0 0% lookup 0 0% read 0 0% mkdir 0 0% remove 0 0% link
nullrecv 0 dupchecks 0
nullrecv 0 dupchecks 0
badcalls 0 getattr 0 0% lookup 0 0% wrcache 0 0% remove 0 0% symlink 0 0% readdir 0 0% getattr 0 0% access 0 0% write 0 0% symlink 0 0% rmdir 0 0% readdir setattr 0 0% readlink 0 0% write 0 0% rename 0 0% mkdir 0 0% statfs 0 0% setattr 0 0% readlink 0 0% create 0 0% mknod 0 0% rename 0 0% readdir+
Module 11 Network Performance 0 0% fsstat 0 0% commit 0 0% 0 0% fsinfo 0 0% 0 0% pathconf 0 0%
The nfsstat -c report shows NFS statistics on an NFS client. The report shows the amount of RPC calls generated by the client, as well as the specific NFS calls.
Fields of Interest in This Report

calls (RPC) This is the total RPC calls generated by the client. This should be monitored relative to the total NFS calls generated. Analyze the ratio of RPC calls to NFS calls generated to determine the percentage of RPC calls that are NFS related. This is the number of times an NFS client process is put into a wait state due to no biod daemons being available. An example of this would be during an NFS write. Normally, an NFS write is performed by a biod daemon, and the biod daemon waits for an acknowledgment to be returned by the NFS server. When no biod daemons are available, then the actual client process itself performs the NFS write and the client has to wait for the acknowledgment to be returned. This is the number of times an NFS request was sent to the NFS server and no response was returned within the timeout period. A timeout can occur for two reasons: the NFS server machine is too busy and cannot get back to the client within the timeout period, or the network is having problems (collisions, bad interface card, bad hub) and the NFS request is never making it to the NFS server. This indicates a bad or duplicate transfer ID number was returned from the NFS server. When a client sends a request to the server, a corresponding transfer ID is sent with the requests. When the NFS server responds, it specifies which transfer ID request it is responding to. For example, if a client sends an NFS request (with a corresponding transfer ID) and does not hear back from the NFS server, it then transmits the request a second time. If the NFS server returns the first and second requests after the client had timed out the first time, the client will view the second response as a duplicate transfer ID. NOTE: The ratio of timeouts to badxids is an excellent way to determine if timeouts are occurring due to a slow NFS server or due to a failed network component. If the badxids are approximately the same as timeouts, then the NFS server is slow and the timeout period should be increased. If there are a lot of timeouts with few to no badxids, then the NFS requests are not making it to the server and there is most likely a failed LAN component.
waits
timeouts
badxids
retrans
This indicates the number of NFS requests retransmitted due to timeouts. Keep in mind, not every timeout causes a retransmission, as most clients error out after two to three retries. This indicates an NFS request has reached its retry count and has returned an error. This is most often due to the NFS client not being able to reach the NFS server (either because the NFS server is down, or the network link between the client and server is down).
badcalls
nfsstat -c (Full output)

# nfsstat -c Client rpc: Connection oriented: calls badcalls 0 0 timeouts newcreds 0 0 timers cantconn 0 0 interrupts 0 Connectionless oriented: calls badcalls 55 0 badxids timeouts 0 0 newcreds badverfs 0 0 toobig nomem 0 0 bufulocks 0 Client nfs: calls 55 cltoomany 0 Version 2: (55 calls) null 0 0% root 0 0% read 0 0% create 0 0% link 0 0% rmdir 0 0% Version 3: (0 calls) null 0 0%
badxids 0 badverfs 0 nomem 0
retrans 0 waits 0 timers 16 cantsend 0
badcalls 0
clgets 55
getattr 50 90% lookup 3 5% wrcache 0 0% remove 0 0% symlink 0 0% readdir 1 1% getattr 0 0%
setattr 0 0% readlink 0 0% write 0 0% rename 0 0% mkdir 0 0% statfs 1 1% setattr 0 0%
Module 11 Network Performance lookup 0 0% read 0 0% mkdir 0 0% remove 0 0% link 0 0% fsstat 0 0% commit 0 0% access 0 0% write 0 0% symlink 0 0% rmdir 0 0% readdir 0 0% fsinfo 0 0% readlink 0 0% create 0 0% mknod 0 0% rename 0 0% readdir+ 0 0% pathconf 0 0%
1112. SLIDE: Network Monitoring lanadmin Output
Network Monitoring lanadmin Output
Network Management ID =4 Description = lan0 Hewlett-Packard LAN Type (value) = ethernet-csmacd(6) MTU Size = 1500 Speed = 10000000 Station Address = 0x800097bfb43 Administration Status (value) = up(1) Operation Status (value) = up(1) Last Change = 4834 Inbound Octets = 426550151 Inbound Unicast Packets = 3380123 Inbound Non-Unicast Packets = 1992200 Inbound Discards =0 Inbound Errors = 1277 Inbound Unknown Protocols = 53618 Outbound Octets = 1653363768 Outbound Unicast Packets = 2626023 Outbound Non-Unicast Packets = 1454 Outbound Discards =1 Outbound Errors =0 Outbound Queue Length =0 Specific = 655367 Press <Return> to continue
Interface Hw Rev 0
Ethernet-like Statistics Group Index =4 Alignment Errors =0 FCS Errors =0 Single Collision Frames = 6221 Multiple Collision Frames = 10151 Deferred Transmissions = 116267 Late Collisions =0 Excessive Collisions =0 Internal MAC Transmit Errors = 0 Carrier Sense Errors =0 Frames Too Long =0 Internal MAC Receive Errors = 0 LAN Interface test mode. LAN Interface Net Mgmt ID = 4
Student Notes
The lanadmin command displays general network packet transmission statistics for a single system.
Fields of Interests in this Report

Collisions Frames These fields indicate the number of collisions detected by the system. Collisions slow NFS performance, as the network has to subside before any packets can be sent following a collision. This is the total of all packet types being sent and received from the Packets system. Compare this to the total number of daemon-related packets transmitted/received to obtain a ratio of total network traffic relative to the specific traffic.
Inbound/Outbound
The primary metric for determining if you have a network bottleneck is the ratio of collisions to out-bound packets. In this example, you would take the total number of collisions (6221 + 10151 = 16371) and divide it by the total number of outbound packets (2626023 + 1454 = 2627477) to get the percentage of collisions per outbound packet (16371 / 2627477 = 0.6%). The commonly used threshold is 5%. Any network experiencing greater than a 5% collision
rate is said to have a bottleneck. This system is well below that threshold. Of course, this metric only works on networks that experience collisions. Standard Ethernet does. Token rings do not. The procedure for producing this report is: 1. Execute the lanadmin command. 2. From the main menu, select lan. 3. From the lan menu, select display. Following is a complete output from this tool:
# lanadmin
LOCAL AREA NETWORK ONLINE ADMINISTRATION, Version 1.0 Thu, Mar 25,2004 11:22:51 Copyright 1994 Hewlett Packard Company. All rights are reserved. Test Selection mode. lan menu quit terse verbose = = = = = LAN Interface Administration Display this menu Terminate the Administration Do not display command menu Display command menu
Enter command: lan LAN Interface test mode. LAN Interface PPA Number = 0 clear display end menu ppa quit reset specific = = = = = = = = Clear statistics registers Display LAN Interface status and statistics registers End LAN Interface Administration, return to Test Selection Display this menu PPA Number of the LAN Interface Terminate the Administration, return to shell Reset LAN Interface to execute its selftest Go to Driver specific menu
Enter command: display LAN INTERFACE STATUS DISPLAY Thu, Mar 25,2004 11:23:02 PPA Number Description TX,FD, AUTO,TT=1500] Type (value) MTU Size Speed Station Address = 0 = lan0 HP PCI 10/100Base-TX Core [100BASE-
= = = =
ethernet-csmacd(6) 1500 100000000 0x306e48c545
Module 11 Network Performance Administration Status (value) Operation Status (value) Last Change Inbound Octets Inbound Unicast Packets Inbound Non-Unicast Packets Inbound Discards Inbound Errors Inbound Unknown Protocols Outbound Octets Outbound Unicast Packets Outbound Non-Unicast Packets Outbound Discards Outbound Errors Outbound Queue Length Specific Press <Return> to continue <CR> Ethernet-like Statistics Group Index Alignment Errors FCS Errors Single Collision Frames Multiple Collision Frames Deferred Transmissions Late Collisions Excessive Collisions Internal MAC Transmit Errors Carrier Sense Errors Frames Too Long Internal MAC Receive Errors = = = = = = = = = = = = 1 0 0 0 0 0 0 0 0 0 0 0 = = = = = = = = = = = = = = = = up(1) up(1) 780 1144058672 3513729 2575374 0 0 13895 784916247 3600289 379474 0 0 0 655367
LAN Interface test mode. LAN Interface PPA Number = 0 clear display end menu ppa quit reset specific = = = = = = = = Clear statistics registers Display LAN Interface status and statistics registers End LAN Interface Administration, return to Test Selection Display this menu PPA Number of the LAN Interface Terminate the Administration, return to shell Reset LAN Interface to execute its selftest Go to Driver specific menu
Enter command: quit #
1113. SLIDE: Network Monitoring netstat i Output
Network Monitoring netstat -i Output
# netstat -i Name Mtu Network lan0 1500 156.153.208.0 lo0 4136 loopback
Address Ipkts Ierrs Opkts Oerrs Coll r265c75.cup.edunet.hp.com 4546682 0 4138618 0 0 localhost 1178171 0 1178171 0 0
# netstat udp: 0 0 0 0 0
-p udp incomplete headers bad data length fields (Deleted from later versions) bad checksums socket overflows data discards (Deleted from later versions)
Student Notes
The netstat command can be used to monitor total collisions and total packet traffic in and out of a LAN card, as well as any UDP socket overflows. The -i option monitors input packets, input errors, output packets, output errors, and collisions for every LAN card on the system. Some versions of this tool did not show the Input Errors, Output Errors, and the Collisions. The output of this tool can also be used to calculate the collision/outbound packet ratio described in the previous topic. The -p udp option monitors overflows related to the UDP socket queue. If there are not enough nfsd daemons, the volume of incoming client NFS requests can exceed the server's ability to drain these requests from the UDP socket queue. When the socket queue becomes full and new NFS requests are received, the NFS request falls off the queue and a UDP socket overflow occurs.
1114. SLIDE: glance NFS Report
glance NFS Report
B3692A GlancePlus B.10.12 10:47:57 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U Cpu Util S |100% 100% 100% Disk Util F | 83% 22% 84% F Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------NFS BY SYSTEM Users= 13 Server (inbound) Client (outbound) Idx System ReadRt WriteRt SvcTm ReadRt WriteRt SvcTm NetwkTm -------------------------------------------------------------------------------1 e2403roc 0.0 0.0 0.00 0.0 0.0 0.00 0.00 2 e2403sto 0.0 0.0 0.00 0.0 0.0 0.00 0.00 3 e2403alf 0.0 0.0 0.00 0.0 0.0 0.00 0.00
S - Select a System
C - cum/interval toggle
Page 1 of 1
Student Notes
The glance NFS report (the n key) monitors total inbound requests for NFS servers and total outbound requests for NFS clients. For NFS servers, the total number of inbound read/write requests received from each client is shown, along with the average amount of time for the server to service each request. For NFS client systems, the total number outbound read/write requests sent to each NFS server is shown, along with the average amount of time, from the client perspective, for the requests to be serviced. For a detailed inspection on the types of NFS requests being sent (client) or the types of NFS requests being received (server), the specific client or server can be selected with the S key.
1115. SLIDE: glance NFS System Report
glance NFS System Report
B3692A GlancePlus B.10.12 10:45:26 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U Cpu Util S |100% 100% 100% F Disk Util F | 83% 22% 84% U B B Mem Util S S U | 94% 95% 96% Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------NFS OPERATIONS for: e2403sto Address = 15.19.83.75 PID = 1275 NFS GLOBAL ACTIVITY Users= 1 Server (inbound) Client (outbound) Current Cum Current Cum -------------------------------------------------------------------------------Read Rate 0.0 0.0 0.0 0.0 Write Rate 0.0 0.0 0.0 0.0 Read Byte Rate 0.0 0.0 0.0 0.0 Write Byte Rate 0.0 0.0 0.0 0.0 NFS Call Count 0 0 0 0 Bad Call Count 0 0 0 0 Service Time 0.00 0.00 0.00 0.00 Network Time na na 0.00 0.00 Read/Write Qlen na na 0 0 Idle biods na na 16 na Page 1 of 3
Student Notes
The glance NFS system report (the N key) displays the activity of NFS packets that are being received by an NFS server, or being sent as an NFS client. If a system is both a client and a server, separate columns are maintained for each. Fields of most interest in this report are the read and write rates, as these typically put the greatest load on a system. Note that this is page one of three. On the following two pages, the individual RPCs are broken down by type and counted. There are version2 and version 3 counts to accommodate earlier and later versions of NFS.
1116. SLIDE: glance Network by Interface Report
glance Network by Interface Report

B3692A GlancePlus B.10.12 10:47:57 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S R U U Cpu Util |100% 100% 100% F F Disk Util | 83% 22% 84% U B B Mem Util S S U | 94% 95% 96% U R R Swap Util U | 21% 21% 22% -------------------------------------------------------------------------------Interval: 5 NETWORK BY INTERFACE Users= 2 Network In Packet Out Packet In KB Out KB Idx Interface Type Rate Rate Rate Rate -------------------------------------------------------------------------------1 lan0 Lan 30.9/ 96.6 31.1/ 98.3 2.3/ 23.1 2.1/ 45.1 2 lo0 Loop na/ na na/ na na/ na na/ na
S - Select an Interface
Page 1 of 1
Student Notes
The glance Netwrok by Interface report (the l key) displays the activity of inbound and outbound packets. Fields of most interest in this report are the inbound and outbound packets rates, as well as the KB transferred in and out by each network card. The lo0 interface is the internal loopback interface used for diagnostics.
1117. SLIDE: Tuning NFS
Tuning NFS
Tune number of nfsd daemons Turn on sticky bit for exported executables Export file systems with asynchronous write option Avoid using symbolic links on exported file systems Tune number of biod daemons Tune mount options when mounting NFS file system: rsize and wsize options
retry and timeout options
Student Notes
There are a number of NFS tuning solutions that can help to improve performance on NFS servers: Tune number of nfsd daemons: The default number of nfsd daemons in HP-UX 11.00 and earlier was four. This most likely is too small. The best recommendation for performance is to have two nfsd daemons for each simultaneous disk operation that can be performed. This allows a request to be received, while another is awaiting disk service. For example, on a system with four SCSI controllers and NFS-exported file systems spanning disks on these controllers, schedule eight nfsd daemons. In 11i, the default number of nfsd was raised to 16. This seems a more reasonable number. The best indicator of too few nfsd daemons is UDP socket overflows. Increase the number of nfsd daemons s if even one UDP socket overflow occurs. The size of the UDP socket queue can be viewed with the netstat -an | grep udp |grep 2049 command. Another indicator of too few nfsd server daemons is a high total of badxids being returned to NFS clients. Remember, only UDP requires the number of nfsds to be tuned. TCP uses multiple threads in the same daemon.
Turn on sticky bit for exported executables: By default, text segments are not paged to swap, as their pages already exist on the file system. In the case of an executable program being loaded from an NFS server across the network to an NFS client, it is desirable to page the text locally, rather than return to the NFS server when the text page is needed again. This behavior can be achieved by setting the sticky bit to ON for the executable program. Below is an example of setting the sticky bit to ON for an executable: # chmod 1555 prgm # ls -l prgm -r-xr-xr-t 1 root
bin
411089 Feb 3 1997 /opt/PGMS/bin/prgm
This also requires modifying the following tunable kernel parameter on the client: page_text_to_local = 1 There are a number of NFS tuning solutions that can help to improve performance on NFS clients: Tune number of biod daemons: The default number of biod daemons in HP-UX 11.00 and earlier was four. This most likely is too small. The best recommendation is to have a minimum of two biod daemons for every client process performing I/O to and from the NFS file system. Each biod daemon has, at most, one NFS request outstanding at any time, and as the number of biod daemons increase, the more disk requests the client can send. If the client has x process performing file system I/O and y biod daemons, then the client could have x+y RPC requests outstanding at one time: one for each of the biod daemons, and one for each of the clients. In 11i, the default number of biod was raised to 16. This seems a more reasonable number. The best indicator of too few biod daemons is the number of waits shown in the netstat -c command. Tune the NFS Mount Options: There are a number of NFS mount options that can affect client performance, among them the NFS read and write buffer sizes. The NFS buffer size (specified with the rsize and wsize mount options) determines the increment in which data is transferred to and from the NFS file system. For example, if the file system block size is 8192 bytes, and the NFS buffer size was 8,500 bytes, two file system I/Os would be required before any NFS packet could be sent. The recommendation for NFS buffer size is to match the size of the file system block size. The default NFS buffer size is 8192 bytes, and this does match the default file system block size on HFS. For JFS, try to match the buffer size to the size of a typical extent size.
1118. SLIDE: Tuning the Network
Tuning the Network

Ways to Reduce LAN Congestions Subnet the network Use routers, not a computer for IP gateway Use higher speed LAN technology: Fast Ethernet, FDDI, ATM Increase the number of LAN adapters on the server Put the server on an FDDI network and use routers to segment client traffic Put the server on an FDDI network and use switches to fan out to clients
Student Notes
Subnetting a network is an effective way to reduce congestion on a LAN. Using routers (as compared to bridges, Ethernet switches, and Ethernet to FDDI concentrators) provides a great deal of flexibility in the form of security, network segmentation, and routing of diverse types of packets. Routers usually provide good throughput and performance at a relatively low cost. Using an existing computer system as a gateway for traffic between NFS clients and the file server is often inefficient and limits the performance of the NFS clients. By making sure the maximum transmission unit MTU on the client system, the file server, and all routers in between them, the overhead on the routers caused by packet fragmentation and re-assembly can be avoided.
Module 11 Network Performance Example
# netstat -i Name Mtu Network ni0* 0 none ni1* 0 none lo0 4608 loopback lan0 1500 156.153.192 lan1* 1500 none
Address Ipkts Ierrs Opkts Oerrs Coll none 0 0 0 0 0 none 0 0 0 0 0 localhost 6055 0 6055 0 0 pr1w1 3724729 0 1705240 10 34739 none 0 0 0 0 0
The route command (-p option) can be used to set the Path Mtu size for a host route only.
1119. SLIDE: Tuning the Network (Continued)
Tuning the Network (Continued)
10 Mb/s 10 Mb/s 10 Mb/s
Add LAN Interfaces to server
10 Mb/s 100 Mb/s Router 10 Mb/s 10 Mb/s
Add Subnets
100 Mb/s
Switch 100 Mb/s
Use Ethernet Switches
Student Notes
If the average client demand on an NFS server is measured to be greater than the network bandwidth, and assuming 100 clients demand 10 NFS requests per second, then a single 10-MB Ethernet segment (with a calculated 360 packets maximum per second) could not handle this workload, even though the server itself may be able to (from a processing stand point). To allow this client workload to be processed by the single NFS server, the following network configurations can be implemented: 1. Use at least three network interface cards, one for each segment, distributing 33-34 clients on a segment. 2. Use one or more high-speed network connections, which connect to multiple lower bandwidth LAN segments. In the first example, we have added multiple LAN interfaces to our NFS server. In the second example, we have a 100-MB/second/ FDDI card on the NFS file server. We also have a router on the same segment as the server that has an FDDI interface, as well as
several regular 10-MB/second Ethernet interfaces. Here, the issue of the routers ability to do packet fragmentation and reassembly efficiently, may become important. In our last example, we have a 100-MB/second FDDI card on the NFS file server and a 100-MB/second translating Ethernet switch on the same FDDI segment. Since this is not routed, the file server and clients share the same subnet address. There are many other possible network topologies.
1120. LAB: Network Performance Directions

The following two labs investigate Network read and write performance. The labs use NFS and are performed against the JFS file system created in the JFS module.
Lab 1 Network Read Performance

To perform this lab, two systems are needed: an NFS server and an NFS client. Pair up with another student in the class for this lab. 1. Make sure the JFS file system on the NFS server contains the make_files program. Execute the make_files program to create files for the client to access. # # # # mount /dev/vg00/vxfs /vxfs cp /home/h4262*/disk/lab1/make_files /vxfs cd /vxfs ./make_files
2. Export the JFS file system so the client can mount it. # exportfs -i -o root=client_hostname # exportfs 3. From the client system, mount the NFS file system. # mount server_hostname:/vxfs /vxfs 4. Time how long it takes to read the 20 MB of files from the mounted file system. Record the results: # timex cat /vxfs/file* > /dev/null Record results: /vxfs
Real: _____________ User: ____________ Sys: ____________ 5. Now that the data is in the client's buffer cache, time how long it takes to read the exact same files again. Record the results: # timex cat /vxfs/file* > /dev/null Record results:
Real: _____________ User: ____________ Sys: ____________ Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help prefetching data.
6. Test to see if fewer biod daemons will change the initial performance. # # # # # # cd / umount /vxfs kill $(ps -e |grep biod|cut -c1-7) /usr/sbin/biod 4 mount server_hostname:/vxfs /vxfs timex cat /vxfs/file* > /dev/null
Record results:
Real: _____________ User: ____________ Sys: ____________ 7. Once finished, remove the files and umount the file system. # rm /vxfs/file* # umount /vxfs
Lab 2 Network Write Performance

The following lab has the client perform many writes to an NFS file system. The following parameters will be investigated: Number of biod daemons NFS version 2 versus NFS version 3 TCP versus UDP
During this lab, the monitoring tools shown below should be used on the client and server CLIENT SERVER
# nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key) - monitor Remote Rds/Wrts 1. From the NFS client, mount the NFS file system as a version 2 file system. # mount -o vers=2 server_hostname:/vxfs /vxfs 2. Terminate all the biod daemons on the client. # kill $(ps -e |grep biod|cut -c1-7) 3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file. # cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs Record results:
Real: _____________ User: ____________ Sys: ____________ 4. Now, start up the biod daemons, and retry timing the copy. Record the results: # /usr/sbin/biod 4 # timex cp /stand/vmunix Record results: /jfs
Real: _____________ User: ____________ Sys: ____________
5. Change the mount options to version 3 and retime the transfer: # # # # # cd / umount /vxfs mount o vers=3 server_hostname:/vxfs /vxfs cd / timex cp /stand/vmunix /vxfs
Record results:
Real: _____________ User: ____________ Sys: ____________ 6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility. # ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp How long did the FTP transfer take? _________ Explain the difference in performance. 7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP. # umount /vxfs # mount -o vers=3 o proto=udp server_hostname:/vxfs /vxfs Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP? # timex cp /stand/vmunix /vxfs

Objectives
Upon completion of this module, you will be able to do the following: Identify which tunable parameters belong to which category Identify tunable kernel parameters that could impact performance Tune both static and dynamic tunable parameters
121. SLIDE: Kernel Parameter Classes
Kernel Parameters Classes

Static
requires a kernel rebuild and a reboot
Dynamic

changes take place immediately changes survive a reboot
Automatic

constantly being tuned by the kernel can be set manually to a fixed value
Student Notes
There are a number of tunable parameters within the kernel that can have a big impact on performance. When making changes to these parameters, it may require that a new kernel be compiled. As of 11i v1, about 12 parameters were converted to dynamically tunable parameters. That is, their values could be changed without rebuilding the kernel and without rebooting the system. As of 11i v2, there are now around 36 dynamically tunable parameters, plus a few traditional parameters that are now tuned by the kernel, so no manual tuning of them need be done at all. Static kernel parameters have been around since UNIX was first designed. In order to change one of these parameters, it was necessary to alter the contents of a system configuration file, system, rebuild the kernel using this altered configuration file, move the new kernel into place, and reboot the system to activate the new kernel. This tended to be time consuming and forced the system to become unavailable for a time. Recently, with HP-UX 11i v1, a few kernel parameters were converted to dynamic tuning. These parameters could be altered, using SAM or kmtune, and the changes would become effective immediately. There was no longer a need to rebuild the kernel or reboot the system. However, this only applied to those few kernel parameters. The vast majority of kernel parameters were still static. The dozen parameters that were made dynamically tunable,
were ones that tended to be tuned by system administrators more frequently, but were relatively easy to convert to dynamic. More recently, with HP-UX 11i v2, several more parameters were converted to dynamic tuning. These parameters were also tuned fairly frequently by system administrators, but were more difficult to convert to dynamic. At the same time, a new class of parameters was introduced automatic. These parameters were tuned by the kernel constantly in response to changing conditions in the system. However, the system administrator could override the automatic handing by the kernel and force the parameter to some fixed value, if needed. At HP-UX 11i v1, the following kernel parameters became dynamic: core_addshmem_read core_addshmem_write maxfiles_lim maxtsiz maxtsiz_64bit maxuprc msgmax msgmnb scsi_max_qdepth semmsl shmmax shmseg At HP-UX 11i v2, the following additional kernel parameters became dynamic: aio_listio_max aio_max_ops aio_monitor_run_sec aio_prio_delta_max aio_proc_thread_pct aio_proc_threads aio_req_per_thread alloc_fs_swapmap alwaysdump dbc_max_pct dbc_min_pct dontdump fs_symlinks ksi_alloc_max max_acct_file_size max_thread_proc maxdsiz maxdsiz_64bit maxssiz maxssiz_64bit nfile nflocks nkthread
nproc nsysmap nsysmap64 physical_io_buffers shmmni vxfs_ifree_timelag Also at HP-UX 11i v2, the following kernel parameters are obsolete or automatic: bootspinlocks clicreservedmem maxswapchunks maxusers mesg ncallout netisr_priority nni ndilbuffers sema semmap shmem spread_UP_drivers
122. SLIDE: Tuning the Kernel
Tuning the Kernel

Use system_prep, kmtune, or kctune to view current values of tunable kernel parameters. Use SAM (or new km/kc commands) to tune kernel parameters. Tune only one parameter at a time. Do not make parameters unnecessarily large. Use glance to monitor system table sizes (ensure highest value is not equal to total table size). Some kernel parameters are dynamic (no reboot) see kmtune and kctune.
Student Notes
Some general rules and notes regarding tuning and recompiling the kernel: View the existing, tunable parameters with the kctune command (HP-UX 11i v2), the kmtune command (HP-UX 11.00 and 11i v1) or the sysdef or system_prep commands (HP-UX 10.x). You can also use SAM with any version of HP-UX to view the current values. Examples of outputs are shown below. Use the System Administration Manager (SAM) to tune the kernel parameters and rebuild the systems. SAM has the advantage of displaying all available, tunable parameters, their current values, and a range of acceptable values. SAM also knows which parameters can be tuned dynamically and will make changes to them immediately. As of HP-UX 11i v2, SAM calls a separate utility to do the actual tuning. When tuning performance by modifying kernel parameters, modify only one value with each kernel rebuild. By changing several parameters at once, you may cloud the picture and make it much more difficult to determine what helped and what hurt the systems performance.
Avoid setting the tunable parameters too large. Many of the parameters create in-core memory data structures whose size is dependent upon the value of the tunable parameter (for example, nprocs to the size of the process table). Generally, it is a good rule of thumb to increase or decrease a parameter by no more than 20%, while trying to find the best setting for it. Of course, if you are changing a parameters value to accommodate some new application you are installing, always follow the manufacturers suggested changes. Use glance to monitor system table sizes. Ensure the system tables are not running out of entries. In general, there should be around 20% of unused entries in any table. This will ensure that you have enough entries to handle any high demand periods.
The step-by-step procedure for tuning and recompiling the kernel manually on HP-UX 11.X is shown below: 1. Log in as superuser. 2. Change directory. cd /stand/build 3. Create a system file from your current kernel. /usr/lbin/sysadm/system_prep -v -s system 4. Modify the /stand/build/system file as desired. 5. Build the kernel: /usr/sbin/mk_kernel -s system. 6. Save your old system and kernel files, just in case you want to go back. cp /stand/system /stand/system.prev cp /stand/vmunix /stand/vmunix.prev cp /stand/dlkm /stand/dlkm_vmunix.prev 7. Schedule the kernel update on the next reboot. kmupdate 8. Shut down and reboot from your new kernel. /sbin/shutdown -ry 0
Understanding Dynamic Kernel Variables.

kctune(1M), kmtune(1M) or sam can be used on the fly to modify some kernel variables. Any changes take place immediately without the need to reboot. In HP-UX 11i v2, kmtune still exists, but simply calls kctune.
Example using kmtune to set and then activate a new value for a dynamic kernel variable.
# kmtune -q shmseg Parameter Current Dyn Planned Module Version ===================================================== shmseg 120 Y 120 # kmtune -s shmseg=155 # kmtune -l -q shmseg Parameter: shmseg Current: 120 Planned: 155 Default: 120 Minimum: Module: Version: Dynamic: Yes # kmtune -u shmseg shmseg has been set to 155 (0x9b).
123. SLIDE: Kernel Parameter Categories
Kernel Parameter Categories

File system Message queues Semaphores Shared memory Process Swap LVM Networking Miscellaneous
Student Notes
The next few slides will present the tunable kernel parameters in these categories.
124. SLIDE: File System Kernel Parameters
File System Kernel Parameters

Kernel Parameter dbc_min_pct dbc_max_pct nbuf bufpages fs_async maxfiles maxfiles_lim nfile ninode nflocks vx_ncsize Default 5 50 0 0 0 60 1024 formula formula 200 1024 Description Minimum size of dynamic buffer cache (dbc) Maximum size of dynamic buffer cache (dbc) Number of buffer headers (in 10.x and above, use DBC) Number of 4-KB buffer pages (in 10.x and above, use DBC) If on (1), forces all meta-data writes to disk to be asynchronous Soft limit to the number of files a process can have open Hard limit to the number of files a process can have open Size of file table in memory Size of inode table in memory Size of file-lock table in memory Size of vxfs directory name lookup cache (DNLC)
Student Notes
dbc_min_pct dbc_min_pct specifies the minimum size that the system's buffer cache may shrink to as a percentage of physical memory. It is now dynamic in 11i v2. dbc_max_pct specifies the maximum size that the system's buffer cache may grow to as a percentage of physical memory. It is now dynamic in 11i v2. nbuf is used to specify the number of file system buffer cache headers. Set nbuf to zero if you want to use the system's ability to grow and shrink this important table dynamically, based on demand. It is not yet obsolete, but expect it to be so in a future release. bufpages specifies the number of 4-KB pages in memory that will be allocated for the file system buffer cache. Like nbuf, this parameter should be set to zero if you want to use the dynamic form of buffer cache allocation. If this value is non-zero, enough nbufs (one for every two bufpages) will be created as well, unless otherwise specified. It is not yet obsolete, but expect it to be so in a future release.
dbc_max_pct
nbuf
bufpages
fs_async
fs_async specifies that file system data structures may be posted to disk asynchronously. While this can speed file system performance for some applications, it increases the risk that a file system will be corrupted in the event of system power loss. maxfiles specifies the soft limit to the number of files that a single program may have open at one time. A program may exceed this soft limit up to the value of maxfiles_lim. In 11i v2, maxfiles is computed at boot and is set to 512, if memory is less than 1 GB. Otherwise its set to 2048. maxfiles_lim is the hard limit to the number of files that a single program can open up at one time. This parameter was made dynamic in 11i v1 and the default value was set to 4096. nfile is the size of the file table in memory, and therefore defines the maximum number of files that may be open at any one time on the system. Every process uses at least three file descriptors. Be generous with this number, as the required memory is minimal. nfile depends on the parameters nproc, maxusers, and npty. This parameter was made dynamic in 11i v2 and was no longer dependent on maxusers. Its value is computed at boot time and is set to 16384 if memory is less than 1 GB; otherwise its set to 65536. ninode is the size of the HFS in-core inode table. By caching inodes in memory the amount of physical I/O is decreased when accessing files. Each unique HFS file open on the system has a unique inode. This table is hashed for performance. At boot time in 11i v2, its set to 4880, if memory is less than 1GB; otherwise its set to 8196. nflocks is the number of file locks available on the system. File locks are a kernel service to enable applications to safely share files. Databases or other applications that make use of the lockf() system call can be large consumers of file locks. Note that one file may have several locks associated with it. This parameter was made dynamic in 11i v2 - at boot time, if memory is less than 1 GB, its set to 1200; otherwise its set to 4096. Along with ninode, this parameter controls the size of the DNLC (directory name lookup cache). Recent directory path names are stored in memory to improve performance. This parameter is set in bytes. This parameter has been obsoleted in 11i v2. VxFS 3.5 now uses its own internal DNLC.
maxfiles
maxfiles_lim
nfile
ninode
nflocks
vx_ncsize
125. SLIDE: Message Queue Kernel Parameters
Message Queue Kernel Parameters

Kernel Parameter mesg msgmap msgmax msgmnb msgmni msgseg msgssz msgtql Default 1 formula 8192 16384 50 2048 8 40 Description Enable or disable IPC messaging (700 only) Size of message-free-space map Maximum size in bytes of an individual message Maximum size in bytes of message queue space Maximum number of messages queue identifiers Number of segments in the the system message buffer Size in bytes of segments to be allocated for messages Size of message header space (1 header per message)
Student Notes
Message queues are used by applications to transfer a small to medium amount of information from exactly one process to another process. This information could be in the form of a structure, a string, a numerical value, or any combination thereof. SVIPC message queues have been around for a long time. They are controlled by a number of tunable kernel parameters. mesg msgmap msgmax msgmnb msgmni mesg when set (mesg = 1) enables the message queue services in the kernel. This parameter is obsolete as of 11i v2. msgmap specifies the size of the free-space map used in allocating message buffer segments for messages. msgmax specifies the maximum size in bytes of an individual message. This parameter is dynamic at HP-UX 11i v1. msgmnb specifies the maximum total space consumed by all messages in a queue. This parameter is dynamic at HP-UX 11i v1. msgmni specifies the maximum number of message queue identifiers allowed on the system at one time. Each message queue has an associated message
queue identifier stored in non-swappable kernel memory. In 11i v2, the default was raised to 512. msgseg msgssz msgtql msgseg is the number of segments in the system-wide message buffer. In 11i v2, the default was raised to 8192. msgssz is the size in bytes of each message buffer segment. In 11i v2, the default was raised to 96. msgtql is the total number of messages that can reside on the system at any on time. In 11i v2, the default was raised to 1024.
Any of these parameters could affect the performance of an application, simply by virtue of not having enough of the message queue resources available when needed. However, the msgssz and the msgseg parameters also control the size in an in-memory message buffer that is shared by all SVIPC message queues. It needs to be large enough to handle all the messages that may be pending at any one time, but by the same token, should not be much larger than that. It could be taking up far more memory than is necessary. It is not dynamic; it is fixed in size. There also exist in HP-UX 11.x POSIX message queues. There are no tunable parameters for them. POSIX message queues have been shown to consistently out-perform SVIPC message queues.
126. SLIDE: Semaphore Kernel Parameters
Semaphore Kernel Parameters

Kernel Parameter sema semaem semmap semmni semmns semmnu semume Default 1 16384 formula 64 128 30 10 32767 2048 Description Enable or disable Semaphore code (700 only) Maximum amount a semaphore can be changed by undo Size of free-space map used for allocating new semaphores Maximum number of sets of semaphores Maximum number of semaphores, system-wide Maximum number of processes that can have undo operations pending on a given semaphore Maximum number of semaphores that a given process can have undo operations pending on Maximum value a semaphore is allowed to reach Maximum number of semaphores in a given set
semvmx semmsl
Student Notes
Semaphores are another form of interprocess communication. Semaphores are used mainly to keep processes properly synchronized to prevent collisions when accessing shared data structures. Semaphores are typically incremented or decremented by a process to block other processes while it is performing a critical operation or using a shared resource. When finished, it decrements or increments the value, allowing blocked processes to then access the resource. Semaphores can be configured as binary semaphores with only two values: 0 and 1, or they can serve as general semaphores (or counters), where one process increments/decrements the semaphore and one or more cooperating processes decrement/increment it. SVIPC semaphores have been around for a long time. They are controlled by several tunable parameters. sema semaem sema (Series 700 only) enables or disables IPC semaphores at system boot time. This parameter is obsolete as of 11i v2. semaem is the maximum value by which a semaphore can be changed in a semaphore undo operation.
semmap
semmap is the size of the free-semaphores resource map for allocating requested sets of semaphores. This semaphore is obsolete as of 11i v2. semmni is the maximum number of sets of IPC semaphores allowed on the system at any given time. In 11i v2, the default was raised to 2048. semmns is the total system-wide number of individual IPC semaphores available to system users. In 11i v2, the default was raised to 4096. semmnu is the maximum number of processes that can have undo operations pending on any given IPC semaphore on the system. In 11i v2, the default was raised to 256. semume is the maximum number of IPC semaphores on which a given process can have undo operations pending. In 11i v2, the default was raised to 100. semvmx, the maximum value any given IPC semaphore is allowed to reach, prevents undetected overflow conditions). Until 11i v2, semmsl was an untunable value in the kernel. It specified the maximum number of semaphores that could be allocated to a specific semaphore set. In 10.X it was set to 500. In 11.00, it was set to 2048. Now it is a dynamic tunable.
semmni semmns semmnu
semume semvmx
semmsl
Any of these parameters could affect the performance of an application, simply by virtue of not having enough of semaphore resources available when needed. There also exist in HP-UX 11.x POSIX semaphores. There are no tunable parameters for them. POSIX semaphores have been shown to consistently out-perform SVIPC semaphores.
127. SLIDE: Shared Memory Kernel Parameters
Shared Memory Kernel Parameters

Kernel Parameter shmem shmmax shmmni shmseg Default 1 64 MB 200 120 Description Enable or disable Shared Memory (700 only) Maximum shared memory segment size Maximum number of total shared memory segments Maximum number of shared memory segments that a single process may attach
Student Notes
Shared memory is reserved memory space for storing data shared between or among cooperating processes. Sharing a common memory space eliminates the need for copying or moving data to a separate location before it can be used by other processes, reducing processor time and overhead, as well as memory consumption. Shared memory is allocated in swappable, shared memory space. Data structures for managing shared memory are located in the kernel. Shared memory segments are much preferred by memory intensive applications, such as Data Bases, since they can be very large and can be accessed without using system calls. SVIPC shared memory use the following tunable parameters. shmem shmmax shmmni shmem ,when set to true, enables the shared memory subsystem at boot time. This parameter is obsolete in 11i v2. shmmax specifies the maximum shared memory segment size. Dynamic in 11i v1. Also in 11i v2, the default was raised to 1GB. shmmni specifies the maximum number of shared memory segments allowed on the system at any one time. Dynamic in 11i v2. Also in 11i v2, the default was raised to 400.
shmseg
shmseg specifies the maximum number of shared memory segments that can be simultaneously attached (shmat()) to a single process. Dynamic in 11i v1. Also in 11i v2, the default was raised to 300.
Any of these parameters could affect the performance of an application, simply by virtue of not having enough shared memory resources available when needed. There also exist in HP-UX 11.x POSIX shared memory. There are no tunable parameters for them. POSIX shared memory segments are implemented through the memory-mapped file architecture, so it could be affected by some of the file system tunable parameters described earlier.
128. SLIDE: Process-Related Kernel Parameters
Process-Related Kernel Parameters

Kernel Parameter maxdsiz maxdsiz_64bit maxssiz maxssiz_64bit maxtsiz maxtsiz_64bit maxressiz maxressiz_64bit maxuprc nproc timeslice Default 256 MB Description Maximum 32 and 64 bit process data segment size
8 MB
Maximum 32 and 64 bit process stack size
64 MB
Maximum 32 and 64 bit process text segment size
8 MB
Maximum 32 and 64 bit process RSE stack size (IA-64 only) Maximum number of concurrent processes per user ID Maximum number of processes system wide Maximum time a process can have the CPU before yielding to next highest priority. Set in ticks (10ms).
50 formula 8
Student Notes
Manage the number of processes on the system and processes per user to keep system resources effectively distributed among users for optimal overall system operation. Manage allocation of CPU time to competing processes at equal and different priority levels. Allocate virtual memory among processes, protecting the system and competing users against unreasonable demands of abusive or run-away processes. maxdsiz maxdsiz defines the maximum size of the static data storage segment of an executing 32-bit process. In 11i v2, this default has been raised to 1 GB. maxdsiz_64bit defines the maximum size of the static data storage segment of an executing 64-bit process. In 11i v2, this default has been raised to 4 GB. maxssiz defines the maximum size of the dynamic storage segment (DSS), also called the stack segment, of an executing 32bit process.
maxdsiz_64bit
maxssiz
maxssiz_64bit
maxssiz_64bit defines the maximum size of the dynamic storage segment (DSS), also called the stack segment, of an executing 64-bit process. In 11i v2, this default has been raised to 256 MB. maxtsiz defines the maximum size of the shared text segment (program storage space) of an executing process. Note maxtsiz_64bit for 64 bit HP-UX 11. maxressiz defines the maximum size of the register stack engine (RSE), also called the RSE stack segment, of an executing 32-bit process. This parameter is only found on an IA-64 kernel. maxressiz_64bit defines the maximum size of the register stack engine (RSE), also called the RSE stack segment, of an executing 64-bit process. This parameter is only found on an IA-64 kernel. maxuprc establishes the maximum number of simultaneous processes available to each user on the system. The user ID number identifies a user. The superuser is immune to this limit. In 11i v2, this default is now set to 256. nproc specifies the maximum total number of processes that can exist simultaneously in the system. This parameter has been made dynamic in 11i v2, and the new default setting is 4200. The timeslice interval is the amount of time one thread is allowed to accumulate before the CPU is given to the next thread at the same priority. The value of timeslice is specified in units of (10 millisecond) clock ticks.
maxtsiz
maxressiz
maxressiz_64bit
maxuprc
nproc
timeslice
129. SLIDE: Memory-Related Kernel Parameters
Memory-Related Kernel Parameters

Kernel Parameter vps_ceiling vps_chatr_ceiling vps_pagesize swapmem_on nswapdev nswapfs swchunk maxswapchunks page_text_to_local Default 16 1048576 4 1 10 10 2048 256 0 Description Maximum automatic page size (kbytes) the kernel selects Maximum page size (kbytes) useable with chatr Default page size used without chatr specification Enable or disable pseudo swap Maximum number of device swap areas Maximum number of file system swap areas Size in DEV_BSIZE (1-KB) units of swap space units Maximum number of swchunk units Enable or disable process text to be swapped locally
Student Notes
Configurable kernel parameters for memory paging enforce operating rules and limits related to virtual memory (swap space). vps_ceiling This parameter is provided as a means to minimize lost cycle time caused by TLB (translation look-aside buffer) misses on systems using newer PA-RISC devices such as the PA-8000 and the Itanium family that have smaller TLBs and may not have a hardware TLB walker. If a user application does not use the chatr command to specify a page size for program text and data segments, the kernel selects a page size that, based on system configuration and object size, appears to be suitable. This is called transparent selection. vps_chatr_ceiling User applications can use the chatr command to specify a page size for program text and data segments, providing some flexibility for improving overall performance, depending on system configuration and object size. The specified size is then
compared to the page-size value limit defined by vps_chatr_ceiling that is defined in the kernel at systemboot time. If the value specified is larger than vps_chatr_ceiling, vps_chatr_ceiling is used. vps_page_size Specifies the default user-page size (in Kbytes) that is used by the kernel if the user application does not use the chatr command to specify a page size. swapmem_on swapmem_on enables or disables the creation of pseudo-swap, which is swap space designed to increase the apparent total swap space, so that real swap can be used completely, or large memory systems dont need corresponding swap space. nswapdev specifies an integer value equal to the number of physical disk devices that can be configured for device swap up to the maximum limit of 25. nswapfs specifies an integer value equal to the number of file systems that can be made available for file-system swap, up to the maximum limit of 25. swchunk defines the chunk size for swap. This value must be an integer power of two. When the system needs swap space, one swap chunk is obtained from a device or file system. When that chunk has been used and another is needed, a new chunk is obtained. If the swap space is full or if there is another swap space at the same priority, the new chunk is taken from a different device or file system, thus distributing swap use over several devices. maxswapchunks specifies the maximum amount of configurable swap space on the system. In 11i v2 this parameter is obsolete. page_text_to_local allows NFS clients to write the text segment to local swap and retrieve it later. This eliminates two separate text-segment data transfers to and from the NFS server, thus improving NFS client program performance. This parameter does not seem to be defined in 11i v2, even though it has not been identified as an obsolete parameter.
nswapdev
nwapfs
swchunk
maxswapchunks
page_text_to_local
1210. SLIDE: LVM-Related Kernel Parameters
LVM-Related Kernel Parameters

Kernel Parameter maxvgs no_lvm_disks Default 10 0 Description Maximum number of volume group on the system Enable or disable system to use LVM (0 = false, LVM exists 1 = true, no LVM disks exist)
Student Notes
Two configurable kernel parameters are provided that relate to kernel interaction with the logical volume manager. maxvgs maxvgs defines the maximum number of volume groups configured by the logical volume manager on the system. no_lvm_disks flag notifies the kernel when no logical volumes exist on the system, i.e. LVM is disabled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.
no_lvm_disks
1211. SLIDE: Networking-Related Kernel Parameters
Networking-Related Kernel Parameters

Kernel Parameter netisr_priority Default 1 Description Define priority to assign to the network packet processing daemon (-1 means handle on an interrupt basis best packet processing performance) Amount of memory, in bytes, to be allocated for IP packet fragmentation reassembly queue
netmemmax
10% of mem
Student Notes
Two configurable kernel parameter are related to the kernel's interaction with the networking subsystems: netisr_priority netisr_priority sets the real-time interrupt priority for the networking interrupt service routine daemon. By default, it is set to 1 on Uniprocessor systems and 100 on Multiprocessor systems. This parameter is obsolete in 11i v2. netmemmax specifies how much memory is reserved for use by networking for holding partial Internet protocol (IP) messages which are typically held in memory for up to 30 seconds. When messages are transmitted using Internet protocol, they are sometimes broken into multiple, "partial" messages (fragments). netmemmax simply establishes a maximum amount of memory that can be used for storing network-message fragments until they are reassembled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.
netmemmax
1212. SLIDE: Miscellaneous Kernel Parameters
Miscellaneous Kernel Parameters

Kernel Parameter create_fastlinks default_disk_ir maxusers ncallout npty rtsched_numpri unlockable_mem Default 0 1 32 formula 60 32 0 Description Enable or disable creation of fast symbolic links Enable or disable immediate reporting on all disks Maximum number of simultaneous users expected Maximum number of timeouts (for example, alarms) pending Maximum number of concurrent pseudo tty connections Number of distinct POSIX real-time priorities Minimum amount of memory to reserved for use by the paging system
Student Notes
The following parameters are more or less unrelated. create_fastlinks When create_fastlinks is non-zero, it causes the system to create HFS symbolic links in a manner that reduces the number of disk-block accesses by one for each symbolic link in a pathname lookup. default_disk_ir enables or disables immediate reporting. With Immediate Reporting ON, disk drives that have data caches return from a write() system call when the data is cached, rather than returning after the data is written on the media. This sometimes enhances write performance, especially for sequential transfers. In 11i v2, this parameter is set to 0, by default. maxusers does not itself determine the size of any structures in the system; instead, the default value of other global system parameters depends on the value of maxusers. When other configurable parameter values are defined in terms of maxusers, the kernel is made smaller and more efficient by minimizing wasted space due to improperly balanced resource allocations. In
default_disk_ir
maxusers
11i v2, the use of maxusers has been eliminated from the formula of every parameter that was dependent on it. Changing its value has no effect on 11i v2. ncallout ncallout specifies the maximum number of timeouts that can be scheduled by the kernel at any given time. A general rule is that one callout per process should be allowed unless you have processes that use multiple callouts. In 11i v2 this parameter is obsolete. npty specifies the maximum number of pseudo-tty data structures available on the system. rtsched_numpri specifies the number of distinct priorities that can be set for POSIX real-time processes running under the realtime scheduler. unlockable_mem defines the minimum amount of memory that always remains available for virtual memory management and system overhead.
npty rtsched_numpri
unlockable_mem

Objectives
Upon completion of this module, you will be able to do the following: Identify and characterize some network performance problems. List some useful tools for measuring network performance problems and state how they might be applied. Identify bottlenecks on other common system devices not associated directly with the CPU, disk, or memory.
131. SLIDE: Review of Bottleneck Characteristics
Review of Bottleneck Characteristics

CPU High CPU utilization Disk High CPU utilization High disk utilization Memory High CPU utilization High disk utilization High memory utilization (with swapping)
Student Notes
The above slide recaps the characteristics related to the three main performance bottlenecks.
CPU Bottlenecks
CPU bottlenecks often exhibit the following characteristics: High CPU usage due to lots of processes competing for the CPU Large number of processes in the CPU run queue No disk bottleneck problems; disk utilization is low, few to no I/O requests in the disk queues No memory bottleneck problems; vhand not needing much, no paging to swap devices
Disk Bottlenecks
Disk bottlenecks often exhibit the following characteristics: High CPU usage due to the disk device drivers constantly executing to perform the I/O and user/system processes continually running to submit the I/O requests
High disk utilization due to lots of I/O requests being continually submitted. No memory bottleneck problems; vhand not needing much, no paging to swap devices
Memory Bottlenecks
Memory bottlenecks often exhibit the following characteristics: High CPU usage (system) due to vhand constantly running to free memory pages, the kernel spending lots of time in the memory management subsystem, and the device drivers for the disk writing memory pages to and from swap High disk utilization due to memory pages being constantly written to and from the swap devices High memory utilization (with swapping) due to free memory falling below LOTSFREE, DESFREE, and MINFREE
Given the above recap, in what order should the three main bottlenecks be checked? When arriving on the scene of an unknown system, where do you start? It would be wise to look for the bottleneck with the most specific symptoms, first. Since the memory bottleneck is the only one to show signs of memory pressure, look for it first. Once you have eliminated that, look for disk bottlenecks. Finally, look for CPU bottlenecks.
132. SLIDE: Performance Monitoring Flowchart
Performance Monitoring Flowchart
Start glance. Look at the memory utilization bar graph. Look at the disk utilization bar graph. Look for other kinds of bottlnecks, e.g. network Look at the CPU utilization bar graph.
Is memory utilization > 95? Yes Is there activity on the swap device? Yes Potential Memory Bottleneck
No
Is disk utilization > 50? No Yes Are there disk I/O requests in the queue? Yes Potential Disk Bottleneck
No
Is CPU utilization > 90? No Yes Are there requests in the CPU run queue? Yes Potential CPU Bottleneck
No
No
Student Notes
The above performance monitoring flow chart assumes glance is being used as the performance-monitoring tool. If glance is not available, the same information can be obtained from a variety of other tools, such as sar and vmstat. The flow chart starts by first looking for symptoms of a memory bottleneck. Is memory utilization high? Is there activity to the swap device?
Memory bottlenecks are checked for first, since memory bottlenecks often exhibit symptoms of high disk and CPU utilization, which could initially be mistaken for disk or CPU bottlenecks. If the system is not bottlenecked on memory, the second bottleneck checked for through the flow chart is a disk bottleneck. Is disk utilization high? Are there disk I/O requests in the disk queue?
Disk bottlenecks are checked for second, as disk bottlenecks often exhibit symptoms of high CPU utilization, but not high memory utilization. If the system is not bottlenecked on disk, the final bottleneck to check for is a CPU bottleneck. Is CPU utilization high? Are there processes in the CPU run queue?
CPU bottlenecks are checked for after memory and disk bottlenecks, as CPU bottlenecks do not exhibit high memory or CPU utilization. If none of these situations appear to exist, then it is time to check the less common bottlenecks. Networks would be a good possibility, but dont neglect other hardware or even software resources, such as file locks and semaphores.
133. SLIDE: Review Memory Bottlenecks
Review Memory Bottlenecks

Look at the memory utilization bar graph.
Is memory utilization > 95?
No
Yes
Is there activity on the swap device? (m) Mem Report Look at VM writes. (d) Disk Report Look at Virt Memory (v) I/O by LV Look at swap devices (w) Swap Space Look at Used (ignore pseudo).
No
Yes Potential Memory Bottleneck
Student Notes
The primary symptoms of a memory bottleneck include high memory utilization and activity to the swap device. The glance reports that show activity on the swap device include:
(m) (d) (v) (w) Memory Report Disk Report I/O by log. volume Swap Space Report shows currently number of VM reads/writes shows VirtMem I/O shows I/O to the swap logical volumes show currently used swap space
Also look at vhand and swapper as processes. Are they accumulating any CPU time? Look at the output of vmstat S. Are pages being paged out? Are processes being swapped out?
134. SLIDE: Correcting Memory Bottlenecks
Correcting Memory Bottlenecks

Reduce maximum size of dynamic buffer cache. Identify programs with large resident set size (RSS). Use the serialize command to reduce thrashing. Use PRM or WLM to prioritize memory allocations. Add more physical memory.
Student Notes
The above slide reviews some of the ways to correct a memory bottleneck: Limit the maximum size of the dynamic buffer cache. This can help to prevent unnecessary paging during periods when the dynamic buffer cache needs to shrink. Identify programs (and users) taking up large amounts of memory, and investigate whether the memory usage is warranted or whether the process has memory leaks. Consider using the serialize command to keep several memory intensive programs from competing with each other. Consider using the Process Resource Manager (PRM) or Work Load Manager (WLM) to favor memory allocation to important processes. Adding more physical memory will always help a memory-constrained system.
135. SLIDE: Review Disk Bottlenecks
Review Disk Bottlenecks
Look at the disk utilization bar graph.
Is disk utilization > 50?
No
Yes
Are there Disk I/O requests in the queue ? (u) I/O by Disk Look at File System activity. (B) Global Waits Look at Blocked on Disk I/O. (d) Disk Report Look at Logical I/O to Physical I/O ratio.
No
Yes
Potential Disk Bottleneck
Student Notes
The primary symptoms of a disk bottleneck include high disk utilization and multiple I/O requests in the disk queue. The glance reports that show disk I/O related activity include:
(u) I/O by Phys. Disk (B) Global Waits (d) Disk Report - shows currently number of reads/writes - shows percentage of processes blocked on Disk I/O - shows Logical I/O and Physical I/O activity
Also check the output of sar u (%wio), sar d, and sar b (for read cache hit rate and write cache hit rate).
136. SLIDE: Correcting Disk Bottlenecks
Correcting Disk Bottlenecks

Load balance across disk drives and disk controllers. Consider asynchronous instead of synchronous I/O. Tune file system block and fragment/extent size. Tune file system (vxfs and hfs) mount options. Tune vxfs file systems with vxtunefs. Tune buffer cache for better hit ratios. Add additional and faster disk drives and controllers.
Student Notes
The above slide reviews some of the ways to correct a disk bottleneck: Spread the I/O activity, as evenly as possible, over the disk drives and disk controllers. Consider using asynchronous I/O so applications do not have to wait for a physical I/O to complete. The trade-off here is a greater exposure to data loss in the event of a system failure. For HFS file systems, increase the fragment and file system block size if large files are being accessed in a sequential manner. For VxFS file systems, increase the block size to improve read-ahead and write-behind. Consider using a fixed extent size. Look at customizing file system mount options (especially for VxFS file systems). Recall that, by default, VxFS is mounted to favor integrity, and HFS is mounted to favor performance. Consider using vxtunefs to tune the performance of VxFS. Match preferred IO size and read ahead to physical stripe depth.
Verify (and tune) the hit ratio on the file system buffer cache. The ratio of logical reads to physical reads should be a minimum of 10 to 1. The ratio of logical writes to physical writes should be a minimum of 3 to 1. Add bigger, better, faster disks and disk controllers.
137. SLIDE: Review CPU Bottlenecks
Review CPU Bottlenecks
Look at the CPU utilization bar graph.
Is CPU utilization > 90? Yes Are there processes in the CPU run queue? (a) CPU by Proc Look at Load Average.
(g) Global Report Look at Processes Blocked on priority.
No
No
Yes Potential CPU Bottleneck
Student Notes
The primary symptoms of a CPU bottleneck include high CPU utilization and multiple processes in the CPU run queue. The glance reports that show CPU activity include:
(a) CPU by Processor (c) CPU Report (g) Process Report - shows CPU load average over last 1, 5, 15 minutes - shows CPU activities - shows CPU hogs in order (see note)
Note
Make sure you are looking at processes in CPU order. Use the Thresholds Page (o) of glance and set CPU as the sort criteria.
Also check sar u and sar q. Use the M option, if you have a multiprocessor.
138. SLIDE: Correcting CPU Bottlenecks
Correcting CPU Bottlenecks

Use nice to reduce priority of less important processes. Use nice to improve priority of more important processes. Use rtprio or rtsched on most important processes. Run batch jobs during non-peak hours. Add another (or faster) processor.
Student Notes
The above slide reviews some of the ways to correct a CPU bottleneck: Use the nice or renice commands on lower priority processes (set nice value to 2139). As a rule of thumb, favor I/O bound programs over CPU-bound programs. I/O-bound programs will block frequently, allowing the CPU-bound programs to run. Use the nice or renice command on higher priority processes (set nice value to 0-19). Use the rtprio or rtsched commands on highest priority processes. BE CAREFUL! A poorly written process could take over your system and render it useless. Schedule large batch jobs, long compiles, and other CPU intensive activity for non-peak hours. Add an additional CPU or a faster CPU to the system.
139. SLIDE: Final Review Major Symptoms
Final Review Major Symptoms

Memory Bottleneck: Disk Bottleneck: CPU Bottleneck: Network Bottleneck: All conditions sustained over time! Both vhand and swapper active Disk utilization > 50% Request queues > 3 CPU utilization > 90% Run queues > 3 per processor Collisions/out-bound packets > 5%
Student Notes
Lets summarize the major bottlenecks and their symptoms: Memory Bottleneck: You know that you have a memory bottleneck if both vhand and swapper are active. This indicates severe memory pressure! Disk Bottleneck: A disk bottleneck will be characterized by disk utilization of at least 50% and at least 3 requests waiting in the request queue. If a controller is the bottleneck, you will see multiple disks with lengthy queues on that controller. Their utilization may not be 50%! The queues are more important than the utilization. CPU Bottleneck: If all of your CPUs are at least 90% busy and they each have run queues that have 3 or more processes in them, you have a CPU bottleneck. If one or more of the processors has empty (or mostly empty) queues, either you are at the limit of your CPU resource, or something is unbalancing the loads on your processors. Network Bottleneck: If your ratio between your collisions/sec and your packets-out/sec is greater than 5%, you have a network bottleneck.
As with any bottleneck symptom, it must be a constant condition sustained over time to be considered a true bottleneck. Otherwise, its a momentary spike which we will keep an eye on, but otherwise ignore.
Appendix A Applying GlancePlus Data

This module is an optional self-study for students.
Objectives
Upon completion of this module, you will be able to do the following: Use case studies to demonstrate how GlancePlus screens can be used to analyze system performance. Observe how a performance specialist approaches a tuning task.
H4262S C.00 A-1 2004 Hewlett-Packard Development Company, L.P.
A1. TEXT PAGE: Case StudiesUsing GlancePlus

The case studies stylized in this module come from the logbooks of HP-UX Performance Specialists and are presented for your consideration. The goal is to help you prepare for your own tasks and adventures. The examples show you possibilities and are not intended to be exact recommendations or solutions to situations that you may encounter. These examples may cause you to think up new questions, in addition to answering some of the classic tuning scenarios. As in most endeavors, there is often much to be gained from reviewing someone else's actions and trying to reverse-engineer their solutions.
An Approach to Monitoring System Behavior

The best approach to monitoring your system's performance is to become familiar with how your system usually behaves. This helps you recognize whether a sudden shift in activity is normal or a sign of a potential problem. The first screen that appears when you start GlancePlus in character mode summarizes system-wide activity and lists all processes that exceed the usage thresholds set for the system. The information on this screen tells you if a resource is being used excessively or a process is monopolizing available resources. The Global screen is the usual starting point for any review of system activity and performance. You can use the statistics on the Global screen to monitor system-wide activity, or you may need to refer to the detailed data screens to focus on specific areas of system usage. The examples in this chapter highlight the use of all GlancePlus screens. GlancePlus provides you with valuable information, but optimal use of this information depends on how well you understand your system's operation and what is the normal or usual behavior for that system. As you use GlancePlus to review your system's performance, you will learn to recognize patterns that differ from this normpatterns that may indicate a problem.
Bottlenecks
A bottleneck is the most common type of problem on any system. It occurs whenever a hardware or software resource cannot meet the demands placed on it, and processes must wait until the resource becomes available. This results in blocks and long queues. Your system handles processes much like a freeway system handles traffic. During normal hours, the freeway adequately carries the traffic load, and cars can travel at optimum speed. But, during rush hour, when too many cars try to access the freeway, the lanes become clogged and traffic can slow to a halt. The freeway becomes bottlenecked. Similarly, a bottleneck can occur on your system if the processes you are running need more CPU time than is available or more memory than is configured for the system. A bottleneck also can occur if there isn't enough disk 1/0 bandwidth to move data, or if swap space isn't configured optimally. A bottleneck can be a temporary problem that is easily fixed. The solution may be to rearrange workloads, such as rescheduling batch programs to run late at night. Solving a disk bottleneck may require only spreading disk loads among all the available disks.
A recurring bottleneck, however, can indicate a long-term situation that is worsening. Perhaps the system was configured to serve fewer users than are now using it, or workloads have gradually increased beyond the system's capacity. The only solution may be a hardware upgrade, but how do you know? If you can identify a bottleneck correctly, you can avoid randomly tuning the system (which can worsen the problem), and you can avoid adding extra hardware that doesn't help performance. You can also avoid expending resources solving a corollary bottleneckone that is caused by the primary bottleneck.
Characteristics of Bottlenecks
Common system bottlenecks have several general characteristics or symptoms. By comparing these symptoms with the statistics on your GlancePlus screens, you can analyze the performance of your system and detect potential or existing bottlenecks. Although a single symptom may not indicate a problem, a combination of symptoms generally reflects a bottleneck situation.
Symptoms of a CPU Bottleneck

Long run queue without available idle time High activity in user mode Reasonable activity in system mode (high activity may indicate other bottlenecks as well) Many processes frequently blocked on priority
Symptoms of a Memory Bottleneck

High swapping activity High paging activity Very little free memory available High disk activity on swap devices High CPU usage in system mode
Symptoms of a Disk Bottleneck

High disk activity. CPU is idle, waiting for I/O requests to complete High rate of physical reads/writes Long disk queues
Symptoms of Other 1/0 Bottlenecks

High LAN activity Low 1/0 throughputs
You may discover that solving one bottleneck uncovers or creates another. It is possible to have more than one bottleneck on a system. In fact, changing workloads are constantly reflected in changing system performance. The goal is not to seek a final solution, but to seek optimal performance at any given time.
Evaluating System Activity

One afternoon Doug noticed that system response had slowed. He ran GlancePlus and looked at the Global screen to view system-wide activity. He saw that the CPU usage was near 100%. Although this is not necessarily a problem, he decided to check it out. Doug then looked at the process summary section of the Global screen, which lists all processes that exceed the usage thresholds set for the system. He noted that a single process accounted for a majority of the near 100% CPU usage. Wanting more information on that particular process, he checked it on the Individual Process screen, which provides detailed information about a specific process. Reviewing that screen, Doug noticed the process was doing no I/0 and was spending all its time in user code. This suggested that the process might be trapped in a CPU loop. After identifying the user's name he telephoned the user to find out if the process could be killed. In this situation, the CPU use for the system did not drop after the user terminated the looping process, because other processes took up the slack. However, response time improved because other processes did not need to wait as often to be given their share of CPU time.
Evaluating CPU Usage

Dean was checking the system one afternoon when he noticed a sudden slowdown in system response time. He ran GlancePlus and looked at the Global screen to view system-wide activity. He saw that the CPU usage was near I00% . The other system resources, such as Memory, Disk, and Swap, showed much less use. Further checking revealed that several processes were blocked due to another process using the CPU (PRI), which meant they were waiting for higher priority processes to finish executing. Dean accessed the CPU Detail screen to see how CPU time was allocated to different states (activities). He discovered that real-time activities were using a much larger percentage of the CPU than other activities. Dean returned to the Global screen to check priorities. One user was running with a priority of 127-- an RTPRIO real-time processing priority. Dean knew that this particular program is CPU-intensive and running it at such a high priority would keep other processes from executing. Already it was causing system performance to degrade. He reset the priority for that process to a lower timeshare priority by using the GlancePlus renice command. This allowed other processes more consistent access to the CPU, and system response time improved.
Evaluating Wait States

Jose's system was running fine until he installed a new application. Now, every time the application runs, response time degrades. Since the application is the only change to the system, Jose starts by checking how it is using the system. Looking at the Glance Individual Process screen, he sees that CPU utilization is about 7 percent, so that isn't the problem. Next, he checks overall CPU utilization on the system; it's averaging about 48 percent, which means there is sufficient CPU resource to accommodate the new application. Jose checks disk I/Os and notices the application is processing about 5 I/Os per second, most
of which is virtual memory 1/0. That looks slow to Jose, so he looks at the Wait States screen to find out what the process is waiting on. Jose learns that the process is spending about 7 percent of its time utilizing the CPU (executing), 27 percent of its time waiting for terminal input, and 66 percent of its time waiting on virtual memory. That's a significant amount of time. Jose checks other processes on the system and discovers that they are experiencing similar waits for virtual memory. He realizes that the new application overloads the system's memory. He makes copies of the relevant screens so he can explain the situation to his manager.
Evaluating Disk Usage

Vivian's company often runs processes that tax available memory. She keeps track of the situation by checking the Disk Detail screen, which displays both logical and physical 1/0 requests for all mounted disk devices. It also categorizes the physical requests as User, Virtual Memory, System, and Raw requests. This screen shows her when large numbers of physical read and write requests are occurring, a situation that results from excessive page faults by processes. Vivian also checks the virtual memory request rate, since that also will be high when system demand is taxing its physical memory capacity. By paying attention to which processes are active when the virtual memory activity is high, Vivian can make intelligent decisions about redistributing activities to balance the system load. This helps increase overall throughput for the system.
Evaluating Memory Usage

Terri's system was experiencing a slowdown in system response time. She checked the Global screen to get an overall picture. All four system resources (CPU, Disk, Memory, and Swap) were near 100%. A large portion of the disk bar activity showed virtual memory activity. In addition, the swapper system daemon appeared to be running continuously. Terri realized that this indicated a possible memory bottleneck. She checked the Memory Detail screen, which provides statistics on memory management events such as page faults, number of pages paged-in and paged-out, and the number of accesses to virtual memory. The screen indicated that Free Memory was 0.0 MB, indicating a lack of usable memory, and the Swap In/Outs showed a rate above 1 per second. Concluding that the problem was a memory bottleneck, Terri returned to the Global screen to study the active processes. She knew that a memory bottleneck can be relieved by adding more memory or by reducing the memory demands of active processes. In this case, she suspected that high swap rates were caused by the large Resident Set Sizes (RSS) for the most active processes. One test program showed a large RSS that appeared to grow at a constant rate. Examining this situation more closely, Terri discovered the program had a "memory leak." It allocated memory using malloc() but did not free up memory using free(). The process's memory allocation increased steadily, causing memory pressure on the system. She talked with the developer, who studied the program code and found the memory leak. The test program was changed and recompiled to use far less memory, thus alleviating the memory bottleneck and improving system response time.
Evaluating I/O by File System

Ingrid noticed that system performance degraded drastically when the system was doing swapping. Looking at the Global screen, she observed that the swapper process was running and that virtual memory use counted for a high percentage of the disk utilization. She checked the Disk I/0 by File System screen to verify which disk was busiest. The Disk I/0 by File System screen provides details of the I/0 rates for each file system or mounted-disk partition. This information is useful for balancing disk loads. When she looked at the Disk I/O by File System screen, Ingrid saw that one disk was being utilized more than all other disks on the system. The disk most utilized was a swap disk. Ingrid decided to add additional swap disk areas to the system to alleviate the load on that one disk. She also might have considered allocating dynamic swap areas on existing underutilized file systems
Evaluating Disk Queue Lengths

Ray had already determined that his system had a disk 1/0 bottleneck. By reading the Global screen, he noticed that the disk utilization was almost always at 100%. He had checked the Disk 1/0 by File System screen, which showed that several disks were being heavily utilized. What Ray wanted to find was a way to ease the situation. He studied the Disk Queue Lengths screen, which shows how well disks are able to process 1/0 requests. He wanted to determine which of the busy disks had the longest delays for service. He knew that "busy" disks did not necessarily have a long queue length. High disk utilization is not a problem unless processes must wait to use the disk. For example, using a high percentage of the lines on a telephone system is not a problem unless calls cannot get through. Ray also knew that long queue lengths meant several disk requests must wait while that drive is servicing other requests. For example, if all phone lines are busy, incoming calls must wait to connect. Once he had a clear picture of the situation, Ray reduced the large queue lengths by moving several files to different file systems to distribute the workload more evenly.
Evaluating NFS Activity

Paul works on a system that is used as a network file system server. One local disk is NFSmounted from several different nodes on the LAN. One afternoon, Paul noticed poor response time on the system. The file system mounted by remote systems was very active. Paul reviewed the NFS Detail screen, which provides current statistics on in-bound and out-bound network file system (NFS) activity for a local system. He wanted to determine which remote system was using the disk the most. He observed a large In-bound Reads rate from one system. This led him to examine that remote system to find out why it was overutilizing the NFS-mounted disk. His examination pinpointed the situation to a single user on the remote system. The user was making repeated, unnecessary greps to files on the NFS-mounted disk. Paul explained the problem to the user and worked with her to lessen the heavy disk use. This reduced the load
on the NFS server and improved overall response time.
Evaluating LAN Activity

Lee noticed a slow response time for applications using datacom services to access data across the local area network. He checked the LAN Detail screen to see what was causing the problem. The LAN Detail screen describes four functions for each local area network card configured on the system. On networked systems, this information can show potential bottlenecks caused by heavy LAN activity. Lee noticed that the Collision and Errors rates were higher than usual. This information led him to investigate whether processes were competing for LAN resources or overloading the LAN software or hardware. In this case, an application that was improperly written using netipc() was causing a bottleneck. Once this program was stopped, other programs using the LAN were able to improve their response time.
Evaluating System Table Utilization

When Debbie was running a program on the system, the program failed, giving this error message: sh: fork failed - too many processes To decide whether or not to reconfigure the value of nproc in her kernel, Debbie needed to find out how much room she had in the Process Table. She referred to the System Table Utilization Detail screen, which provides information on the use and configured sizes of several important internal system tables. The information on this screen provides feedback on how well the kernel configuration is tuned to the average workload on the system. Debbie confirmed that she had indeed run out of room in the Proc Table. She knew that usually the system buffer cache is fully utilized. Other tables can be proactively monitored in order to reconfigure the appropriate kernel variable before she reached a limit.
Evaluating an Individual Process

Cliff noticed that one process seemed to be running quite slowly. He ran GlancePlus and looked at the detail information for that process. He then examined the statistics on the Single Process Detail screen, which provides detailed information about a specific process. Cliff knew that if the process was running slowly because of a memory shortage, he would see an increase in context switches and fault counts. He noticed that the I/0 read and write counts were large and that the process was doing a lot of I/O. He checked what the process had been blocking on and noticed a high percentage for Disk 1/0 blocks. He suspected that the process was slowed because of competition for disk throughput capacity.
Had the process shown a high percentage of being blocked on priority, it would have meant the process was ready to run but was unable to do so because the CPU was being used by processes with higher dispatching priorities.
Evaluating Open Files

Kathryn is developing an application for communicating with remote systems. When a request is received, the application opens a socket and sends the specified data. However, when Kathryn tests the application, no data is received by the remote system. To find out what happened she checks the Glance Open Files screen. When she looks for the opened socket, she discovers that it never opened. She returns to her application to look for the coding error.
Evaluating Memory Regions

One day while reviewing Glance's Global Summary screen, Nancy notices that several processes have very large resident set sizes. Could this mean a potential problem with the applications' memory usage? She wonders if she should begin planning to increase physical memory size to accommodate additional users in the future. She knows current system performance is fine and memory size seems adequate, but she wants to prevent any future degradation in performance. Before making any decisions she reviews Glance's Memory Regions screen to analyze the situation more closely. She discovers that all of the affected processes have a shared memory region of about 200 KB. When added to the private DATA and TEXT regions, this accounts for the large resident set sizes. By checking the virtual address of the shmem region, she determines that the same shared memory region is being used by the processes. Because it is a shared region, it is physically in memory only once, but Glance displays it for each process attached to the shared memory region. Nancy smiles when she sees this, because it means that no problem exists. By using the shared memory region the processes are using far less memory than it appears.
Evaluating All CPUs Statistics

Rosalie works on a multiprocessor system. While checking the All CPUs screen one day she noticed that one CPU seemed to be consistently busier than the others. Realizing that overall system throughput would be improved if the load were balanced among the processors, she decided to investigate the situation. As she studied the All CPUs screen she noticed that one PID always seemed to be the last PID executing on CPU 1, her busiest CPU. When Rosalie checked mpctl(), she saw that the process had been assigned to CPU 1. Using mpctl -f, she reassigned the process to be a floater, so that the system could determine which processor should run the process. Rosalie then checked for other processes that had been assigned to CPU I and reassigned them as floating processes. After doing so, she rechecked the All CPUs screen and observed that the load appeared more even among all the processors, thus alleviating a potential bottleneck on any single CPU.
Evaluating Activity on Logical Volumes

Lately when Yuki uses GlancePlus to check his system, he notices that the Global Disk Utilization bar displayed in the top portion of every Glance screen is often close to 100%. Yuki's system has multiple disk drives, but he knows that the global disk utilization figure indicates the activity on the busiest disk.
Yuki would like to spread the disk I/0 more evenly among the drives to avoid potential I/O bottlenecks. With that goal in mind, he first checks the Disk Detail screen. It shows that logical disk activity is high. For details, he goes to the Logical Volumes screen, where he notices a high write activity on logical volume /dev/vgOO/lvol12. Getting out of Glance and into the UNIX shell, he types vgdisplay -v /dev/vg00 to ascertain the physical disk names associated with the volume. Back in Glance, Yuki views the Disk Queue Lengths screen to determine the busiest disks in the volume. Then he checks the Disk Detail screen to find out whether disk activity was caused by system or user activity. Yuki notices that the Virtual Memory physical accesses are low, indicating application rather than system activity. He checks the Open Files screen to find which application was creating so many writes to the disk. Voila! Fred is running his baseball pool again! Yuki pays a visit to Fred. After discussing Fred's I/0 needs, Yuki returns to his console to balance the I/0 load, using LVM commands to rearrange the logical volumes. Now its time to grab your toolbox, pop the hood, and take a look.
Good Luck!
Solutions
H4262S C.00 Solutions-1 2004 Hewlett-Packard Development Company, L.P.
Solutions
111. LAB: Establishing a Baseline

Directions
The following lab exercise establishes baselines for three CPU-bound applications and one disk-bound application. The objective is to time how long these applications take when there is no activity on the system. These same applications will be executed later on in the course when other bottleneck activity is present. The impact of these bottlenecks on user response time will be measured through these applications. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Compile three C programs long, med and short by running the BUILD script # ./BUILD 3. Time the execution of the long program. Make sure there is no activity on the system. # timex ./long Record Execution Time real: user: sys:
Answer: Varies with system configuration in the order of 10s of seconds to minutes. Example output follows from an rp2430 server: # timex ./long The last prime number is : real user sys 3:37.89 3:35.68 0.12 49999
Example output follows from an rx2600 server: # timex ./long The last prime number is : real user sys 2:53.24 2:51.74 0.06 99991
Solutions
4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time real: user: sys:
Answer: Varies with system configuration should be about one half of long. Example output follows from an rp2430 server: # timex ./med The last prime number is : real user sys 1:52.68 1:51.55 0.08 49999
Example output follows from an rx2600 server: # timex ./med The last prime number is : real user sys 1:33.71 1:33.02 0.04 99991
5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time real: user: sys:
Answer: Varies with system configuration should be about one eigth to one tenth of med. Example output follows from an rp2430 server: # timex ./short The last prime number is : real user sys 10.88 10.70 0.05 49999
Example output follows from an rx2600 server:
Solutions
# timex ./short The last prime number is : real user sys 8.56 8.49 0.03
99991
6. Time the execution of the diskread program. # timex ./diskread Record Execution Time real: user: sys:
Answer: Varies with system configuration in the order of tens of seconds. Example output follows from an rp2430 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real user sys 28.01 0.02 0.53
Example output follows from an rx2600 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real user sys 28.69 0.01 0.13
7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why. Answer: We first assume that there is no other load on the system. In the case of a classic number crunching CPU hog (long, med, and short are all these) there will be no system calls (except for the final terminal output) and the program only needs CPU time in usr mode.
Solutions
As there is only one process, there is no waiting. This is shown by the real time being very close to the sum of the sys and usr times for the process. long, med, and short only do calculations and make no call on kernel resources during their execution, so the usr time is very high compared to the sys time. This is not the case for diskread. The program makes very little demand on the CPU, shown by the sum of usr and sys being quite small compared to the real or wall clock time. The huge difference between real time and usr+sys time proves that the program is waiting on disk I/O most of the time. Also note that sys is much higher than usr meaning that the program is bound on system calls (disk I/O) rather than computation when it does execute.
Solutions
112. LAB: Verifying the queuing Theory

Directions
The performance queuing theory states that as the number of jobs in a queue increases, so will the response time of the jobs waiting to use that resource. (This lab uses the short program compiled from /home/h4262/baseline/prime_short.c). Example figures below are from a C200 workstation 1. In terminal window 1, monitor the CPU queue with the sar command. # sar -q 5 200 2. In a second terminal window, time how long it takes for the short program to execute. # timex ./short & Answer: rp2430: # timex ./short & [1] 10050 # The last prime number is : real user sys 10.85 10.70 0.05 49999
# timex ./short & [1] 6486 rx2600: root@r265c145:/home/h4262/baseline # The last prime number is : 99991 real user sys 8.59 8.50 0.03
How long did the program take to execute? 8 to 11 secs. How does this compare to the baseline measurement from earlier? A little longer due to the overhead of sar.
Solutions
3. Time how long it takes for three short programs to execute? # timex ./short & timex ./short & timex ./short &
How long did the slowest program take to execute? _____________________ How did the CPU queue size change from first window? __________________ Answer: rp2430: # timex [1] [2] [3] ./short & timex ./short & timex ./short & 10203 10205 10206 49999
# The last prime number is : real user sys 29.86 10.68 0.01
The last prime number is : real user sys 32.07 10.67 0.01
49999
The last prime number is : real user sys 32.35 10.67 0.01 rx2600: # timex [1] [2] [3]
49999
./short & timex ./short & timex ./short & 6690 6692 6694 99991
99991
Solutions
99991
How long did the slowest program take to execute? 25 to 34 secs, around three times longer than one occurrence of the program. If you have a multiprocessor, the time will be distributed over the number of processors with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since were only running three processes here (not including sar), three processors or more than three processors would show the same results. How did the CPU queue size change from first window? The sar q shows that the average cpu queue length (first field) increases by three times when three programs are run concurrently. 4. Time how long it takes for five short programs to execute? # timex ./short & timex ./short & timex ./short & timex ./short & timex ./short & How long did the slowest program take to execute? _________ How did the CPU queue size change from first window?________ Answer: rp2430: # timex timex [1] [2] [3] [4] [5] ./short & timex ./short & timex ./short & \ ./short & timex ./short & 10212 10214 10216 10218 10220 49999 \
49999
Solutions
49999
49999
The last prime number is : real user sys 54.15 10.68 0.01 rx2600:
49999
# timex ./short & timex ./short & timex ./short & \ ./short & timex ./short & timex ./short & [1] [2] [3] [4] [5] 6737 6739 6741 6743 6745 99991
99991
99991
The last prime number is : real user 42.67 8.48
99991
Solutions
sys
0.00 99991
How long did the slowest program take to execute? 43 to 54 secs If you have a multiprocessor, the time will be distributed over the number of processors with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since were only running five processes here (not including sar), five processors or more than five processors would show the same results. How did the CPU queue size change from first window? It increased by 5 while the test is being run. 5. Is the relationship between elapsed execution (real) time and the number of running programs linear? Answer: Yes very much so. The fastest program in the last case (where 5 programs are running) takes five times longer than with one program. You can draw a graph and go to 10 programs if you are unsure! Typing the command with more than 10 occurrences gets a little tedious! You will find a linear relationship in any case. 6. Comment about the overhead of switching from one process to another. Answer: The overhead of task switching is very low. If it were not, the relationship in the above tests would not be linear. If there is an overhead, it looks like we will not see it unless there are hundreds of processes being switched.
Solutions
268. LAB: Performance Tools Lab

The goal of this lab is to gain familiarity with performance tools. A secondary goal is to get familiar with the metrics reported by the tools, although they will be explored in depth during the next days.
Directions
Set up: Change directories to: Execute the setup script: # cd /home/h4262/tools # ./RUN
Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion. These results were obtained on a C200 workstation running 11i. Remember the absolute numbers are not important here but you should be drawing similar conclusions. 1. How many processes are running on the system? Which tools can you use to determine this? Answer: top ps glance sar gpm Gives the number of running processes in the summary portion of the screen: 119 processes: 96 sleeping, 17 running, 6 zombies ps -e | wc -l and subtract 1 for the headers and 1 each for ps and wc. Look at the table screen (t page) and see the current size of the proc table. sar v 2 10 Look at the proc-sz field. Gives the count at the top of the Process List report.
2. Are there any real-time priority processes running? If so, list the name and priority. What tools can you use to determine this? Answer: syncer, midaemon, lab_proc2, sometimes swapper. ttisr and prm3d will also be seen on 11/11i systems running at pri 32. This is the posix real time range which is even higher than the normal UNIX real time priorities. glance top Global, PRI column (Turn off all filters) PRI column
Solutions
gpm ps el
Use the filters to filter priorities <128 (process list/configure/filters) PRI column
(Try this command: # ps el | grep v PRI | sort k 7,7n | more The highest priority processes will be listed on the top.) Remember a real time priority is anything less than 128 3. Are there any nice'd processes on the system? If so, list the name and priority for each. What tools can you use to determine this? Answer: glance gpm top ps el Go through each single process screen. (Default is 20. 21- 39 is nice; 0 to 19 is nasty, i.e. anti-nice.) Process list, select a process by double clicking. NI column NI column
(Try this command: # ps el | grep v NI | sort k 8,8n | more The nasty processes will be listed at the top and the niced processes will be at the bottom.) On 11i the following were nasty: diagmond, diaglogd, psmctd, memlogd, krsd The following were nice: All 6 <defunct> zombie processes (see below), lab_proc4 4. Are there any zombie processes on the system? If so, how many are there? What tools can you use to determine this? Answer: A zombie is a terminated process whose parent is still running, but has not called wait() for the child. Zombies whose parent has terminated should eventually be adopted by the init process, which will issue a wait() on the zombie. Therefore, a zombie whose parent has terminated should eventually disappear. What resources do zombies consume? entries memory (<= 20 pages), table
Solutions
top glance and gpm ps el
The number of zombie processes is shown in the summary portion of the screen: 119 processes: 96 sleeping, 17 running, 6 zombies By design, they do not currently report zombies, unless the process entered the zombie state during the interval. Z in S(tate) column and <defunct> in the Comm(and) column
5. What is the length of the run queue? What are the load averages? What tools can you use to determine this? Answer: glance CPU screen (c), page 2, shows a RUNNING LOAD AVERAGE, but has been labeled incorrectly in older versions as run-queue. The All CPUs screen (a) shows the 1-, 5-, and 15-minute load averages whereas the CPU screen shows the interval load average. CPU button or Reports/CPU Graph show the interval load average. Reports/CPU report shows the interval load average. Reports/CPU by processor shows the 1-, 5-, and 15-minute load averages. 1-, 5-, and 15-minute load averages Load averages: 5.39, 5.27, 5.20 and interval load average Average run queue size over interval. r column is the run queue size over the interval. 10-second load average over time.
gpm
uptime top sar q vmstat xload
The run queue length on the test system was around 5 no matter how it was measured. There is also a hardware dependant approach that can be used on servers using the console Hex display code.
HEX display (front panel or console) shows size of runQ in the second digit. F31F means there are three processes in the runQ and one CPU. FA1F means 10 or more in the runQ. MPE uses this as a percent utilization number. The runQ is an instantaneous value and can never be a fractional number. The load average is based on the runQ, but includes short sleepers (discussed in CPU section).
6. How many system processes are running? What tools can you use to determine this? NOTE: A system process is defined as a process whose data space is the kernel's data space. (i.e. swapper, vhand, statdaemon, unhashdaemon, supsched, etc.) ps reports their size as zero. Others as below.
There are three ways this can be determined. If you get stuck on this question, move on. Don't spend more than a few minutes trying to answer this question.
Solutions
Answer: top glance ps el PA-RISC: RES = 16K (32-bit kernel) or 32K (64-bit kernel) per thread. IA-64: RES = 80K per thread PA-RISC: 16K (32-bit kernel) or 32K (64-bit kernel) per thread on global screen. IA-64: 80K per thread on global screen. The second bit in the F column value indicates a system process. (See the man page for ps.) F column = 3, PPID column = 0, and SZ column = 0.
(Try this command: # ps el | grep 3 | more This will list all the system processes. No, technically, init is NOT a system process.) This amounts to 17 processes on the test 11i system. 7. What percentage of time is the CPU spending in different states? What tools can you use to determine this? Answer: glance Bar graph, CPU screen (c): displays detailed CPU state information. Per-process (S): details per/process CPU utilization. (Display can be toggled between cumulative/interval (C) and percent/absolute (%).) Main window Reports/CPU or Reports/CPU by processor user/system/waiting for io/idle user/nice/system/idle/block/swait/intr/ssys (context switch) SEE /usr/include/sys/dk.h for CPUSTATE (CP_USER, etc.). block is spinlock percentage (on MP systems only). This figure is obsolete at 11i. swait is alpha semaphore percentage (on MP systems only). This figure is obsolete at 11i. user/system/idle us/ni/sy/id
gpm sar top
vmstat iostat t NOTE:
Always watch the first line in vmstat or iostat. It is the average since bootup. Use vmstat -z to clear the sum structures for vmstat. There is no similar option for iostat.
8. What is the size of memory? What is the size of free memory? What tools can you use to determine this? Answer: glance gpm vmstat M(emory) screen, Free Memory, Phys Memory, Avail Memory, Total VM, Active VM, Buf Cache Size Reports/Memory Report Free (in pages) avm (active virtual memory, in pages) includes on-disk pages
Solutions
top /etc/dmesg
Real (real active), virtual (virtual active), free in KB. Amount of physical and available memory.
The memory stats from top are misleading. The values in brackets are figures for processes that are regarded as busy whatever top means by that. In most utilities, busy or active means that the process is in the RUN state, or has executed within the last 20 seconds. The real figures are a summation of the resident set sizes for all processes (sum of the RES field). This is not the amount of physical memory in the system. The only way to get the true physical memory in the system is through glance/gpm or dmesg. The boot info seen in dmesg with the physical memory figure will be lost if general console messages (e.g. file system full) have overwritten the limited buffer space. vmstat figures are generally accurate with the free field agreeing well with glance/gpm and top. Remember top reports memory in 1K units while vmstat reports memory in 4K pages so multiply the vmstat figures by 4 to compare them to top. 9. What is the size of the swap area(s)? What is the percentage of swap utilization? What tools can you use to determine this? Answer: glance gpm Bar graph (reserved/used) w(swap):%used (device and filesys) and MB reserved by swap device, MB available and MB used Reports/Swap space (glance w) Reports/System Table Info/System Tables Graph Report NOTE: Graph shows high water mark (nice!)
sar -w vmstat -S top: swapinfo -t bdf b
N/A (only shows size of swap queue and swapping rates) N/A (only shows paging and swapping rates) N/A KB avail/used/free/% used by swap device File system swap space used/avail
swapinfo can be misleading unless you know what you are looking at. To remove the confusing issue (pseudo swap) enter a swapinfo t to correctly calculate and include pseudo swap issues by taking the total figures. This will be explained in detail in the module on swap space management.
# swapinfo -t TYPE dev reserve memory total # Kb AVAIL 524288 180940 705228 Kb Kb USED FREE 11276 513012 204360 -204360 84488 96452 300124 405104 PCT USED 2% 47% 43% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
Note that we have used 43% of our swap space and not 2%.
Solutions
10. What is the size of the kernels incore inode table? How much of the inode table is utilized? What tools can you use to determine this? Answer: glance gpm sar v t(able) screen, page two Reports/System Table Info/System Tables Report (NOT Graph) Used/size/overflows
Remember that the inode cache may contain entries for files that are closed, if it doesn't need to flush it out to open a new file Its size is the maximum number of unique files that can be open system wide. 11. Are there any CPU bound processes running (processes using a lot of CPU)? If so, what is the name of the process? What steps did you take to determine this? Answer: glance gpm top ps -el Global screen and single process screen Can sort by CPU utilization and can filter by > 0 Automatically lists the processes by CPU utilization Cpu hogs often have large C counts (<=255)
(Try this command: # ps el | grep v C | sort k 6,6n This will list the most active processes at the end.) lab_proc5 and lab_proc3 are the main CPU users. They are consuming close to 100% of the CPU between them. This is not normal behavior! 12. Are there any processes running which are using a lot of memory? (A "lot" is relative, i.e. a large RSS size compared to other processes.) If so, what is the name of the process? What steps did you take to determine this? Is memory utilization changing? Answer: glance gpm top ps el Global screen: RSS, Per-process screen: RSS and VSS sizes Reports/Process List (glance g) SIZE (KB: text/data/stack), RES (KB: resident size) SZ column (size in 1-K blocks of the core image including only text + data + stack)
(Try this command: # ps el | grep v SZ | sort k 10,10n
Solutions
This will list the largest processes at the end.) lab_proc1 has a much larger SZ (ps el output) size than most other processes. This program is 8MB in core and could be regarded as a memory hog. Remember that SZ is in pages, multiply by 4K to get the actual figure. 13. Are there any processes running which are doing any disk I/O ? If so, what is the name of the process? What steps did you take to determine this? What are the I/O rates of the disk bound processes? What files are open by this (these) process(es)? NOTE: No processes are really doing a lot of physical disk I/O. However, lab_proc3 is doing a LOT of logical I/O.
Answer: glance i screen will periodically show lab_proc3 as largest disk user s(ingle) process screen, open files, will show actual open files and offset, which MIGHT be indicative of the amount of I/O Reports/Process List Reports physical disk I/O for the system overall. Reports and compares logical I/O to physical I/O
gpm sar -d sar -b
Notice sar b reporting very high logical read I/Os. The lab_proc3 process is very busy with disk reads but the system has cached all the data in the buffer cache preventing physical disk I/O.
# sar -b 2 2 HP-UX workstn B.11.11 U 9000/782 01/22/01
15:39:38 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 15:39:40 0 19646 100 1 2 60 0 0 15:39:42 0 21454 100 0 3 83 0 0
14. What is the current rate of semaphore or message queue usage? What tools can you use to determine this? Answer: sar m glance The ONLY tool to show message and semaphore ops/sec. The single process screen shows messages sent/received.
Semaphore and message usage was effectively zero in the lab test as none of the test programs manipulate semaphores or messages. These resources will be covered in a later module. Relational data bases (Oracle, Informix, Sybase etc) are big users of such resources. # sar -m 2 2
Solutions
HP-UX workstn B.11.11 U 9000/782 15:41:58 15:42:00 15:42:02 msg/s 0.00 0.00 sema/s 3.98 2.00
01/22/01
15. Is there any paging or swapping occurring? What tools can you use to determine this? Answer: glance m(memory) screen: page faults, paging request, KB paged in, KB paged out, deactivations/reactivations, KB swapped in, KB swapped out, VM reads, VM writes Reports/Memory Graph (or Memory button): page OUTS, swap OUTS Reports/Memory Report: in/out/etc. Swapping only Paging (pi/po) Swapping (si/so) & Paging (pi/po)
gpm sar w vmstat vmstat -S
In terms of simple UNIX commands, vmstat is the way to go. The sar command does not understand paging (more in the module on memory management!) and is measuring the swap rate only. See the pi and po fields below from vmstat. This system is not paging at all so we can be confident that there is no swapping activity.
# vmstat 2 3 procs memory faults cpu r b w avm free sy cs us sy id 4 0 0 78478 1593 2096 171 3 2 94 4 0 0 78478 1552 page re 6 2 at 1 0 pi 0 0 po 0 0 fr 0 0 de 0 0 sr 0 0 in 108 117
glance/gpm do give good paging and swapping detail on the m screen and the data should tie in with vmstat. 16. What is the system call rate? What tools can you use to determine this? Answer: glance: CPU screen, page 2; Single process screen (s), then (L) reads/writes/opens/closes/ioctls/forks/vforks/messages sent and received Reports/CPU Report Reports/Process List, select for single process screen Total/reads/writes/forks/execs First sy column (under faults)
gpm sar c vmstat
sar and vmstat give good data that should agree here. The system call rate can be used as an indication of how busy your system is once you have established the normal range for its value. There is no absolute good or bad figure as this depends on:
Solutions
a) How powerful your system is. b) How many cpus you have. c) What processes you are running. Example live data from the test system (C200 running 11i):
# sar -c 2 10 HP-UX workstn B.11.11 U 9000/782 16:03:24 scall/s 16:03:26 4417 16:03:28 4623 sread/s 1205 1252 swrit/s 1188 1249 01/22/01 fork/s 0.00 0.00 exec/s 0.00 0.00 rchar/s wchar/s 85899336 2038 88524288 4096
Note the system is doing over 4k system calls per second (scall/s), over half of which can be attributed to reads (sread/s) and writes (swrit/s). See how dramatically this number can be changed by adding a simple extra process (you might like to try this while monitoring sar c).
# dd if=/stand/vmunix of=/dev/null bs=64 & # sar -c 2 10 HP-UX workstn B.11.11 U 9000/782 16:10:37 scall/s 16:10:39 21528 16:10:41 19882 sread/s 10461 7790 swrit/s 9586 6878 01/22/01 fork/s 4.98 5.00 exec/s 4.98 5.00 rchar/s wchar/s 172712128 16302 148155392 7168
The glance/gpm tools become invaluable when you want to know which processes are hitting the system with so many system calls. More on this later. 17. What is the buffer cache hit ratio? What tools can you use to determine this? Answer: sar b glance gpm Read and write hit ratios Disk screen (d), page 2 shows both ratios Reports/Disk report shows both ratios
See answer 13 for example output of sar b. 18. What is the tty I/O rate? What tools can you use to determine this? Answer: sar -y: iostat -t: The quickest tool to use here is iostat t. In general modern system administrators care less and less about terminal I/O as almost all users connect to application servers and services over LAN networks. An exception to this rule would be the case of modems. A
Solutions
system with multiple modems may experience a modem storm with meaningless data being fired at the host by a bad modem line. iostat t will catch this problem as a high tin (tty characters read per second) value.
# iostat -t tty tin tout 0 5 cpu us ni 3 1 sy 3 id 94
19. Are there any traps (interrupts) occurring? What tools can you use to determine this? Answer: vmstat s NOTE: Traps since bootup (should probably zero it out first) examples of trap call page faults overflow/underflow (integer and floating point) HPMC/LPMC floating point emulation traps
When traps occur, the normal flow of a program is interrupted and work is done to take care of a problem before normal program instructions can be continued. For example, trying to access a data page which is not in memory and is out on disk would result in a page fault causing the execution of the program to stop, waiting for the required data to come in from disk. Like with system call rates (see 16), there is no good figure but you are advised to monitor the trap rate as a sanity reference. Clear the vmstat counters with vmstat z first. Note that the numbers seen have been generated in the time between the two vmstat commands! Example data from our C200 running 11i. The parameter list has been reduced and the trap events are in bold.
# vmstat -z # vmstat -s 12 swap ins 12 swap outs 0 pages swapped in 0 pages swapped out 8636 total address trans. faults taken 2633 page ins 0 page outs 20 pages paged in 0 pages paged out 6594 cpu context switches 7640 device interrupts 11335 traps 153724 system calls
Solutions
20. What information can you collect about network traffic? What tools can you use to determine this? Answer: glance l(an) screen: packets in/out, collisions, errors NFS global screen (N): rd/wr rates, calls, response time, etc. nfs by system (n): reads/writes/response time by system for both client and server requests Reports/LAN Graph (or NW button): packets in/out per second, Reports/Network by LAN, Reports/NFS Global Activity, Reports/NFS by system, Reports/NFS by operation Sockets in use -m memory buffers in use (NOTE: No longer works in HP-UX 11.X) -i packets in/out, errors in/out, collisions by interface -s packets, bytes, retransmissions, duplicate, acks, checksum errors, timeouts, etc. -rs routine statistics Server rpc and nfs stats; client rpc and nfs stats
gpm
netstat:
nfsstat
A later module will cover networking performance issues in more detail. The most important performance metric in a CSMA/CD (Ethernet or 802.3) network is collision rate. This is available in glance/gpm and in the Network module we will learn how to extract this data using lanadmin. 21. What information can be gathered on CPUs in an SMP environment? What tools can you use to determine this? Answer: glance gpm: sar Mu/Mq top a(ll)CPU detail utilization and load averages by CPU Reports/CPU Info/CPU by Processor: util, ld avg, CS rate, fork rate, last PID Utilization/queue lengths by CPU (with u or q) CPU number on which a process is assigned and utilization per CPU.
sar M output has changed at 11i. The output of sar M will look identical to sar u (or q) if the system only has one cpu. For MP systems you are presented with the sar ( -u or q) data on a per CPU basis. This becomes helpful in measuring the balance of processes across processors. top displays its MP information by default, giving the cpu reference for each process. This information is hidden if there is only one processor or if the h option is used. The a page of glance also measures the balance per CPU and indicates the last process to run on any given CPU. Very useful.
Solutions
22. What information can be gathered on Logical Volumes? What tools can you use to determine this? Answer: glance gpm vgdisplay, lvdisplay, pvdisplay bdf,mount v(LVM)screen reads/writes/MWC hits and misses by LV or VG Reports/Disk Info/I/O by LV General information on Volume Groups, Logical Volumes and Physical Volumes, Use v for details Information on file systems on logical volumes.
Physical disk layout (the positioning of data on disk) is important for performance. The lvdisplay v and pvdisplay v commands are the best way of finding out the precise layout of logical volumes on physical disks. In a later module we will look in detail at mirroring and striping techniques used to manipulate physical disk layout to our advantage. 23. What information can be gathered on Disk I/O? What tools can you use to determine this? Answer: glance d(isk) logical/physical reads/writes, user/VM/system/raw/NFS i(o): by file system, logical/phys/VM u(queue) queue length and utilization by spindle v(LV): see above press disk bottleneck button (queue) Reports/Disk Info/Disk Report (glance d) I/O by disk (glance u + type [phys, logl, VM, FS, System, RAW]) I/O by fs (glance i + blocksize, util, logl, sys, VM) I/O by LV (glance v) KB/sec, seeks/sec, millisec/seek by spindle (NOTE: millisec/seek no longer reported permanently reports 1.0) %busy, average queue, io per sec, blocks/sec, average wait time, average service time
gpm
iostat sar d
iostat is a redundant tool because its data is not as useful or as accurate as that obtained from sar d. The most important place to start looking for disk I/O info lies with the disks themselves. sar cannot understand LVM layouts and only sees the disk as a whole. Use glance/gpm on the individual disks once you have identified them with sar d. Below is some example data collected at the start of the tools lab. Stop the lab with ./KILLIT and start it again with ./RUN to see some disk I/O.
# ./KILLIT Killing the lab procs Removing the files # ./RUN cc -wall +DAportable cpu_hog.c -o cpu_hog cc -wall +DAportable vm_bnd.c -o vm_bnd cc -wall +DAportable io_bnd.c -o io_bnd
Solutions cc -wall # sar -d 2 4 +DAportable zombie.c -o zombie
HP-UX workstn B.11.11 U 9000/782 17:22:54 17:22:56 17:22:58 17:23:00 17:23:02 Average device c0t6d0 c0t6d0 c0t6d0 c0t6d0 c0t6d0 %busy 3.50 2.50 3.00 5.50 3.62 avque 0.50 0.50 0.50 0.50 0.50
01/22/01 r+w/s 6 4 5 6 5 blks/s 160 84 84 159 122 avwait 3.55 1.90 2.57 3.76 3.04 avserv 9.24 10.75 8.75 14.07 10.90
We would not consider 3-5% busy as being a bottleneck here. We will see much higher disk loads later!
Shut down the simulation by entering: # ./KILLIT
Solutions
3-18. LAB: gpm and glance Walk-Through

Directions
The following lab is intended to familiarize the student with gpm and glance. To achieve this result, the lab will walk the student through a number of windows and tasks in both the ASCII and X-Windows versions of gpm and glance.
The Graphical Version GlancePlus

1. Log in. If you have not already done so, please log into the system with the user name and password provided by your instructor. 2. Start GlancePlus. From a terminal window, invoke GlancePlus by entering gpm. # gpm In a few seconds gpm will come up. The first thing will be a license notification informing you that you are starting a trial version of GlancePlus, along with ordering and technical support information. On the gpm Main screen, you will see four graphs for CPU, Memory, Disk, and Networking. By default, the graphs are in the resource history format. This means that for each interval (configurable) there will be a data point on the graph, up to the maximum number of intervals (also configurable). 3. Interval Customizations. Click on Configure in the menu bar, and select Measurement. Set the sample interval to 10 seconds and the number of graph points to 50. This will allow you to see up to 500 seconds of system history. Click on OK. NOTE: This setting will be saved for you in your home directory in a file called $HOME/.gpmhp-system_name. This means that all GlancePlus users will have their customizations saved.
Start a program from another window: # cd /home/h4262/cpu/lab1; # ./RUN & 4. Main Window. Below each graph within the GlancePlus Main window, you will find a button. These buttons display the status color of adviser symptoms. This is a powerful feature of GlancePlus that we will investigate later. Clicking on one of these buttons displays details of that particular graph. To view the advisor symptoms from the main window, select: Adviser -> Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.
Solutions
View CPU details: Click the CPU button. To view a detailed report regarding the CPU, select: Reports -> CPU Report Select: Reports -> CPU by Processor This is a useful report, even on a single processor system. 5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? . Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal. 6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own. An alarm is simply a notification that a symptom has been detected. From the main window, select: Adviser -> Symptom History For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window. 7. Process Details. Close all windows except for the main window. Select: Reports -> Process List This shows the interesting processes on the system (interesting in terms of size and/or activity). To customize this listing, select: Configure -> Choose Metrics
Solutions
This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related metrics available in GlancePlus. Note that the familiar ? button is also available from this window. Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK. 8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select Configure -> Sort Fields The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select Configure -> Arrange Columns Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order. 9. More Customizations. It is possible to modify the definition of interesting processes by selecting: Configure -> Filters An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows: Change Enable Filter to ON
Solutions
Change Filter Relation to >= Change Filter Value to 3.0 Change Change Change Change Enable Highlight to ON Highlight Relation to >= Highlight Value to 3.0 Highlight Color to any LOUD color
Reset the logic condition make to OR, then click OK. Verify the filter took effect. 10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced. In the Process List window, select the proc8 process. To access the Admintools, select: Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select: Admin -> Kill Click OK, and note the process is no longer present. 11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process. Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated. After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows. 12. Exit GlancePlus. From the Main window, select: File -> Exit GlancePlus
Solutions
13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance. NOTE: Never run glance or gpm in the background.
If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface. 14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system. Retrieve online help related to this window by typing h, which brings up a help menu. Select: Current Screen Metrics Use the cursor keys to select CPU Util NOTE: This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the Page Down key to toggle to the next page of help.
Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select: Screen Summaries Use the cursor keys to select Global Bars From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.
Solutions
15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen. Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads. 16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports. Feel free to use this time to ask the instructor "How Do I . . .?" types of questions. Glance
COMMAND *a b *c *d e f *g h *i j *l *m *n o p q r *s *t *u *v *w y z ! ? <CR> FUNCTION All CPUs Performance Stats Back one screen CPU Utilization Stats Disk I/O Stats Exit Forward one screen Global Process Stats Help I/O by Filesystem Change update interval Lan Stats Memory Stats NFS Stats Change Threshold Options Print current screen Quit Redraw screen Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Renice process Zero all Stats Shell escape Help with options Update screen data
GlancePlus (gpm)
"REPORT" CPU by Processor
CPU Report Disk Report
Process List I/O by Filesystem Network by LAN Memory Report NFS Report
Process List, double-click process System Table Report Disk Report,double-click disk I/O by Logical Volume Swap Detail Administrative Capabilities
Solutions
416. LAB: Process Management

Directions
The following lab is designed to manage a group of processes. This includes observing the parent-child relationship and modifying process nice values (and thus indirectly priorities) with the nice/renice command .
Modifying Process Priorities

This portion of the lab uses glance to monitor and modify nice values of competing processes. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Start seven long processes in the background. # ./long & ./long & ./long & ./long & ./long & ./long & ./long & [1] 15722 [2] 15723 [3] 15724 [4] 15725 [5] 15726 [6] 15727 [7] 15728 3. Start a glance session. Answer the following questions. How much CPU time is each long process receiving? _________sec ________%
Answer: Hint: Change the sample period to 10 secs (hit the j key). This will give you more time to think and makes per second calculations easier!
The CPU should be balanced between the seven processes with each getting around 14% of the CPU (i.e. 5/7 seconds each for a 5 second interval and 10/7 seconds each for a 10 second interval). This is seen in the CPU Util field of the main glance window. Notice that the programs all have similar priority around 248-249 which is towards the bottom of the pile. If you have a multiprocessor, the processes will quickly distribute themselves among all available processors. However, the overall metrics should stay the same with the exception of the overall length of time that the processes take.
Solutions
How are the processes being context switched (forced or voluntary)? ______________
Answer:
Select one of the long processes using the glance s key. Make sure the PID being suggested is the right one or enter the correct PID. In the first column of info you will find the Forced CSwitch and Voluntary CSwitch metrics. You will notice that (almost!) all context switched are forced when you compare the two figures. This is normal for a CPU hog process. It never leaves the CPU on his own accord and is always told to leave by the scheduler. We saw 7.7-9.6 context switches per second for the period for each of the processes on an rp2430. All of the context switches were forced. On a multiprocessor, there would be the same number of context switches taking place, however fewer processes would be sharing the same processor. How many times over the interval is the process being dispatched? ___________
Answer:
Again, we can look to the first column of the selected process resource summary page. Find the Dispatches metric. This is a measure of how often the process is getting onto the CPU with the summation of Forced CSwitch + Voluntary CSwitch measuring how often the process gets switched out. On a multiprocessor, each processor would have fewer processes wanting its resource, so, each process would be selected more often. What is the ratio of system CPU time to user CPU time? ____________
Answer:
Look to the first column of the selected process info again and you will find the System CPU metric. This will be zero or close to zero on any system. By using the C (upper case) key we can switch between metrics for the last interval (10 seconds if you are following the solutions) or the total over the period of tracking. It makes no difference how you look at it, these processes do not process system calls. They are typical CPU hogs that crunch numbers and do nothing else. All the CPU is User/Nice/RT. What are the processes being blocked on? __________________
Answer PRIority
The most frequent event that is blocking the process is shown by the Wait Reason metric at the bottom of the first column of Process Resource info (the same page we have been looking at all along). In this case it is PRI, short for Priority.
Solutions
The process has been blocked because it is timeslicing with all the other processes. Each time it is switched out, it is placed at the end of the queue in true round-robin fashion. Thus, it is no longer the most eligible process to run and the scheduler has chosen another. For more stats go to the Wait States page for this process (softkey F2 or hit W) notice that the process is blocked on Priority for 80-90% (6/7) of the time and the rest of the time it is on the CPU. There are no other active wait states. The seven long processes are in a circular fight to get to the top of the pile(s). What are the nice values for the processes? _______
Answer 24
A Bourne-based shell (Bourne, Korn, Posix, bash) always places background processes at a nice level 4 higher than the calling shell. The standard nice value of our shell is 20 so the child background jobs inherit 24 as the nice value. One exception is the C shell which runs background processes at the same nice value as the shell. 4. Select one of the processes and favor it by giving it a more favorable nice value. What is the PID of the process being favored? ____________
Answer:
To change the processes nice value, enter: # renice n -5 <PID of selected process> Be careful! This forces a negative offset of 5 from 20 (the standard nice value) and not the current nice value (24). The nice value in this case will end up at 15, which is more favorable than the others, still at 24. Watch that processs percentage of the CPU over several display intervals with glance or top. What effect did it have on the process? _____________________________ _______________________________________________________________________
Answer:
The effect on the process is that it will race away from the others, consuming approx 5060% of the CPU! This might take a little time to settle down at 50-60%. Give it several intervals to complete its adjustment. 5. Select another long process and set the nice value to 30. # renice n 10 <PID of another selected process> What effect did that have on that process? ___________________________________ ______________________________________________________________________
Solutions Answer:
This really turns the process into a loser! The priority of the process drops to 251-252, preventing the process from getting much action. If you select the process and look in the first column of the Process Resource page you will see that it is being dispatched but not very often. You will see the process getting less than 2% of CPU but not much more. Each of the other processes will take up the excess, with the majority of the excess going to the process with the nice value of 15. 6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with: # kill $(ps el | grep long | cut c18-22)
Solutions
5-24. LAB: CPU Utilization, System Calls, and Context Switches Directions
General Setup
Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available: # # # # vgdisplay v | grep Name (Note which disks are already in use by LVM) ioscan fnC disk (Note any disks not mentioned above, select one) pvcreate -f <raw disk device file> vgextend vg00 <block disk device file>
In either case: # # # # # # lvcreate -n vxfs vg00 lvextend -L 1024 /dev/vg00/vxfs <block disk device file> newfs -F vxfs /dev/vg00/rvxfs mkdir /vxfs mount /dev/vg00/vxfs /vxfs prealloc /vxfs/file <75% of main memory in bytes>
The lab programs are under /home/h4262/cpu/lab0 # cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system otherwise results are unpredictable. If the executables are missing, generate them by typing: # make all
CPU Utilization: System Call Overhead

Use the dd command to size the read and write operations. Thus their number can be varied to change the number of system calls used to transfer the same amount of information. Then we can see the overhead of the system call interface. The first command loads the entire file into buffer cache. # timex dd if=/stand/vmunix of=/dev/null bs=64k Now we take our measurements. # timex dd if=/stand/vmunix of=/dev/null bs=64k real __ user __________ system ____________
Solutions
# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real
Answer:
__
user __________
system ____________
Results for an rp2430: # timex dd if=/stand/vmunix of=/dev/null bs=64k 282+1 records in 282+1 records out real user sys 0.04 0.00 0.03
# timex dd if=/stand/vmunix of=/dev/null bs=2k 9055+1 records in 9055+1 records out real user sys 0.15 0.02 0.12
# timex dd if=/stand/vmunix of=/dev/null bs=64 289765+1 records in 289765+1 records out real user sys 3.82 0.56 2.95
Results for an rx2600: # timex dd if=/stand/vmunix of=/dev/null bs=64k 728+1 records in 728+1 records out real user sys 0.03 0.00 0.03
# timex dd if=/stand/vmunix of=/dev/null bs=2k 23299+1 records in 23299+1 records out real user 0.18 0.02
Solutions
sys
0.13
# timex dd if=/stand/vmunix of=/dev/null bs=64 745575+1 records in 745575+1 records out real user sys 4.57 0.54 3.39
Notice that the last case is much slower due to the number of system calls being made. The block size is a factor of 1000 times less than in the first case causing 1000 time more calls to the read() and write() system calls. Try a sar c 2 10 in another window while the test is being run to see this effect. None of these effects are anything to do with physical disk I/O as the whole vmunix file is coming from buffer cache. Prove this to yourself with a sar b 2 10 while the test is being run. Notice the 100% read cache hit rate.
System Calls and Context Switches

This lab shows you the maximum system call and context switch rates that your system can take. Three programs are supplied: syscall loads the system with system calls of one type filestress (shell script) generates file system-related system calls cs loads the system with context switches
1. What is the system call rate when your system is "idle"? ________________
Answer Around 400-500 on our test systems (rp2430) 03/16/04 fork/s 0.00 0.00 0.00 exec/s 0.00 0.00 0.00 rchar/s 203272 4096 103741 wchar/s 8151 512 4341
# sar -c 2 2
HP-UX r206c42 B.11.11 U 9000/800 11:18:56 scall/s 11:18:58 602 11:19:00 264 Average # 434 (rx2600) sread/s 3 4 3 swrit/s 1 1 1
sar -c 2 2
HP-UX r265c145 B.11.23 U ia64 04/06/04 # 10:57:02 scall/s sread/s swrit/s fork/s 10:57:04 719 3 1 0.00 10:57:06 434 3 1 0.00
exec/s 0.00 0.00
rchar/s 260840 4096
wchar/s 0 4096
Solutions Average 577 3 1 0.00 0.00 132668 2043
2. Run filestress in the background. What is the system call rate now? What system calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar c 10 4 Answer Around 20000-30000 on our test systems
(rp2430) 03/16/04 fork/s 130.07 63.40 192.60 212.10 149.54 exec/s 130.07 63.40 192.60 212.00 149.51 rchar/s wchar/s 29710218 147104 32159540 8192 39581900 17818 40309248 134963 35438766 77037
# sar -c 10 4
HP-UX r206c42 B.11.11 U 9000/800 11:19:43 scall/s 11:19:53 17423 11:20:03 12420 11:20:13 23240 11:20:23 26279 Average 19840 sread/s 3112 3577 4227 3884 3700 swrit/s 1158 2627 1337 700 1456
# sar -c 10 4
(rx2600) 04/06/04 fork/s 290.51 171.70 189.40 222.70 218.60 exec/s 290.51 171.60 189.40 222.60 218.55 rchar/s wchar/s 92426384 77746 69435392 80282 67771592 62259 72799840 91750 75612445 78009
HP-UX r265c145 B.11.23 U ia64 11:02:40 scall/s 11:02:50 39624 11:03:00 28069 11:03:10 27178 11:03:20 31592 Average 31618 sread/s 4530 5618 5214 5057 5105
swrit/s 1619 3883 3320 2814 2909
What system calls are generated by filestress?

read() and write().
Answer
3. Terminate the filestress process by entering the following commands: # kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22) 4. Run the syscall program and again answer question 2. Is the system call rate lower or higher than with filestress? Why? Answer Syscall rate is higher than with filestress. Non-blocking system calls Produce rates up to 138,000 per second on an rp2430 and up to 290,000 on an rx2600.
(rp2430)
# sar -c 10 4
Solutions HP-UX r206c42 B.11.11 U 9000/800 11:36:11 scall/s 11:36:21 137619 11:36:31 136788 11:36:41 137887 11:36:51 138224 Average 137629 (rx2600) 04/06/04 fork/s 0.50 0.00 0.00 0.00 0.12 exec/s 0.40 0.00 0.00 0.00 0.10 rchar/s 60560 233472 27853 14746 84104 wchar/s 4092 20480 4096 3277 7985 sread/s 2 2 2 2 2 swrit/s 0 0 0 0 0 03/16/04 fork/s 0.00 0.00 0.00 0.00 0.00 exec/s 0.00 0.00 0.00 0.00 0.00 rchar/s 42863 4506 5734 3686 14171 wchar/s 3376 1946 3277 1229 2457
# sar -c 10 4
HP-UX r265c145 B.11.23 U ia64 11:15:51 scall/s 11:16:01 287322 11:16:11 288439 11:16:21 289239 11:16:31 290331 Average 288832 sread/s 27 7 9 4 12
swrit/s 1 1 1 0 1
The syscall program uses the open() and close() system calls and does no I/O as such. These system calls do not block the process which turns into a CPU hog, only blocking on Priority in the glance Wait States page. Kill the syscall program, before proceeding. # kill $(ps el | grep syscall | cut c18-22) 5. Using cs, compare the number of context switches on an idle system and a loaded system. Idle ________
Answer # sar -w 2 2 (rp2430) 03/16/04
Loaded ______________
HP-UX r206c42 B.11.11 U 9000/800
11:39:27 swpin/s bswin/s swpot/s bswot/s pswch/s 11:39:29 0.00 0.0 0.00 0.0 86 11:39:31 0.00 0.0 0.00 0.0 83 Average # ./cs & # sar -w 2 2 HP-UX r206c42 B.11.11 U 9000/800 03/16/04 0.00 0.0 0.00 0.0 85
11:41:43 swpin/s bswin/s swpot/s bswot/s pswch/s 11:41:45 0.00 0.0 0.00 0.0 47733
Solutions 11:41:47 0.00 0.0 0.00 0.00 0.0 0.0 47471 47602
Average 0.00 0.0 # sar -w 2 2 (rx2600)
HP-UX r265c145 B.11.23 U ia64
04/06/04
11:22:07 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:09 0.00 0.0 0.00 0.0 150 11:22:11 0.00 0.0 0.00 0.0 177 Average # ./cs& # sar -w 2 2 HP-UX r265c145 B.11.23 U ia64 04/06/04 0.00 0.0 0.00 0.0 164
11:22:57 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:59 0.00 0.0 0.00 0.0 81912 11:23:01 0.00 0.0 0.00 0.0 82728 Average 0.00 0.0 0.00 0.0 82319
Notice that we go from an idle context switch rate (pswch/s) of approx 100 processes per second up to 47000 or 82000! Additionally, you can look at the glance CPU Report (c). Note how much of the CPU time is spent doing context switching. (About 15%) 6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem. # kill $(ps el | grep cs | cut c18-22) # rm f /vxfs/file # umount /vxfs
Solutions
525. LAB: Identifying CPU Bottlenecks

Directions
The following labs are designed to show symptoms of a CPU bottleneck.
Lab 1
1. Change directory to /home/h4262/cpu/lab1 # cd /home/h4262/cpu/lab1
2. Start the processes running in the background. # ./RUN
3. Start a glance session and answer the following questions. What is the CPU utilization? _______
Answer At or near 100%
What are the nice values of the processes receiving the most CPU time? _______
Answer 10
What is the average number of jobs in the CPU run queue? ______
Answer
# uptime 12:05pm #
Varies with configuration should be approx 3-5

up 4 days, 19:38, 7 users, load average: 4.73, 3.31, 2.26
4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs? Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs. Glance global (g) page output (rp2430):
PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc8 27425 1 215 root 50.1/49.4 138.6 0.0/ 0.0 168kb 1 proc3 27420 1 221 root 48.4/49.2 138.0 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1125.1 0.0/ 0.0 26.6mb 19 proc5 27422 1 168 root 0.0/ 0.2 0.5 4.0/ 4.0 168kb 1
Solutions
proc2 27419 1 168 root 0.0/ 0.2 0.5 3.8/ 3.9 168kb 1
Glance global (g) page output (rx2600):

PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc3 26194 1 219 root 50.8/49.3 81.5 0.0/ 0.0 268kb 1 proc8 26199 1 216 root 48.5/49.4 81.6 0.0/ 0.0 268kb 1 scopeux 2105 1 127 root 0.0/ 0.0 13.3 0.0/ 0.0 20.7mb 1 prm3d 2139 1 168 root 0.0/ 0.1 77.5 0.0/ 0.0 49.5mb 19 ia64_corehw 2989 1 154 root 0.0/ 0.1 65.9 1.1/ 0.0 1.8mb 1 proc2 26193 1 168 root 0.0/ 0.1 0.2 7.6/ 7.7 256kb 1 proc5 26196 1 168 root 0.0/ 0.1 0.2 7.6/ 5.8 256kb 1
proc3 and proc8 are the main CPU hogs. They have been run with nice values of 10! The process pair are accounting for almost 100% of the CPU between them. With the same CPU rates and RSS (Resident Set Size), it is likely that these are identical programs. Selecting one of these processes in glance reveals no disc I/O and a context switch profile which is always forced. proc5 and proc2 also manage to execute with 0.2% CPU utilization each. Again these look like a pair. If you select one of these programs and look at the Process Resource page you can see a small amount of write disk I/O, most of which is logical. The main Wait Reason for this process is SLEEP. It would appear that these processes do a small amount of disk I/O and then call sleep() and pause for some time intentionally. proc1 and proc7 are a pair. On selecting one of these we see a nice value of 39! These processes find it nearly impossible to get CPU with the real time pair of proc3 and proc8 taking all the CPU resource. If you watch the Dispatches metric on the Process Resource page they can be seen to get one or two slices of CPU very infrequently. You should also see that for every Dispatch (these are rare), these is always an accompanying Forced Cswitch. You can conclude that these processes would be CPU hogs if they were not so crippled by their own high nice values and the aggression of proc3 and proc8. proc4 and proc6 are the last pair. They have standard nice values of 20 and seem to do nothing but call the sleep() system call. They are being dispatched slightly more frequently than proc1 and proc7 and they are always subject to Voluntary CSwitch. These processes are not CPU hogs. They also do no disk I/O of any kind. None of the above processes had any significant memory size. 5. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______
Answer: # timex /home/h4262/baseline/short & The last prime number is : 49999 (rp2430)
Solutions
real user sys
56.44 10.66 0.01 (rx2600)
# timex /home/h4262/baseline/short & # The last prime number is : real user sys 1:02.38 8.48 0.00 99991
6. Compare your results to the baseline established in the lab exercise in module 1, step 7.
Answer:
Total execution time is over 5 times slower!
7. End the CPU load by executing the KILLIT script. # ./KILLIT
Solutions
Lab 2
1. Change directory to /home/h4262/cpu/lab2. # cd /home/h4262/cpu/lab2 2. Start the processes running in the background. # ./RUN 3. In one terminal window, start glance. In a second terminal window run # sar -u 5 200. Answer the following questions: What does glance report for CPU utilization? _______ Answer: Should be greater than 50%. (the more, the merrier!) Output of rp2430 glance (g) page below
PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc2 27761 1 1 root 92.0/92.3 723.2 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1137.2 0.0/ 0.0 26.6mb 19
Output of rp2430 glance (a) page below

CPU BY PROCESSOR Users= 1 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------0 Enable 93.2 0.5/ 0.6/ 1.7 1724 27761
Output of rx2600 glance (g) page below

PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc2 26469 1 1 root 71.5/71.6 47.1 0.0/ 0.0 288kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1137.2 0.0/ 0.0 26.6mb 19
Output of rx2600 glance (a) page below

CPU BY PROCESSOR Users= 1 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------0 Enable 73.1 0.0/ 0.2/ 0.8 1432 26469
What does sar report for CPU utilization? ________
Solutions
Answer: sar reports the CPU is mostly idle. Util is less than 10%.
# sar -u 5 200 HP-UX r206c42 B.11.11 U 9000/800 13:45:58 13:46:03 13:46:08 13:46:13 13:46:18 13:46:23 %usr 4 0 1 0 1 %sys 2 1 1 0 1 %wio 0 0 0 0 0 03/16/04 %idle 94 99 98 100 98
This is very strange; the tools totally disagree with each other. sar is reporting over 90% idle with glance reporting over 80% busy! They cannot both be right. Which one do you trust? The output of top is also confused. It sees the busy process but still reports 90% idle!
Load averages: 112 processes: Cpu states: LOAD USER 0.50 0.6% 0.50, 0.56, 1.41 (rp2430) 99 sleeping, 13 running NICE 0.0% SYS 2.2% IDLE 97.2% BLOCK 0.0% SWAIT 0.0% INTR 0.0% SSYS 0.0% Page# 1/8
Memory: 91236K (64076K) real, 365020K (299140K) virtual, 30120K free TTY PID USERNAME PRI NI pts/tb 27761 root 1 20 Load averages: 128 processes: Cpu states: LOAD USER 0.03 0.2% SIZE 1664K RES STATE 148K sleep
TIME %WCPU %CPU COMMAND 14:49 92.56 92.40 proc2
0.03, 0.12, 0.68 (rx2600) 107 sleeping, 20 running, 1 zombie NICE 0.0% SYS 0.0% IDLE 99.8% BLOCK 0.0% SWAIT 0.0% INTR 0.0% SSYS 0.0% Page# 1/
Memory: 197664K (154768K) real, 608492K (523032K) virtual, 23516K free 10 TTY PID USERNAME PRI NI tty1p0 26469 root 1 20 SIZE 3304K RES STATE 252K sleep
TIME %WCPU %CPU COMMAND 4:08 71.77 71.64 proc2
What is the priority of the process receiving the most CPU time? _______
Answer
The proc2 process is the culprit and is running with the high UNIX real time priority of 1. How much time is the process spending in the sigpause system call? ______
Answer
Now this is where the clues start!
Solutions
The Wait States for proc2 show that it is blocked on SLEEP when it is not running. This wait state is the result of the process putting itself to sleep. To see the system calls that the process is calling hit the F6 softkey or L key once you have selected the process. glance will collect the data and present it after about 10-20 seconds. rp2430:
System Calls PID: 27761, proc2 1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------sigpause 111 449 99.7 0.35218 1497 74.1 1.17095 sigcleanup 139 450 100.0 0.00166 1500 74.2 0.00553 PPID:
rx2600:
System Calls PID: 26469, proc2 1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------sigpause 111 525 100.9 1.49255 1500 74.2 4.26847 sigcleanup 139 525 100.9 0.00143 1500 74.2 0.00408 PPID:
The sigpause() call is causing the sleep blocks that we see in the Wait States page. The interesting thing is that the rate at which the program calls sigpause() is always 100 times per second. That is 10 ms (milli-seconds) between calls. How can a program be so coordinated with the wall clock and what is it using to achieve this synchronization? Can you tell what it is yet? How is the process being context switched (forced or voluntary)? ______
Answer
Review the Resource Summary page again for proc2 and you will see that all the context switches are Voluntary. This is not the expected case for a CPU hog. How is it that a process can use so much CPU and never be seen by the scheduler and thrown off the CPU? The Bottom Line If you examine the code of the lab you will see that the process arms a trap waiting for the system hardware clock (the tick) to pop. When this occurs the program wakes up and wastes CPU for an amount of time that your instructor has tuned to be just under 10ms (see waste.c). The program then arms the trap again and voluntarily goes to sleep waiting for the next hardware tick. Remember the UNIX scheduler analyzes system activity on the hardware tick intervals and our program has done a good job at never being around at these times! Its a free lunch.
Solutions
The standard UNIX tools (sar and top for example) feed on the schedulers internal statistics for measurement data and so they get the wrong story. glance however uses the midaemon, which recalculates performance stats every time a process returns from a system call. And you cannot play this game without system calls. 4. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______
Answer: (rp2430)
# timex /home/h4262/baseline/short & The last prime number is : 49999 real user sys (rx2600) # timex /home/h4262/baseline/short & # The last prime number is : real user sys 30.86 8.51 0.01 99991 2:32.86 10.88 0.07
Our old benchmark figure was around 10 seconds (real) so this is significantly slower. This program is running in the gaps that the proc2 process is leaving. You could further modify waste.c to use more of the tick period. 5. End the CPU load by executing the KILLIT script. # ./KILLIT
Solutions
618. LAB: Memory Leaks

There are several performance issues related to memory management, memory leaks, and swapping/paging, protection ID thrashing. Let's investigate a few of them. 1. Change directories to /home/h4262/memory/leak: # cd /home/h4262/memory/leak Memory leaks occur when a process requests memory (typically through the malloc()or shmget() calls) but doesn't free the memory once it finishes using it. The five processes in this directory all have memory leaks to different degrees. The following solution data came from an rp2430 server with 640MB of physical memory and 2GB of device swap, and an rx2600 server with 1012MB of physical memory and 2GB of device swap. The rp2430 was running HPUX 11i v1 and the rx2600 was running 11i v2. 2. Before starting the background processes, look up the current value for maxdsiz using the kmtune command on 11i v1 and the kctune command on 11i v2. On the rp2430: # kmtune lq maxdsiz Answer: Varies with configuration probably 64MB if you are pre 11i and 256MB for 11i v1. # kmtune-lq maxdsiz Parameter: maxdsiz Current: 0x10000000 Pending: 0x10000000 Default: 0x10000000 Minimum: Module: Version Dynamic: No # The number is in hex Converting this to decimal = 268435456 = 256MB On the rx2600: # kctune avq maxdsiz Answer: Varies with configuration probably 1GB for 11i v2. # kctune -avq maxdsiz Tunable maxdsiz
Solutions
Description process (bytes) Module Current Value Value at Next Boot Value at Last Boot Default Value Constraints Can Change
Maximum size of the data segment of a 32-bit
vm 1073741824 [Default] 1073741824 [Default] 1073741824 1073741824 maxdsiz >= 262144 maxdsiz <= 4294963200 Immediately or at Next Boot
The number is in decimal = 1073741824 = 1GB The default maxdsiz on 11i v2 is 1 GB. This will make proc1 very slow in reaching its limits. You can change maxdsiz to a more reasonable number for this lab exercise by:
# kctune maxdsiz=0x10000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000
Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? rp2430:
# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 3 0 0 75182 92519 408 138 1 0 99 3 0 0 75182 92465 214 75 0 0 100 page re 3 3 at 0 0 pi 0 1 po 0 0 fr 0 0 de 0 0 sr 0 0 in 104 106
We have around 92000 free pages which equates to 368MB. rx2600:

# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 124095 97927 page re 466 at 165 pi 298 po 0 fr 0 de 0 sr 2 in 1134
Solutions
47856 2 21509 476 0 470 14 19 67 0 124095 3 3 94
96427
137
26
69
26
536
We have around 97000 free pages which equates to 388MB. 3. Use the RUN script to start the background processes: # ./RUN 4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions fairly quickly, before the memory leaks get too large. Go for the m page of glance for the best info. You have to be quick off the mark after starting the leak programs!
MEMORY REPORT Users= 1 Event Current Cumulative Current Rate Cum Rate High Rate ------------------------------------------------------------------------------Page Faults 588 1301 113.0 116.1 137.1 Page In 1 33 0.1 2.9 6.1 Page Out 0 0 0.0 0.0 0.0 KB Paged In 0kb 36kb 0.0 3.2 6.9 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 0 3 0.0 0.2 0.5 VM Writes 0 0 0.0 0.0 0.0 Total VM : 384.9mb Active VM: 342.1mb Sys Mem : 182.3mb Buf Cache: 32.4mb User Mem: 96.9mb Free Mem: 328.4mb Phys Mem: 640.0mb
What is the current amount of free memory? Answer:Varies with configuration Already this has dropped to 328.4MB What is the size of the buffer cache? Answer:Varies with configuration In our case this is 32.4MB Is there any paging to the swap space? Answer:Varies with configuration No not in the last sample, see KB paged Out above How much swap space is currently reserved? Answer:Varies with configuration Get this from swapinfo. Again you need to do this just after the programs start: In our case around 249MB.
# swapinfo -tm
Solutions Mb TYPE AVAIL dev 2048 /dev/vg00/lvol2 reserve memory 1013 total 3061 Mb USED 0 379 330 709 Mb FREE 2048 -379 683 2352 PCT USED 0% START/ Mb LIMIT RESERVE 0 -
PRI 1
NAME
33% 23%
The total swapspace used (used = really used + reserved) is the figure in bold. More detail on swap management is in Module 7. For now take the bottom line figure in bold above. Which process has the largest Resident Set Size (RSS)? Answer proc1. You can see that from the global process list in glance (the g key). As you watch it, it will grow until vhand kicks in and limits its RSS. However, the VSS will continue to grow. Select that process (with s) and observe to RSS/VSS figure.
PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt ------------------------------------------------------------------------------proc1 3267 1 168 root 0.0/ 0.2 1.0 0.0/ 0.0 275.8mb 1 proc2 3268 1 168 root 0.0/ 0.1 0.4 0.0/ 0.0 114.6mb 1 proc3 3269 1 168 root 0.0/ 0.0 0.2 0.0/ 0.0 56.7mb 1 proc4 3270 1 168 root 0.0/ 0.0 0.1 0.0/ 0.0 27.7mb 1 alarmgen 3277 3276 168 root 0.0/ 0.0 0.1 1.3/ 0.1 1.6mb 6 vhand 2 0 128 root 0.4/ 0.2 2.0 81.7/44.2 64kb 1
Resources PID: 3267, proc1 PPID: 1 euid: 0 User: root ------------------------------------------------------------------------------CPU Usage (util): 0.0 Log Reads : 0 Wait Reason : SLEEP User/Nice/RT CPU: 0.0 Log Writes: 0 Total RSS/VSS :275.7mb/479.1mb System CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 0/ 542 Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0 Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0 Priority : 168 VM Reads : 0 Signals Recd : 0 Nice Value : 20 VM Writes : 0 Mesg Sent/Recd : 0/ 0 Dispatches : 5 Sys Reads : 0 Other Log Rd/Wt: 0/ 0 Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 5 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Tue Apr 6 14:29:16 2004 CPU Switches : 0 Bytes Xfer: 0kb :
What is the data segment size of the process with the largest RSS? Answer:select the memory regions page for proc1 with the M key.
Memory Regions PID: 3267, proc1 PPID: 1 euid: 0 User: root
Type RefCt RSS VSS Locked File Name ------------------------------------------------------------------------------NULLDR/Shared 87 4kb 4kb 0kb <nulldref> TEXT /Shared 2 4kb 4kb 0kb /home/.../leak/proc1 DATA /Priv 1 301.0mb 716.2mb 0kb /home/.../leak/proc1 MEMMAP/Priv 1 0kb 16kb 0kb /usr/lib/tztab
Solutions
MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv Text RSS/VSS: Shmem RSS/VSS: 1 1 1 1 1 4kb/ 0kb/ 4kb 4kb 0kb 24kb 40kb 4kb 0kb 4kb 8kb 8kb 28kb 40kb 0kb 0kb 0kb 0kb 0kb <mmap> <mmap> <mmap> /usr/lib/hpux32/libc.so. <mmap> Stack RSS/VSS: 4kb/ 8kb
Data RSS/VSS:301mb/716mb Other RSS/VSS:1.6mb/3.2mb
The data segment size in this example is 301/716 MB and growing! 5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs. What happens when the process reaches its maximum data size? Answer This is going to take several minutes. The maxdsiz limit is probably either 256MB or 1GB on the test system. Be careful! maxdsiz is a limit on the VSS (Virtual Set Size) and not the RSS (Resident Set Size). System starts doing a LOT of disk I/O. Look for the large F bar in the Disc Util global meter. Why does disk utilization become so high at this point? Answer Kernel is dumping the core file of the user process in our case. You will probably run out of disc space in the /home file system. You may want to remove the /home/h4262/memory/leak/core file! Remember it is not the process that is doing the disk I/O, it is the kernel that is doing it to produce the core file. 6. As the other processes grow towards their maximum data segment size, continue to monitor the following: Free memory
page re 54 1 at 19 0 pi 79 115 po 285 12 fr 16 0 de 0 0 sr 359 0 in 548 397
# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 321403 91118 4962 326 2 3 95 2 0 0 321403 90413 552 191 0 0 100
Not a lot of free memory now. The system is under memory pressure and is paging out to stabilize the memory system Swap space reserved
Mb USED 715 Mb FREE 1333 PCT USED 35% START/ Mb LIMIT RESERVE 0 -
# swapinfo -tm Mb TYPE AVAIL dev 2048
PRI 1
Solutions
reserve memory total 1013 3061 341 340 1396 -341 673 1665
34% 46%
Swapspace is up to 46% utilization! The size of the processes' data segments
All the proc(n) processes continue to grow (see VSS) just like proc1 did and they are aborted in the same way when they cross the line (maxdsiz). The RSS of the processes
The running memory hog processes compete for the limited real memory resource. We didnt have a lot free at the start of the test and the lab procs all want to grow to the maxdsiz limit. They cannot all fit together so they fight. This is a classic memory thrash situation. The number of page-outs/page-ins to the swap space
This depends on when you look! These figures were taken while proc2 was still on the move and free memory was approaching its minimum.
# vmstat 2 10 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 166464 2692 173 82 0 0 100 2 1 0 170444 1649 209 92 0 0 100 2 1 0 170444 1028 189 88 0 6 94 2 1 0 170444 1146 176 129 0 5 95 2 1 0 170444 1392 175 112 0 0 100 2 1 0 170444 1366 190 156 0 0 100 1 0 0 169455 1090 209 201 0 0 100 1 0 0 169455 1112 193 163 0 1 99 1 0 0 169455 1048 180 133 5 0 95 1 0 0 169455 1600 240 119 0 4 96 page re 0 0 0 8 12 12 9 6 3 5 at 0 0 0 0 0 0 0 0 0 0 pi 0 13 8 6 5 5 5 3 2 0 po 0 0 5 101 263 312 304 351 332 396 fr 0 0 4 109 69 44 28 31 19 12 de 0 0 0 0 0 0 0 0 0 0 sr 0 0 1256 9869 9659 8186 6410 5334 3902 2576 in 103 123 122 225 316 331 316 359 339 370
7. Run the two baseline programs, short and diskread. # timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread
Solutions
rp2430:
# timex /home/h4262/baseline/short The last prime number is : 49999 real user sys 12.00 10.86 0.02
# timex /home/h4262/baseline/diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real user sys 31.79 0.02 0.53
rx2600:
# timex /home/h4262/baseline/short & # The last prime number is : real user sys 8.54 8.48 0.00 99991
# timex /home/h4262/baseline/diskread & [1] 3841 root@r265c145:/home/h4262/memory/leak # DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real user sys 29.60 0.01 0.16
How does the performance of these programs compare to their earlier runs? Answer: short takes a little longer. The CPU is not under much pressure at this time so compute bound processes will not be affected (unless they need memory!). It is a different story for diskread, in the first test case, it took noticeably longer due to the disk load already in progress for the paging activity. It is not good to have swap space on your application disks! 8. When finished monitoring the behavior of processes with memory leaks, clean up the processes.
Solutions
Exit glance. Execute the KILLIT script: # ./KILLIT
If you changed maxdsiz, change it back:
# kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000
Solutions
715. LAB: Monitoring Swap Space Preliminary Steps

A portion of this lab requires you to interact with the ISL and boot menus, which can only be accomplished via a console login. If you are using remote lab equipment, access your systems console interface via the GSP/MP. You may get some file system full messages while you are shutting down the system. You can ignore these messages.
Directions
The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities. 1. Use the swapinfo -m command to display the current swap space statistics on the system. List the MB Avail and MB Used for the following three items: MB Available dev reserve memory Answer MB Used
512 451
0 139 27
Varies with configuration, examples below.

(rp2430)
Mb USED 0 139 27 Mb FREE 512 -139 424 PCT USED 0% 6% START/ Mb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 Mb AVAIL 512 451
# swapinfo m TYPE dev reserve memory
# swapinfo -m TYPE dev reserve memory Mb AVAIL 2048 1013 Mb USED 75 189 339
(rx2600)
Mb FREE 1973 -189 674 PCT USED 4% 33% START/ Mb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
2. To see total swap space available and total swap space reserved, enter: # swapinfo -mt What is the total swap space available (including pseudo swap)? Answer Varies with configuration, in our case it is 964 Mb or 3 Gb (as seen in the bolded figures below.)
Solutions
# swapinfo -tm Mb TYPE AVAIL dev 512 reserve memory 451 total 963 # swapinfo -mt Mb TYPE AVAIL dev 2048 reserve memory 1013 total 3061
(rp2430)
Mb USED 0 139 27 166 Mb FREE 512 -139 424 797 PCT USED 0% 6% 17% START/ Mb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
(rx2600)
Mb USED 74 190 339 603 Mb FREE 1974 -190 674 2458 PCT USED 4% 33% 20% START/ Mb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
What is the total space "reserved"? Answer Varies with configuration. Swap space is first reserved and then it may (or may not) be used by the process that reserved it. The bottom line is that reserved swap space is no more available than used swap space so the only figure that really matters here are the totals underlined (166 Mb and 603 Mb). This figure is unavailable to any other process. 3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so lets not use the m option. Upon verification, exit the shell. Is the swap space returned upon exiting the shell process? Answer It should and it does. But you have to be careful when you look. It is easy for some other activity on the system to spoil the results You may want to try it 2 or 3 times to see if your results change. What SHOULD happen is that the reserve-USED entries should increase and then decrease by exactly the same amount. rp2430:
# swapinfo TYPE dev reserve memory # sh # swapinfo TYPE dev reserve Kb AVAIL 524288 Kb Kb USED FREE 0 524288 144768 -144768 PCT USED 0% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 Kb AVAIL 524288 462248 Kb Kb USED FREE 0 524288 144444 -144444 28384 433864 PCT USED 0% 6% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
Solutions
memory # exit # swapinfo TYPE dev reserve memory Kb AVAIL 524288 462248 Kb Kb USED FREE 0 524288 144444 -144444 28388 433860 PCT USED 0% 6% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 462248 28384 433864 6%
rx2600:
# swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064 # sh # swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064 # exit # swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064 Kb Kb USED FREE 75652 2021500 194900 -194900 346740 690324 PCT USED 4% 33% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 Kb Kb USED FREE 75652 2021500 195540 -195540 346740 690324 PCT USED 4% 33% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 Kb Kb USED FREE 75652 2021500 194900 -194900 346740 690324 PCT USED 4% 33% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
If you see that some swap was reserved and not released, then there is something else going on in the background that is skewing the figures. 4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Util. percentage increases in glance. Type: # /home/h4262/memory/paging/mem256 & This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Util. in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? Answer Varies with configuration, depends on your swap space.
Solutions
On the rp2430, after 12 copies of mem256 the test system swap space was almost gone. Below is what happened when the 13th process was introduced.
# swapinfo -tm Mb TYPE AVAIL dev 512 reserve memory 451 total 963 Mb USED 461 51 399 911 Mb FREE 51 -51 52 52 PCT USED 90% 88% 95% START/ Mb LIMIT RESERVE 0 -
PRI 1
# /home/h4262/memory/paging/mem256& [13] 2864 # exec(2): insufficient swap or memory available. [13] + Done(9) /home/h4262/memory/paging/mem256&
On the rx2600, after 37 copies of mem256 the test system swap space was almost gone. Below is what happened when the 38th process was introduced.
# swapinfo -tm Mb TYPE AVAIL dev 2048 reserve memory 1013 total 3061 Mb USED 1978 70 991 3039 Mb FREE 70 -70 22 22 PCT USED 97% 98% 99% START/ Mb LIMIT RESERVE 0 -
PRI 1
# ./mem256& [38] 4159 exec(2): insufficient swap or memory available.
What prevented an additional mem256 process from being started? Answer Insufficient swap or memory available
Kill all mem256 processes to restore performance. 5. Recompile the kernel, disabling pseudo-swap. Use the following procedure: 11i v1 and earlier: # # # # # # cd /stand/build /usr/lbin/sysadm/system_prep -s system echo "swapmem_on 0" >> system mk_kernel -s system cd / shutdown -ry 0
11i v2 and later:

# cd / # kctune swapmem_on=0 NOTE: The configuration being loaded contains the following change(s) that cannot be applied immediately and which will be held for the next boot:
Solutions -- The tunable swapmem_on cannot be changed in a dynamic fashion. WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? no NOTE: The backup will not be updated. * The requested changes have been saved, and will take effect at next boot. Tunable Value Expression swapmem_on (now) 1 Default (next boot) 0 0 # shutdown ry 0
6. Reboot from the new kernel. rp2430: Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test rx2600: (Nothing special needs to be done) 7. Once the system reboots, login and execute swapinfo. Is there a memory entry? Why or why not? Answer No. Pseudo-swap has been disabled.
Will the same number of mem256 processes be able to execute as earlier? Answer No. How many mem256 processes can be started now? Answer Varies with configuration
On the rp2430, only 6 processes could be started successfully. On the rx2600, only 27 processes could be started successfully. Kill all mem256 processes to restore performance. 8. If you have a two disk system. If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk, continue this lab at question 13.
Solutions
If you did not add the second disk earlier, # # # # vgdisplay v | grep Name (Note the physical disks used by vg00) ioscan fnC disk (Note which disk is unused by LVM) pvcreate f <raw_dev_file_of_second_disk> vgextend /dev/vg00 <block_dev_file_of_second_disk>
To create the new swap device on the second disk, # lvcreate n swap1 /dev/vg00 # lvextend L 512 /dev/vg00/swap1 <dev_file_of_second_disk> Note: In our case the primary swap was 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap. 9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work. # swapon p 1 /dev/vg00/swap1 Answer:
# swapinfo -tm Mb TYPE AVAIL dev 512 reserve total 512 Mb USED 0 130 130 Mb FREE 512 -130 382 PCT USED 0% 25% START/ Mb LIMIT RESERVE 0 0 PRI 1 NAME /dev/vg00/lvol2
# swapon -p 1 /dev/vg00/swap1
swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging.
Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use) You need to override.
# swapon -p 1 f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1.
Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify maxswapchunks and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel (if necessary), to increase maxswapchunks. Use the following procedure:
Solutions
11i v1 and earlier (ONLY!) # # # # cd /stand/build echo "maxswapchunks mk_kernel -s system cd / 512" >> system
# shutdown -ry 0 10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system. Otherwise, skip to step 11. 11i v1 and earlier (ONLY!) Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test And now add the new swap device:
# swapon -p 1 f /dev/vg00/swap1
Verify that the new swap space has be recognized by the kernel:
# swapinfo -mt (rp2430) Mb Mb TYPE AVAIL USED dev 512 0 dev 512 0 reserve 141 total 1024 141 # swapinfo -tm (rx2600) Mb Mb TYPE AVAIL USED dev 2048 86 dev 2048 0 reserve 158 total 4096 244 Mb FREE 512 512 -141 883 PCT USED 0% 0% 14% START/ Mb LIMIT RESERVE 0 0 0
PRI 1 1 -
NAME /dev/vg00/lvol2 /dev/vg00/swap1
Mb FREE 1962 2048 -158 3852
PCT USED 4% 0% 6%
START/ Mb LIMIT RESERVE 0 0 0
PRI 1 1 -
NAME /dev/vg00/lvol2 /dev/vg00/swap1
Done! 11. Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on an rp2430 with 640MB, I found that 8 processes got things paging nicely! On an rx2600, 10 should do nicely.
# vmstat 2 2 procs faults r b memory cpu w avm free re at pi po fr de sr in page
Solutions
sy 9 213 9 191 cs 0 471 0 355 us sy id 0 180106 100 0 0 0 180106 100 0 0
5064 5056
34 23
0 0
192 122
340 217
99 63
0 0
3136 2006
339 216
Note the system is paging constantly in the vmstat output and free memory is very low. 12. Measure the disk I/O to see what is happening with swap space. Go to question 15 when you have finished. Answer: The I/O should be balanced across both disks!
# sar -d 5 2
(rp2430)
03/18/04 r+w/s 409 406 395 385 402 396 blks/s 12222 12093 12209 11976 12216 12034 avwait 33.45 31.03 28.53 25.00 31.03 28.10 avserv 13.86 9.24 12.26 10.57 13.08 9.89
HP-UX r206c41 B.11.11 U 9000/800 14:22:12 14:22:17 14:22:22 device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 %busy 87.03 60.68 82.60 72.20 84.82 66.43 avque 24.73 23.21 22.01 19.57 23.39 21.43
Average Average
# sar -d 5 2
(rx2600)
04/07/04 r+w/s 25 14 79 47 52 31 blks/s 542 271 2373 1229 1456 750 avwait 0.00 0.01 2.85 3.86 2.16 2.97 avserv 6.05 4.71 5.35 3.94 5.51 4.12
HP-UX r265c145 B.11.23 U ia64 11:28:10 11:28:15 11:28:20 device c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 %busy 9.38 3.79 21.40 6.60 15.38 5.19 avque 0.50 0.50 6.75 10.42 5.25 8.13
Average Average
This has doubled the effective performance of swap space. The results would be even better if the swap disks were on different controllers. 13. If you have a single disk system. Create three additional swap devices with sizes of 20 MB. # lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00 Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.
Solutions
List the current amount of swap space in use. Answer Varies with configuration. Use swapinfo tm.
If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out? A) B) The distribution would be 10MB and 10MB. or The distribution would be 15MB and 5MB. B. vhand does not consider what the previous utilization was.
Answer
14. Activate the newly created swap devices. Activate two with a priority of 1, and the third with a priority of 2. # swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on a 640MB system I found that 8 processes got things paging nicely!
# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 10 0 0 175597 6489 271 58 26 4 70 10 0 0 175597 6414 300 254 100 0 0
page re 12 20 at 8 0 pi 2 27 po 31 87 fr 11 22 de 0 0 sr 467 1316 in 0 103
Note the system is paging constantly in the vmstat output and free memory is very low. Is the new paging activity being distributed evenly across the paging devices? Answer No. It is confined to lvol2 (primary swap), swap1, and swap3.
15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo-swap and remove the additional swap devices. For 11i v1 and earlier, follow this procedure:
Solutions
# cd / # shutdown ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown ry 0
Solutions
818. LAB: Disk Performance Issues

Directions
The following lab illustrates a number of performance issues related to disks. 1. A file system is required for this lab. One was created in an earlier exercise. Mount it now. # mount /dev/vg00/vxfs /vxfs We also need to assure that the controller does not have " SCSI immediate reporting" enabled. Enter the following command and check your current state: (fill in the device file name as appropriate) # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) If the current immediate_report = 1 then enter the following: # scsictl -m ir=0 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) 2. Copy the lab files to the file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files /vxfs /vxfs
Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs
Solutions
4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key. From the first window, time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real: user: sys: 0.73 0.01 0.11 glance Disk Report Logl Rds: Phys Rds:
Logl Rds: Phys Rds:
2560 500
# timex cat file* > /dev/null (rx2600) real user sys 0.34 0.00 0.06
Logl Rds: Phys Rds:
2560 2560
5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same command and record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real: user: sys: 0.06 0.01 0.05 glance Disk Report Logl Rds: Phys Rds:
Logl Rds: Phys Rds:
2560 0
# timex cat file* > /dev/null (rx2600) real: user: sys: 0.02 0.00 0.02
Logl Rds: Phys Rds:
2560 0
Solutions
NOTE:
The conclusion is that I/O is much faster coming from the buffer cache, than having to go to disk to get the data.
6. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the VxFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
Answer: The disk got over 80% busy. The average number of requests in the I/O queue reached around 53 on the rp2430 and 442 on the rx2600. The average wait time of a request was around 65 ms on the rp2430 and 182 ms on the rx2600. The task took around 12.5 seconds on the rp2430 and 7.5 seconds on the rx2600. 7. The glance I/O by Disk report Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long. Record the results below: # ./disk_long Answer: Utilization reached 86% and queue length reached 55 on the rp2430. Utilization reached 85% and queue length reached 414 on the rx2600. 8. The glance I/O by File System report Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long. Record results below: # ./disk_long glance I/O by Disk Report Logl I/O: Phys I/O: glance I/O by Disk Report Util: Qlen:
Solutions
Answer: Logical I/Os reached 4059 and Physical I/Os reached 806 on the rp2430. Logical I/Os reached 4702 and Physical I/Os reached 1528 on the rx2600. 9. Performance tuning immediate reporting. Ensure the immediate reporting options are set for the disk that the file system is located on. If immediate reporting is not set, set it. # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache. # # # # cd / umount /vxfs mount /dev/vg00/vxfs /vxfs cd /vxfs
10. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
How do the results of step 11 compare to the results in step 6?

________________________________________________________________
Solutions
914. LAB: HFS Performance Issues Directions

The following lab illustrates a number of performance issues related to HFS file systems. 1. A 512 MB HFS file system is required for this lab. Use the mount and bdf commands to determine if such a file system is available. # mount v # bdf If there is no such HFS file system available, create one using the commands below: # lvcreate -n hfs vg00 # lvextend L 512 /dev/vg00/hfs /dev/dsk/cXtYdZ (second disk) # newfs -F hfs /dev/vg00/rhfs # mkdir /hfs # mount /dev/vg00/hfs /hfs 2. Copy the lab files to the newly created HFS file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files /hfs /hfs
Next, execute the make_files program to create five 4-MB ASCII files. # cd /hfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs
Solutions
4. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real user sys 1.04 0.01 0.16
The cat command took 1.04 seconds to complete on the rp2430 and 0.45 seconds on the rx2600. 5. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
Answer:
# sar -d 5 200 (rp2430) HP-UX r206c41 B.11.11 U 9000/800 11:53:15 11:53:20 11:53:25 11:53:30 11:53:35 device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 %busy 5.20 33.60 7.57 55.98 2.01 100.00 8.00 84.20 avque 0.50 6922.08 0.50 5215.11 0.50 8156.62 5.80 1237.19 03/23/04 r+w/s 13 950 10 1758 6 2983 18 558 blks/s avwait 66 5.09 15049 629.53 36 5.40 27980 2113.38 44 3.92 47696 2591.43 108 25.31 8670 1555.06 avserv 4.54 14.85 6.82 13.70 5.01 16.45 18.95 17.68
Solutions 11:53:40 11:53:45 11:53:50 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c3t15d0 6.00 0.50 71.20 7379.94 0.20 0.50 25.80 2375.50 9.20 0.50 15 2168 1 950 16 76 4.69 34537 1322.90 5 0.08 15206 3478.83 258 5.06 4.72 14.77 8.35 14.42 5.21
The disk got up to 100% busy. The average number of requests in the request queue was about 5200. The average wait time in the request queue was about 1950 ms.
# timex ./disk_long real user sys 22.76 4.57 3.45
The operation completed in 22.76 seconds.

# sar -d 5 200 (rx2600) HP-UX r265c145 B.11.23 U ia64 13:20:25 13:20:30 13:20:35 13:20:40 13:20:45 13:20:50 13:20:55 device c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t1d0 04/07/04 r+w/s blks/s avwait avserv 27 706 0.00 1.67 90 756 0.00 3.04 245 4026 173.18 12.76 3322 53129 2127.15 2.35 3 51 0.00 4.62 3895 62320 6436.22 2.04 13 287 0.00 5.68 2097 33482 9701.92 2.06 7 164 0.00 6.94 2 34 0.00 9.94
%busy avque 4.39 0.50 27.15 0.50 41.00 104.29 99.20 24004.63 1.40 0.50 100.00 20020.69 4.00 0.50 57.20 5030.77 2.40 0.50 1.40 0.50
The disk got up to 100% busy. The average number of requests in the request queue was about 50,000. The average wait time in the request queue was about 6100 ms.
Solutions
6. Performance tuning recreate the file system with larger fragment and file system block sizes. Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below: # lvcreate -n custom-lv vg00 # lvextend L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs 7. Copy the lab files to the customized HFS file system, execute the make_files program, and purge the buffer cache. # cp /hfs/disk_long /cust-hfs
# cp /hfs/make_files /cust-hfs # cd /cust-hfs # ./make_files # cd / # umount /cust-hfs # mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs 8. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real user sys 0.84 0.01 0.10
Solutions
The cat command took 0.84 seconds to complete on the rp2430 and 0.43 seconds on the rx2600. How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________ Answer: The larger block and fragment size resulted in I/O operations which were almost 20% faster on the rp2430 and marginally faster on the rx2600. 9. Performance tuning change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties. # cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs 10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
Answer: # sar -d 5 200 (rp2430) 03/23/04
HP-UX r206c41 B.11.11 U 9000/800
Solutions
12:08:22 12:08:27 12:08:32 12:08:37 12:08:42 12:08:47
device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c3t15d0
%busy 6.20 61.20 7.00 58.60 8.40 92.80 6.60 100.00 71.20
avque 0.50 5592.30 0.50 7186.64 3.94 4986.82 0.50 15588.44 5725.86
r+w/s blks/s avwait avserv 9 38 4.18 6.19 2120 33818 1376.80 13.94 16 81 4.31 5.28 1675 26765 1295.53 17.00 24 146 20.12 13.03 1860 29579 2678.62 16.11 17 120 4.84 3.79 2344 37493 2943.35 16.95 2292 36664 6159.69 15.69
The disk got up to 100% busy. The average number of requests in the request queue was about 7800. The average wait time in the request queue was about 2900 ms. # timex ./disk_long real user sys 17.17 4.61 3.72
The operation completed in 17.17 seconds. # sar -d 5 200 (rx2600) 04/07/04 r+w/s blks/s avwait avserv 4 67 0.00 2.51 1274 20184 1026.94 2.54 5 77 0.00 5.94 3684 58941 4021.91 2.15 9 141 11.85 12.77 3888 62008 8740.46 2.05 2 30 0.00 4.42 287 4562 11067.58 1.51 9 43 0.00 4.45
HP-UX r265c145 B.11.23 U ia64 13:39:39 13:39:44 13:39:49 13:39:54 13:39:59 13:40:04 device c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0
%busy avque 1.00 0.50 46.11 22190.48 2.00 0.50 100.00 30303.60 3.20 5.20 99.80 11176.41 0.80 0.50 5.60 716.00 4.00 0.50
The disk got up to 100% busy. The average number of requests in the request queue was about 17500. The average wait time in the request queue was about 6100 ms. # timex ./disk_long real user sys 14.46 0.86 3.04
Solutions
How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________ Answer: With fsasync turned on, the operation was about 25% faster on the rp2430 and 14% faster on the rx2600.
Solutions
1023. LAB: JFS File System Tuning

Directions
The following lab exercise compares performance of JFS with different mount options. The mount options used with JFS can have a big impact on JFS performance. 1. Mount a JFS file system to be used for this lab under /vxfs. # mount /dev/vg00/vxfs /vxfs 2. Because the above mount command specified no special mount options, the default mount options are used. Use the mount -v command to view the default options, including the option for transaction logging type. What type of transaction logging does JFS use by default? Answer Full logging
3. Change directory to /vxfs. Time the execution of the disk_long program, which writes 400 MB of data to the file system in 20 MB increments. After each 20 MB is written, the files are deleted. Run the command three times and record the middle results. # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long Record middle results: Real: _____________ User: ____________ Sys: ____________
Answer Varies with configuration, live data from test
(rp2430)
# timex ./disk_long (rrx2600) real user sys 9.49 0.90 1.62
If you look back to the HFS results, you will see that this is faster. See question 5 from the previous lab; test time was 23 seconds or 17 seconds!
Solutions
4. Remount the JFS file system using delaylog option. This helps performance of noncritical transactions. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o delaylog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, should be faster than before: (rp2430)
(rx2600)
Based on the results, does the disk_long program perform any non-critical transactions?
Answer
The answer is yes; the disk_long program is performing some non-critical transactions. This is seen by some improvement in time to execute. Since the programs write data in 1 MB increments (that's it), just about every JFS transaction is critical, so mounting with delaylog versus full log does not greatly affect performance in this case. It will in other cases. 5. Remount the JFS file system using tmplog option. This causes the system call to be returned after the JFS transaction is updated in memory (step 1 from lecture), and before the transaction is written to the intent log. Run the command three times and record the middle results. # cd / # umount /vxfs
Solutions
# # # # #
mount -o tmplog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data: (rp2430)
(rx2600)
Based on the results, why does the disk_long program show little or no improvement when mounted with tmplog?
Answer
The disk_long program shows little performance improvement because the program is performing extending write calls. When an extending write call is issued, by default JFS writes the user data first before writing the JFS transaction to the intent log. As a result, even JFS file systems mounted with tmplog or nolog will still have to wait for the user data to be written to disk. This waiting for the user data to be written hurts the performance of JFS.
6. Remount the JFS file system using tmpcache option. This allows the JFS transaction to be created without having to wait for the user data to be written in extending write calls. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o mincache=tmpcache /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Solutions
Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data. Fastest yet!
(rp2430)
# timex ./disk_long
real user sys
9.13 4.51 2.69 (rx2600)
# timex ./disk_long real user sys 9.51 0.90

1.65
Answer
When the mincache=tmpcache option is specified, under 2 MB out of 400 MB is physically written to disk. When this option is not specified, all 400 MB (400 out of 400) is physically written to disk. Major performance improvements should be seen with using this option, especially for applications doing lots of extending write calls (like the one in the lab). 7. Remount the JFS file system using direct option. This option requires all user data and all JFS transactions to bypass the buffer cache and go directly to disk. Run the command just once and record the results. # # # # # cd / umount /vxfs mount -o mincache=direct /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long
Record results: Real: _____________ User: ____________ Sys: ____________

Answer Varies with configuration, live test data, not very impressive! (rp2430)
# timex ./disk_long
real user sys
7:36.75 5.15 5.41 (rx2600)
# timex ./disk_long
Solutions
real user sys
3:06.72 0.90 2.45
Based on the results, why does the disk_long program show such poor performance results when mounted with mincache=direct? When would this option be appropriate to use?
Answer
The performance is poor because system calls have to wait while user data and JFS transactions are written out to disk. Normally, the JFS transactions are written to buffer cache, and the system calls do not have to wait for the transaction to be written to disk. This option is appropriate when the application performs its own caching, like with an RDBMS (for example, Oracle).
8. Dismount the VxFS file system. # umount /vxfs
Solutions
1120. LAB: Network Performance

Directions
The following two labs investigate network read and write performance. The labs use NFS and are performed against the JFS file system created in the JFS module.
Lab 1 Network Read Performance

To perform this lab, two systems are needed: an NFS server and an NFS client. Pair up with another student in the class for this lab. 1. Make sure the JFS file system on the server contains the make_files program. Execute the make_files program to create files for the client to access. # # # # mount /dev/vg00/vxfs /vxfs cp /home/h4262/disk/lab1/make_files /vxfs cd /vxfs ./make_files
2. Export the JFS file system so the client can mount it. # exportfs -i -o root=<client_hostname> # exportfs 3. From the client system, mount the NFS file system. # umount /vxfs # mount server_hostname:/vxfs /vxfs /vxfs
4. Time how long it takes to read the 20 MB of files from the mounted file system. Record the results: # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data below, (rp2430)
# timex cat /vxfs/file* > /dev/null real user sys 1.80 0.01 0.07
# timex cat /vxfs/file* > /dev/null real user 1.17 0.00
(rx2600)
Solutions
sys 0.02 5. Now that the data is in the client's buffer cache, time how long it takes to read the exact same files again. Record the results: # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live data below. Much faster once buffered. (rp2430)
(rx2600)
Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help by prefetching data. 6. Test to see if fewer biod daemons will change the initial performance. # # # # # # cd / umount /vxfs kill $(ps -e | grep biod | cut -c1-7) /usr/sbin/biod 4 mount server_hostname:/vxfs /vxfs timex cat /vxfs/file* > /dev/null
Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, but significant change. Large sequential access appears to be independent of the number of biods. Not what theory suggests? Well this depends! (rp2430)
# timex cat /vxfs/file* > /dev/null
(rx2600)
Solutions
real 1.15 user 0.00 sys 0.02 7. Once finished, remove the files and umount the file system. # rm /vxfs/file* # umount /vxfs
Lab 2 Network Write Performance

The following lab has the client perform many writes to an NFS file system. The following parameters will be investigated: Number of biod daemons NFS version 2 versus NFS version 3 TCP versus UDP
During this lab, the monitoring tools shown below should be used on the client and server CLIENT SERVER
# nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key) - monitor Remote Rds/Wrts 1. From the NFS client, mount the NFS file system as a version 2 file system. # mount -o vers=2 server_hostname:/vxfs /vxfs 2. Terminate all the biod daemons on the client. # kill $(ps -e |grep biod|cut -c1-7) 3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file. # cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs Record results:
Real: _____________ User: ____________ Sys: ____________
Solutions
Answer
Varies with configuration (rp2430)
# timex cp /stand/vmunix /vxfs real user sys 33.95 0.00 0.44
(rx2600)
4. Now, start up the biod daemons, and retry timing the copy. Record the results: # /usr/sbin/biod 4 # timex cp /stand/vmunix Record results: /jfs
Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, the test data shows marked improvement. The biods are providing the write behind service which is reducing the wait time experienced by the cp command. (rp2430)
(rx2600)
5. Change the mount options to version 3 and retime the transfer: # # # # # cd / umount /vxfs mount o vers=3 server_hostname:/vxfs /vxfs cd / timex cp /stand/vmunix /vxfs
Record results:
Solutions
Real: _____________ User: ____________ Sys: ____________ Answer: Interesting, it would appear that Version 3 mounting is far better than version 2. The results were obtained using the same 4 biods started in question 3. # timex cp /stand/vmunix /vxfs real 2.63 user 0.00 sys 0.18 # timex cp /stand/vmunix /vxfs real user sys 4.13 0.00 0.13 (rp2430)
(rx2600)
6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility. # ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp How long did the FTP transfer take? _________ Explain the difference in performance. Answer The data below shows that ftp is well optimized to perform data transfer. The good news is that Version 3 of NFS keeps up with it and remember that at 11i, NFS is using TCP/IP and not UDP/IP.
# ftp r265c69 (rp2430) Connected to r265c69.cup.edunet.hp.com. 220 r265c69.cup.edunet.hp.com FTP server (Version 1.1.214.4(PHNE_23950) Tue May 22 05:49:01 GMT 2001) ready. Name (r265c69:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 27573440 bytes sent in 2.55 seconds (10554.31 Kbytes/s) ftp> # ftp r265c145 (rx2600) Connected to r265c145. 220 r265c145.cup.edunet.hp.com FTP server (Revision 1.1 Version wuftpd-2.6.1 Tue Jul 15 07:42:07 GMT 2003) ready.
Solutions
Name (r265c145:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 47716848 bytes sent in 4.03 seconds (11557.24 Kbytes/s) ftp>
7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP. # umount /vxfs # mount -o vers=3 o proto=udp server_hostname:/vxfs /vxfs Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP? # timex cp /stand/vmunix /vxfs
Answer # timex cp /stand/vmunix /vxfs real user sys 2.44 0.00 0.15 (rx2600) (rp2430)
It would appear that UDP is marginally quicker than TCP but the difference is very small and probably not worth the risk. HPUX 11i version 3 NFS with TCP provides good performance and reliability.

HP-UX Performance and Tuning (H4262S)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HP-UX Performance and Tuning (H4262S)

Uploaded by

Copyright:

Available Formats

StudentPerformance and Tuning HP-UX H4262S C.

H4262S C.00 iii 2004 Hewlett-Packard Development Company, L.P.

H4262S C.00 iv 2004 Hewlett-Packard Development Company, L.P.

3-11. 3-12. 3-13. 3-14. 3-15. 3-16. 3-17. 3-18.

H4262S C.00 v 2004 Hewlett-Packard Development Company, L.P.

523. 5-24. 525.

H4262S C.00 vi 2004 Hewlett-Packard Development Company, L.P.

810. 811. 812. 813. 814. 815. 816. 817. 818.

H4262S C.00 vii 2004 Hewlett-Packard Development Company, L.P.

1020. 1021. 1022. 1023.

H4262S C.00 viii 2004 Hewlett-Packard Development Company, L.P.

H4262S C.00 ix 2004 Hewlett-Packard Development Company, L.P.

H4262S C.00 x 2004 Hewlett-Packard Development Company, L.P.

Student Performance Objectives

Module 2 Performance Tools

H4262S C.00 1 2004 Hewlett-Packard Development Company, L.P.

Overview Module 4 Process Management

Module 5 CPU Management

Module 6 Memory Management

H4262S C.00 2 2004 Hewlett-Packard Development Company, L.P.

Overview Module 7 Swap Space Performance

Module 8 Disk Performance Issues

Module 9 File System Performance

Module 10 VxFS Performance

H4262S C.00 3 2004 Hewlett-Packard Development Company, L.P.

Overview Module 11 NFS Performance

Module 12 Tunable Kernel Parameters

Module 13 Putting It All Together

H4262S C.00 4 2004 Hewlett-Packard Development Company, L.P.

Student Profile and Prerequisites

OR HP-UX Administration for the Experienced UNIX Administrator (H5875S) | | | |

Inside HP-UX (H50815S)

HP-UX Performance and Tuning (H4262S)

H4262S C.00 5 2004 Hewlett-Packard Development Company, L.P.

H4262S C.00 6 2004 Hewlett-Packard Development Company, L.P.

H4262S C.00 1-1 2004 Hewlett-Packard Development Company, L.P.

11. SLIDE: Welcome to HP-UX Performance and Tuning

Welcome to HP-UX Performance and Tuning

H4262S C.00 1-2 2004 Hewlett-Packard Development Company, L.P.

12. SLIDE: Course Outline

H4262S C.00 1-3 2004 Hewlett-Packard Development Company, L.P.

13. SLIDE: System Performance

Response Time Users

System Throughput Management

Response Time User's Perspective

H4262S C.00 1-4 2004 Hewlett-Packard Development Company, L.P.

Throughput IT or MIS Management Perspective

H4262S C.00 1-5 2004 Hewlett-Packard Development Company, L.P.

14. SLIDE: Areas of Performance Problems

Areas of Performance Problems

H4262S C.00 1-6 2004 Hewlett-Packard Development Company, L.P.

In which of these three areas are most performance problems located?

H4262S C.00 1-7 2004 Hewlett-Packard Development Company, L.P.

15. SLIDE: Performance Bottlenecks

CPU Run Queue

Disk I/O Queue

Disk System Bottleneck Areas CPU Memory Disk Network

H4262S C.00 1-8 2004 Hewlett-Packard Development Company, L.P.

H4262S C.00 1-9 2004 Hewlett-Packard Development Company, L.P.

16. SLIDE: Baseline

Best Possible Response Time

Response Time with Five Users

Response Time with Ten Users