The information contained herein is subject to change without notice Top Ten Performance Tips for HP-UX 11i v2 on HP Integrity Servers June 2006 Simplifying Integrity Performance Team BCS Transition Engineering and Consulting
Slide 2
May 31, 2006 2 2006 Hewlett-Packard Development Company, L.P. What Will Help Customers Get the Best Out-of-the-Box Performance? This list covers both operating system and application tuning Gathered from direct discussions with HP-UX labs These all apply to HP-UX on Integrity servers some apply to HP-UX on PA-RISC as well Performance tuning is an art not every issue will apply to every customer These ten represent SOME of the areas that will help customers but not everything The order of listing does not imply priority
Slide 3
May 31, 2006 3 2006 Hewlett-Packard Development Company, L.P. Tip #1: Use the Latest HP Compilers Best performance will be achieved using the HP Integrity compilers for HP-UX, rather than: using open source compilers executing PA-RISC binaries with the ARIES translator Use current versions of the HP compilers Download or order from DSPP: www.hp.com/go/acc www.hp.com/go/fortran www.hp.com/go/java Free for registered DSPP partners
HP continues to evolve HP-UX, its compilers, and development tools on HP Integrity servers for improved performance. Improvements are being added to the HP compilers with every release. Some customers have been surprised to find that they are running PA-RISC binaries on their Integrity servers without realizing it. You can use the file command on a binary to determine if it is a PA- RISC binary or a native Integrity binary.
Slide 4
May 31, 2006 4 2006 Hewlett-Packard Development Company, L.P. Performance Improvements over Time with New Compiler Releases 0% 10% 20% 30% 40% 50% 60% May-02 Oct-02 Jun-03 Mar-04 Dec-04 Sep-05 Jun-06 Integer Technical Commercial Application performance normalized to same hardware
The Integer and Technical results use the base options of +O4, Profile feedback, and large pages. The Commercial result corresponds to a large commercial application built with +O2, Profile feedback, and large pages. The points on the chart correspond to different compiler releases.
Slide 5
May 31, 2006 5 2006 Hewlett-Packard Development Company, L.P. Tip #2: Optimize, Optimize, Optimize! Take advantage of the HP Integrity compiler optimizations Can optimize selectively or across an entire application Use Profile-Based Optimization (PBO) +Oprofile=collect Invoke Interprocedural Optimizer (IPO) Can be used with level 2, 3, or 4 (for example, - +O2 -ipo) Trade-off - performance versus the ability to debug Use the latest version of the Caliper performance tool at: http://www.hp.com/go/caliper Optimizing Itanium-based Applications (April 2006) http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/Optimi zingApps-Itanium.pdf
Profile-based optimization (PBO) is a set of performance-improving code transformations that make use of an execution profile gathered for an application. There are three steps to PBO: 1. Instrumentation compile the program with profiling turned on 2. Data collection run the program with representative data to collect execution profile statistics 3. Optimization generate optimized code based on the profile
Note that instrumented programs run slower and should only be used to collect statistics for profile-based optimization. Better alias information and inlining improves optimization.
Slide 6
May 31, 2006 6 2006 Hewlett-Packard Development Company, L.P. Performance Gains Through Optimization 0% 10% 20% 30% 40% 50% 60% 70% 80% gcc4.0 O2 gcc4.0 O3 gcc4.1 O2 gcc4.1 O3 HP DD64 -O HP DD64 -O PBO HP DD64 -ipo HP DD64 -ipo -PBO HP DD32 Base Compiler optimizations deliver greater performance on HP Integrity than on other architectures
Slide 7
May 31, 2006 7 2006 Hewlett-Packard Development Company, L.P. Tip #3: Investigate Memory Requirements Memory requirements for HP-UX on Integrity should be about the same as on PA-RISC. Differences may occur, depending on the applications being run or when comparing to other UNIX operating systems Read the new article on DSPP, entitled Memory Usage on HP-UX Integrity Servers May see increases in code size with little system impact; larger binaries on disk does not imply more memory May need more memory if moving applications from 32-bit to 64-bit, or using lots of pointers in C++ http://h21007.www2.hp.com/dspp/files/unprotected/h pux/Itanium-Memory-Usage.pdf
The article covers three topics: Code expansion, data expansion, and object file expansion. Expect an increase in code size compared to PA-RISC. This increase is a trade-off of using some of the new performance features of Itaniumprocessors. The effect on overall system memory usage should be minor, because one copy of code is typically shared in memory by all instances of a program. Data size can also increase if, for example, a 32-bit application is migrated to the 64-bit programming model. Top and other tools may report huge Virtual Address spaces, especially for stacks, but these do not affect actual memory usage.
Slide 8
May 31, 2006 8 2006 Hewlett-Packard Development Company, L.P. Tip #4: Use Large Pages The ability to dynamically change page sizes at run time is a competitive advantage for HP-UX Goal is to reduce the number of data TLB misses This can be done globally by increasing: vps_ceiling, vps_chatr_ceiling, vps_pagesize Or it can be done by process (up to vps_chatr_ceiling): chatr +pd 1M sets data page size maximum to 1 MB Can change and rerun to find the best page size for each application in powers of 4 (4 K, 16 K, 64 K, ...)
To ensure that chatr works properly on all versions of HP-UX, the recommended technique to use it is to: Terminate all processes running the program. Make a copy of the program file. Run the chatr command on the copy. Copy the file that the chatr command was run on back over the original program file. Run the program.
Slide 9 May 31, 2006 9 2006 Hewlett-Packard Development Company, L.P. Tip #5: Get All of the Latest HP-UX 11i v2 Performance Patches Some critical performance patches: PHKL_33583 high memory pressure/page synchronization PHKL_33368 JFS direct I/O performance Some key performance enhancements: Threads PHCO_33675 pthread cumulative patch PHKL_34032 ksleep cumulative patch High-resolution timers Six patches to help customers migrating to HP-UX
HP continuously analyzes application performance on HP-UX and offers HP-UX patches that improve throughput, responsiveness, and behavior. This list is an attempt to document some known patch/performance relationships and the suggested remedy. It is meant to be used as a quick check when a system is experiencing performance problems. This is not a complete list, as new patches continue to be introduced. Many HP-UX 11i v2 patches are the same for both PA-RISC and Integrity servers. PHKL_33583 - High memory pressure is seen and some pages of physical memory are never used by the kernel. This problem is only seen on Integrity servers. This patch provides a page cache synchronization fix. PHKL_33368 - Direct I/O reads after buffered I/O writes on a large file take a long time. This patch provides JFS3.5 direct I/O performance improvement. Patch PHCO_33675 caused a problem in one situation with Java applications, so it is no longer recommended on Integrity servers on which pthreads applications, such as Java, intermittently abort or exhibit other unexpected behavior. PHCO_34718 is planned to supersede PHCO_33675 at some point in 2006 and correct this problem. Patch PHKL_34032 - Higher-resolution timers have been requested by customers porting from Tru64 UNIX and AIX. A new resolution of 1 ms will be provided for usleep, nanosleep, and setitimer in a series of patches for HP- UX 11i v2. These six patches are numbered from PHKL_34356 through PHKL_34361, inclusive, and will go with patch PHKL_34032 to address this need. Slide 10
May 31, 2006 10 2006 Hewlett-Packard Development Company, L.P. Locating Performance Patches Search for performance-specific patches ITRC (patches released to customers) http://www1.itrc.hp.com/service/patch/mainPage.do Use TOUR V3.0 for networking patches (Transport Optional Upgrade Release) Performance enhancements and bug fixes http://software.hp.com (search on TOUR)
TOUR V3.0 TOUR packages are designed for releasing optional enhancements and bug fixes that some customers may want. Many customers may not need these enhancement features. TOUR release notes are available at http://docs.hp.com. Doing a keyword search on "TOUR" (all in uppercase) will locate all these TOUR release notes. Note many of the TOUR documents are labeled for TOUR V2, but are still relevant to TOUR V3. In TOUR 3.0, HP has included a NOSYNC enhancement. This feature is only beneficial to systems using the link aggregation product (HP APA) or 10 gigabit links.
Slide 11
May 31, 2006 11 2006 Hewlett-Packard Development Company, L.P. Tip #6: Use Kernel Threads (1x1) MxN (Kernel and User threads) versus 1x1 (Kernel) Different operating system versions have different defaults for threads, with different performance implications: in HP-UX 11i v2, the default was originally MxN in all HP-UX 11i v2 updates, the default changed from MxN to 1x1 Stick to 1x1 threads for best performance Install the performance patches for the pthread library (see Tip #5) Make sure you set all the right environment variables for threads, described in: POSIX Threads on HP-UX 11i: HP-UX 11i v2 Update 2 http://devresource.hp.com/drc/topics/hpux_hpux.jsp#a095b8d8480264033
The key to using 1x1 threads is to NOT set the environment variables for MxN. This is described in the paper listed here. Try to use process private mutexes and condition variables rather than process shared.
Slide 12 May 31, 2006 12 2006 Hewlett-Packard Development Company, L.P. Tip #7: On Cell-based Systems, Use Cell Local Memory and psets to Improve Memory Latency Optimize performance by minimizing memory accesses across cell board boundaries (within the same hard partition). Cells fully populated with same-size DIMMs give optimal bandwidth. For applications such as BI on a big system, allocate up to 70% of memory as cell local. Bring up Oracle BI applications which use Parallel Query (PQ) slaves with a scattering policy of round- robin (mpsched RR). Use psets to separate applications and improve their memory locality.
From the HP-UX 11i v2 Release Notes for the subject of cell local memory, found at: http://docs.hp.com/en/5990-8153/ch12s02.html This feature can improve system and application performance when the memory of the system is appropriately configured to the proper balance between interleaved and cell local memory for the particular work load running on the system. Further performance improvements are possible if applications are modified to advise the operating system of the usage model for the memory they request. This feature can degrade performance if the system memory configuration does not match the work load on the system: for example, if the workload largely requires interleaved memory but the system has been configured with mostly cell local memory. This feature can also degrade performance if multithreaded applications have their threads distributed across multiple locality domains while their memory is allocated cell local. Refer to the white paper on ccNUMA: http://docs.hp.com/en/4913/ccNUMA_White_Paper.pdf
For best performance, consider putting the application and database layers in separate psets. With Oracle, consider putting the Oracle log writer in its own pset.
Slide 13
May 31, 2006 13 2006 Hewlett-Packard Development Company, L.P. Tip #8: Monitor, Profile, and Tune Java Applications with the HP Free Java Performance Tools HPjconfig and JavaOOB Configure your system for Java workloads: kernel parameters and latest OS patches. HPjmeter Profile your application using Xeprof option to collect detailed performance metrics. Then run HPjmeter to view, navigate, and drill down to discover your performance bottlenecks. Use HPjmeter 2.0 in your production environment to monitor your applications performance and resource utilization, and to set up custom alerts.
Slide 14
May 31, 2006 14 2006 Hewlett-Packard Development Company, L.P. Free HP Java Performance Tools HPjtune Use Xverbosegc to collect detailed metrics on memory use and garbage collection (GC) performance. Then use HPjtune to view results, discover inefficient GC behaviors, and compare and tune your heap sizes and GC algorithms. For more info on Java performance: Webcasts: Learn how to use Java tools: www.presentationselect.com/hpinvent/archivec.asp?ctg=JAV HP Programmers Guide for Java: www.hp.com/products1/unix/java/infolibrary/prog_guide/index.ht ml HP-UX Performance Tuning Java Web site: http://h21007.www2.hp.com/dspp/tech/tech_TechDocumentDetailPag e_IDX/1,1701,1602,00.html
When an application does its work, it frequently creates and uses new Java objects. When the JVM heap memory into which new objects are placed becomes full, a Garbage Collection occurs. However, if the garbage collector is doing long collections at frequent intervals, it can play havoc with application performance. The -Xverbosegc Java command line option prints out detailed information about the spaces within the Java Heap before and after garbage collection. The size of the heap determines the frequency and duration of garbage collections. The JVM command-line options to set the initial and maximum heap sizes are the -Xms and -Xmx options, respectively. A third option (available with Java 2 HotSpot JVM only), -Xmn, configures the New generation heap size, where newly created objects are stored. You should tune each application individually to get the best performance.
Slide 15
May 31, 2006 15 2006 Hewlett-Packard Development Company, L.P. Tip #9: Application Specific Tips With SAP, Oracle, u Apache, the d minimize the number of work processes; do not allocate more than you need. Also do not oversize the buffers. With se the SCHED_NOAGE parameter for best performance in I/O intensive environments. With efault tuning recommendations for v1 on apache.org can lead to issues with copy avoidance and the use of sendfile. This is fixed in Apache v2.
See Appendix A of the Oracle white paper on tuning for HP-UX at: http://www.oracle.com/technology/products/database/clustering/pdf/11iRACBM2.pdf The parameter hpux_sched_noage should be set to 178. With Apache v1, you should change the default so that you are not using memory mapped files.
Slide 16 May 31, 2006 16 2006 Hewlett-Packard Development Company, L.P. Tip #10: Integrity Virtual Machines Guest Operating Systems When installing a guest operating system, make sure you install the HPVM-Guest bundle that includes a performance tuning script, /sbin/init.d/hpvmguest. The script does several things to improve I/O and network performance and behavior of the guest: disables I/O forwarding disables TOPS (Thread-Optimized Packet Scheduling) extends the default SCSI disk timeout value extends the timeout settings for the mpt SCSI driver
The HPVM host installation package (T2767AC) contains a depot for installation in HP-UX guests. This depot can be found on every HPVM host in: /opt/hpvm/guest-images/hpux/hpvm_guest_depot.sd This depot contains a single bundle: HPVM-Guest A.01.10 Integrity VM Guest It is technically not necessary to have the bundle installed on guests, but it is HIGHLY RECOMMENDED. The bundle includes HPVM tools such as hpvminfo and hpvmcollect, and a set of kernel tunes that improve I/O and network performance and behavior of the guest. The tuning is performed via a rc script, /sbin/init.d/hpvmguest. The script's actions include: - disabling I/O forwarding - disabling Thread-Optimized Packet Scheduling (TOPS) - extending the default SCSI disk timeout value - extending the timeout settings for the mpt SCSI driver You can find out and potentially alter the new settings one by one in the file /etc/rc.config.d/hpvmguest. Warning: The usual restrictions apply when making changes to kernel tunables. The presence of a tunable in this file does not imply that it is supported for customers to change it. Slide 17
May 31, 2006 17 2006 Hewlett-Packard Development Company, L.P. Honorable Mention: Other Tips Use a transaction monitor like Tuxedo to multiplex down the number of processes. Select will run slow on large systems, due to the number of devices that must be polled. Improve this by using event ports instead. Run kcweb or kctune to check your kernel tuning for things like maximum stack size. If at all possible, test on the same size server you will implement on. Applications that run well on a small server may have problems when combined with other applications on a large server.
Event ports are described in the poll(7) manpage. For more information, refer to the following Web site: http://docs.hp.com/en/B2355-60105/poll.7.html
Slide 18
Itanium is a trademark or registered trademark of Intel Corporation in the U.S. and other countries and is used under license. Java is a US trademark of Sun Microsystems, Inc. Oracle is a registered US trademark of Oracle Corporation, Redwood City, California.