11TB01 - Performance Guidelines For IBM InfoSphere DataStage Jobs Containing Sort Operations On Intel Xeon-Final

White Paper
Performance Guidelines for IBM InfoSphere DataStage Jobs Containing Sort Operations on Intel Xeon Servers
Revision: 1.0 Intended for public distribution Date: January 31, 2011 Authors: Garrett Drysdale, Intel Corporation Jantz Tran, Intel Corporation Sriram Padmanabhan, IBM Brian Caufield, IBM Fan Ding, IBM Ron Liu, IBM Pin Lp Lv, IBM Mi Wan Shum, IBM Jackson Dong Jie Wei, IBM Samuel Wong, IBM
Table of Contents
Performance Guidelines for IBM InfoSphere DataStage Jobs ............................................................................ 1 Containing Sort Operations on Intel Xeon Servers......................................................................................... 1 1. Introduction ........................................................................................................................................... 3 2. Overview of IBM InfoSphere DataStage .............................................................................................. 3 3. Overview of Intel Xeon Series X7500 Processors ................................................................................ 4 4. Sort Operation in IBM InfoSphere DataStage .......................................................................................... 5 Testing Configurations ......................................................................................................................... 6 5. Summary for Sort Performance Optimizations ........................................................................................ 8 6. Recommendations for Optimizing Sort Performance ............................................................................... 9 6.1 Optimal RMU Tuning ...................................................................................................................... 9 Configuration / Job Tuning Recommendations ....................................................................................... 10 6.2 Final Merge Sort Phase Tuning using Linux Read Ahead ............................................................... 13 Configuration / Job Tuning Recommendations ....................................................................................... 13 6.3 Using a Buffer Operator to minimize latency for Sort Input .......................................................... 14 Configuration / Job Tuning Recommendations ....................................................................................... 15 6.4 Minimizing I/O for Sort Data containing Variable length fields ....................................................... 15 Configuration / Job Tuning Recommendations ....................................................................................... 16 6.5 Future Study: Using Memory for RAM Based Scratch Disk ............................................................ 16 7. Conclusion............................................................................................................................................ 17 8. About the Authors ............................................................................................................................... 19 9. Legal Disclaimer (Intel) ......................................................................................................................... 21 10. Legal Disclaimer (IBM) .......................................................................................................................... 22
1. Introduction
This whitepaper is the first in what is anticipated to be a series of whitepapers envisioned to provide IBM InfoSphere DataStage customers with helpful performance tuning guidelines for deployment on Intel Xeon processor- based platforms. IBM and Intel began collaborating to optimize performance and ROI of the combination of IBM InfoSphere DataStage and Intel Xeon based platforms in 2007. Our goal is to not only optimize the performance, and therefore, reduce the total cost of ownership of this powerful combination in future versions of IBM InfoSphere DataStage on future Intel processors, but also to pass along tuning and configuration guidance that we discover along the way. In our work together, we are striving to understand the execution characteristics of DataStage jobs on Intel platforms. This information is used to determine the hardware configurations, the operating system settings, and the job design and tuning techniques to optimize performance. Because of the highly scalable capabilities of IBM InfoSphere DataStage, our tests are focused on the latest Intel Xeon 4 and 8 socket capable X7560 Xeon EX processors. Initially, we are testing with four socket configurations. We have presented information about IBM InfoSphere DataStage on Intel platforms at the 2009 and 2010 IBM Information on Demand Conferences. In 2009, our audience applauded the great scalability of IBM InfoSphere DataStage on Intel platforms, but asked us to provide more information on the I/O requirements of jobs and how to get the most out of existing platform I/O capability. Since then, we have found ways to increase the overall performance of all jobs in the new Information Server 8.5 version of IBM InfoSphere DataStage that is now a 64 bit binary on Intel platforms, and we investigated the I/O requirements of sorting. The focus of the paper is on the key pieces of information we obtained regarding configuring the platform, Operating System, and DataStage jobs that contain sort operators. Sort is a crucial operation in data integration software, Sort operations are I/O intensive operations and can cause significant I/O load on the temporary or scratch file system. To optimize server CPU utilization, the scratch I/O storage system must be capable of providing the necessary disk bandwidth demanded by the sort operations. A scratch storage system that cannot write or read data at a high enough bandwidth will lead to under-utilization of computing capability of the system. This will be observed as low CPU utilization. The paper provides recommendations that will reduce the bandwidth demand placed on the scratch storage I/O system by sort operations. These I/O reductions result in improved performance that can be quite significant for systems where the scratch I/O storage system is significantly under sized in comparison to the compute capability of the processors. We show such a scenario in this paper. Ideally, the best solution is to upgrade the scratch I/O storage subsystem to match the compute capability of the server.
2. Overview of IBM InfoSphere DataStage

IBM InfoSphere DataStage is a product for data integration via Extract-Transform-Load capabilities. It provides a designer tool that allows developers to visually create integration jobs. Job is used within IBM InfoSphere DataStage to describe extract, transform and load (ETL) tasks. Jobs are composed from a rich palette of operators called stages. These stages include: Source and target access for databases, applications and files
General processing stages such as filter, sort, join, union, lookup and aggregations Built-in and custom transformations Copy, move, FTP and other data movement stages Real-time, XML, SOA and Message queue processing Additionally, IBM InfoSphere DataStage allows pre- and post-conditions to be applied to all these stages. Multiple jobs can be controlled and linked by a sequencer. The sequencer provides the control logic that can be used to process the appropriate data integration jobs. IBM InfoSphere DataStage also supports a rich administration capability for deploying, scheduling and monitoring jobs. One of the great strengths of IBM InfoSphere DataStage is that when designing jobs, very little consideration to the underlying structure of the system is required and does not typically need to change. If the system changes, is upgraded or improved, or if a job is developed on one platform and implemented on another, the job design does not necessarily have to change. IBM InfoSphere DataStage has the capability to learn about the shape and size of the system from the IBM InfoSphere DataStage configuration file. Further, it has the capability to organize the resources needed for a job according to what is defined in the configuration file. When a system changes, the file is changed, not the jobs. A configuration file defines one or more processing nodes with which the job will run. The processing nodes are logical rather than physical. The number of processing nodes does not necessarily correspond to the number of cores in the system. The following are factors that affect the optimal degree of parallelism: CPU-intensive applications, which typically perform multiple CPU-demanding operations on each record, benefit from the greatest possible parallelism up to the capacity supported by a given system. Jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into databases, benefit from configurations in which the number of logical nodes equals the number of I/O paths being accessed. For example, if a table is partitioned 16 ways inside a database or if a data set is spread across 16 disk drives, one should set up a node pool consisting of 16 processing nodes. Another great strength of IBM InfoSphere DataStage is that it does not rely on the functions and processes of a database to perform transformations: while IBM InfoSphere DataStage can generate complex SQL and leverages databases, IBM InfoSphere DataStage is designed from the ground up as a multipath data integration engine equally at home with files, streams, databases, and internal caching in single-machine, cluster, and grid implementations. As a result, customers in many circumstances find they do not also need to invest in staging databases to support IBM InfoSphere DataStage.
3. Overview of Intel Xeon Series X7500 Processors

Servers using the Intel Xeon series 7500 processor deliver dramatic increases in performance and scalability versus previous generation servers. The chipset includes new embedded technologies that give professionals in business, information management, creative, and scientific fields, the tools to solve problems faster, process larger data sets, and meet bigger challenges.
With intelligent performance, a new high-bandwidth interconnect architecture, and greater memory capacity, platforms based on the Intel Xeon series 7500 processor are ideal for demanding workloads. A standard four-socket server provides up to 32 processor cores, 64 execution threads and a full terabyte of memory. Eight-socket and larger systems are in development by leading system vendors. The Intel Xeon series 7500 processor also includes more than 20 new reliability, availability and serviceability (RAS) features that improve data integrity and uptime. One of the most important is Intel Machine Check Architecture Recovery, which allows the operating system to take corrective action and continue running when uncorrected errors are detected. These highly scalable servers can be used to support enormous user populations. Server platforms based on the Intel Xeon series 7500 processor deliver a number of additional features that help to improve performance, scalability and energy-efficiency. Next-generation Intel Virtualization Technology (Intel VT) provides extensive hardware assists in processors, chipsets and I/O devices to enable fast application performance in virtual machines, including near-native I/O performance. Intel VT also supports live virtual machine migration among current and future Intel Xeon processor-based servers, so businesses maintain a common pool of virtualized resources as they add new servers. Intel QuickPath Interconnect Technology provides point-to-point links to distributed shared memory. The Intel Xeon 7500 series processors with QPI feature two integrated memory controllers with and 3 QPI links to deliver scalable interconnect bandwidth, outstanding memory performance and flexibility and tightly integrated interconnect RAS features. Technical articles on QPI can be found at http://www.intel.com/technology/quickpath/. Intel Turbo Boost Technology boosts performance when its needed most by dynamically increasing core frequencies beyond rated values for peak workloads. Intel Intelligent Power Technology adjusts core frequencies to conserve power when demand is lower. Intel Hyper-Threading Technology can improve throughput and reduce latency for multithreaded applications and for multiple workloads running concurrently in virtualized environments. For additional information on the Intel Xeon Series 7500 Processor for mission critical applications, please see http://www.intel.com/pressroom/archive/releases/20100330comp_sm.htm.
4. Sort Operation in IBM InfoSphere DataStage

A brief overall description of the Sort operation is given here. The Sort operator implements a segmented merge sort and accomplishes sorting in two phases. First, the initial sort phase categorizes chunks of data into the correct order and stores this data as files to the scratch file system. The sort operator uses a buffer whose size is defined by the RMU parameter. This buffer is divided into two halves. The sorting thread will sort the half of the buffer it is working on until it is full and then move to the other half to begin inserts. The full buffer portion is sorted and then written out as a chunk to the scratch file system. The data is written to disk by a separate writer thread. See the figure below.
Figure 1 - Sort operation overview The sort buffer is used during both the initial sort phase and the final merge phase of the sort operation. During the final merge phase, a block of data is read from the beginning of each of the temporary sorted files stored on the scratch file system. If the sort buffer is too small, there will not be enough memory to read a chunk of data from each of the temporary sort files from the initial sort phase. This condition will be detected during the initial sort phase and if it occurs, a second thread will run to perform pre-merging of the temporary sort files. This will reduce the number of temporary sort files so that the buffer will have sufficient space to load a block of data from each of the temporary sort files during final merging. In the following tests, we will show several tuning and configuration settings that can be used to reduce the I/O demand placed on the system by sort operations.
Testing Configurations
The testing was done on a single Intel server with the Intel Xeon 7500 series chipset and four Intel Xeon X7560 processors. The X7560 processors are based on the Nehalem micro architecture. The system has 4 sockets, 8 cores per socket, and 2 threads per core using Intel Hyper-Threading Technology for a total of 64 threads of execution. Our test configuration uses 64 GB of memory though the platform has a maximum capacity of 1 TB. The processor operating frequency is 2.26 GHz and each processor has 24 MB of L3 cache shared across the 8 cores. The system uses 5 Intel X-25E solid state drives (SSDs) for temporary I/O storage configured in a RAID-0 array using the on board RAID controller. This storage is used as scratch storage for the sort tests. The bandwidth capability of the 5 SSDs was not sufficient to maximize the CPU utilization of the system given the high performance capabilities of DataStage and this will be explained in more detail later. We recommend sizing the I/O subsystem to maximize CPU utilization although we were not able to do this given the equipment available at the time of data collection. The operating system is Red Hat* Enterprise Linux* 5.3, 64 bit version. The test environment is a standard Information Server two tier configuration. The client tier is used to run just the DataStage client applications. All the remaining Information Server tiers are installed on a single Intel Xeon X7560 server.
Test Client(s)
Client

Information Server (IS) Tiers (Services + Repository + Engine)
Window server 2003 Processor Type: x86 -based PC Processor Speed: 2.4GHZ Memory Size: 8 GB RAM
Services + Repository + Engine Tiers
Intel Xeon X7560 Server
Platform: Red Hat EL 5.3, 64 bit Processor: Intel Xeon X7560, 4 socket, 32 cores, 64 threads Processor Speed: 2.26 GHz Memory Size: 64 GB RAM Metadata Repository: DB2/LUW 9.7 GA 5 Intel X25-E SSDs for Scratch Space configured as RAID0 array using onboard controller.
IS Topology: Standalone
Figure 2 - System Test Configuration The following table lists the specifics of the platform tested: OEM CPU Model ID Platform Name Sockets Cores per Socket Threads per core CPU Code Name CPU Frequency (GHz) QPI GT/s Hyperthreading Prefetch Settings LLC Size (MB) BIOS Version Memory Installed (GB) DIMM Type DIMM Size (GB) Number of DIMMS NUMA OS Table 1 Intel Platform Tested Intel 7560 Boxboro 4 8 2 Nehalem-EX 2.24 6.4 Enabled Default 24 R21 64 DDR3-1066 4 16 Enabled RHEL 5.3 64 bit
5. Summary for Sort Performance Optimizations

This section provides a brief summary of the recommendations from this performance study. Section 6 provides more detail for those seeking the deeper technical dive. Reducing I/O contention is critical to optimizing Sort stage performance. Spreading sorting I/O usage across different physical disks is a simple first step. A sample DataStage configuration file to implement this method is shown below. { node "node1" { fastname "DataStage1.ibm.com" pools "" resource disk "/opt/IBM/InformationServer/Server/Datasets1" {pools ""} resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch1" {pools ""} } node "node2" { fastname "DataStage2.ibm.com" pools "" resource disk "/opt/IBM/InformationServer/Server/Datasets2" {pools ""} resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch2" {pools ""} }
In this configuration file, each DataStage processing node has its own scratch space defined in a directory that resides on separate physical devices. This helps prevent contention for I/O subsystem resource among DataStage processing nodes. This is a fairly well known technique and was not studied for this paper. This paper describes additional techniques to achieve the optimal performance for DataStage jobs containing Sort operations: 1) Setting the Restrict Memory Usage (RMU) parameter for sort operations to an appropriate value for large data sets will reduce I/O demand on the scratch file system. The recommend RMU size varies with the data set size and node count. The formula is shown in section 6.1 along with a reference table that summarizes the suggested RMU sizes for a variety of data set sizes and node counts. The RMU parameter provides users with the flexibility of defining the sort buffer size to optimize memory usage of their system. 2) Increasing the default Linux read-ahead value for the disk storage system(s) for scratch space can increase the performance of the final merge phase of sort operations. The recommended setting for the read ahead value is 512 or 1024 sectors (256KB or 512KB) for the scratch file system. See section 6.2 for information on how to change the read ahead value in Linux. 3) Sort operations can benefit from having a buffer operator inserted prior to the sort in the data flow graph. Because sort operations work on large amounts of data, a buffer operator provides extra storage to get the data to the sort operator as fast as possible. See section 6.3 for details. 4) Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand during sorting when bounded length VARCHAR data is involved, potentially resulting in improved overall throughput for the job.
6. Recommendations for Optimizing Sort Performance

We investigated the Sort operation in detail and considered the effect of a number of performance tuning factors on the I/O characteristics. The input data size and the format of the data are critical input factors affecting sort. The layout of the I/O subsystem and the file cache and prefetching characteristics are also important. The RMU buffer size configuration parameter has a significant effect on the behavior of the sort as the input data set size is adjusted. These factors are considered in greater detail below. In our tests, a job consisting of one sort stage and running on a one node configuration was capable of sorting and writing data at the rate of 120 MB/s to the scratch file system. Increasing the node count of the job quickly resulted in more I/O requests to the scratch I/O storage array than it was able to service in a timely manner. Due to the limitation of the scratch I/O system, the Intel Xeon server CPUs were greatly underutilized. The scratch I/O file system was simply under-configured for a server with such a high computational capability. This illustrates the high compute power available on the Intel Xeon processors and the ability of IBM InfoSphere DataStage to efficiently harness this compute power. Configuring sufficient I/O to harness the computational capability of this powerful combination of hardware and software is of paramount importance to enable efficient utilization of system. For our test system, we chose a configuration for the scratch storage I/O system that was significantly undersized in comparison to the compute capability of the server. While we recommend always configuring for optimal performance which would include a more capable scratch storage system, customer feedback has indicated that many deployed systems have insufficient bandwidth capability to the scratch storage system. The tuning and configuration tips found in this paper are designed to increase performance on all systems, but will be especially beneficial for systems constrained by the scratch I/O storage system. In all cases, the amount of data transferred will be reduced by these tuning and configuration tips. By adjusting the DataStage parallel node count, we were able to match the scratch storage capabilities and prevent the scratch storage system from being saturated. This allowed us to study and develop this tuning guidance in a balanced environment. Using this strategy, we developed several tuning guidelines to reduce the demand for scratch storage I/O which we used to effectively increase performance. This is likely to be the situation for many customers as growth in CPU processing performance continues to outpace the I/O capability of storage subsystems. Several valuable tuning guidelines were discovered and we present the findings here. While these findings are significant, and we highly recommend them, we also want to make clear that there is no substitute for having a high performance scratch storage system capable of supplying sufficient bandwidth and I/Os per second (IOPS) to maintain high CPU utilization. The tuning guidance given here will help even a high performance scratch I/O system deliver better performance to DataStage jobs using sort operations. The remainder of this section describes the tuning results we found to improve sort performance through I/O reduction.
6.1
Optimal RMU Tuning
This section describes how to tune the sort buffer size parameter called RMU to minimize I/O demand on the scratch I/O system. An RMU value that is too small will result in intermediate merges of temporary files during the initial sort phase. These intermediate merges can significantly increase the I/O demand on the scratch file
system. Tuning the RMU value appropriately can eliminate all intermediate merge operations and greatly increase throughput of sort operations for systems with limited I/O bandwidth to the scratch I/O file system. The scratch disk I/O system on many systems is a bottleneck to performance due to insufficient bandwidth capability of the number of disks or the interconnect bandwidth is less than needed to maximize CPU utilization. The elimination of pre-merging can reduce the overall I/O demand on the scratch file system therefore allowing the scratch file system to complete I/O faster, increasing throughput and decreasing job run time.
Configuration / Job Tuning Recommendations

Given knowledge of the size of data to be sorted, it is possible to calculate the optimal RMU value that will prevent the pre-merge thread from running and thus, reducing I/O demand. The RMU formula is: RMU (MB) >= SQRT ( DataSizeToSort (MB) / NodeCount) / 2 Notes about using the above formula: 1) The total data size is divided by the node count because the data sorted per node decreases with increasing node count. A node in this context refers to the number of parallel instances of the job when it is instantiated. 2) Our tests indicate that the RMU value can span a fairly large range and still provide good performance. Sometimes the amount of data to be sorted is not known precisely. We recommend attempting to estimate the input data size within one or two factors of the actual value. In other words, overestimating the data set size by a factor of 2x will still result in an RMU value from the above equation that will provide good performance results. 3) The default RMU value is 20MB. This RMU value can sort up to 1.6 GB of data per node while avoiding costly pre-merge operations. If your data set size divided by node count is less than 1.6 GB, then no change is necessary to the RMU. The following table is a handy reference of RMU settings for different sizes of input data (data set size) and node counts. The table assumes the user knows the data set size to be sorted. Knowing the precise size of the data being sorted may not be feasible. Over estimating the data set size by up to a factor 2 times the actual data size will still result in good performance. The default RMU value is 20 MB. The table contains the word Default where the formula results in less than 20 MB indicating that the user should use the default value. It is not necessary to decrease the RMU value below the 20 MB default, though doing so is allowed.
10
1Node 4Nodes 8Nodes 16Nodes 24Nodes 32Nodes 48Nodes 64Nodes DataSize to be Sorted MinRMU MinRMU MinRMU MinRMU MinRMU MinRMU MinRMU MinRMU (GB) (MB) (MB) (MB) (MB) (MB) (MB) (MB) (MB) 1 Default Default Default Default Default Default Default Default 1.5 Default Default Default Default Default Default Default Default 3 28 Default Default Default Default Default Default Default 10 51 25 Default Default Default Default Default Default 30 88 44 31 22 Default Default Default Default 100 160 80 57 40 33 28 23 20 300 277 139 98 69 57 49 40 35 1000 506 253 179 126 103 89 73 63 3000 876 438 310 219 179 155 126 110 10000 1600 800 566 400 327 283 231 200
Table 2 RMU Buffer Size Table
Our test results of a job consisting of one sort stage running with 4 parallel nodes with two different RMU values are shown in Figure 3. The correct sizing of the RMU value resulted in a 36% throughput increase. In the tests, the I/O bandwidth did not decrease because the I/O subsystem was delivering the maximum bandwidth it was capable of in both cases. However, because the total quantity of data transferred was much lower, the CPU cores were able to operate at higher CPU utilization and complete the sort in a shorter amount of time. This optimization is very effective for scratch disks that are unable to deliver enough scratch file I/O bandwidth to feed the high performing Intel Xeon Server and highly efficient IBM InfoSphere DataStage Software. The results shown here are for a sort only job where we have isolated the effect of the RMU parameter. This optimization will help more complex jobs, but will only directly affect the performance of the sort operators within the job.
RMU Size 10 MB 30 MB
Read Ahead Setting 128 kB (Linux Default) 128 kB (Linux Default)
Run time 4.05 minutes 2.97 minutes
Figure 1 - Performance Tuning Sort with Sort Operator RMU value To modify the RMU setting for a Sort Stage in a job, on DataStage Designer client canvas, open the Sort stage, click on Tab Stage, then Properties, click on Options in the left window, and select Restrict Memory Usage (MB) from the Available properties to add window to add it.
11
Figure 4 Adding RMU Option Once the Restrict Memory Usage option is added, its value can be set to the recommended one based on above-mentioned formula.
12
Figure 5 Setting RMU Option
6.2
Final Merge Sort Phase Tuning using Linux Read Ahead
During testing of the single node sort job, we found that CPU utilization of final merge can be improved by changing the scratch disk read ahead setting in Linux, resulting in substantial throughput improvements of the final merge sort phase.

The default Linux file system read ahead value is 256 sectors. A sector is 512 bytes so the total default read ahead is 128 kB. Our testing indicated that increasing the read ahead value to 1024 sectors (512 kB) increased CPU utilization and reduced the final merge time by reducing the amount of time that DataStage had to wait for I/Os from the scratch file system. This resulted in an increase in throughput of the final merge phase of sort of approximately 30%. Test results for a job consisting of one sort stage running with 4 parallel nodes with two different values for the Linux read ahead setting are shown in Figure 6. Increasing the Linux default read ahead setting of 128 kB to 512 kB resulted in a 9% improvement in throughput of the job.
13
RMU Size 30 MB 30 MB
Read Ahead Setting 128 kB (Linux Default) 512 kB
Run time 2.97 minutes 2.72 minutes
Figure 6 - Performance Tuning Sort Operator with Linux Read Ahead Setting The current read ahead setting in Linux can be obtained using the following command: >hdparm To set the read ahead setting for a specific disk device in Linux, use the following command: >hdparm a 1024 /dev/sdb1 (sets read ahead to 1024 sectors on disk device, /dev/sdb1)
To make the command persist across reboots, edit the /etc/init.d/boot.local file. Recommended settings to try are 512 sectors (256 kB) or 1024 sectors (512 kB). Increasing read ahead size results in more data being read from the disk and stored in the OS disk cache memory. As a result, more read requests by the sort operator get the requested data directly from the OS disk cache instead of waiting for the full latency of a data read from the scratch storage system. (Note that the Linux file system cache is controlled by the kernel and uses memory that is not allocated to processes.) In our tests, the scratch storage system consists of SSDs configured in a RAID-0 array. I/O request latencies are low on this system compared to typical rotating media storage arrays. Increasing OS read ahead will benefit scratch storage arrays consisting of HDDs even more. Larger read ahead values than those tested may be more beneficial for HDD arrays. We chose to use SSDs because they provide higher bandwidth, much improved IOPS (I/Os per second) and much lower latency than an equivalent number of hard disk drives. Many RAID controllers found in commercial storage systems also have capability to do read ahead on read requests and store data in the cache. It is good to enable this feature if it is available on the storage array being used for scratch storage. It is still important to increase read ahead in the OS. Serving requests from the OS disk cache will be faster than having to wait for data from the RAID engine. The results shown here are on a job with a sort operation only. Tuning of read ahead will not impact performance of other operations in the job that are not performing scratch disk I/O.
6.3
Using a Buffer Operator to minimize latency for Sort Input
The DataStage parallel engine employs buffering automatically as part of its normal operations. Because the initial sort phase has such a high demand for input data, it is especially sensitive to latency spikes in the data source feeding the sort. These latency spikes can occur due to data being sourced from local or remote disks, or due to scheduling of operators by the operating system. By adding an additional buffer in front of the sort,
14
we were able to maintain the CPU utilization on the core running the sort thread at 100% during the entire initial sort phase, thus increasing the performance of the initial sort phase by nearly 7%.

We recommend using an additional buffer prior to sort of size equal to the RMU value. To add an additional buffer in front of the sort, open the Sort stage in a DataStage job on DataStage Designer client canvas, click on the Input tab, then Advanced. Select Buffer from the Buffering mode drop-down menu and modify the Maximum memory buffer size (bytes) field.
Figure 7 - Adding buffer in front of the sort
6.4
Minimizing I/O for Sort Data containing Variable length fields
By default, the parallel engine internally handles bounded length VARCHAR fields (those that specify a maximum length) as essentially fixed length character strings. If the actual data in the field is less than the maximum length, the string is padded to the maximum size. This behavior is efficient for CPU processing of
15
records throughout the course of an entire job flow but it increases the I/O demands for operations such as Sort. When environment variable APT_OLD_BOUNDED_LENGTH is set, the data within each VARCHAR field is processed without additional padding resulting in a decreased amount of data written to disk. This can decrease I/O bandwidth demand and therefore increase performance when running a scratch disk subsystem with insufficient bandwidth. This can increase job throughput if the scratch file system is not able to keep up with the processing capability of DataStage and the Intel Xeon Server. Additional CPU cycles will be used to process variable length data when using APT_OLD_BOUNDED_LENGTH. More CPU processing power is used to reduce the amount of I/O required from the scratch file system by using this setting. Our test results of a job consisting of one sort stage running with 16 parallel nodes using the APT_OLD_BOUNDED_LENGTH resulted in a 25% reduction in size of temporary sort files and a 26% increase in throughput (a 21% reduction in runtime.) With APT_OLD_BOUNDED_LENGTH Normalized Comparison Default Scratch Storage Space Consumed 1.0 0.75x (75% of the original storage space used) Runtime 1.0 0.79x (79% of the original runtime) Throughput 1.0 1.26x (26% increase in job processing rate) Table 3 Sort Operation performance comparison using APT_OLD_BOUNDED_LENGTH Please note that the performance benefit of this tuning parameter will vary based on several factors. It only applies to data records that have varchar fields. The actual file size reduction realized on the scratch storage system will depend heavily on the maximum size specified by the varchar fields, and the size of the actual data contained in these fields, and whether the varchar fields are a sort key for the records. The amount of performance benefit will depend on how much the total file size is reduced, along with the data request rate of the sort operations compared to the capability of the scratch file system to supply the data. In our test configuration, the 16 node test resulted in the scratch I/O system being driven to its maximum bandwidth limit. By setting APT_OLD_BOUNDED_LENGTH, the amount of data that was written and subsequently read from the disk decreased substantially over the length of the job allowing faster completion.

This optimization will only affect data sets that use bounded length VARCHAR data types. APT_OLD_BOUNDED_LENGTH is a user defined variable for DataStage. The variable can be added either at the project level or job level. You can follow the instructions in the IBM InfoSphere DataStage and QualityStage Administrator Client Guide and the IBM InfoSphere DataStage and QualityStage Designer Client Guide to add and set a new variable. We recommend trying this setting if low CPU utilization is observed during sorting or if it is known that the scratch file system is unable to keep up with job demands.
6.5
Future Study: Using Memory for RAM Based Scratch Disk
As a future study, we intend to investigate performance when using a RAM based disk for scratch storage. . The memory bandwidth available in the Nehalem-EX test system is greater than 70 GB/s when correctly configured. While SSDs offer some bandwidth improvements over hard disk drives, they cannot begin to
16
match the performance of main memory bandwidth. The system supports PCI Express lanes to reach ~ 35 GB/s of I/O in each direction if all PCIe lanes are utilized. However, such an I/O solution would be expensive. The currently available 4 socket Intel X7560 systems can address 1 TB of memory and 8 socket systems can address 2 TB of memory. DRAM capacity will continue to rise with new product releases and IBM X series systems also offer options to increase DRAM capacity beyond the baseline. While DRAM is expensive when compared to disk drives on a per capacity basis, it is more favorable when comparing bandwidth capability in and out of the system. We plan to evaluate the performance and cost benefit analysis of large in-memory storage compared to disk drive based storage solutions and provides the results in the near future.
BestPractices
This paper describes additional techniques to achieve the optimal performance for DataStage jobs containing Sort operations:
Setting the Restrict Memory Usage (RMU) parameter for sort operations to an appropriate value for large data sets will reduce I/O demand on the scratch file system. The recommend RMU size varies with the data set size and node count. The formula is shown in section 6.1 along with a reference table that summarizes the suggested RMU sizes for a variety of data set sizes and node counts. The RMU parameter provides users with the flexibility of defining the sort buffer size to optimize memory usage of their system. Increasing the default Linux read-ahead value for the disk storage system(s) for scratch space can increase the performance of the final merge phase of sort operations. The recommended setting for the read ahead value is 512 or 1024 sectors (256KB or 512KB) for the scratch file system. See section 6.2 for information on how to change the read ahead value in Linux. Sort operations can benefit from having a buffer operator inserted prior to the sort in the data flow graph. Because sort operations work on large amounts of data, a buffer operator provides extra storage to get the data to the sort operator as fast as possible. See section 6.3 for details. Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand during sorting when bounded length VARCHAR data is involved, potentially resulting in improved overall throughput for the job.
7. Conclusion
We have shown how to optimize IBM InfoSphere DataStage sort performance on Intel Xeon processors using a variety of tuning options such as Sort buffer RMU size, Linux read ahead settings, additional Buffer operator, and configuring the Varchar length parameter.
17
Our results reinforce the necessity of correctly sizing I/O to optimize server performance. For sort, it is imperative to have sufficient scratch I/O storage performance to allow maximization of all sort operators running in the system concurrently in order to fully utilize the server. Powerful mission critical servers like the Intel Xeon Platforms based on the X7500 series processor running the IBM InfoSphere DataStage parallel engine can efficiently process data at extremely high data rates. As a result, I/O and network bandwidth are extremely important for high performance. Network interconnects like 10 Gbit/s Ethernet or 40 Gbit/s Fiber Channel are necessary to fully realize the computation potential of this powerful combination of hardware and software. In the near future, we plan to analyze the cost and benefit trade off of using large DRAM capacity as a replacement for disk subsystems for scratch I/O. We also will be looking at tuning high bandwidth networking solutions to optimize performance.
18
8. About the Authors

Garrett Drysdale is a Sr. Software Performance Engineer for Intel. Garrett has analyzed and optimized software on Intel platforms since 1995 spanning client, workstation, and enterprise server market segments. Garrett currently works with enterprise software developers to analyze and optimize server applications, and with internal design teams to assist in evaluating the impact of new technologies on software performance for future Intel platforms. Garrett has a BSEE from University of Missouri-Rolla and a MSEE from The Georgia Institute of Technology. His email is garrett.t.drysdale@intel.com. Jantz Tran is a Software Performance Engineer for Intel. He has been analyzing and optimizing enterprise software on Intel server platforms for 10 years. Jantz has a BSCE from Texas A&M University. His email is jantz.c.tran@intel.com.
Dr. Sriram Padmanabhan is an IBM Distinguished Engineer, and Chief Architect for IBM InfoSphere Servers. Most recently, he had led the Information Management Advanced Technologies team investigating new technical areas such as the impact of Web 2.0 information access and delivery. He was a Research Staff Member and then a manager of the Database Technology group at IBM T.J. Watson Research Center for several years. He was a key technologist for DB2s shared-nothing parallel database feature and one of the originators of DB2s multi-dimensional clustering feature. He was also a chief architect for Data Warehouse Edition which provides integrated warehousing and business intelligence capabilities enhancing DB2. Dr. Padmanabhan has authored more than 25 publications including a book chapter on DB2 in a popular database text book, several journal articles, and many papers in leading database conferences. His email is srp@us.ibm.com. Brian Caufield is a Software Architect for Infosphere* Information Server responsible for the definition and design of new IBM InfoSphere DataStage features, and also works with the Information Server Performance Team. Brian represents IBM at the TPC, working to define an industry standard benchmark for data integration. Previously, Brian worked for 10 years as a developer on IBM InfoSphere DataStage specializing in the parallel engine. His email is bcaufiel@us.ibm.com. Fan Ding is currently a member of the Information Server Performance Team. Prior to joining the team, he worked in Information Integration Federation Server Development. Fan has a PH.D. in Mechanical Engineering and a Master in Computer Science from University of Wisconsin. His email is: fding@us.ibm.com.
Ron Liu is currently a member of the IBM InfoSphere Information Server Performance Team with focus on performance tuning and information integration benchmark development. Prior to his current job, Ron had 7 years in Database Server development (federation runtime, wrapper, query gateway, process model, and database security). Ron has a Master of Science in Computer Science and Bachelor of Science in Physics. His email is ronliu@us.ibm.com.
19
Pin Lp Lv is a Software Performance Engineer from IBM. Pin has worked for IBM since 2006. He worked as a software tester for IBM WebSphere Product Center Team and RFID Team from September 2006 to March 2009, and joined IBM InfoSphere Information Server Performance Team in April 2009. Pin has a Master of Science degree in Computer Science from University of West Scotland. His email is pinlv@cn.ibm.com Mi Wan Shum is the manager of the IBM InfoSphere Information Server performance team at the IBM Silicon Valley Lab. She graduated from University of Texas at Austin and she has years of software development experience in IBM. Her email is msshum@us.ibm.com Jackson (Dong Jie) Wei is a Staff Software Performance Engineer for IBM. He once worked as a DBA in CSRC before joining IBM in 2006. Since then, he has been working on the Information Server product. In 2009, he began to focus his work on the ETL performance. Jackson is also the technical lead for the IBM China Lab Information Server performance group. He got his bachelor and master degrees for Electronic Engineering of Peking University in 2000 and 2003 respectively. His email is weidongj@cn.ibm.com. Samuel Wong is a member of the IBM InfoSphere InfoSphere Information Server performance team at the IBM Silicon Valley Lab. He graduated from University of Toronto and he has 12 years of software development experience with IBM. His email is samwong@us.ibm.com
20
9. Legal Disclaimer (Intel)

Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/. THIS DOCUMENT AND RELATED MATERIALS AND INFORMATION ARE PROVIDED "AS IS" WITH NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION, OR SAMPLE. INTEL ASSUMES NO RESPONSIBILITY FOR ANY ERRORS CONTAINED IN THIS DOCUMENT AND HAS NO LIABILITIES OR OBLIGATIONS FOR ANY DAMAGES ARISING FROM OR IN CONNECTION WITH THE USE OF THIS DOCUMENT. All products, product descriptions, plans, dates, and figures are preliminary based on current expectations and subject to change without notice. Availability may vary in different channels. *Other names and brands may be claimed as the property of others. 2011, Intel Corporation. All rights reserved. Intel, the Intel logo, Core, Itanium, NetBurst, Pentium, and VTune are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the Intel Compiler User and Reference Guides under Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel Streaming SIMD Extensions 3 (Intel SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20101101
21
10. Legal Disclaimer (IBM)

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. THE INFORMATION CONTAINED IN THIS DOCUMENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS DOCUMENTATION, IT IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS DOCUMENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE. The results reported in this document were achieved under controlled lab conditions that represent an optimal test case scenario. IBM does not guarantee these results and individual results will vary. IBM, the IBM logo, ibm.com, InfoSphere, and Information Server DataStage are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml
22

11TB01 - Performance Guidelines For IBM InfoSphere DataStage Jobs Containing Sort Operations On Intel Xeon-Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11TB01 - Performance Guidelines For IBM InfoSphere DataStage Jobs Containing Sort Operations On Intel Xeon-Final

Uploaded by

Copyright:

Available Formats

White Paper

2. Overview of IBM InfoSphere DataStage

3. Overview of Intel Xeon Series X7500 Processors

4. Sort Operation in IBM InfoSphere DataStage

Services + Repository + Engine Tiers

Intel Xeon X7560 Server

5. Summary for Sort Performance Optimizations

6. Recommendations for Optimizing Sort Performance

Optimal RMU Tuning

Configuration / Job Tuning Recommendations

Read Ahead Setting 128 kB (Linux Default) 128 kB (Linux Default)

Run time 4.05 minutes 2.97 minutes

Figure 5 Setting RMU Option

Final Merge Sort Phase Tuning using Linux Read Ahead

Configuration / Job Tuning Recommendations

Read Ahead Setting 128 kB (Linux Default) 512 kB

Run time 2.97 minutes 2.72 minutes

Using a Buffer Operator to minimize latency for Sort Input

Configuration / Job Tuning Recommendations

Figure 7 - Adding buffer in front of the sort

Minimizing I/O for Sort Data containing Variable length fields

Configuration / Job Tuning Recommendations

Future Study: Using Memory for RAM Based Scratch Disk

8. About the Authors

9. Legal Disclaimer (Intel)

10. Legal Disclaimer (IBM)

You might also like