You are on page 1of 4

DataStage - Best Practices

Here is an overview of the some design techniques for getting the best possible performance from DataStage jobs that one is designing.

Tips on performance and tuning of DS jobs

Translating Stages and Links to Processes


When you design a job you see it in terms of stages and links. When it is compiled, the DataStage engine sees it in terms of processes that are subsequently run on the server. How does the DataStage engine define a process? It is here that the distinction between active and passive stages becomes important. Actives stages, such as the Transformer and Aggregator perform processing tasks, while passive stages, such as Sequential file stage and ODBC stage, are reading or writing data sources and provide services to the active stages. At its simplest, active stages become processes. But the situation becomes more complicated where you connect active stages together and passive stages together. What happens when you have a job that links two passive stages together? Obviously there is some processing going on. Under the covers DataStage inserts a cut-down transformer stage between the passive stages which just passes data straight from one stage to the other, and becomes a process when the job is run. What happens where you have a job that links two or more active stages together?. By default this will all be run in a single process. Passive stages mark the process boundaries, all adjacent active stages between them being run in a single process.

Hash File Design Poorly designed hashed files can be a cause of disappointing performance. Hashed files are commonly used to provide a reference table based on a single key. Performing lookups can be fast on a well designed file, but slowly on a poorly designed one. Another use is to host slowly-growing dimension tables in a star-schema warehouse design. Again, a well designed file will make extracting data from dimension files much faster. There are various steps you can take within your job design to speed up operations that read and write hash files. Pre-loading - You can speed up read operations of reference links by pre-loading a hash file into memory. Specify this on the Hash File stage Outputs page. Write Caching - You can specify a cache for write operations such that data is written there and then flushed to disk. This ensures that hashed files are written to disk in group order rather than the order in which individual rows are written (which would by its nature necessitate time consuming random disk accesses). If server caching is enabled, you can specify the type of write caching when you create a hash file, the file then always uses the specified type of write cache. Otherwise you can turn write caching on at the stage level via the Outputs page of the hash file stage. Pre-allocating - If you are using dynamic files you can speed up loading the file by doing some rough calculations and specifying minimum modulus accordingly. This greatly enhances operation by cutting down or eliminating split operations. You can calculate the minimum modulus as follows: minimum modulus = estimated data size/ (group size *2048). When you have calculated your minimum modulus you can create a file specifying it or resize an existing file specifying it (using the RESIZE command) Calculating static file modulus - You can calculate the modulus required for a static file using a similar method as described above for calculating a pre-allocation modulus

for dynamic files: modulus = estimated data size/(separation * 512). When you have calculated your modulus you can create a file specifying it (using the Create File feature of the Hash file dialog box) - or resize an existing file specifying it (using the RESIZE command). Diagnosing the Jobs Once the jobs have been designed it is better to run some diagnostics to see if performance could be improved. There may be two factors affecting the performance of your DataStage job: It may be CPU limited It may be I/O limited You can now obtain detailed performance statistics on a job to enable you to identify those parts of a job that might be limiting performance, and so make changes to increase performance. The collection of performance statistics can be turned on and off for each active stage in a DataStage job. This is done via the Tracing tab of the Job Run Options dialog box, select the stage you want to monitor and select the Performance statistics check box. Use shift-click to select multiple active stages to monitor from the list. Interpreting Performance Statistics The performance statistics relate to the per-row processing cycle of an active stage, and of each of its input and output links. The information shown is: Percent- The percentage of overall execution time that this part of the process used. Count- The number of times this part of the process was executed. Minimum - The minimum elapsed time in microseconds that this part of the process took for any of the rows processed. Average - The average elapsed time in microseconds that this part of the process took for the rows processed. Care should be taken to interpret these figures. For example, when in-process active stage to active stage links are used the percent column will not add up to 100%. Also be aware that, in these circumstances, if you collect statistics for the first active stage the entire cost of the downstream active stage is included in the active-to-active link This distortion remains even where you are running the active stages in different processes (by having inter-process row buffering enabled) unless you are actually running on a multi-processor system. If the Minimum figure and Average figure are very close, this suggests that the process is CPU limited. Otherwise poorly performing jobs may be I/O limited. If the Job monitor window shows that one active stage is using nearly 100% of CPU time this also indicates that the job is CPU limited. Improve Job Performance CPU Limited Jobs Single Processor Systems - The performance of most DataStage jobs can be improved by turning in-process row buffering on and recompiling the job. This allows connected active stages to pass data via buffers rather than row by row. You can turn in-process row buffering on for the whole project using the DataStage Administrator. Alternatively, you can turn it on for individual jobs via the Performance tab of the Job Properties dialog box. CPU Limited Jobs - Multi-processor Systems The performance of most DataStage jobs on multiprocessor systems can be improved by turning on inter-process row buffering and recompiling the job. This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor. You can turn inter-process row buffering on for the whole project using the DataStage Administrator. Alternatively, you can turn it on for individual jobs via the Performance tab of the Job Properties dialog box.

CAUTION: You cannot use inter-process row-buffering if your job uses COMMON blocks in

transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. If you have one active stage using nearly 100% of CPU you can improve performance by running multiple parallel copies of a stage process. This is achieved by duplicating the CPU-intensive stages or stages (using a shared container is the quickest way to do this) and inserting a Link Partitioner and Link Collector stage before and after the duplicated stages. I/O Limited Jobs Although it can be more difficult to diagnose I/O limited jobs and improve them, there are certain basic steps you can take: If you have split processes in your job design by writing data to a Sequential file and then reading it back again, you can use an Inter Process (IPC) stage in place of the Sequential stage. This will split the process and reduce I/O and elapsed time as the reading process can start reading data as soon as it is available rather than waiting for writing process to finish. If an intermediate sequential stage is being used to land a file so that it can be fed to an external tool, for example a bulk loader, or an external sort, it may be possible to invoke the tool as a filter command in the Sequential stage and pass the data direct to the tool If you are processing a large data set you can use the Link Partitioner stage to split it into multiple parts without landing intermediate fields If a job still appears to be I/O limited after taking one or more of the above steps you can use the performance statistics to determine which individual stages are I/O limited. Following can be done: 1. Run the job with a substantial data set and with performance tracing enabled for each of the active stages. 2. Analyze the results and compare them for each stage. In particular look for active stages that use less CPU than others, and which have one or more links where the average elapsed time. Once you have identified the stage the actions you take might depend on the types of passive stage involved in the process. Poorly designed hashed files can have particular performance implications for all stage types you might consider: redistributing files across disk drives changing memory or disk hardware reconfiguring databases reconfiguring operating system

Points of Caution Organizing a Datastage project depends on the number of jobs you expect to have. The number of jobs somewhat corresponds to the number of tables in each target. There is a trade-off between in benefits gained by having less projects and the complexity of MetaStage and Reporting Assistant. These tools can extract the ETL business rules and allow you to report against them. We would try to keep the number of jobs under 250 in each project. If it gets over 1000 then you see some performance loss to browse through the jobs. DataStage itself seems to take longer to do things like pull a job up. Some platforms have a lot less of an issue with this. If you can separate your jobs into projects that never overlap then do it. If there is some overlap in functionality then you cannot easily run jobs in 2 separate projects. Reusability is not an issue. Jobs usually cannot be reused. Routines are easily copied from one project to another. Routines are seldom changed. Either they work or they do not work. Replicating metadata is not a problem either. It does not take long to re-import table definitions or export them and import them into another project.

NOTE - If you do not separate then you may have issues in isolating sensitive data. Financial data may be sensitive and need specific developers working on it. Here are some of our observations based on our experience with Datastage: 500+ jobs in a project causes a long refresh time in the DataStage Director. During this refresh, your Director client is completely locked up. Any edit windows open are hung until the refresh completes. Increasing the refresh interval to 30 seconds mitigates the occurrence of refresh, but does not lessen the impact of the refresh. The usage analysis links on import add a lot of overhead to the import process. Compiling a routine can take minutes, even a 1 line routine, depending on how many jobs there are and how many jobs use the routine. A Director refresh will hang a Monitor dialog box until the refresh completes.

You might also like