You are on page 1of 223

DataStage Enterprise Edition

Advanced Architecture and Best Practices

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Welcome!
 This course is intended to provide:
An overview of the development and runtime architecture of DataStage Enterprise Edition Recommendations for parallel Job Design and Best Practices

 There is purposely a combination of baseline and advanced material


Most of this information does not exist in the current course offerings or DataStage documentation This material will eventually be rolled into future Essentials and Advanced course offerings

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition Advanced Architecture and Best Practices

 Agenda:
Day 1
y Module 1: Parallel Framework Architecture y Module 2: Partitioning, Collecting, and Sorting y Module 3: The Parallel Job Score

Day 2
y Module 4: Best Practices and Job Design Tips y Module 5: Environment Variables y Module 6: Introduction to Performance Tuning

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition


Module 01: Parallel Framework Architecture

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 23, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Why You Need to Know This


 DataStage Client is a developer productivity tool
GUI is not intended as a replacement for understanding parallel, flow-based ETL design DataStage Designer includes intelligence to facilitate quick development of simple flows But, this is a development environment, not Visio (picture drawing)

 The key to mastering Enterprise Edition is in understanding the DataStage Parallel Framework
Parallel ETL is a fundamentally different process Complex, high-volume flows require an understanding of the underlying engine architecture

 For now (v7.x1), youll ALWAYS need a copy of the OEM (Orchestrate) documentation
Documentation for the DataStage Parallel Framework
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

DataStage Enterprise Edition Parallel Framework Architecture

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Data Stage Enterprise Edition Component Architecture

Ascential Applications (Data Stage Client)

Third Party Applications

Ascential Data Management Components

Ascential Data Analysis Components

Transformer, BuildOp Components

Third Party Components

DataStage Parallel Application Framework and Runtime System


UNIX Operating System / Networking Parallel Hardware (SMP, Cluster, MPP)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

Introduction to Enterprise Edition


 Parallel processing = executing your application on multiple CPUs
Scalable processing = add more resources (CPUs and disks) to increase system performance
Example system containing 6 CPUs (or processing nodes) and disks Run an application in parallel by executing it on 2 or more CPUs Scale up system by adding more CPUs Can add new CPUs as individual nodes, or add CPUs to an SMP node

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Traditional Batch Processing

Operational Data

Transform

Clean

Load

Archived Data

Disk

Disk

Disk

Data Warehouse

Source

Target
Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Pipeline Multiprocessing
Think of a conveyor belt moving rows from process to process!
Transform, clean and load processes are executing simultaneously Rows are moving forward through the flow Operational Data

Transform
Archived Data

Clean

Load Load

Data Warehouse

Target Source
Start a downstream process while an upstream process is still running. This eliminates intermediate storing to disk, which is critical for big data. This also keeps the processors busy. Still have limits on scalability

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

10

Partition Parallelism
 Divide large data into smaller subsets (partitions) across resources
Goal is to evenly distribute data Some transforms require all data within same group to be in same partition

Node 1

Transform
subset1 Node 2 subset2

Transform
Node 3

Source Data

 Requires the same transform on all partitions


BUT: Each partition is independent of others, there is no concept of global state

subset3 subset4

Transform
Node 4

Transform

 Facilitates near-linear scalability (correspondence to hardware resources)


8X faster on 8 processors 24X faster on 24 processors

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11

Enterprise Edition Combines Partition and Pipeline Parallelisms


 Within the Parallel Framework, Pipelining and Partitioning Are Always Automatic
Job developer need only identify
y y y y Sequential vs. Parallel operations (by stage) Method of data partitioning Configuration file (there are advanced topics here) Advanced per-stage options (buffer tuning, combination, etc)
Pipelining

Source Data

Transform

Clean

Load

Data Warehouse

Source
January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Target 12

Job Design vs. Execution


User assembles the flow using DataStage Designer

at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)

No need to modify or recompile your job design!


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13

Example: Three Types of Parallelism

Sample

Derivation

pipeline
Link Constraint

Lookup

Sort

explicit data-partition Explicit parallelism Implicit pipeline "parallelism" Implicit data-partition parallelism
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14

Defining Parallelism
 Execution mode (sequential/parallel) is controlled by stage definition and properties default = parallel for most Ascential-supplied stages Can override default in most cases through Advanced stage properties; examples where stage usage defines parallelism:
y y y y Sequential File reads (unless number of readers per node is set) Sequential File targets Oracle Enterprise sources (unless partition table is set) others...

 Degree of parallelism is determined by config. file Total number of logical nodes in nameless default pool, or Nodes listed in [nodemap] or in named [nodepool]
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15

The Parallel Configuration File


 Configuration Files separate configuration (hardware/software) from job design
Specified per job at runtime by $APT_CONFIG_FILE Alter hardware and resources without changing job design

 Defines #nodes = logical processing units with corresponding resources (need not match physical CPUs)
Dataset, Scratch, Buffer disk (filesystems) Optional resources (eg. Database, SAS, etc) Advanced topics (pools - named subsets of nodes)

 Multiple configuration files should be used


Optimize overall throughput and matches job characteristics to overall hardware resources Provides runtime throttle on resource usage on a per job basis
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16

The Parallel Configuration File


{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} } }

3 1

4 2

key aspects: 1. 2. # Nodes defined (LOGICAL processing entities) Resources assigned to each node (order of entries within each node is significant!) Advanced resource optimizations and configuration (named pools, database, SAS)

3.

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17

DataStage Enterprise Edition Job Compilation

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18

DataStage Designer Parallel Canvas Job Compilation


DataStage Designer client generates all code  Validates link requirements, mandatory stage options, transformer logic, etc.  Generates OSH representation of job data flow and stages
GUI stages are representations of Framework operators Stages in parallel shared containers are statically inserted in the job flow Each server shared container becomes a dsjobsh operator
Designer Client

Compile
DataStage server

 Generates transform code for each parallel Transformer


Compiled on the DataStage server into C++ and then to corresponding native operators
To improve compilation times, previously compiled Transformers that have not been modified are not recompiled Force Compile recompiles all Transformers (use after client upgrades)
Executable Job

Buildop stages must be compiled manually within the GUI or using buildop UNIX command line
Transformer Components

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19

Viewing Generated OSH


Enable viewing of generated OSH in DS Administrator:

OSH is visible in:

Comments Operator Schema

- Job Properties - Job run log - View Data - Table Definitions (Show Schema)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20

Example Stage / Operator Mapping


Within Designer, stages represent operators, but there is not always a 1:1 correspondence. Examples:
 Sequential File
Source: import Target: export

 Oracle
Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert

   

DataSet: copy Sort (DataStage): tsort Aggregator: group Row Generator, Column Generator, Surrogate Key Generator: generator

 Lookup File Set


Target: lookup -createOnly

See OEM OperatorsRef.PDF


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21

Generated OSH Primer


Example of generated OSH for first 2 stages of this job:

Designer inserts comment blocks to assist in understanding the generated OSH Note that operator order within the generated OSH is the order a stage was added to the job canvas OSH uses the familiar syntax of the UNIX shell to create applications for Data Stage Enterprise Edition operator name schema
y for generator, import, export

operator options (use -name value format) input (indicated by n< where n is the input #) output (indicated by n> where n is the output #)
y may include modify

#################################################### #### STAGE: Row_Generator_0 ## Operator generator ## Operator options -schema record ( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10}; ) -records 50000 ## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')] ## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ; #################################################### #### STAGE: SortSt ## Operator tsort ## Operator options -key 'a' -asc ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v' ## Outputs 0> [modify ( keep a,b,c; )] 'SortSt:lnk_sorted.v' ;

For every operator, input and/or output datasets (links) are numbered sequentially starting from 0. For example: op1 0> dst op1 1< src The following operator input/output data sources are generated by DataStage Designer:

Virtual data set, (name.v) Persistent data set (name.ds or [ds] name)

Virtual data set (link) name is used to connect output of one operator to input of another

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

22

Terminology

Framework
schema property type virtual dataset record/field operator step, flow, OSH command Framework

DataStage
table definition format SQL type + length [and scale] link row/column stage job DS engine

GUI uses both terminologies Log messages (info, warnings, errors) use Framework terms
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

23

DataStage Enterprise Edition Runtime Architecture

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

24

Enterprise Edition Runtime Execution


 Generated OSH and Configuration file are used to compose a job SCORE similar to the way an RDBMS builds a query optimization plan
Identifies degree of parallelism and node assignment for each operator Inserts sorts and partitioners as needed to ensure correct results Defines connection topology (datasets) between adjacent operators Inserts buffer operators to prevent deadlocks (eg. fork-joins) Defines number of actual UNIX processes
y Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements

 Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control
Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage log

 It is only after these steps that processing begins


This is the startup overhead of an Enterprise Edition job

 Job processing ends when


Last row (end of data) is processed by final operator in the flow (or) A fatal error is encountered by any operator (or) Job is halted (SIGINT) by DataStage Job Control or human intervention (eg. DataStage Director STOP)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

25

Viewing the Job SCORE


Set $APT_DUMP_SCORE to output the Score to the DataStage job log For each job run, 2 separate Score Dumps are written to the log
First score is actually from the license operator Second score entry is the actual job score
License Operator job score

Actual job score


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

26

Example Job Score


 Job scores are divided into two sections
Datasets
y partitioning and collecting

Operators
y node/operator mapping

 Both sections note sequential or parallel processing

Why 9 Unix processes?


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

27

Job Execution: The Orchestra


Conductor Node

Conductor - initial Framework process


C

Processing Node
SL

Score Composer Creates Section Leader processes (one/node) Consolidates massages, to DataStage log Manages orderly shutdown

Section Leader (one per Node)


P

Forks Players processes (one/Stage) Manages up/down communication

Processing Node
SL

Players
The actual processes associated with Stages Combined players: one process only Sends stderr, stdout to Section Leader
P

Default Communication:
January 21, 2012

Establish connections to other players for data flow Clean up upon completion

SMP: Shared Memory MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

28

Runtime Control and Data Networks


Control Channel/TCP Conductor Stdout Channel/Pipe Stderr Channel/Pipe APT_Communicator

Section Leader,0

Section Leader,1

Section Leader,2

generator,0

generator,1

generator,2

copy,0

copy,1

copy,2

$ osh generator -schema record(a:int32) [par] | roundrobin | copy


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

29

Parallel Data Flow


 Think of job runtime as a series of conveyor belts transporting rows for each link
If the stage is parallel, each link will have multiple independent belts (partitions)

Row order is undefined (non-deterministic) across partitions, or across multiple links


Order within a particular link and partition is deterministic
y based on partition type y and, optionally, sort order

For this reason, job designs cannot include circular references


eg. cannot update a source or reference used in the same flow

Data Flow

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

30

DataStage Enterprise Edition Data Types, Conversions, Nullability

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

31

Data Formats
 The Framework processes only datasets
 For external data, Enterprise Edition must perform conversion operations:
Format translation using data type mappings May also require:
y Recordization y Columnization

DataSet format
Parallel Framework

External Data

 External data formats fall in two major categories:


Automatic: the conversion is automatic or semi-automatic
y data stored in a relational database (DB2, Informix, Oracle, Teradata) y data stored in a SAS data set y Mapping rules are documented in OperatorsRef.pdf

Manual: user needs to manually specify formats


y everything else: flat text files, binary files y Use the Sequential File Stage

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

External Data

Conversion

Conversion

32

Data Sets
Data Sets are the structured internal representation of data within the Parallel Framework
 Consist of:
Framework Schema (format=name, type, nullability) Data Records (data) Partition (subset of rows for each node)

 Virtual Data Sets exist in-memory


Correspond to DataStage Designer links

 Persistent Data Sets are stored on-disk


Descriptor file (metadata, configuration file, data file locations, flags) Multiple Data Files (one per node, stored in disk resource file systems)
node1:/local/disk1/ node2:/local/disk2/

 There is no DataSet operator the Designer GUI inserts a copy operator


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

33

When to Use Persistent Data Sets


 When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints)
Stored in native internal format (no conversion overhead) Retain data partitioning and sort order (end-to-end parallelism across jobs) Maximum performance through parallel I/O

 Data Sets are not intended for long-term or archive storage


Internal format is subject to change with new DataStage releases Requires access to named resources (node names, file system paths, etc) Binary format is platform-specific

 For fail-over scenarios, servers should be able to cross-mount filesystems


Can read a dataset as long as your current $APT_CONFIG_FILE defines the same NODE names (fastnames may differ) orchadmin x lets you recover data from a dataset if the node names are no longer available
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

34

Caution on using Plug-In MetaData


 DataStage Server plug-ins do not always match the data type definitions used by native Enterprise database stages
Do not use a Plug-In to import Oracle table definitions

 Instead, use ORCHDBUTIL to import Oracle table definitions

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

35

Runtime Column Propagation


 Runtime Column Propagation (RCP) allows you to define only part of your table definition (schema). When RCP is enabled, if your job encounters extra columns not defined in the metadata, it will adopt these extra columns and propagate them through the rest of the job.  RCP must be enabled at the project level (it is off by default)
Can then be enabled/disabled at the job level (Job Properties/Execution) Can also be enabled/disabled at the stage level (Output Columns)

 RCP allows maximum re-use of parallel shared containers


Input and Output table definitions only need columns required by container stages. Parallel Shared Container can be used by multiple jobs with different schemas, as long as the core input/output columns exist. Must enable RCP in every stage within the parallel shared container

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

36

Output Mapping With RCP Disabled


 When RCP is Disabled (default)
DataStage Designer will enforce Stage Input Column to Output Column mappings At job compile time modify operators are inserted on output links in the generated OSH

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

37

Output Mapping With RCP Enabled


 When RCP is Enabled
DataStage Designer will not enforce mapping rules
y Modify is still inserted at compile but
 

Columns are not removed from output Columns are not renamed unless explicitly dragged to derivation

In this example, runtime error because Name will not map to NAME, (RCP maps by case sensitive column name)
Must drag column name to derivation column

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

38

Type Conversions
 Enterprise Edition provides numerous conversion functions between source and target data types
Default type conversions take place across the output mappings of any parallel stage when runtime column propagation is disabled for that stage
y Variable to Fixed-length string conversions will pad remaining length with ASCII NULL (0x0) characters y Use $APT_STRING_PADCHAR to change default padding (also used by target Sequential File stages)

Non-default type conversions require use of Transformer or Modify (recommended method) Look for warnings in DataStage log to indicate unexpected conversions!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

39

Source Type to Target Type Conversions


Source Field Target Field d = There is a default type conversion from source field type to destination field type. e = You can use a Modify or a Transformer conversion function to convert from the source type to the destination type A blank cell indicates that no conversion is provided. int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp

int8 uint8 int16 uint16 int32 uint32 Int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp
January 21, 2012

d d de d de d de d de de de de de e e e e d d d d d d d d d d d

d d

d d d

d d d d

d d d d d

d d d d d d

d d d d d d d

d d d d d d d d

de d d d d d d d d

d d d d d d d d d de

de d de de de de d d d de de

de d de de de de d d d de de d

d d d d d d d d de de d d d d d d d d d

e e

d d d d d de d d e d d d d d de de

d d d de d d d d de d d

d d d d de de de

de de d

e e

e e

e e e

e e e

e e e

e e e e e

e de

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

40

Enterprise Edition Nullable Data


 Out-of-band: an internal data value to mark a field as NULL  In-band: a specific user-defined field value indicates a NULL
Required for Transformer processing Disadvantage:
y must reserve a field value that cannot be used as valid data elsewhere in the flow

Examples:
y a numeric fields most negative possible value y an empty string

 To convert a NULL representation from an out-of-band to an in-band and vice-versa:


Transformer stage:
y Stage variables: IF ISNULL(linkname.colname) THEN ELSE y Derivations: SetNull(linkname.colname)

Modify stage:
y destinationColumnName = handle_null(sourceColumnName,value) y destinationColumnName = make_null(sourceColumnName,value)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

41

Null Transfer Rules


 When mapping between source and destination columns of different nullability, the following rules apply:
Source Field
not_nullable nullable not_nullable

Destination Field
not_nullable nullable nullable

Result
Source value propagates to destination. Source value or null propagates. Source value propagates; destination value is never null. WARNING messages in log. If source value is null, a fatal error occurs. Must handle in Transformer or Modify.

nullable

not_nullable

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

42

NULLS and Sequential Files


 For NULLABLE columns, the following properties are used when reading from or writing to Sequential Files:
null_field
y A number, string, or C-style literal escape value (eg. \xAB) that defines the NULL value representation

null_length
y Field length that indicates a NULL value (only appropriate for variable-length files)

 Null field representation can be any string, regardless of valid values for actual column data type

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

43

Lookup and Nullable Columns


 When using Lookup with If Not Found = Continue unmatched output rows follow nullability attributes of the reference link for non-key columns:
If the non-key columns of the reference link are defined as non-nullable, the Lookup stage assigns a "default value" on unmatched records
y Default Value depends on the data type*. For example:
 

Unmatched rows follow nullability attributes of non-key reference link columns

Lookup
If Not Found = Continue

Integer columns default value is zero. Varchar is a zero-length string (this is distinctly different from a NULL value) Char is a string of fixed length $APT_STRING_PADCHAR characters

TIP: When changing column attributes, be careful to propagate the change through the remaining links of your job design
(Including the output column definition of the Lookup stage in this example).

If the non-key columns of the reference link are defined as nullable, the Lookup stage will place NULL values in these columns for unmatched records * Data type default values are documented in OEM UserGuide.pdf
January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

44

Outer JOINs and Nullable Columns


 Similar to Lookup, when performing an OUTER JOIN (Left Outer, Right Outer, Full Outer), unmatched output rows follow nullability attributes of the the corresponding outer link(s):
If the non-key columns of the outer link(s) are defined as non-nullable, the Join stage will assigns a "default value" on unmatched records, based on their data type If the non-key columns of the outer link(s) are defined as nullable, the Join stage will place NULL values in these columns for unmatched records
January 21, 2012

Unmatched rows follow nullability attributes of non-key outer link columns


Left Right

Left Outer JOIN

Unmatched rows follow nullability attributes of non-key columns of outer links

Full Outer JOIN


45

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer and Null Expressions


 Within a parallel transformer, any expression that includes a NULL value will produce a NULL result 1 + NULL = NULL John : NULL : Doe = NULL  When the result of a link constraint or output derivation is NULL, the Transformer will output that row to its reject link (dashed line) Always create a Transformer reject link in DataStage Designer Always test for null values before using in an expression
y IF ISNULL(link.col) THEN ELSE y Use stage variables if re-used

v7 Transformer now warns when rows reject v7 also clarifies naming of output link constraints (Otherwise)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

46

Framework OEM Documentation


 UserGuide.PDF
Covers framework architecture, parallel processing, partitioning/collecting data, data sets, data types, conversion functions, OSH Also includes detailed documentation on buildops

 OperatorsRef.PDF
Detailed reference for every built-in operator

 RecordSchema.PDF
Format of Framework schema definition (including import, export, generator)

 DevGuide.PDF, HeaderSorted.PDF, ClassSorted.PDF


low-level Orchestrate C++ APIs for building custom operators

Available in the documentation section (Orchestrate) of Ascential eServices public website


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

47

For More Information


 Framework OEM Documentation
User Guide Operators Reference Record Schema

 DataStage Enterprise Edition Best Practices and Performance Tuning document


PLEASE send your comments and feedback to:
y cpaul@ascential.com

 Dont be afraid to try!

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

48

DataStage Enterprise Edition


Module 01: Parallel Framework Architecture

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 23, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition


Module 02: Partitioning, Collecting, and Sorting Data

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioners, Collectors, and Sorting


 Partitioners distribute rows of a single link into smaller segments that can be processed independently in parallel
ONLY before parallel stages

partitioner

collector

 Collectors combine parallel partitions of a single link for sequential processing


ONLY before sequential stages

 Sorting is used to arrange rows into specific groupings and order.


May be parallel or sequential
January 21, 2012

Stage Stage running in running in Parallel Parallel

Stage running Sequentially

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

51

Partitioning and Collecting Icons

Fan-Out Partitioner
Sequential to Parallel

Collector
(Fan-In) Parallel to Sequential

NOTE: Partitioner and Collector icons are ALWAYS drawn Left to Right regardless of how the link is drawn!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

52

partitioner

Partitioning Data
Stage Stage running in running in Parallel Parallel

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

53

Partitioners
 Partitioners distribute rows of a single link (data set) into smaller segments that can be processed independently in parallel  Partitioners exist before ANY parallel stage. The previous stage may be running:
Sequentially
y Results in a fan-out operation (and link icon)
Stage running Sequentially Stage running in Parallel

partitioner

In Parallel
y If partitioning method changes, data is repartitioned
Stage running in Parallel Stage running in Parallel

Stage Stage running in running in Parallel Parallel

repartitioning icon
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

54

Partition Numbers and Director Job Log


 At runtime, the Parallel Framework determines the degree of parallelism for each stage using:
$APT_CONFIG_FILE (and optionally) a stages node pool (Advanced properties)

 Partitions are assigned numbers, starting at zero


Partition number is appended to the stage name for messages written to the DataStage Director job log
stage name

partition #
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

55

System Variables for Parallel Derivations


To facilitate parallel calculations regardless of actual runtime config, system variables are provided in Column / Row Generator and Transformer stages.
 Within Column / Row Generator, two reserved words are provided for numeric cycles:
part: actual partition # (starts at zero) partcount: total number of partitions at runtime

Example Generator Sequence:


Type = Cycle Initial value = part Increment = partcount

For a 4-node configuration file:


@NUMPARTITIONS = 4 @PARTITIONNUM = 0 through 3

Starting with v7.1, the Surrogate Key Generator stage can generate a sequence of integer values in parallel:
Internally similar to using Column Generator stage with part and partcount keywords Also supports initial value for the sequence(s)

Assuming incoming data is round-robin partitioned:


Row#
1 2 3 4 5 6 7 8

Part
0 1 2 3 0 1 2 3

Partcount Result
4 4 4 4 4 4 4 4 0 1 2 3 4 5 6 7

initial values first increment

Within the Transformer, @INROWNUM system variable is generated for each node. Instead, use:
@PARTITIONNUM: actual partition number (starts at zero) @NUMPARTITIONS: total number of partitions

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

56

Selecting a Partitioning Method


 Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition
Ensures that processing is evenly distributed across nodes
y Greatly varied partition sizes (skew) increase processing time

Enable Show Instances in DataStage Director Job Monitor to show data distribution (skew) across partitions:

Setting the environment variable $APT_RECORD_COUNTS outputs row counts per partition to the DataStage log as each stage/node (operator) completes processing
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

57

Selecting a Partitioning Method


 Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition Ensures that processing is evenly distributed across nodes
y Greatly varied partition sizes (skew) increase processing time

 Objective 2: Partition method MUST match the stage logic, assigning related records to the same partition if required Any stage that operates on groups of related data (often using key columns)
y Examples: Aggregator, Join, Merge, Sort, Remove Duplicates, etc (perhaps also Transformers, Buildops)

Partitioning method needed to ensure correct results may violate Objective #1, depending on actual data distribution  Objective 3: Partition method should not be overly complex The simplest method that meets Objectives 1 and 2 If possible, leverage partitioning performed earlier in a flow
January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

58

Specifying Partitioning
 Partitioning method is defined on the Input properties, Partitioning category, of any stage running in parallel

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

59

Partitioning Methods
Keyless Partitioning Rows are distributed independent of actual data values  Same
Existing partitioning is not altered

Keyed Partitioning Rows are distributed at runtime based on values in specified key column(s)  Hash
Rows with same key column value go to the same partition

 Round Robin
Rows are evenly alternated among partitions

 Modulus
Assigns each row of an input dataset to a partition, as determined by a specified numeric key column in the input dataset

 Random
Rows randomly assigned to partitions

 Entire
Each partition gets the entire dataset (rows duplicated)

 Range
Similar to hash, but partition mapping is user-determined and partitions are ordered

Auto (the default method)


DataStage EE chooses appropriate partitioning method Round Robin, Same, or Hash are most commonly chosen
January 21, 2012

 DB2
Matches DB2 EEE partitioning Discussed in database chapter

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

60

SAME Partitioning
 Keyless partitioning method  Rows retain current distribution and order from output of previous parallel stage
Doesnt move data between nodes Retains carefully partitioned data (such as the output of a previous sort)

Row ID's

0 3 6

1 4 7

2 5 8

0 3 6

1 4 7

2 5 8

 Fastest partitioning method (no overhead) SAME partitioning icon


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

61

Impact of SAME Partitioning


 Dont over-use SAME partitioning in a job flow  Because SAME does not alter existing partitions, the degree of parallelism remains unchanged in the downstream stage
If you read a Sequential File using SAME partitioning (without specifying Readers Per Node option), the downstream stage will run sequentially! If you read a persistent Data Set using SAME partitioning, the downstream stage runs with the degree of parallelism used to create the data set, regardless of the current $APT_CONFIG_FILE / specified node pool

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

62

Round Robin and Random Partitioning


 Keyless partitioning methods  Rows are evenly distributed across partitions
Good for initial import of data if no other partitioning is needed Useful for redistributing data
8 7 6 5 4 3 2 1 0

Round Robin

 Fairly low overhead  Round Robin assigns rows to partitions as dealing cards
Row/Partition assignment will be the same for a given $APT_CONFIG_FILE
6 3 0 7 4 1 8 5 2

 Random distributes rows with random order


Higher overhead than Round Robin Not subject to regular patterns that might exist in source data Row/Partition assignment will differ between runs of the same input data

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

63

Parallel Runtime Example


 Remember, row order is undefined (non-deterministic) across partitions, or across multiple links  Consider this example job:
Round robin partitioning distributes rows in a specific order to the number of nodes at runtime But, across nodes, the order a particular node outputs its results may change with each run:
Round Robin partitioning

Row Generator 10 rows {A: Integer, initial_value=1, incr=1} Results with a 4-node $APT_CONFIG_FILE: Node 0: 1, 5, 9 Node 1: 2, 6, 10 Node 2: 3, 7 Node 3: 4, 8 With round robin partitioning, rows are distributed in the same order for the same input data and $APT_CONFIG_FILE

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

64

ENTIRE Partitioning
 Keyless partitioning method  Each partition gets a complete copy of the data
Useful for distributing lookup and reference data
y May have performance impact in MPP / clustered environments
8 7 6 5 4 3 2 1 0

ENTIRE

On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data

. . 3 2 1 0

. . 3 2 1 0

. . 3 2 1 0

 ENTIRE is the default partitioning for Lookup reference links with Auto partitioning
On SMP platforms, it is a good practice to set this explicitly on the Normal Lookup reference link(s)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

65

HASH Partitioning
 Keyed partitioning method  Rows are distributed according to the values in one or more key columns
Guarantees that rows with identical combination of values in key column(s) are assigned to the same partition Needed to prevent matching rows from hiding in other partitions y eg. Join, Merge, RemDup Partition size will be relatively equal if the data across the source key column(s) is evenly distributed
January 21, 2012

Values of key column 0 3 2 1 0 2 3 2 1 1

HASH

0 3 0 3

1 1 1

2 2 2

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

66

Hash Key Selection


 HASH ensures that rows with the same combination of all key column values are assigned to the same partition
Source Data
ID
1 2 3 4 5 6 7 8 9 10

 Hash on LName, 4 node config file distributes as:


Part 0 Partition 1
ID
5 6

LName
Dodge Dodge

FName
Horace John

Address
17840 Jefferson 75 Boston Boulevard

ID
1 2 3 4 7 8 9 10

LName
Ford Ford Ford Ford Ford Ford Ford Ford

FName
Henry Clara Edsel Eleanor Henry Clara Edsel Eleanor

Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore

LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford

FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor

Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore

 NOTE:
Partition distribution matches source data distribution In this example, number of distinct hash key values limits parallelism!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

67

Another Hash Key Example


 Using same source data, Hash on LName, FName, 4 node config file:
ID
2 8

LName
Ford Ford

FName
Clara Clara

Address
66 Edison Avenue 4901 Evergreen

 NOTE:
Improved distribution Only the unique combination of key columns appear in the same partition For partitioning purposes, order of HASH key columns is insignificant
y NOTE: To avoid repartitioning, key column order should be consistent across stages with same keys
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

Part 0 Part 1 Part 2 Part 3

ID
3 5 9

LName
Ford Dodge Ford

FName
Edsel Horace Edsel

Address
7900 Jefferson 17840 Jefferson 1100 Lakeshore

ID
4 6 10

LName
Ford Dodge Ford

FName
Eleanor John Eleanor

Address
7900 Jefferson 75 Boston Boulevard 1100 Lakeshore

ID
1 7

LName
Ford Ford

FName
Henry Henry

Address
66 Edison Avenue 4901 Evergreen

68

Modulus Partitioning
 Keyed partitioning method  Rows are distributed according to the values in one integer key column
Simpler (and faster) calculation than HASH using modulus (remainder) of division:
partition = MOD (key_value / #partitions)

Values of key column 0 3 2 1 0 2 3 2 1 1

MODULUS

Guarantees that rows with identical values in key column end up in the same partition Partition size will be relatively equal if the data within the key column is evenly distributed
January 21, 2012

0 3 0 3

1 1 1

2 2 2

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

69

RANGE Partitioning
Values of key column

 Keyed partitioning method  Rows are evenly distributed according to the values in one or more key columns
Requires pre-processing data to generate a range map
y y More expensive than HASH partitioning Must read entire data TWICE to guarantee results

4 0 5 1 6 0 5 4 3

RANGE

Guarantees that rows with identical values in key columns end up in the same partition

 The Write Range Map stage is used to generate the range map file
If the source data distribution is consistent over time, it may be possible to re-use the map file Values outside of a given range map will land in the first or last partition as appropriate
January 21, 2012

0 1 0

4 4 3

5 6 5

QUIZ! If incoming data is ordered on key, something bad happens. WHAT? ANSWER: The process runs
sequentially (key value adjacency)!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

70

Example Partitioning Icons


fan-out
Sequential to Parallel

SAME partitioner

re-partition
watch for this!

AUTO partitioner

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

71

Automatic Partitioning
 By default, the Parallel Framework inserts partition components as necessary to ensure correct results (check the job score)
Before any stage with Auto partitioning Generally chooses between ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values (eg. Join, Merge, RemDup) Inserts ENTIRE on Normal (not Sparse) Lookup reference links
y NOT always appropriate for MPP/clusters

 Since the Framework has limited awareness of your data and business rules, it is usually best to explicitly specify HASH partitioning when key groupings are required
Framework has no visibility into Transformer logic Required before SORT and AGGREGATOR (hash method) stages Framework may insert un-needed or non-optimal partitioning
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

72

Preserve Partitioning Flag


 The preserve partitioning flag is designed for stages that use Auto partitioning
Flag has 3 possible settings:
y Set: instructs downstream stages to attempt to retain partitioning and sort order y Clear: downstream stages need not retain partitioning and sort order y Propagate: passes (if possible) the flag setting from input to output links

Set automatically by some operators (eg. Sort, Hash partitioning) Can be manually set by users through stage Advanced properties Functionally equivalent to explicitly specifying SAME partitioning
y But allows the Parallel Framework to over-ride and optimize for performance (eg. if the degree of parallelism differs)

 Preserve Partitioning setting is part of the data set structure


y In memory (virtual) and on disk (persistent)

 At runtime, if Preserve Partitioning flag as set and a downstream operator cannot use previous partitioning, a warning is issued

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

73

Summary: Partitioning Strategy


 Use HASH partitioning when stage requires grouping of related values
Specify only the key columns that are necessary for correct grouping (as long as the number of unique values is sufficient) Use MODULUS if group key is a single Integer column RANGE may be appropriate in rare instances when data distribution is uneven but consistent over time Know your data!
y How many unique values in the hash key column(s)?

 If grouping is not required, use ROUND ROBIN to redistribute data equally across all partitions
Framework will often do this with AUTO partitioning

 Try to optimize partitioning for the entire job flow


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

74

Job SCORE: Data Sets


 The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime
Partitioners and Collectors are associated with datasets (top portion of the SCORE) Datasets connect a source and a target: - operator(s) (see lower portion of SCORE) - persistent Dataset(s) Partitioner / Collector method is shown between the source and target

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

75

Interpreting the Job Score - Partitioning


 The DataStage Parallel Framework implements a producer-consumer data flow model
Upstream stages (operators or persistent data sets) produce rows that are consumed by downstream stages (operators or data sets)

Producer Consumer
y Partitioning method is associated with producer y Collector method is associated with consumer y Separated by an indicator: -> Sequential to Sequential
<> Sequential to Parallel => Parallel to Parallel (SAME) #> Parallel to Parallel (not SAME) >> Parallel to Sequential > No producer or no consumer

y May also include [pp] notation when Preserve Partitioning flag is set

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

76

Optimizing Partitioning
 Minimize the number of re-partitions within and across job flows
Within a flow
y Examine up-stream partitioning and sort order and attempt to preserve for down-stream stages using SAME partitioning y May require re-examining key column usage within stages and processing (stage) order

Across jobs through a persistent data set


y Data sets retain partitioning AND sort order across flows


If sort order is significant, write to a persistent data set with the Preserve Partitioning flag SET

y Useful if downstream jobs are run with the same degree of parallelism and require same partition and sort order
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

77

collector

Collecting Data
Stage running Sequentially

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

78

Collectors
 Collectors combine partitions of a dataset into a single input stream to a sequential Stage
...
data partitions (NOT links) collector

sequential Stage

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

79

Specifying Collector Type


 Collector method is defined on the Input properties, Partitioning category, of any stage running sequentially when the previous stage is running in parallel

Stage running in Parallel

Stage running Sequentially

collector icon

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

80

Collector Methods
(Auto)
y Eagerly read any row from any input partition y Output row order is undefined (non-deterministic) y This is the default collector method

Round Robin
y Patiently pick row from input partitions in round robin order y Slower than auto, rarely used

Ordered
y Read all rows from first partition, then second, y Preserves order that exists within partitions

Sort Merge
y Produces a single (sequential) stream of rows sorted on specified key column(s) for input sorted on those keys y Row order is undefined for non-key columns
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

81

Choosing a Collector Method


 In most instances, Auto collector (the default) is the fastest and most efficient method of collecting data into a sequential stream  To generate a single stream of sorted data, use the Sort Merge collector for previously-sorted input
Input data must be sorted on these keys to produce a sorted result Sort Merge does not perform a sort, it simply defines the order that rows are read from all partitions using the values in one or more key columns

 Ordered collector is only appropriate when sorted input has been range-partitioned
No sort required to produce sorted output

 Round robin collector can be used to reconstruct original (sequential) row order for round-robin partitioned inputs
As long as intermediate processing (eg. sort, aggregator) has not altered row order or reduced number of rows Rarely used in real life scenarios

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

82

Collectors vs. Funnels


Dont confuse a collector with a Funnel stage!  Collector
Operates on a single, partitioned link (single virtual data set) Consolidates partitions as the input to a sequential stage Always identified by a fan-in link icon

 Funnel stage
Stage that runs in parallel Merges data from multiple links (multiple virtual data sets) to a single output link Table Definitions (schema) of all links must match

Collector
January 21, 2012

Funnel
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

83

Sorting Data

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

84

Traditional (Sequential) Sort


 Traditionally, the process of sorting data uses one primary key column and (optionally) multiple secondary key columns to generate a sequential, ordered result set
Order of key columns determines sequence (and groupings) Each key column specifies an ascending or descending sort group

 This is the method that SQL uses for an ORDER BY clause


Sorted Result Source Data
ID
1 2 3 4 5 6 7 8 9 10

LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford

FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor

Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore

ID
6 5 1 7 4 10

LName
Dodge Dodge Ford Ford Ford Ford Ford Ford Ford Ford

FName
John Horace Henry Henry Eleanor Eleanor Edsel Edsel Clara Clara

Address
75 Boston Boulevard 17840 Jefferson 66 Edison Avenue 4901 Evergreen 7900 Jefferson 1100 Lakeshore 7900 Jefferson 1100 Lakeshore 66 Edison Avenue 4901 Evergreen

Sort on:
Lname (asc), FName (desc)

3 9 2 8

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

85

Parallel Sort
 In most cases, there is no need to globally sort data to produce a single sequence of rows  Instead, sorting is most often needed to establish order within specified groups of data Join, Merge, Aggregator, RemDup, etc This sort can be done in parallel!  Partitioning is used to gather related rows Assigns rows with the same key column(s) values to the same partition  Sorting is used to establish grouping and order within each partition based on one or more key column(s) Key values are adjacent  Partition and Sort keys need not be the same! Often the case before Remove Duplicates
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

86

Example Parallel Sort


 Using same source data, Hash partition on LName, FName (4 node config):
Part 0 Part 1 Part 2 Part 3
ID
2 8

 Within each partition, sort using LName, FName:


Part 0 Parallel Sort
ID
2 8

LName
Ford Ford

FName
Clara Clara

Address
66 Edison Avenue 4901 Evergreen

LName
Ford Ford

FName
Clara Clara

Address
66 Edison Avenue 4901 Evergreen

Part 1

ID
3 5 9

LName
Ford Dodge Ford

FName
Edsel Horace Edsel

Address
7900 Jefferson 17840 Jefferson 1100 Lakeshore

ID
5 3 9

LName
Dodge Ford Ford

FName
Horace Edsel Edsel

Address
17840 Jefferson 7900 Jefferson 1100 Lakeshore

Parallel Sort

Part 2

ID
4 6 10

LName
Ford Dodge Ford

FName
Eleanor John Eleanor

Address
7900 Jefferson 75 Boston Boulevard 1100 Lakeshore

ID
6 4 10

LName
Dodge Ford Ford

FName
John Eleanor Eleanor

Address
75 Boston Boulevard 7900 Jefferson 1100 Lakeshore

Parallel Sort Parallel Sort

Part 3

ID
1 7

LName
Ford Ford

FName
Henry Henry

Address
66 Edison Avenue 4901 Evergreen

ID
1 7

LName
Ford Ford

FName
Henry Henry

Address
66 Edison Avenue 4901 Evergreen

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

87

Stages that require Sorted Data


 Stages that process data on groups
Aggregator Remove Duplicates Compare (perhaps)
y If only comparing values, not order between two sources

Transformer, Buildop (perhaps)


y Depending on internal stage-variable logic

 Lightweight stages that minimize memory usage by requiring data in key-column sort order
Join Merge Sort Aggregator
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

88

Parallel (Grouped) Sorting Methods


 DataStage Designer provides two methods for parallel (grouped) sorting:
Sort stage
y Parallel execution mode

OR

Specified on a link when partitioning is not Auto


y Links with SORT defined will have a Sort icon:

 By default, both methods use the same internal sort package (tsort operator)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

89

Sorting on a Link

Right-click on key column to specify sort options

 Easier job maintenance (fewer stages on job canvas)  BUTFewer options (tuning, features)
January 21, 2012

Specify key usage for Sorting, Partitioning, or Both


90

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sort Stage
 The Sort stage offers more options than a link sort

Always specify DataStage Sort Utility (much faster than UNIX sort)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

91

Stable Sorts
 Stable sort preserves the order of non-key columns within each sort group  Stable sorts are slightly slower than non-stable sorts for the same data/keys
Only use Stable sort when needed By default, stable sort is enabled on Sort stages! Stable sort is not the default for Link sorts
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

92

Resorting on Sub-Groups
 Use Sort Key Mode property to re-use key column groupings from previous sorts
Uses significantly less memory/disk!
y Sort is now on previously-sorted key-column groups not the entire data set y Outputs rows after each group

 Key column order is important!  Dont forget to retain incoming sort order (eg. SAME partitioning)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

93

Partitioning and Sort Order


 When partitioning data (except for SAME), sort order is not maintained  To restore row order / groupings, a sort is required after any repartitioning

1 2 3

103 102 101

Partitioner
2 101 3 1 103 102

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

94

Sequential (Total) Sorting Methods


 Within Enterprise Edition, DataStage provides two methods for generating a sequential (totally sorted) result:
Sort stage
y Sequential execution mode

OR

SortMerge Collector
y For sorted input In general, parallel Sort + SortMerge collector will be MUCH faster than a sequential Sort
- Unless data is already sequential

(Similar to how databases parallel sort)


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

95

Automatic Sorting
 By default, the Parallel Framework inserts sort operators as necessary to ensure correct results Before any stage that requires matched key values (eg. Join, Merge, RemDup) Only inserted when the user has NOT explicitly defined an input sort Check the Job SCORE for inserted tsort operators
op1[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
op1[4p] {(parallel inserted tsort operator {key={value=LastName, subArgs={sorted}}, key={value=FirstName}, subArgs={sorted}}(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}

 For versions 7.01 and later, set $APT_SORT_INSERTION_CHECK_ONLY to change behavior of automatically inserted sorts
Instead of actually performing the sort, the inserted sort operators only VERIFY the data is sorted If data is not sorted properly at runtime, the job will fail Recommended only on a per-job basis during performance tuning
January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

96

Sort Resource Usage


 By default, each sort uses 20MB per partition as an internal memory buffer
Includes user-defined (link, stage) and framework-inserted sorts A different size can be specified for each Sort stage using the Restrict Memory Usage option
y Increasing this value can improve performance, especially if the entire (or group) data partition can fit into memory y Decreasing this value may hurt performance, but will use less memory (minimum is 1MB per partition) y From Designer, this option is unavailable for link sorts

 When the memory buffer is filled, sort uses temporary disk space in the following order:
y Scratch disks in the $APT_CONFIG_FILE sort named disk pool y Scratch disks in the $APT_CONFIG_FILE default disk pool (normally all scratch disks are part of the default disk pool) y The default directory specified by $TMPDIR y The UNIX /tmp directory
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

97

Optimizing Sort Performance


 Minimize number of sorts within a job flow
Each sort interrupts the parallel pipeline - must read all rows before generating output

 Specify only necessary key columns  Avoid stable sorts unless needed to retain order of non-key column data  If possible, use Sort Key Usage key column option to re-use previous sort keys  Within Sort stage, adjusting Restrict Memory Usage option may improve performance

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

98

Partitioning Examples

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

99

Partitioning Example 1
 Scenario: Assign average value to existing detail rows  Standard Solution (3 hash/sorts):
Copy Data, Hash and Sort on all inputs to Aggregator, Join This is also the method the framework would use with Auto partitioning to ensure correct results

Copy Aggregate

Join

Notice that all 3 hash partitioners and sorts use the same key columns and order!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

10 0

Example 1 - Optimized Solution


 Optimize partitioning keys (and sort order) across multiple stages in a single flow
To minimize re-sorts and re-partitions

 Optimized Solution (1 hash/sort):


Move Hash/Sort upstream before the Copy Use SAME partitioning to preserve partitioning and sort order
Partition and Sort on key column(s) Copy Join SAME partitioning retains previous sort order Inputs to JOIN do not need to be sorted!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Aggregate

10 1

Example 1: Sort Insertion


 Looking at the Job SCORE for the optimal solution, the Framework inserts sorts before each Join input to ensure correct results
Regardless of the partitioning method chosen op3[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0) ) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )} op4[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0) ) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
10 2

 In this example we dont want these extra sorts

 To change behavior of frameworkinserted sorts for this job, set


$APT_SORT_INSERTION_CHECK_ONLY

Inserted sorts will verify row order at runtime, but will not actually sort data

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Example 2: Header / Detail


 Know your data HASH guarantees correct grouping results, but it is not always the most efficient  Scenario: Header / Detail Processing
Assign Data from Header Row to all Detail Rows Use Transformer to
y Separate Header and Detail Data y Add Join Key column (constant value) to both outputs
Header Src Detail Out

NOTE: since the Join Key value is constant, inputs to the JOIN stage should NOT be sorted
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

10 3

Partitioning Example 2: Solutions


 Solution 1: Hash On Key Columns and Join
This is the standard approach
It is also the method the Framework would use with Auto partitioning

 Solution 2: Use Entire to copy header data to all partitions


Distribute detail data using Round Robin Join will now run in parallel

BUT there is only one hash key value, so the Join runs sequentially
Src

For either solution, to counteract framework-inserted sorts, set


$APT_SORT_INSERTION_CHECK_ONLY

Header Out Detail

But there is still a possible problem with either solution!


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

10 4

Introducing the Buffer Operator


 At runtime, the Framework automatically inserts buffer operators to prevent deadlocks and to optimize overall performance
For job flows with a fork-join, buffer operators are inserted on all inputs to the downstream joining operator
y Any link split that is later combined in the same job flow Stage 1 Stage 2
Buffer

Buffer

Buffer operators may also be inserted in an attempt to match producer and consumer rates

Stage 3

 Data is never repartitioned across a buffer operator


First-in, First-Out row processing

 Some stages (eg. Sort, Hash Aggregator) internally buffer the entire data set before outputting a row
Buffer operators are never inserted after these stages
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

10 5

Identifying Buffer Operators


 At runtime, buffers are identified in the operators section of the job SCORE
It has 6 operators: op0[1p] {(sequential Row_Generator_0) on nodes ( ecc3671[op0,p0] )} op1[1p] {(sequential Row_Generator_1) on nodes ( ecc3672[op1,p0] )} op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3) on nodes ( ecc3671[op2,p0] )} op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )} op4[4p] {(parallel APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3) (APT_TransformOperatorImplV0S7_cpLookupTest1_Transform er_7 in Transformer_7) (PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )} op5[1p] {(sequential APT_RealFileExportOperator in Sequential_File_12) on nodes ( ecc3672[op5,p0] )} It runs 12 processes on 4 nodes.

For more details on buffering: OEM UserGuide.PDF, Appendix A


January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

10 6

How Buffer Operators Work


 The primary goal of a buffer operator is to prevent deadlocks  This is accomplished by holding rows until the downstream operator is ready to process them
Rows are held in memory up to size defined by
$APT_BUFFER_MAXIMUM_MEMORY

y default is 3MB per buffer per partition

When buffer memory is filled, rows are spilled to disk


y By default, up to amount of available scratch disk unless QUEUE UPPER BOUND limit has been set Producer
January 21, 2012

Buffer

Consumer
10 7

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Buffer Flow Control


 When buffer memory usage reaches $APT_BUFFER_FREE_RUN the buffer operator will offer resistance to the new rows, slowing down the rate of upstream producer
Default 0.5 = 50% Setting $APT_BUFFER_FREE_RUN > 1 (100%) will prevent the buffer from slowing down upstream producer until data size of $APT_MAXIMUM_MEMORY * $APT_BUFFER_FREE_RUN has been buffered
y Assumes that the overhead of disk I/O for buffer scratch usage is less than the impact of slowing down upstream operator

Producer

Buffer

Consumer

Buffer will offer resistance to new rows, slowing down upstream producer
January 21, 2012

$APT_BUFFER_FREE_RUN

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

10 8

Tuning Buffer Settings


 On a per-job basis through environment variables
$APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT And many other advanced options

 On a per-link basis (Inputs/Outputs ->Advanced)


Buffer options are defined per link (virtual data set), hence the Output of one stage is the Input of the following stage In general, Auto Buffering (default) is recommended
y Dont change unless you really understand your job flow and data! y Disabling buffering may cause the job to deadlock (hang)

 In general, buffer tuning is an advanced topic. The default settings should be appropriate for most job flows. For very wide rows, it may be necessary to increase default buffer size to handle more rows in memory
y Calculate total record width using internal storage for each data type / length / scale. For variable-length (varchar) columns, use maximum length.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

10 9

Buffer Resource Usage


 By default, each buffer operator uses 3MB per partition of virtual memory
Can be changed through Advanced link properties, or globally using

$APT_BUFFER_MAXIMUM_MEMORY  When buffer memory is filled, temporary disk space is used in the following order:
y Scratch disks in the $APT_CONFIG_FILE buffer named disk pool y Scratch disks in the $APT_CONFIG_FILE default disk pool (normally all scratch disks are part of the default disk pool) y The default directory specified by $TMPDIR y The UNIX /tmp directory
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 0

End of Data / End of Data Group


 Stages that process groups of data (eg. Join, Merge, Remove Duplicates, Sort Aggregator) cannot output a row until:

Data in the grouping key column(s) changes (logical End of Data Group) All rows have been processed (End of Data)
 For stages that process groups, rows are buffered in memory until an End of Data Group or End of Data  Some stages (eg. Sort, Hash Aggregator) must read an entire input data set (until End of Data) before outputting a single record

Setting Dont Sort, Previously Sorted key option changes Sort behavior to output on groups instead of entire data set
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 1

Revisiting Example 2: Buffering Impact


 For large data volumes, buffering introduces a possible problem with this solution:
At runtime, buffer operators are inserted for this fork-join scenario The Join stage, operating on key-column groups, is unable to output rows until (end of data group) or (end of data)

 Generating one header row with no subsequent change in join column, data is buffered until end of data
Buffer Header Src Buffer Detail Out

 Solution: Use stage variables hold header data values. Output multiple header rows with different join-key values  This additional logic may impact Transformer performance
Proper solution ultimately depends on data volume and available hardware resources
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 2

Revisiting Example 2: Buffering Solution


 Define stage variables to hold header-row values.
Set initial values to empty Only set header values when header is identified

 Header Link:
Use output link constraints to only output data after header values have been captured. Assign more than one join key value using @INROWNUM
y Assumes only one header row

 Detail Link:
Assign constant value to detail join column
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 3

Join Stage: Internal Buffering


 Even for Inner Joins, there is a difference between the inputs of a Join stage!
The first link (#0, LEFT within Link Ordering) establishes driver input rows are read one at a time For non-unique key values, all rows within the same key value group are read into memory from the second link (#1, RIGHT by Link Ordering)

 For Example 2, single Header row must be the second input link (#1) to the Join stage
Otherwise, all input data will be read into virtual memory

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 4

Avoiding Buffer Contention


 Datasets do not buffer there is no upstream operation that would prevent rows from being output  In some cases, the best solution to avoiding forkjoin buffer contention is to split the job, landing results to intermediate data sets
Develop a single job first If performance / volume testing indicates a bufferingrelated performance issue that cannot be resolved by adjusting buffering settings, then split the job across intermediate datasets

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 5

Example 2: Why Not Use Lookup?


 Lookup cannot output any rows until ALL reference link data has been read into memory (End of Data)
Except for Sparse database lookups

 NEVER generate Lookup reference data using a fork-join of source data


Separate creation of lookup reference data from lookup processing
Header HeaderRef

Src Detail

Out

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 6

Summary
 Partitioning
Method should ensure correct results AND (if possible) evenly distribute data Must be aware of data distribution and impact on processing

 Collecting
Used to consolidate partitioned data into sequential process

 Sorting
Parallel sorting establishes row order within groups
y Partitioning gathers related rows

Sequential sorting only needed to produce single, globally sorted sequential result set
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

11 7

DataStage Enterprise Edition


Module 02: Partitioning, Collecting, and Sorting Data

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition


Module 03: The Parallel Job Score

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

The Parallel Job SCORE


 The Job SCORE details the optimization plan used by the DataStage Parallel Framework to run a given job design based on a specified $APT_CONFIG_FILE
Similar to the way a parallel RDBMS builds a query plan
Identifies degree of parallelism and node assignment(s) for each operator
y Details mappings between functional (stage/operator) and actual UNIX processes y Includes buffer operators inserted to prevent deadlocks and optimize data flow rates between stages y Can be used to identify sorts and partitioners that have been automatically inserted to ensure correct results

Outlines connection topology (datasets) between adjacent operators and/or persistent data sets Defines number of actual UNIX processes
y Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 0

Viewing the Job SCORE


Set $APT_DUMP_SCORE to output the Score to the DataStage job log
Can enable at the project level to apply to all jobs

For each job run, 2 separate Score Dumps are written to the log
First score is actually from the license operator Second score entry is the actual job score
License Operator job score

Actual job score


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 1

Example Job Score


 Job scores are divided into two sections
Datasets
y partitioning and collecting

Operators
y node/operator mapping

 Both sections note sequential or parallel processing

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 2

Job SCORE: Operators


 The operators (lower) section of the Job Score details the mapping between stages and actual processes created at runtime Operator combination Operator to node mappings Degree of Parallelism per operator Framework-inserted sorts Buffer operators
op0[1p] {(sequential APT_CombinedOperatorControlle r: (Row_Generator_0) (inserted tsort operator {key={value=LastName}, key={value=FirstName}}) ) on nodes ( node1[op0,p0] )} op1[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )} op2[4p] {(parallel buffer(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited. )}

January 21, 2012

12 3

Operator Combination
 At runtime, the DataStage Parallel Framework can only combine stages (operators) that:
Use the same partitioning method
y Repartitioning prevents operator combination between the corresponding producer and consumer stages y Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination

Are Combinable
y Set automatically within the stage/operator definition y Can also be set within DataStage Designer: Advanced stage properties

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 4

Composite Operator Example: Lookup


 The Lookup stage is a composite operator
Internally it contains more than one component, but to the user it appears to be one stage
y LUTCreateImpl


Reads the reference data into memory Performs actual lookup processing once reference data has been loaded

y LUTProcessImpl


op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3) on nodes ( ecc3671[op2,p0] )} op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )} op4[4p] {(parallel APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3) (APT_TransformOperatorImplV0S7_cpLoo kupTest1_Transformer_7 in Transformer_7) (PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}

At runtime, each internal component is assigned to operators independently

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 5

Job SCORE: Data Sets


 The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime
Partitioners and Collectors are associated with datasets (top portion of the SCORE) Datasets connect a source and a target: - operator(s) (see lower portion of SCORE) - persistent Dataset(s) Partitioner / Collector method is shown between the source and target

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 6

Interpreting the Job Score - Partitioning


 The DataStage Parallel Framework implements a producer-consumer data flow model
Upstream stages (operators or persistent data sets) produce rows that are consumed by downstream stages (operators or data sets)

Producer Consumer
y Partitioning method is associated with producer y Collector method is associated with consumer


eCollectAny is specified for parallel consumers, although no collection occurs! -> Sequential to Sequential <> Sequential to Parallel => Parallel to Parallel (SAME) #> Parallel to Parallel (not SAME) >> Parallel to Sequential > No producer or no consumer

y Separated by an indicator:

y May also include [pp] notation when Preserve Partitioning flag is set
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 7

Using the Job SCORE


 $APT_DUMP_SCORE = 1 (True) is a recommended default (project level) setting for all jobs  At runtime, the Job SCORE can be examined to identify:
Number of UNIX processes generated for a given job and $APT_CONFIG_FILE Operator combination Partitioning methods between operators Framework-inserted components
y Including Sorts, Partitioners, and Buffer operators

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

12 8

DataStage Enterprise Edition


Module 03: The Parallel Job Score

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition


Module 04: Best Practices and Job Design Tips

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Assumptions
 This module assumes that you have an understanding of the topics covered in:
Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Material covered in DS324PX: DataStage Enterprise Edition Essentials

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13 1

DataStage Enterprise Edition


Job Design Tips

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Overall Job Design


 Ideal job design must strike a balance between performance, resource usage, and restartability  In theory, best performance results from processing all data in-memory without landing to disk Requires hardware resources (eg. CPU, memory) and UNIX resources (eg. ulimit, nfiles, etc)
y Resource usage grows exponentially based on degree of parallelism and number of stages in a flow y Must also consider what else is running on the server(s)

May not be possible with very large amounts of data


y eg. Sort will use scratch disk if data is larger than memory buffer

Business rules may dictate job boundaries


y eg. Dimensional maintenance before Fact table processing/load y eg. Lookup reference data must be created before lookup processing
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13 3

Modular Job Design


 Parallel shared containers facilitate modular job design by creating re-usable components (stages, logic)
Runtime column propagation allows maximum parallel shared container re-use (only need to define columns used within container logic) The total number of stages in a job includes the total of all stages in all parallel shared containers

 Job parameters and multi-instance job properties facilitate job re-use  Land intermediate results to parallel data sets
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13 4

Establishing Job Boundaries


 Business requirements  Functional / DataStage requirements  Establish restart points in the event of a failure Segment long-running steps Separate final database Load from Extract and Transformation steps  Resource utilization (number of stages, etc)  Performance

Fork-join job flows may run faster if split into two separate jobs with intermediate datasets
y Depends on processing requirements and ability to tune buffering
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13 5

Job Sequences
 Job Sequences can be used to combine individual jobs into functional modules to perform a sequence of activities  Starting with DataStage release 7.1, Job Sequences can be restartable
In the event of a failure, rerunning the sequence will not rerun activities that successfully completed It is the developers responsibility to ensure that an individual job can be re-run after a failure The do not checkpoint run sequence stage property will execute that step every Sequence run. Enable Sequence restart in Job Properties (enabled by default)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13 6

Job Design Stage Usage Tips


 Sequential File
Optimizing performance Reading and Writing fixed-width files Adjusting write buffer size

     

Column Import Lookup Sort Aggregator Transformer Database Stages


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

13 7

Reading a Sequential File in Parallel


 By default, Sequential File reads are not parallel, unless multiple files are specified  The Readers Per Node option can be used to read a single input file in parallel at evenly spaced offsets
Note that sequential row order cannot be maintained when reading a file in parallel

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13 8

Partitioning and Sequential Files


 Sequential File sources (import operator) create one partition for each input file
Always follow a Sequential File with ROUND ROBIN or other appropriate partitioning type NEVER follow a Sequential File source with SAME partitioning
y If reading from one file, this will cause the downstream flow to run sequentially! y SAME is only appropriate in unusual scenarios where the source data is already separated into multiple files by partition

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

13 9

Capturing Sequential File Rejects


 The Sequential File stage supports an optional reject link to capture rows that do not match source or target format
Reject schema is a single raw (binary) column Be careful writing rejects to another SequentialFile Easiest to output rejects to a Dataset (with Peek for debug)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 0

Sequential File Tips


 To write fixed-length files from variable-length fields, use the following column properties:
field width: specifies the output column width pad string: specifies character used to pad data to the specified field width (if not specified an ASCII NULL character 0x0 is used for padding)

 When reading delimited files, extra characters are silently truncated for source file values longer than the maximum specified length of VARCHAR columns
Starting with v7.01, set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject these records instead
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 1

Buffering Sequential File Writes


 By default, Sequential File targets (export operator) buffer writes to optimize performance
Buffers are automatically flushed when the job completes successfully

 For realtime applications, the environment variable $APT_EXPORT_FLUSH_COUNT can be used to specify the number of rows to buffer
For example $APT_EXPORT_FLUSH_COUNT=1 flushes to disk for every row Setting this value low incurs a SIGNIFICANT performance penalty!

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 2

Using Column Import

 The Column Import stage can be used to improve performance of non-parallel Sequential File reads and FTP sources
Allows column parsing to run in parallel Separates parsing (CPU) from sequential source I/O Define source file/FTP as a single column
y Type RAW or [VAR]CHAR y Maximum length = record size y Note that there are metadata implications

Define Columns, Data Types, and other format options in Column Import stage
y Similar to Sequential File definition
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 3

Lookup Stage Usage


 The Lookup stage is most appropriate when reference data is small enough to fit into physical (shared) memory
For reference datasets larger than available memory, use the JOIN or MERGE stage

 Limit use of Sparse Lookup (for DB2 and Oracle reference tables)
Per-row database lookups are extremely expensive (slow)
y For small numbers of rows, can be used for databasegenerated variables / function results

ONLY appropriate when the number of input rows is significantly smaller (eg. 1:100) than the number of reference rows
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 4

Lookup Reference Data Partitioning


 ENTIRE is the default partitioning for Lookup reference links with Auto partitioning
On SMP platforms, it is a good practice to set this explicitly on the Normal Lookup reference link(s)

 Lookup stage uses shared memory instead of duplicating ENTIRE reference data
On SMP platforms

 To minimize data movement across nodes in clustered / MPP platforms, it may be appropriate to select a keyed partitioning method
Especially if data is already partitioned on those keys Input and Reference data partitioning must match
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 5

Lookup Reference Data  NEVER generate Lookup reference data using a fork-join of source data
Lookup cannot output rows until all reference data has been read into memory (except for Oracle or DB2 Sparse reference links)
HeaderRef Header

Src Detail

Out

 Use Lookup File Sets to separate the creation of lookup reference data from lookup processing
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 6

Lookup File Sets


 Lookup File Sets should be used to store reference data on disk
Data is stored in native format, partitioned, and pre-indexed on lookup key column(s) Key column(s) and partitioning are specified when file is created

 Lookup File Sets can only be used as reference input link to a Lookup stage
Partitioning method and key columns specified when the Lookup File Set is created will be used to process the reference data on subsequent Lookups that use this file

 Particularly useful when static reference data can be reused in multiple jobs (or runs of the same job)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

14 7

Aggregator
 The Aggregator stage summarizes data based on groupings of key-column values
Input partitioning must match desired groupings

 Use Hash method for inputs with a limited number of distinct keycolumn values
Uses 2K of memory/group Incoming data does not need to be pre-sorted Results are output after all rows have been read Output row order is undefined
y Even if input data is sorted

 Use Sort method with a large (or unknown) number of distinct key-column values
Requires input pre-sorted on key columns Results are output after each group
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

January 21, 2012

14 8

Sequential (Total) Aggregations


 To summarize on all input rows
Generate a constant-value key column
y Column Generator y Transformer (if already in the upstream job flow)

Sequentially Aggregate on generated key column


y No need to sort or hash-partition input data!

 Use 2 aggregators to prevent sequential aggregation (and collector) from slowing down upstream data flow
First aggregator runs in parallel, grouping on generated key column
y Round-robin input if not evenly distributed

Second aggregator runs sequentially, grouping on generated key column


y Auto collector
Parallel
January 21, 2012

Sequential
14 9

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer vs. Other Stages


 For optimum performance, consider more appropriate stages instead of a Transformer in parallel job flows:

- Use the Copy stage as a placeholder


- this is different from DataStage Server! - unless FORCE=TRUE, Copy is optimized out at runtime

- Leverage stage (eg. Copy) Output Mappings (RCP off) to


y Rename Columns y Drop Columns y Perform Default Type Conversions

- Modify is the most efficient stage. Use it for


- Non-default type conversions - Null handling (converting between in-band and out-of-band) - String trimming (v7.01 and later)

- NOTE: starting with v7.01, Transformer output link constraints are FASTER than Filter stage! (Filter is always interpreted)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 0

Transformer vs. Lookup


- Consider implementing Lookup tables for expressions that depend on value mapping - For example:
- Instead of using transformer expressions such as:
- link.A=1 OR link.A=3 OR link.A=5 - link.A=2 OR link.A=7 OR link.A=15 OR link.A=20

- Create a Lookup table for the source-value pairs, and use the Lookup stage to assign values A Result
1 3 1 1 1 2 2 2 2
15 1

- This method can also be used to simply output link constraints


January 21, 2012

5 2 7 15 20

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer Performance Guidelines


- Minimize the number of Transformers by combining derivations from multiple Transformers  NEVER use the Server-side BASIC Transformer in high-volume data flows
Intended to provide a migration path for existing DataStage Server applications that use DataStage BASIC routines Starting with v7, the parallel Transformer supports user-defined functions (external object files or libraries, not through DataStage BASIC routines)

 Replace Transformer stages that do not meet performance requirements with BuildOps
It is generally not necessary to replace all Transformers, just those that are bottlenecks Remember, BuildOps require more knowledgeable developers than equivalent Transformer logic
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 2

Optimizing Transformer Expressions


 The parallel Transformer uses the following evaluation algorithm:
Evaluate each stage variable initial value For each input row:
y Evaluate each stage variable derivation value unless the derivation is empty y For each output link:
 

Evaluate each column derivation value Write the output record

Stage variables and columns within a link are evaluated in the order displayed in the Transformer editor

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 3

Optimizing Transformer Expressions


 Given Transformer evaluation order, use stage variables instead of per-column derivations to minimize repeated use of the same derivation:
Move repeated expressions outside of loops Examples:
y Portions of output column derivations that are used in multiple derivations y Where an expression includes calculated constant values


Use the stage variable Initial Value to evaluate once for all rows

y Where an expression requiring a type conversion is used as a constant, or it is used in multiple places

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 4

Transformer Decimal Arithmetic


 Starting with v7.0.1 and v6.0.2, the Transformer supports DECIMAL arithmetic (earlier releases converted to dfloat)

Default internal decimal variables are precision 38 scale 10, but this can be changed by specifying
y $APT_DECIMAL_INTERM_PRECISION y $APT_DECIMAL_INTERM_SCALE

Set $APT_DECIMAL_INTERM_ROUND_MODE to specify:


y ceil: rounds toward positive infinity


1.4 -> 2, -1.6 -> -1 1.6 -> 1, -1.4 -> -2

y floor: rounds toward negative infinity




y round_inf: rounds or truncates to nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity


1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2

y trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size
January 21, 2012

1.56 -> 1.5, -1.56 -> -1.5

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 5

Conditionally Aborting a Job


 Use the Abort After Rows setting in the output link constraints of the parallel Transformer to conditionally abort a parallel job Create a new output link and assign a link constraint that matches the abort condition Set the Abort After Rows for this link to the number of rows allowed before the job aborts

 When the Abort After Rows threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 6

More Transformer Best Practices


 Always include Reject Link Captures NULL errors from Transformer expressions

 Always test for null value before using a column in a function  Avoid type conversions Try to maintain data type as imported  Be aware of Column and Stage Variable data types It is easy to neglect setting the proper Stage Variable type
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 7

First Row Transformer Derivations


 Within a Transformer, stage variables can be used to identify the first row of an input group
Define one stage variable for each grouping key column Define a stage variable to flag when input key column(s) do not match previous value(s) On new group (flag set), set stage variable(s) to incoming key column value(s)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 8

Last Row Transformer Derivations


 Since the Transformers cannot read ahead, other methods must be used when derivations depend on the last row of a group
For aggregate calculations within the Transformer, generate a running total for each group, then Remove Duplicates, retaining Last row

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

15 9

Identifying Last Row in a Group


 In general, it is a bad idea to perform multiple, back-to-back sorts  The sort stage, however, can be used for more than just sorting
Sub-sorting on groups (instead of complete sorts) Creating key change columns

 Example: For derivations that cannot output a running total, use 3 Sort stages before Transformer to generate a change key column for the last row in the group
Often, data is already sorted earlier in the same flow Hash/Sort on key columns before first sort Use SAME partitioning to ensure that subsequent stages keep grouping and sort order

Sort
January 21, 2012

KeyChange

SubSort
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 0

Last Row Sort Details


First Sort
Sorts on key columns Sorts Descending on group order column

Second Sort
Does no sorting creates key-change column Specify only key columns

Final Sub-Sort
Does not sort on key columns Sub-sorts Ascending on group order column

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 1

DataStage Enterprise Edition


Database Stage Usage

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Database Stage Usage


      Overall Database Guidelines Native Parallel vs. Plug-In Stages DB2 Guidelines Oracle Guidelines Teradata Guidelines SQL or DataStage?

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 3

Optimizing Select Lists for Read


 For source database stages, limit the use of SELECT * to read all columns
Uses more memory, may impact job performance Only needed for dynamic source / target flows (uncommon)

 Instead, explicitly specify ONLY the columns needed in the flow


For Table read method, specify Select List property

Or, use Auto-Generated or User-Defined SQL


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 4

Native Parallel Database Stages


 Starting with release 7, DataStage Enterprise Edition offers database connectivity through native parallel and plug-in stage types.  In general, for maximum parallel performance, scalability, and features it is best to use the native parallel database stages.
Parallel read and write OPEN and CLOSE commands

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 5

Upsert (API) vs. Load Methods


 For database targets, most Enterprise stages provide the choice of Upsert or Load Methods
Upsert method uses database APIs
y Allows concurrent processing with other jobs and applications y Does not bypass database constraints, indexes, triggers

Load method uses corresponding database-specific parallel load utilities


y Can be significantly faster than Upsert method for large data volumes y Subject to database-specific limitations of load utilities
 

May be issues with index maintenance, constraints, etc May not work with tables that have associated triggers
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

y Requires exclusive access to target table


January 21, 2012

16 6

OPEN and CLOSE commands


 OPEN command allows user to specify SQL to be executed before the stage begins reading or writing
Example: Create temporary table used to write rows

 CLOSE command allows user to specify SQL to be executed after the stage completes reading or writing
Example: SELECT FROM INSERT INTO from temporary table to actual table Example: Delete temporary table(s)

 Available only in the native parallel (Enterprise) database stages

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 7

Plug-In Database Stages


 Plug-in stage types are intended to provide connectivity to database configurations not offered by native parallel stages.
Cannot read in parallel Cannot span multiple servers in clustered or MPP configurations

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 8

Designer Palette Customization


 DataStage repository window displays all stages available in the parallel canvas.  Stage Types/Parallel category  Not all of these stages are included in the default Designer palette.  Customize the palette to add stage types (eg. Teradata API)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

16 9

Enterprise Edition DB2 Stages


 DB2 Enterprise stage
Should always be used when reading from, performing lookups against, or writing to DB2 Enterprise Server Edition with Database Partitioning Feature (DPF)
y DB2 v7.x this was called DB2EEE

Tightly coupled with DB2, communicates directly with each DB2 database node, using same partitioning as DB2 table Supports Parallel Read, Upsert, Load, Sparse Lookup

 DB2 API stage


Provides connectivity to non-UNIX DB2 databases (such as mainframe editions through DB2-CONNECT)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 0

DB2 Upsert Commit Interval


 For target DB2 tables using the Upsert method, the DB2 Enterprise Stage provides options to specify the database commit interval for each stage  Rows are committed after a period of time or number of rows, whichever comes first:
Default is every 2 seconds or 2000 rows

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 1

Cleaning Up Failed DB2 Loads


 In the event of a failure during DB2 Load operation, the DB2 Fast Loader marks the table inaccessible (quiesced exclusive or load pending state)  To reset the target table state to normal mode:
Re-run the job specifying CleanupOnFailure=True option Any rows that were inserted before the load failure must be deleted manually

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 2

Enterprise Edition Oracle Stages


 Oracle Enterprise
Source
y Supports sequential (default) or parallel reads

Target
y Upsert: uses Oracle API y Load: invokes SQL*Loader, subject to its limitations

 Oracle OCI Load


ONLY used for heterogeneous loads
y When target databases hardware platforms differ from the Oracle client (DataStage server) platform

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 3

Specifying Oracle Remote Server


 The Oracle Enterprise Remote Server connection option is intended for Oracle instances on remote hosts

 In general, avoid using this option for local Oracle databases (on same host as DataStage server)
Specifying for local Oracle instances forces TCP (network) instead of shared memory database connection Instead, set the environment variable $ORACLE_SID
y Oracle environment is typically defined within the DataStage dsenv file
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 4

Reading from Oracle in Parallel


 By default, Oracle Enterprise reads sequentially. Use the partition table option to read in parallel from Oracle sources

 Limitations of Parallel Read:


Source table can only be non-partitioned or range-partitioned Cannot run queries containing a "GROUP BY" clause which are not also partitioned by same field Cannot perform a non-collocated join
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 5

Oracle Schema Owner


 To access Oracle tables that were created by a different user, fully-qualify the table name
Syntax: ownername.tablename NOTES:
y Parameterize ownername y Database permissions must allow access y CANNOT create an unqualified synonym


no access to Oracle system catalog information required by Oracle Enterprise stage

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 6

Improving Oracle Upsert Performance


 In Upsert write mode, the Oracle Enterprise stage:
Executes the Insert statement (if present) first If the Insert fails with a unique-constraint violation, it then executes the Update statement

 For larger data volumes, it is often faster to identify Insert and Update data within the job and separate into different Oracle Enterprise targets
Set Upsert Mode=Update Only for rows to be updated Set Upsert Mode=Update and Insert for rows to be inserted Prevents double-processing of update records

 Insert processing uses Oracle host arrays to improve processing


Optional InsertArraySize parameter can enhance performance (default is 500 rows)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 7

Oracle Upsert Commit Interval


 For target Oracle tables using the Upsert method, two environment variables specify the database commit interval
As environment variables, commit settings apply to all Oracle stages in a job

 Rows are committed after a period of time or number of rows, whichever comes first, for each Oracle stage/partition:
$APT_ORAUPSERT_COMMIT_ROW_INTERVAL
y Default is every 2000 rows (per stage/partition)

$APT_ORAUPSERT_TIME_INTERVAL
y Default is every 2 seconds
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 8

Oracle Load into Indexed Tables


 By default, Oracle Enterprise will not Load an indexed table Must drop indexes before the load, and recreate after the load (need appropriate Oracle privileges)
y Can use OPEN and CLOSE commands

 In Append or Truncate modes, the IndexMode option can allow load into an indexed table: Rebuild: bypasses indexes during load, rebuilds indexes after load completes
y uses Oracle ALTER INDEX REBUILD command y indexes cannot be partitioned
Maintenance:

maintains index on load

y Loads each partition sequentially y Table and Index must be partitioned y Index must be local range-partitioned using same range values used to partition the table
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

17 9

Alternate: Load into Indexed Tables


 If index mode options are not possible, or if you do not have proper Oracle permissions, it is still possible to Load into an indexed table:
Set Oracle Enterprise stage to run sequentially Set environment variable $APT_ORACLE_LOAD_OPTIONS
y OPTIONS(DIRECT=TRUE,PARALLEL=FALSE)

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 0

Teradata Stage Usage


 Because of limitations imposed by the Teradata Utilities, it is sometimes appropriate to use plug-in stages for Teradata sources or targets Teradata imposes a system-wide limit to the number of concurrent database utilities
y Can be adjusted by the DBA, but can not be greater than 15 y Within a parallel job, each use of Teradata Enterprise, Teradata MultiLoad, or Teradata Load stages count against this limit when the job is run

 Which Teradata stage to use? Source or Target Teradata Enterprise uses FastExport and FastLoad utilities
y High-volume parallel reads and writes y Targets are limited to Insert operations (empty table or Append) y Supports OPEN and CLOSE commands
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 1

Teradata Enterprise DBOptions


 For Teradata instances with a large number of AMPs (VPROCs), it may be necessary to set the optional SessionsPerPlayer and RequestedSessions in the DBOptions string in the Teradata Enterprise stage
It is a good idea to parameterize these settings Syntax is:
y user=[user],password=[password],SessionsPerPlayer=nn, RequestedSessions=nn

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 2

Teradata Enterprise Sessions


 RequestedSessions determines the total number of distributed connections to the Teradata source or target
When not specified, it equals the number of Teradata VPROCs (AMPs) (your DBA can provide this) Can set between 1 and number of VPROCs

 SessionsPerPlayer determines the number of connections each player will have to Teradata. Indirectly, it also determines the number of players (degree of parallelism).
Default is 2 sessions / player The number selected should be such that SessionsPerPlayer * number of nodes * number of players per node = RequestedSessions Setting the value of SessionsPerPlayer too low on a large system can result in so many players that the job fails due to insufficient resources. In that case, the value for SessionsPerPlayer should be increased.
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 3

Teradata SessionsPerPlayer Example

DataStage Server

Teradata Server MPP with 4 TPA nodes 4 AMPs per TPA node

Example Settings
Configuration File 16 nodes 8 nodes 8 nodes 4 nodes
January 21, 2012

Sessions Per Player 1 2 1 4

Total Sessions 16 16 8 16

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 4

Teradata Plug-Ins
Target Teradata MultiLoad plug-in (MultiLoad utility)
y Targets allow Insert, Update, Delete, or Upsert of moderate data volumes (stage cannot run in parallel) y Do NOT use as a source in an EE flow! (runs FastExport sequentially)

Target Teradata MultiLoad plug-in (TPump utility)


y Targets allow Insert, Update, Delete, or Upsert of small data volumes in a large database y Does NOT lock target table exclusively y stage cannot run in parallel

Source or Target Teradata API stage does not use database utilities
y Intended for small-volumes of data y Does not count against Teradata limits, but slower than TPump y Andcannot read in parallel (parallel writes are allowed)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 5

Teradata Stage Usage Guidelines


 Stages that use Teradata Utilities (database-wide limit): TeraData Enterprise will always have maximum performance for high volumes of data
y ONLY stage that will read in parallel y Limited target capabilities (insert, append)

TeraData MultiLoad for moderate data volumes


y Inserts, Updates, Deletes, Upserts y Target stage ONLY! y Must run sequentially

TeraData MultiLoad (TPump option)


y Similar to MultiLoad, but does not lock target table exclusively

 Stages that do not use Teradata Utilities: Teradata API


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 6

SQL or DataStage?
 When reading data from multiple tables in the same database, it is possible to use either SQL or DataStage for some tasks.  In general, the optimal implementation leverages the strengths of each technology:
When possible, use a SQL filter (WHERE clause) to limit the number of rows sent to the DataStage job Use a SQL JOIN to combine data from tables of small-medium number of rows, especially when the join columns are indexed In general, avoid SQL SORTs DataStage SORT is much faster and runs in parallel without the overhead of sort-merge Use DataStage SORT and JOIN to combine data from very large tables, or when the join condition is complex Avoid the use of database stored procedures (eg. Oracle PL/SQL) on a per-row basis. Implement these routines using native DataStage components.

 When the direction is not obvious, the decision is often made by actual tests, or influenced by other factors such as metadata needs and developer skill sets
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 7

For More Information


 Orchestrate OEM Documentation (available in the documentation section of Ascential eServices public website)
User Guide Operators Reference Record Schema

 DataStage Enterprise Edition Best Practices and Performance Tuning document  Dont be afraid to try!

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

18 8

DataStage Enterprise Edition


Module 04: Best Practices and Job Design Tips

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition


Module 05: Environment Variables

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Understanding a Jobs UNIX Environment


 Jobs inherit environment variables at runtime based on this order of evaluation:

Environment variables defined in $DSHOME/dsenv


y Shared by all projects on the DataStage server

Project-level environment variables defined by DS Administrator


y Duplicate variables over-ride $DSHOME/dsenv y NOTE: when migrating between environments, project level environment variables are NOT exported

Job-level environment variables set in Job Parameters


y Duplicate variables over-ride $DSHOME/dsenv and project-level settings y Cannot be set / passed in Job Sequences (bug!) y To avoid hard-coding job parameters, use special values:

January 21, 2012

$ENV pulls value from operating system environment $PROJDEF uses project default value

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 1

Copying Project-Level Environment Variables


 Project-level environment variables are not exported when performing a full export using DataStage manager  With care, project-level environment variables can be copied between projects by editing the DSParams file located on the top-level of the project directory User-defined settings are near the end of this file
[InternalSettings] DisableParSCCheck=0 [AUTO-PURGE] PurgeEnabled=0 DaysOld=0 PrevRuns=0 [EnvVarValues] "ORACLE_SID"\1\"cpaul" "APT_SORT_INSERTION_CHECK_ONLY"\1\"1"

 IMPORTANT: Always make a backup-copy of the DSParams file before any manual editing. It is possible to render a project un-usuable through improper editing of DSParams
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 2

Environment Variables For All Jobs


 The following environment variables are recommended for all jobs. Although these can be set at the project level, it is better to specify within job properties
Provides runtime parameter Specify in your Job template(s)

 $APT_CONFIG_FILE=[filepath]  $APT_DUMP_SCORE=1  $APT_RECORD_COUNTS=1


Outputs record counts to the job log as each operator completes processing

 $OSH_ECHO=1
Outputs generated OSH to job log

 $APT_PM_SHOW_PIDS=1
Places UNIX process ID entries in job log for each process started at runtime Does not show DataStage phantom or Server processes

 $APT_BUFFER_MAXIMUM_TIMEOUT=1
Maximum buffer delay in seconds

 $APT_COPY_TRANSFORM_OPERATOR=1
For clusters/MPP only: copies Transform operator(s) to remote nodes
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 3

Job Monitoring Environment Variables


 Starting with DataStage v7, the Director Job Monitor captures results on a time interval
Captured row counts are shown in Director, Job Monitor, and Designer (Show Performance Statistics) This data is also stored in the DataStage repository, and can be extracted using Job Control or XML reports

 The following environment variables alter Job Monitor characteristics:


$APT_MONITOR_TIME=[seconds]
y Specifies time interval for capturing job monitor information at runtime.

$APT_MONITOR_SIZE=[rows]
y If set, specifies that the job monitor capture information on a row (not time) basis. This is the method used in DataStage release 6.x

$APT_NO_JOBMON=1
y Disables job monitoring completely no statistics will be captured y In rare instances, this may improve performance
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 4

Job Design Environment Variables


 $APT_STRING_PADCHAR=[char]
Overrides the default pad character of 0x0 (ASCII NULL) Can be a string character, or C-notation Used for all variable-length to fixed-length string conversions May have implications for some target database stages (eg. Oracle)

 $APT_DECIMAL_INTERM_PRECISION=[precision] $APT_DECIMAL_INTERM_SCALE=[scale]
Specifies internal precision and scale used for internal Transformer derivations Default precision/scale is [38,10], maximum is [255,255]

 $APT_DECIMAL_INTERM_ROUND_MODE=[mode]
ceil: rounds toward positive infinity
y 1.4 -> 2, -1.6 -> -1

floor: rounds toward negative infinity


y 1.6 -> 1, -1.4 -> -2

round_inf: rounds or truncates to nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity
y 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2

trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size
y 1.56 -> 1.5, -1.56 -> -1.5
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 5

Job Debugging Environment Variables


 The following environment variables can assist with debugging a job flow:
$OSH_PRINT_SCHEMAS=1
y Outputs the actual schema used at runtime for each dataset in a job flow. This is useful for determining if actual schema matches what the job designer expected.

$APT_PM_PLAYER_TIMING=1
y When set, prints detailed information in the job log for each operator, including CPU utilization and elapsed processing time

$APT_PM_PLAYER_MEMORY=1
y When set, prints detailed information in the job log for each operator when additional memory is allocated

$APT_BUFFERING_POLICY=FORCE $APT_BUFFER_FREE_RUN=1000
y Used in conjunction, these two environment variables effectively isolate each operator from slowing upstream production. Using the job monitor statistics, this can identify which part of a job flow is impacting overall performance. y NOT recommended for production job runs!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 6

Buffer Environment Variables


 The following environment variables may also be specified on a per-stage basis within Designer:
$APT_BUFFERING_POLICY $APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 7

Sequential File Stage Environment Variables


Environment Variable
$APT_EXPORT_FLUSH_COUNT

Setting
[nrows]

Description
Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O. Setting this environment variable directs DataStage to reject Sequential File records with strings longer than their declared maximum column length. By default, imported string fields that exceed their maximum declared length are truncated. Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance. In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations. Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2. This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.

$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS

(DataStage v7.01 and later)


$APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE

[Kbytes]

$APT_CONSISTENT_BUFFERIO_SIZE

[bytes]

$APT_DELIMITED_READ_SIZE

[bytes]

$APT_MAX_DELIMITED_READ_SIZE

[bytes]

This variable controls the upper bound which is by default 100,000 bytes. When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 8

DB2 Environment Variables


Environment Variable
$INSTHOME

Setting
[path]

Description
Specifies the DB2 install directory. This variable is usually set in a users environment from .db2profile.

$APT_DB2INSTANCE_HOME

[path]

Used as a backup for specifying the DB2 installation directory (if $INSTHOME is undefined).

$APT_DBNAME

[database]

Specifies the name of the DB2 database for DB2/UDB Enterprise stages if the Use Database Environment Variable option is True. If $APT_DBNAME is not defined, $DB2DBDFT is used to find the database name.

$APT_RDBMS_COMMIT_ROWS Can also be specified with the Row Commit Interval stage input property.

[rows]

Specifies the number of records to insert between commits. The default value is 2000.

$DS_ENABLE_RESERVED_CHAR_CONVERT

Allows DataStage to handle DB2 databases which use the special characters # and $ in column names.

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

19 9

Informix Environment Variables

Environment Variable
$INFORMIXDIR

Setting
[path]

Description
Specifies the Informix install directory. Specifies the path to the Informix sqlhosts file.

$INFORMIXSQLHOSTS

[filepath]

$INFORMIXSERVER

[name]

Specifies the name of the Informix server matching an entry in the sqlhosts file.

$APT_COMMIT_INTERVAL

[rows]

Specifies the commit interval in rows for Informix HPL Loads. The default is 10000.

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 0

Oracle Environment Variables


Environment Variable
$ORACLE_HOME $ORACLE_SID $APT_ORAUPSERT_COMMIT_ROW_INTERVAL $APT_ORAUPSERT_COMMIT_TIME_INTERVAL

Setting
[path] [sid] [num] [seconds]

Description
Specifies installation directory for current Oracle instance. Normally set in a users environment by scripts. Specifies the Oracle service name, corresponding to a TNSNAMES entry. These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method. Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows. Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)

$APT_ORACLE_LOAD_OPTIONS

[SQL* Loader options ] [char]

$APT_ORACLE_LOAD_DELIMITED

(DataStage 7.01 and later)


$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM

Specifies a field delimiter for target Oracle stages using the Load method. Setting this variable makes it possible to load fields with trailing or leading blank characters. When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the tables tablespace. Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands. Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

$APT_ORA_WRITE_FILES

[filepath]

$DS_ENABLE_RESERVED_CHAR_CONVERT

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 1

Teradata Environment Variables


Environment Variable
$APT_TERA_SYNC_DATABASE

Setting
[name]

Description
Starting with v7, specifies the database used for the terasync table. By default, EE uses the

$APT_TERA_SYNC_USER

[user]

Starting with v7, specifies the user that creates and writes to the terasync table. Specifies the password for the user identified by $APT_TERA_SYNC_USER. Enables 64K buffer transfers (32K is the default). May improve performance depending on network configuration. This environment variable is not recommended for general use. When set, this environment variable may assist in job debugging by preventing the removal of error tables and partially written target table.

$APT_TER_SYNC_PASSWORD

[password]

$APT_TERA_64K_BUFFERS

$APT_TERA_NO_ERR_CLEANUP

$APT_TERA_NO_PERM_CHECKS

Disables permission checking on Teradata system tables that must be readable during the TeraData Enterprise load process. This can be used to improve the startup time of the load.

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 2

For More Information


 Orchestrate OEM Documentation (available in the documentation section of Ascential eServices public website)
Admin Install Guide, Chapter 11: Environment Variables Operators Reference

 DataStage Enterprise Edition Best Practices and Performance Tuning document

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 3

DataStage Enterprise Edition


Module 05: Environment Variables

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition


Module 06: Introduction to Performance Tuning

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Assumptions
 This module assumes that you have an understanding of the topics covered in:
Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Module 04: Best Practices and Job Design Tips Material covered in DS324PX: DataStage Enterprise Edition Essentials

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 6

Optimizing Performance
 The ability to process large volumes of data in a short period of time requires optimizing all aspects of the job flow and environment for maximum throughput and performance:
Job Design Stage Properties DataStage Parameters Configuration File Disk Subsystem
y Especially RAID arrays / SANs

Source and Target database Network etc...


January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 7

Enterprise Edition Performance


 Within DataStage, examine (in order):
End-to-end process flow y Intermediate results, sources/targets, disk usage DataStage Configuration File(s) for Each Job y Degree of Parallelism y Impact on Overall System Resources y File system mappings, scratch disk Individual Job Design (including shared containers) y Stages chosen, overall design approach y Partitioning Strategy y Combination y Buffering (as a last resort)

 Ultimate job performance may be constrained by external sources / targets


eg. disk subsystem, network, database, etc. May be appropriate to scale-back degree of parallelism to conserve un-used resources
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 8

Performance Tuning Methodology


 Performance tuning is an iterative process:
Test in isolation (nothing else should be running)
y DataStage Server y Source and Target databases

Change one item at a time, then examine impact


y Use Job Score to determine
  

Number of processes generated Operator combination Framework-inserted sorting and partitioning Data distribution (partitioning) Throughput and bottlenecks

y Use DataStage Job Monitor to verify


 

y Use UNIX system monitoring tools to determine resource utilization (CPU, memory, disk, network)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

20 9

Using DataStage Director Job Monitor


 Enable Show Instances to show data distribution (skew) across partitions
Best performance with even distribution

 Enable Show %CP to display CPU utilization

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 0

Selectively Disabling Operator Combination


 Operator combination is intended to improve overall performance and lower resource usage
Generally separates I/O from CPU activity

 There may be instances when operator combination hurts performance


One process cannot use more than 100% of CPU It is also a good idea to separate I/O from CPU tasks

 Use DataStage Job Monitor to identify CPU bottlenecks  Selectively disable combination through Designer stage properties

 In unusual circumstances, disable all combination by setting $APT_DISABLE_COMBINATION=TRUE


Generates significantly more UNIX processes May negatively impact performance
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 1

Operator Combination Example


 In this example, the combined operator is using 100% CPU  Disabling operator combination allows each to use stage to use more CPU, and separates I/O from CPU Without Operator Combination
With Operator Combination
It has 2 operators: op0[1p] {(parallel APT_CombinedOperatorController: (FileSetIn.InStream) (APT_TransformOperatorImplJob_Transformer in Transformer) (APT_RealFileExportOperator in File_Set_6.ToOutput) ) on nodes ( node1[op0,p0] )} op1[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op1,p0] )} It runs 2 processes on 1 node.
It has 4 operators: op0[1p] {(parallel FileSetIn.InStream) on nodes ( node1[op0,p0] )} op1[1p] {(parallel APT_TransformOperatorImplJob_Transformer in Transformer) on nodes ( node1[op1,p0] )} op2[1p] {(parallel APT_RealFileExportOperator in File_Set_6.ToOutput) on nodes ( node1[op2,p0] )} op3[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op3,p0] )} It runs 4 processes on 1 node.

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 2

Configuration File Guidelines


 Minimize I/O overlap across nodes
If multiple filesystems are shared across nodes, alter order of file systems within each node definition Pay particular attention with mapping of file systems to physical controllers / drives within a RAID array or SAN Use local disks for scratch storage if possible

 Named Pools can be used to further separate I/O


buffer file systems are only used for buffer overflow sort file systems are only used for sorting

 On clustered / MPP configurations, named pools can be used to further specify resources across physical servers
Through careful job design, can minimize data shipping Specifies server(s) with database connectivity
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 3

Use Parallel Data Sets


 Use Parallel Data Sets to land intermediate results between parallel jobs
Stored in native internal format (no conversion overhead) Retains data partitioning and sort order (end-to-end parallelism across jobs) Maximum performance through parallel I/O But, can only be used by other DataStage Enterprise Edition parallel jobs

 When generating Lookup reference data to be used in subsequent jobs, use Lookup File Sets
Internal format, partitioned Pre-indexed
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 4

Impact of Partitioning
 Ensure data is as close to evenly distributed as possible
When business rules dictate otherwise, re-partition to a more balanced distribution as soon as possible to improve performance of downstream stages

 Minimize repartitions by optimizing the flow to re-use upstream partitioning


Especially in clustered / MPP environments

 Know your data


Choose hash key columns that generate sufficient unique key combinations (while meeting business requirements)

 Use SAME partitioning carefully


Maintains degree of parallelism
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 5

Impact of Sorting
 Use parallel sorts if possible (sort by key-column groups)
Where sequential sort is required, parallel sort + sort merge collector is generally much faster than a sequential sort

 Complete sorts are expensive


Interrupts pipeline
y Rows cannot be output until all rows have been read

Uses scratch disk for intermediate storage


y Unless the data set is small enough to fit in sort buffer

 Minimize and combine sorts where possible


Use the Dont Sort, Previously Sorted key-column option to leverage previous sort groupings
y Uses much less memory y Outputs rows after each key-column group

Parallel data sets maintain sort order and partitioning across jobs

 Stable sorts are slower than non-stable sorts; use only when necessary  Use the Restrict Memory Usage (MB) option to increase amount of memory per partition (default is 20MB)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 6

Impact of Transformers
 Minimize number and use of Transformers
Consider more appropriate stages / methods
y Copy, Output Mappings, Modify, Lookup

Combine derivations from multiple Transformers

 Use stage variables to perform calculations used by multiple derivations  Replace complex Transformers that do not meet performance requirements with BuildOps  And NEVER use the BASIC Transformer for highvolume flows!

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 7

Impact of Buffering
 Consider maximum row width

For very wide rows, it may be necessary to increase buffer size to hold more rows in memory (default is 3MB / partition) Set through stage properties or for entire job using $APT_BUFFER_MAXIMUM_MEMORY
 Tune all other factors (job design, configuration file, disk, resources, etc) before tuning buffer settings  Be careful changing buffering mode

Disabling buffering might cause deadlocks (job hang)


 In some cases, the best solution to avoiding fork-join buffer contention may be to split the job, landing to intermediate data sets

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 8

Isolating Buffers from Overall Performance


 Buffer operators may make it difficult to identify performance bottlenecks in a job flow  Setting the following environment variables effectively isolates each stage (by inserting buffers), and prevents the buffers from slowing down upstream stages (by spilling to disk)
$APT_BUFFERING_POLICY=FORCE
y Inserts buffers between each operator (isolates)

$APT_BUFFER_FREE_RUN=1000
y Writes excess buffer to disk instead of slowing down producer y Buffer will not slow down producer until it has written 1000*$APT_MAXIMUM_MEMORY to disk

 Important notes:
These settings will generate a significant amount of disk I/O! Use configuration file buffer disk pools to isolate buffer file systems from scratch and resource disks Do NOT use these settings for production jobs!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

21 9

Other Performance Tips


 Remove un-needed columns as early as possible within the flow
Minimizes memory usage, optimizes buffering Use a select list when reading from database sources To remove columns on Output Mapping, disable runtime column propagation

 Always specify a maximum length for VARCHAR columns


Significant performance benefits

 Avoid type conversions if possible


Verify with $OSH_PRINT_SCHEMAS Always import Oracle table definitions using orchdbutil
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

22 0

Tuning Sequential File Performance


 On heavily loaded file servers or some RAID/SAN configurations, setting these environment variables may improve performance (specify a number in Kbytes, default is 128):
$APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE

 In some disk array configurations, set the following environment variable equal to the read/write size in bytes:
$APT_CONSISTENT_BUFFERIO_SIZE

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

22 1

For More Information


 Orchestrate OEM Documentation (available in the documentation section of Ascential eServices public website)
User Guide Operators Reference

 DataStage Enterprise Edition Best Practices and Performance Tuning document  Dont be afraid to try!

January 21, 2012

2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

22 2

DataStage Enterprise Edition


Module 06: Introduction to Performance Tuning

NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Paul Christensen Solution Architect

Last revision: June 22, 2004


2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

You might also like