Professional Documents
Culture Documents
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Welcome!
This course is intended to provide:
An overview of the development and runtime architecture of DataStage Enterprise Edition Recommendations for parallel Job Design and Best Practices
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Agenda:
Day 1
y Module 1: Parallel Framework Architecture y Module 2: Partitioning, Collecting, and Sorting y Module 3: The Parallel Job Score
Day 2
y Module 4: Best Practices and Job Design Tips y Module 5: Environment Variables y Module 6: Introduction to Performance Tuning
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
The key to mastering Enterprise Edition is in understanding the DataStage Parallel Framework
Parallel ETL is a fundamentally different process Complex, high-volume flows require an understanding of the underlying engine architecture
For now (v7.x1), youll ALWAYS need a copy of the OEM (Orchestrate) documentation
Documentation for the DataStage Parallel Framework
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Operational Data
Transform
Clean
Load
Archived Data
Disk
Disk
Disk
Data Warehouse
Source
Target
Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Pipeline Multiprocessing
Think of a conveyor belt moving rows from process to process!
Transform, clean and load processes are executing simultaneously Rows are moving forward through the flow Operational Data
Transform
Archived Data
Clean
Load Load
Data Warehouse
Target Source
Start a downstream process while an upstream process is still running. This eliminates intermediate storing to disk, which is critical for big data. This also keeps the processors busy. Still have limits on scalability
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
10
Partition Parallelism
Divide large data into smaller subsets (partitions) across resources
Goal is to evenly distribute data Some transforms require all data within same group to be in same partition
Node 1
Transform
subset1 Node 2 subset2
Transform
Node 3
Source Data
subset3 subset4
Transform
Node 4
Transform
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11
Source Data
Transform
Clean
Load
Data Warehouse
Source
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Target 12
at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
13
Sample
Derivation
pipeline
Link Constraint
Lookup
Sort
explicit data-partition Explicit parallelism Implicit pipeline "parallelism" Implicit data-partition parallelism
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14
Defining Parallelism
Execution mode (sequential/parallel) is controlled by stage definition and properties default = parallel for most Ascential-supplied stages Can override default in most cases through Advanced stage properties; examples where stage usage defines parallelism:
y y y y Sequential File reads (unless number of readers per node is set) Sequential File targets Oracle Enterprise sources (unless partition table is set) others...
Degree of parallelism is determined by config. file Total number of logical nodes in nameless default pool, or Nodes listed in [nodemap] or in named [nodepool]
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15
Defines #nodes = logical processing units with corresponding resources (need not match physical CPUs)
Dataset, Scratch, Buffer disk (filesystems) Optional resources (eg. Database, SAS, etc) Advanced topics (pools - named subsets of nodes)
16
3 1
4 2
key aspects: 1. 2. # Nodes defined (LOGICAL processing entities) Resources assigned to each node (order of entries within each node is significant!) Advanced resource optimizations and configuration (named pools, database, SAS)
3.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18
Compile
DataStage server
Buildop stages must be compiled manually within the GUI or using buildop UNIX command line
Transformer Components
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19
- Job Properties - Job run log - View Data - Table Definitions (Show Schema)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
20
Oracle
Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert
DataSet: copy Sort (DataStage): tsort Aggregator: group Row Generator, Column Generator, Surrogate Key Generator: generator
21
Designer inserts comment blocks to assist in understanding the generated OSH Note that operator order within the generated OSH is the order a stage was added to the job canvas OSH uses the familiar syntax of the UNIX shell to create applications for Data Stage Enterprise Edition operator name schema
y for generator, import, export
operator options (use -name value format) input (indicated by n< where n is the input #) output (indicated by n> where n is the output #)
y may include modify
#################################################### #### STAGE: Row_Generator_0 ## Operator generator ## Operator options -schema record ( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10}; ) -records 50000 ## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')] ## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ; #################################################### #### STAGE: SortSt ## Operator tsort ## Operator options -key 'a' -asc ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v' ## Outputs 0> [modify ( keep a,b,c; )] 'SortSt:lnk_sorted.v' ;
For every operator, input and/or output datasets (links) are numbered sequentially starting from 0. For example: op1 0> dst op1 1< src The following operator input/output data sources are generated by DataStage Designer:
Virtual data set, (name.v) Persistent data set (name.ds or [ds] name)
Virtual data set (link) name is used to connect output of one operator to input of another
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
22
Terminology
Framework
schema property type virtual dataset record/field operator step, flow, OSH command Framework
DataStage
table definition format SQL type + length [and scale] link row/column stage job DS engine
GUI uses both terminologies Log messages (info, warnings, errors) use Framework terms
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
23
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
24
Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control
Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage log
25
26
Operators
y node/operator mapping
27
Processing Node
SL
Score Composer Creates Section Leader processes (one/node) Consolidates massages, to DataStage log Manages orderly shutdown
Processing Node
SL
Players
The actual processes associated with Stages Combined players: one process only Sends stderr, stdout to Section Leader
P
Default Communication:
January 21, 2012
Establish connections to other players for data flow Clean up upon completion
SMP: Shared Memory MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
28
Section Leader,0
Section Leader,1
Section Leader,2
generator,0
generator,1
generator,2
copy,0
copy,1
copy,2
29
Data Flow
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
30
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
31
Data Formats
The Framework processes only datasets
For external data, Enterprise Edition must perform conversion operations:
Format translation using data type mappings May also require:
y Recordization y Columnization
DataSet format
Parallel Framework
External Data
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
External Data
Conversion
Conversion
32
Data Sets
Data Sets are the structured internal representation of data within the Parallel Framework
Consist of:
Framework Schema (format=name, type, nullability) Data Records (data) Partition (subset of rows for each node)
33
34
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
35
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
36
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
37
Columns are not removed from output Columns are not renamed unless explicitly dragged to derivation
In this example, runtime error because Name will not map to NAME, (RCP maps by case sensitive column name)
Must drag column name to derivation column
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
38
Type Conversions
Enterprise Edition provides numerous conversion functions between source and target data types
Default type conversions take place across the output mappings of any parallel stage when runtime column propagation is disabled for that stage
y Variable to Fixed-length string conversions will pad remaining length with ASCII NULL (0x0) characters y Use $APT_STRING_PADCHAR to change default padding (also used by target Sequential File stages)
Non-default type conversions require use of Transformer or Modify (recommended method) Look for warnings in DataStage log to indicate unexpected conversions!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
39
int8 uint8 int16 uint16 int32 uint32 Int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp
January 21, 2012
d d de d de d de d de de de de de e e e e d d d d d d d d d d d
d d
d d d
d d d d
d d d d d
d d d d d d
d d d d d d d
d d d d d d d d
de d d d d d d d d
d d d d d d d d d de
de d de de de de d d d de de
de d de de de de d d d de de d
d d d d d d d d de de d d d d d d d d d
e e
d d d d d de d d e d d d d d de de
d d d de d d d d de d d
d d d d de de de
de de d
e e
e e
e e e
e e e
e e e
e e e e e
e de
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
40
Examples:
y a numeric fields most negative possible value y an empty string
Modify stage:
y destinationColumnName = handle_null(sourceColumnName,value) y destinationColumnName = make_null(sourceColumnName,value)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
41
Destination Field
not_nullable nullable nullable
Result
Source value propagates to destination. Source value or null propagates. Source value propagates; destination value is never null. WARNING messages in log. If source value is null, a fatal error occurs. Must handle in Transformer or Modify.
nullable
not_nullable
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
42
null_length
y Field length that indicates a NULL value (only appropriate for variable-length files)
Null field representation can be any string, regardless of valid values for actual column data type
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
43
Lookup
If Not Found = Continue
Integer columns default value is zero. Varchar is a zero-length string (this is distinctly different from a NULL value) Char is a string of fixed length $APT_STRING_PADCHAR characters
TIP: When changing column attributes, be careful to propagate the change through the remaining links of your job design
(Including the output column definition of the Lookup stage in this example).
If the non-key columns of the reference link are defined as nullable, the Lookup stage will place NULL values in these columns for unmatched records * Data type default values are documented in OEM UserGuide.pdf
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
44
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
v7 Transformer now warns when rows reject v7 also clarifies naming of output link constraints (Otherwise)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
46
OperatorsRef.PDF
Detailed reference for every built-in operator
RecordSchema.PDF
Format of Framework schema definition (including import, export, generator)
47
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
48
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
partitioner
collector
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
51
Fan-Out Partitioner
Sequential to Parallel
Collector
(Fan-In) Parallel to Sequential
NOTE: Partitioner and Collector icons are ALWAYS drawn Left to Right regardless of how the link is drawn!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
52
partitioner
Partitioning Data
Stage Stage running in running in Parallel Parallel
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
53
Partitioners
Partitioners distribute rows of a single link (data set) into smaller segments that can be processed independently in parallel Partitioners exist before ANY parallel stage. The previous stage may be running:
Sequentially
y Results in a fan-out operation (and link icon)
Stage running Sequentially Stage running in Parallel
partitioner
In Parallel
y If partitioning method changes, data is repartitioned
Stage running in Parallel Stage running in Parallel
repartitioning icon
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
54
partition #
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
55
Starting with v7.1, the Surrogate Key Generator stage can generate a sequence of integer values in parallel:
Internally similar to using Column Generator stage with part and partcount keywords Also supports initial value for the sequence(s)
Part
0 1 2 3 0 1 2 3
Partcount Result
4 4 4 4 4 4 4 4 0 1 2 3 4 5 6 7
Within the Transformer, @INROWNUM system variable is generated for each node. Instead, use:
@PARTITIONNUM: actual partition number (starts at zero) @NUMPARTITIONS: total number of partitions
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
56
Enable Show Instances in DataStage Director Job Monitor to show data distribution (skew) across partitions:
Setting the environment variable $APT_RECORD_COUNTS outputs row counts per partition to the DataStage log as each stage/node (operator) completes processing
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
57
Objective 2: Partition method MUST match the stage logic, assigning related records to the same partition if required Any stage that operates on groups of related data (often using key columns)
y Examples: Aggregator, Join, Merge, Sort, Remove Duplicates, etc (perhaps also Transformers, Buildops)
Partitioning method needed to ensure correct results may violate Objective #1, depending on actual data distribution Objective 3: Partition method should not be overly complex The simplest method that meets Objectives 1 and 2 If possible, leverage partitioning performed earlier in a flow
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
58
Specifying Partitioning
Partitioning method is defined on the Input properties, Partitioning category, of any stage running in parallel
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
59
Partitioning Methods
Keyless Partitioning Rows are distributed independent of actual data values Same
Existing partitioning is not altered
Keyed Partitioning Rows are distributed at runtime based on values in specified key column(s) Hash
Rows with same key column value go to the same partition
Round Robin
Rows are evenly alternated among partitions
Modulus
Assigns each row of an input dataset to a partition, as determined by a specified numeric key column in the input dataset
Random
Rows randomly assigned to partitions
Entire
Each partition gets the entire dataset (rows duplicated)
Range
Similar to hash, but partition mapping is user-determined and partitions are ordered
DB2
Matches DB2 EEE partitioning Discussed in database chapter
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
60
SAME Partitioning
Keyless partitioning method Rows retain current distribution and order from output of previous parallel stage
Doesnt move data between nodes Retains carefully partitioned data (such as the output of a previous sort)
Row ID's
0 3 6
1 4 7
2 5 8
0 3 6
1 4 7
2 5 8
61
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
62
Round Robin
Fairly low overhead Round Robin assigns rows to partitions as dealing cards
Row/Partition assignment will be the same for a given $APT_CONFIG_FILE
6 3 0 7 4 1 8 5 2
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
63
Row Generator 10 rows {A: Integer, initial_value=1, incr=1} Results with a 4-node $APT_CONFIG_FILE: Node 0: 1, 5, 9 Node 1: 2, 6, 10 Node 2: 3, 7 Node 3: 4, 8 With round robin partitioning, rows are distributed in the same order for the same input data and $APT_CONFIG_FILE
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
64
ENTIRE Partitioning
Keyless partitioning method Each partition gets a complete copy of the data
Useful for distributing lookup and reference data
y May have performance impact in MPP / clustered environments
8 7 6 5 4 3 2 1 0
ENTIRE
On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data
. . 3 2 1 0
. . 3 2 1 0
. . 3 2 1 0
ENTIRE is the default partitioning for Lookup reference links with Auto partitioning
On SMP platforms, it is a good practice to set this explicitly on the Normal Lookup reference link(s)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
65
HASH Partitioning
Keyed partitioning method Rows are distributed according to the values in one or more key columns
Guarantees that rows with identical combination of values in key column(s) are assigned to the same partition Needed to prevent matching rows from hiding in other partitions y eg. Join, Merge, RemDup Partition size will be relatively equal if the data across the source key column(s) is evenly distributed
January 21, 2012
HASH
0 3 0 3
1 1 1
2 2 2
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
66
LName
Dodge Dodge
FName
Horace John
Address
17840 Jefferson 75 Boston Boulevard
ID
1 2 3 4 7 8 9 10
LName
Ford Ford Ford Ford Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
NOTE:
Partition distribution matches source data distribution In this example, number of distinct hash key values limits parallelism!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
67
LName
Ford Ford
FName
Clara Clara
Address
66 Edison Avenue 4901 Evergreen
NOTE:
Improved distribution Only the unique combination of key columns appear in the same partition For partitioning purposes, order of HASH key columns is insignificant
y NOTE: To avoid repartitioning, key column order should be consistent across stages with same keys
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
ID
3 5 9
LName
Ford Dodge Ford
FName
Edsel Horace Edsel
Address
7900 Jefferson 17840 Jefferson 1100 Lakeshore
ID
4 6 10
LName
Ford Dodge Ford
FName
Eleanor John Eleanor
Address
7900 Jefferson 75 Boston Boulevard 1100 Lakeshore
ID
1 7
LName
Ford Ford
FName
Henry Henry
Address
66 Edison Avenue 4901 Evergreen
68
Modulus Partitioning
Keyed partitioning method Rows are distributed according to the values in one integer key column
Simpler (and faster) calculation than HASH using modulus (remainder) of division:
partition = MOD (key_value / #partitions)
MODULUS
Guarantees that rows with identical values in key column end up in the same partition Partition size will be relatively equal if the data within the key column is evenly distributed
January 21, 2012
0 3 0 3
1 1 1
2 2 2
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
69
RANGE Partitioning
Values of key column
Keyed partitioning method Rows are evenly distributed according to the values in one or more key columns
Requires pre-processing data to generate a range map
y y More expensive than HASH partitioning Must read entire data TWICE to guarantee results
4 0 5 1 6 0 5 4 3
RANGE
Guarantees that rows with identical values in key columns end up in the same partition
The Write Range Map stage is used to generate the range map file
If the source data distribution is consistent over time, it may be possible to re-use the map file Values outside of a given range map will land in the first or last partition as appropriate
January 21, 2012
0 1 0
4 4 3
5 6 5
QUIZ! If incoming data is ordered on key, something bad happens. WHAT? ANSWER: The process runs
sequentially (key value adjacency)!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
70
SAME partitioner
re-partition
watch for this!
AUTO partitioner
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
71
Automatic Partitioning
By default, the Parallel Framework inserts partition components as necessary to ensure correct results (check the job score)
Before any stage with Auto partitioning Generally chooses between ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values (eg. Join, Merge, RemDup) Inserts ENTIRE on Normal (not Sparse) Lookup reference links
y NOT always appropriate for MPP/clusters
Since the Framework has limited awareness of your data and business rules, it is usually best to explicitly specify HASH partitioning when key groupings are required
Framework has no visibility into Transformer logic Required before SORT and AGGREGATOR (hash method) stages Framework may insert un-needed or non-optimal partitioning
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
72
Set automatically by some operators (eg. Sort, Hash partitioning) Can be manually set by users through stage Advanced properties Functionally equivalent to explicitly specifying SAME partitioning
y But allows the Parallel Framework to over-ride and optimize for performance (eg. if the degree of parallelism differs)
At runtime, if Preserve Partitioning flag as set and a downstream operator cannot use previous partitioning, a warning is issued
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
73
If grouping is not required, use ROUND ROBIN to redistribute data equally across all partitions
Framework will often do this with AUTO partitioning
74
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
75
Producer Consumer
y Partitioning method is associated with producer y Collector method is associated with consumer y Separated by an indicator: -> Sequential to Sequential
<> Sequential to Parallel => Parallel to Parallel (SAME) #> Parallel to Parallel (not SAME) >> Parallel to Sequential > No producer or no consumer
y May also include [pp] notation when Preserve Partitioning flag is set
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
76
Optimizing Partitioning
Minimize the number of re-partitions within and across job flows
Within a flow
y Examine up-stream partitioning and sort order and attempt to preserve for down-stream stages using SAME partitioning y May require re-examining key column usage within stages and processing (stage) order
If sort order is significant, write to a persistent data set with the Preserve Partitioning flag SET
y Useful if downstream jobs are run with the same degree of parallelism and require same partition and sort order
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
77
collector
Collecting Data
Stage running Sequentially
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
78
Collectors
Collectors combine partitions of a dataset into a single input stream to a sequential Stage
...
data partitions (NOT links) collector
sequential Stage
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
79
collector icon
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
80
Collector Methods
(Auto)
y Eagerly read any row from any input partition y Output row order is undefined (non-deterministic) y This is the default collector method
Round Robin
y Patiently pick row from input partitions in round robin order y Slower than auto, rarely used
Ordered
y Read all rows from first partition, then second, y Preserves order that exists within partitions
Sort Merge
y Produces a single (sequential) stream of rows sorted on specified key column(s) for input sorted on those keys y Row order is undefined for non-key columns
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
81
Ordered collector is only appropriate when sorted input has been range-partitioned
No sort required to produce sorted output
Round robin collector can be used to reconstruct original (sequential) row order for round-robin partitioned inputs
As long as intermediate processing (eg. sort, aggregator) has not altered row order or reduced number of rows Rarely used in real life scenarios
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
82
Funnel stage
Stage that runs in parallel Merges data from multiple links (multiple virtual data sets) to a single output link Table Definitions (schema) of all links must match
Collector
January 21, 2012
Funnel
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
83
Sorting Data
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
84
LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
ID
6 5 1 7 4 10
LName
Dodge Dodge Ford Ford Ford Ford Ford Ford Ford Ford
FName
John Horace Henry Henry Eleanor Eleanor Edsel Edsel Clara Clara
Address
75 Boston Boulevard 17840 Jefferson 66 Edison Avenue 4901 Evergreen 7900 Jefferson 1100 Lakeshore 7900 Jefferson 1100 Lakeshore 66 Edison Avenue 4901 Evergreen
Sort on:
Lname (asc), FName (desc)
3 9 2 8
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
85
Parallel Sort
In most cases, there is no need to globally sort data to produce a single sequence of rows Instead, sorting is most often needed to establish order within specified groups of data Join, Merge, Aggregator, RemDup, etc This sort can be done in parallel! Partitioning is used to gather related rows Assigns rows with the same key column(s) values to the same partition Sorting is used to establish grouping and order within each partition based on one or more key column(s) Key values are adjacent Partition and Sort keys need not be the same! Often the case before Remove Duplicates
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
86
LName
Ford Ford
FName
Clara Clara
Address
66 Edison Avenue 4901 Evergreen
LName
Ford Ford
FName
Clara Clara
Address
66 Edison Avenue 4901 Evergreen
Part 1
ID
3 5 9
LName
Ford Dodge Ford
FName
Edsel Horace Edsel
Address
7900 Jefferson 17840 Jefferson 1100 Lakeshore
ID
5 3 9
LName
Dodge Ford Ford
FName
Horace Edsel Edsel
Address
17840 Jefferson 7900 Jefferson 1100 Lakeshore
Parallel Sort
Part 2
ID
4 6 10
LName
Ford Dodge Ford
FName
Eleanor John Eleanor
Address
7900 Jefferson 75 Boston Boulevard 1100 Lakeshore
ID
6 4 10
LName
Dodge Ford Ford
FName
John Eleanor Eleanor
Address
75 Boston Boulevard 7900 Jefferson 1100 Lakeshore
Part 3
ID
1 7
LName
Ford Ford
FName
Henry Henry
Address
66 Edison Avenue 4901 Evergreen
ID
1 7
LName
Ford Ford
FName
Henry Henry
Address
66 Edison Avenue 4901 Evergreen
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
87
Lightweight stages that minimize memory usage by requiring data in key-column sort order
Join Merge Sort Aggregator
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
88
OR
By default, both methods use the same internal sort package (tsort operator)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
89
Sorting on a Link
Easier job maintenance (fewer stages on job canvas) BUTFewer options (tuning, features)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sort Stage
The Sort stage offers more options than a link sort
Always specify DataStage Sort Utility (much faster than UNIX sort)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
91
Stable Sorts
Stable sort preserves the order of non-key columns within each sort group Stable sorts are slightly slower than non-stable sorts for the same data/keys
Only use Stable sort when needed By default, stable sort is enabled on Sort stages! Stable sort is not the default for Link sorts
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
92
Resorting on Sub-Groups
Use Sort Key Mode property to re-use key column groupings from previous sorts
Uses significantly less memory/disk!
y Sort is now on previously-sorted key-column groups not the entire data set y Outputs rows after each group
Key column order is important! Dont forget to retain incoming sort order (eg. SAME partitioning)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
93
1 2 3
Partitioner
2 101 3 1 103 102
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
94
OR
SortMerge Collector
y For sorted input In general, parallel Sort + SortMerge collector will be MUCH faster than a sequential Sort
- Unless data is already sequential
95
Automatic Sorting
By default, the Parallel Framework inserts sort operators as necessary to ensure correct results Before any stage that requires matched key values (eg. Join, Merge, RemDup) Only inserted when the user has NOT explicitly defined an input sort Check the Job SCORE for inserted tsort operators
op1[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
op1[4p] {(parallel inserted tsort operator {key={value=LastName, subArgs={sorted}}, key={value=FirstName}, subArgs={sorted}}(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
For versions 7.01 and later, set $APT_SORT_INSERTION_CHECK_ONLY to change behavior of automatically inserted sorts
Instead of actually performing the sort, the inserted sort operators only VERIFY the data is sorted If data is not sorted properly at runtime, the job will fail Recommended only on a per-job basis during performance tuning
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
96
When the memory buffer is filled, sort uses temporary disk space in the following order:
y Scratch disks in the $APT_CONFIG_FILE sort named disk pool y Scratch disks in the $APT_CONFIG_FILE default disk pool (normally all scratch disks are part of the default disk pool) y The default directory specified by $TMPDIR y The UNIX /tmp directory
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
97
Specify only necessary key columns Avoid stable sorts unless needed to retain order of non-key column data If possible, use Sort Key Usage key column option to re-use previous sort keys Within Sort stage, adjusting Restrict Memory Usage option may improve performance
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
98
Partitioning Examples
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
99
Partitioning Example 1
Scenario: Assign average value to existing detail rows Standard Solution (3 hash/sorts):
Copy Data, Hash and Sort on all inputs to Aggregator, Join This is also the method the framework would use with Auto partitioning to ensure correct results
Copy Aggregate
Join
Notice that all 3 hash partitioners and sorts use the same key columns and order!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
10 0
Aggregate
10 1
Inserted sorts will verify row order at runtime, but will not actually sort data
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
NOTE: since the Join Key value is constant, inputs to the JOIN stage should NOT be sorted
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
10 3
BUT there is only one hash key value, so the Join runs sequentially
Src
10 4
Buffer
Buffer operators may also be inserted in an attempt to match producer and consumer rates
Stage 3
Some stages (eg. Sort, Hash Aggregator) internally buffer the entire data set before outputting a row
Buffer operators are never inserted after these stages
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
10 5
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
10 6
Buffer
Consumer
10 7
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Producer
Buffer
Consumer
Buffer will offer resistance to new rows, slowing down upstream producer
January 21, 2012
$APT_BUFFER_FREE_RUN
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
10 8
In general, buffer tuning is an advanced topic. The default settings should be appropriate for most job flows. For very wide rows, it may be necessary to increase default buffer size to handle more rows in memory
y Calculate total record width using internal storage for each data type / length / scale. For variable-length (varchar) columns, use maximum length.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
10 9
$APT_BUFFER_MAXIMUM_MEMORY When buffer memory is filled, temporary disk space is used in the following order:
y Scratch disks in the $APT_CONFIG_FILE buffer named disk pool y Scratch disks in the $APT_CONFIG_FILE default disk pool (normally all scratch disks are part of the default disk pool) y The default directory specified by $TMPDIR y The UNIX /tmp directory
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 0
Data in the grouping key column(s) changes (logical End of Data Group) All rows have been processed (End of Data)
For stages that process groups, rows are buffered in memory until an End of Data Group or End of Data Some stages (eg. Sort, Hash Aggregator) must read an entire input data set (until End of Data) before outputting a single record
Setting Dont Sort, Previously Sorted key option changes Sort behavior to output on groups instead of entire data set
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 1
Generating one header row with no subsequent change in join column, data is buffered until end of data
Buffer Header Src Buffer Detail Out
Solution: Use stage variables hold header data values. Output multiple header rows with different join-key values This additional logic may impact Transformer performance
Proper solution ultimately depends on data volume and available hardware resources
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 2
Header Link:
Use output link constraints to only output data after header values have been captured. Assign more than one join key value using @INROWNUM
y Assumes only one header row
Detail Link:
Assign constant value to detail join column
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 3
For Example 2, single Header row must be the second input link (#1) to the Join stage
Otherwise, all input data will be read into virtual memory
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 4
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 5
Src Detail
Out
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 6
Summary
Partitioning
Method should ensure correct results AND (if possible) evenly distribute data Must be aware of data distribution and impact on processing
Collecting
Used to consolidate partitioned data into sequential process
Sorting
Parallel sorting establishes row order within groups
y Partitioning gathers related rows
Sequential sorting only needed to produce single, globally sorted sequential result set
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
11 7
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Outlines connection topology (datasets) between adjacent operators and/or persistent data sets Defines number of actual UNIX processes
y Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
12 0
For each job run, 2 separate Score Dumps are written to the log
First score is actually from the license operator Second score entry is the actual job score
License Operator job score
12 1
Operators
y node/operator mapping
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
12 2
12 3
Operator Combination
At runtime, the DataStage Parallel Framework can only combine stages (operators) that:
Use the same partitioning method
y Repartitioning prevents operator combination between the corresponding producer and consumer stages y Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination
Are Combinable
y Set automatically within the stage/operator definition y Can also be set within DataStage Designer: Advanced stage properties
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
12 4
Reads the reference data into memory Performs actual lookup processing once reference data has been loaded
y LUTProcessImpl
op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3) on nodes ( ecc3671[op2,p0] )} op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )} op4[4p] {(parallel APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3) (APT_TransformOperatorImplV0S7_cpLoo kupTest1_Transformer_7 in Transformer_7) (PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
12 5
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
12 6
Producer Consumer
y Partitioning method is associated with producer y Collector method is associated with consumer
eCollectAny is specified for parallel consumers, although no collection occurs! -> Sequential to Sequential <> Sequential to Parallel => Parallel to Parallel (SAME) #> Parallel to Parallel (not SAME) >> Parallel to Sequential > No producer or no consumer
y Separated by an indicator:
y May also include [pp] notation when Preserve Partitioning flag is set
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
12 7
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
12 8
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Assumptions
This module assumes that you have an understanding of the topics covered in:
Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Material covered in DS324PX: DataStage Enterprise Edition Essentials
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
13 1
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
13 3
Job parameters and multi-instance job properties facilitate job re-use Land intermediate results to parallel data sets
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
13 4
Fork-join job flows may run faster if split into two separate jobs with intermediate datasets
y Depends on processing requirements and ability to tune buffering
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
13 5
Job Sequences
Job Sequences can be used to combine individual jobs into functional modules to perform a sequence of activities Starting with DataStage release 7.1, Job Sequences can be restartable
In the event of a failure, rerunning the sequence will not rerun activities that successfully completed It is the developers responsibility to ensure that an individual job can be re-run after a failure The do not checkpoint run sequence stage property will execute that step every Sequence run. Enable Sequence restart in Job Properties (enabled by default)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
13 6
13 7
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
13 8
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
13 9
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 0
When reading delimited files, extra characters are silently truncated for source file values longer than the maximum specified length of VARCHAR columns
Starting with v7.01, set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject these records instead
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 1
For realtime applications, the environment variable $APT_EXPORT_FLUSH_COUNT can be used to specify the number of rows to buffer
For example $APT_EXPORT_FLUSH_COUNT=1 flushes to disk for every row Setting this value low incurs a SIGNIFICANT performance penalty!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 2
The Column Import stage can be used to improve performance of non-parallel Sequential File reads and FTP sources
Allows column parsing to run in parallel Separates parsing (CPU) from sequential source I/O Define source file/FTP as a single column
y Type RAW or [VAR]CHAR y Maximum length = record size y Note that there are metadata implications
Define Columns, Data Types, and other format options in Column Import stage
y Similar to Sequential File definition
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 3
Limit use of Sparse Lookup (for DB2 and Oracle reference tables)
Per-row database lookups are extremely expensive (slow)
y For small numbers of rows, can be used for databasegenerated variables / function results
ONLY appropriate when the number of input rows is significantly smaller (eg. 1:100) than the number of reference rows
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 4
Lookup stage uses shared memory instead of duplicating ENTIRE reference data
On SMP platforms
To minimize data movement across nodes in clustered / MPP platforms, it may be appropriate to select a keyed partitioning method
Especially if data is already partitioned on those keys Input and Reference data partitioning must match
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 5
Lookup Reference Data NEVER generate Lookup reference data using a fork-join of source data
Lookup cannot output rows until all reference data has been read into memory (except for Oracle or DB2 Sparse reference links)
HeaderRef Header
Src Detail
Out
Use Lookup File Sets to separate the creation of lookup reference data from lookup processing
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 6
Lookup File Sets can only be used as reference input link to a Lookup stage
Partitioning method and key columns specified when the Lookup File Set is created will be used to process the reference data on subsequent Lookups that use this file
Particularly useful when static reference data can be reused in multiple jobs (or runs of the same job)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 7
Aggregator
The Aggregator stage summarizes data based on groupings of key-column values
Input partitioning must match desired groupings
Use Hash method for inputs with a limited number of distinct keycolumn values
Uses 2K of memory/group Incoming data does not need to be pre-sorted Results are output after all rows have been read Output row order is undefined
y Even if input data is sorted
Use Sort method with a large (or unknown) number of distinct key-column values
Requires input pre-sorted on key columns Results are output after each group
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
14 8
Use 2 aggregators to prevent sequential aggregation (and collector) from slowing down upstream data flow
First aggregator runs in parallel, grouping on generated key column
y Round-robin input if not evenly distributed
Sequential
14 9
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
- NOTE: starting with v7.01, Transformer output link constraints are FASTER than Filter stage! (Filter is always interpreted)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 0
- Create a Lookup table for the source-value pairs, and use the Lookup stage to assign values A Result
1 3 1 1 1 2 2 2 2
15 1
5 2 7 15 20
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Replace Transformer stages that do not meet performance requirements with BuildOps
It is generally not necessary to replace all Transformers, just those that are bottlenecks Remember, BuildOps require more knowledgeable developers than equivalent Transformer logic
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 2
Stage variables and columns within a link are evaluated in the order displayed in the Transformer editor
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 3
Use the stage variable Initial Value to evaluate once for all rows
y Where an expression requiring a type conversion is used as a constant, or it is used in multiple places
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 4
Default internal decimal variables are precision 38 scale 10, but this can be changed by specifying
y $APT_DECIMAL_INTERM_PRECISION y $APT_DECIMAL_INTERM_SCALE
y round_inf: rounds or truncates to nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity
y trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 5
When the Abort After Rows threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 6
Always test for null value before using a column in a function Avoid type conversions Try to maintain data type as imported Be aware of Column and Stage Variable data types It is easy to neglect setting the proper Stage Variable type
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 7
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 8
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
15 9
Example: For derivations that cannot output a running total, use 3 Sort stages before Transformer to generate a change key column for the last row in the group
Often, data is already sorted earlier in the same flow Hash/Sort on key columns before first sort Use SAME partitioning to ensure that subsequent stages keep grouping and sort order
Sort
January 21, 2012
KeyChange
SubSort
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 0
Second Sort
Does no sorting creates key-change column Specify only key columns
Final Sub-Sort
Does not sort on key columns Sub-sorts Ascending on group order column
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 1
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 3
16 4
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 5
May be issues with index maintenance, constraints, etc May not work with tables that have associated triggers
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 6
CLOSE command allows user to specify SQL to be executed after the stage completes reading or writing
Example: SELECT FROM INSERT INTO from temporary table to actual table Example: Delete temporary table(s)
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 7
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 8
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
16 9
Tightly coupled with DB2, communicates directly with each DB2 database node, using same partitioning as DB2 table Supports Parallel Read, Upsert, Load, Sparse Lookup
17 0
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17 1
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17 2
Target
y Upsert: uses Oracle API y Load: invokes SQL*Loader, subject to its limitations
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17 3
In general, avoid using this option for local Oracle databases (on same host as DataStage server)
Specifying for local Oracle instances forces TCP (network) instead of shared memory database connection Instead, set the environment variable $ORACLE_SID
y Oracle environment is typically defined within the DataStage dsenv file
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17 4
17 5
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17 6
For larger data volumes, it is often faster to identify Insert and Update data within the job and separate into different Oracle Enterprise targets
Set Upsert Mode=Update Only for rows to be updated Set Upsert Mode=Update and Insert for rows to be inserted Prevents double-processing of update records
17 7
Rows are committed after a period of time or number of rows, whichever comes first, for each Oracle stage/partition:
$APT_ORAUPSERT_COMMIT_ROW_INTERVAL
y Default is every 2000 rows (per stage/partition)
$APT_ORAUPSERT_TIME_INTERVAL
y Default is every 2 seconds
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17 8
In Append or Truncate modes, the IndexMode option can allow load into an indexed table: Rebuild: bypasses indexes during load, rebuilds indexes after load completes
y uses Oracle ALTER INDEX REBUILD command y indexes cannot be partitioned
Maintenance:
y Loads each partition sequentially y Table and Index must be partitioned y Index must be local range-partitioned using same range values used to partition the table
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
17 9
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 0
Which Teradata stage to use? Source or Target Teradata Enterprise uses FastExport and FastLoad utilities
y High-volume parallel reads and writes y Targets are limited to Insert operations (empty table or Append) y Supports OPEN and CLOSE commands
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 1
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 2
SessionsPerPlayer determines the number of connections each player will have to Teradata. Indirectly, it also determines the number of players (degree of parallelism).
Default is 2 sessions / player The number selected should be such that SessionsPerPlayer * number of nodes * number of players per node = RequestedSessions Setting the value of SessionsPerPlayer too low on a large system can result in so many players that the job fails due to insufficient resources. In that case, the value for SessionsPerPlayer should be increased.
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 3
DataStage Server
Teradata Server MPP with 4 TPA nodes 4 AMPs per TPA node
Example Settings
Configuration File 16 nodes 8 nodes 8 nodes 4 nodes
January 21, 2012
Total Sessions 16 16 8 16
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 4
Teradata Plug-Ins
Target Teradata MultiLoad plug-in (MultiLoad utility)
y Targets allow Insert, Update, Delete, or Upsert of moderate data volumes (stage cannot run in parallel) y Do NOT use as a source in an EE flow! (runs FastExport sequentially)
Source or Target Teradata API stage does not use database utilities
y Intended for small-volumes of data y Does not count against Teradata limits, but slower than TPump y Andcannot read in parallel (parallel writes are allowed)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 5
18 6
SQL or DataStage?
When reading data from multiple tables in the same database, it is possible to use either SQL or DataStage for some tasks. In general, the optimal implementation leverages the strengths of each technology:
When possible, use a SQL filter (WHERE clause) to limit the number of rows sent to the DataStage job Use a SQL JOIN to combine data from tables of small-medium number of rows, especially when the join columns are indexed In general, avoid SQL SORTs DataStage SORT is much faster and runs in parallel without the overhead of sort-merge Use DataStage SORT and JOIN to combine data from very large tables, or when the join condition is complex Avoid the use of database stored procedures (eg. Oracle PL/SQL) on a per-row basis. Implement these routines using native DataStage components.
When the direction is not obvious, the decision is often made by actual tests, or influenced by other factors such as metadata needs and developer skill sets
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 7
DataStage Enterprise Edition Best Practices and Performance Tuning document Dont be afraid to try!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
18 8
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
$ENV pulls value from operating system environment $PROJDEF uses project default value
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 1
IMPORTANT: Always make a backup-copy of the DSParams file before any manual editing. It is possible to render a project un-usuable through improper editing of DSParams
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 2
$OSH_ECHO=1
Outputs generated OSH to job log
$APT_PM_SHOW_PIDS=1
Places UNIX process ID entries in job log for each process started at runtime Does not show DataStage phantom or Server processes
$APT_BUFFER_MAXIMUM_TIMEOUT=1
Maximum buffer delay in seconds
$APT_COPY_TRANSFORM_OPERATOR=1
For clusters/MPP only: copies Transform operator(s) to remote nodes
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 3
$APT_MONITOR_SIZE=[rows]
y If set, specifies that the job monitor capture information on a row (not time) basis. This is the method used in DataStage release 6.x
$APT_NO_JOBMON=1
y Disables job monitoring completely no statistics will be captured y In rare instances, this may improve performance
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 4
$APT_DECIMAL_INTERM_PRECISION=[precision] $APT_DECIMAL_INTERM_SCALE=[scale]
Specifies internal precision and scale used for internal Transformer derivations Default precision/scale is [38,10], maximum is [255,255]
$APT_DECIMAL_INTERM_ROUND_MODE=[mode]
ceil: rounds toward positive infinity
y 1.4 -> 2, -1.6 -> -1
round_inf: rounds or truncates to nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity
y 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2
trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size
y 1.56 -> 1.5, -1.56 -> -1.5
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 5
$APT_PM_PLAYER_TIMING=1
y When set, prints detailed information in the job log for each operator, including CPU utilization and elapsed processing time
$APT_PM_PLAYER_MEMORY=1
y When set, prints detailed information in the job log for each operator when additional memory is allocated
$APT_BUFFERING_POLICY=FORCE $APT_BUFFER_FREE_RUN=1000
y Used in conjunction, these two environment variables effectively isolate each operator from slowing upstream production. Using the job monitor statistics, this can identify which part of a job flow is impacting overall performance. y NOT recommended for production job runs!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 6
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 7
Setting
[nrows]
Description
Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O. Setting this environment variable directs DataStage to reject Sequential File records with strings longer than their declared maximum column length. By default, imported string fields that exceed their maximum declared length are truncated. Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance. In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations. Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2. This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS
[Kbytes]
$APT_CONSISTENT_BUFFERIO_SIZE
[bytes]
$APT_DELIMITED_READ_SIZE
[bytes]
$APT_MAX_DELIMITED_READ_SIZE
[bytes]
This variable controls the upper bound which is by default 100,000 bytes. When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 8
Setting
[path]
Description
Specifies the DB2 install directory. This variable is usually set in a users environment from .db2profile.
$APT_DB2INSTANCE_HOME
[path]
Used as a backup for specifying the DB2 installation directory (if $INSTHOME is undefined).
$APT_DBNAME
[database]
Specifies the name of the DB2 database for DB2/UDB Enterprise stages if the Use Database Environment Variable option is True. If $APT_DBNAME is not defined, $DB2DBDFT is used to find the database name.
$APT_RDBMS_COMMIT_ROWS Can also be specified with the Row Commit Interval stage input property.
[rows]
Specifies the number of records to insert between commits. The default value is 2000.
$DS_ENABLE_RESERVED_CHAR_CONVERT
Allows DataStage to handle DB2 databases which use the special characters # and $ in column names.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
19 9
Environment Variable
$INFORMIXDIR
Setting
[path]
Description
Specifies the Informix install directory. Specifies the path to the Informix sqlhosts file.
$INFORMIXSQLHOSTS
[filepath]
$INFORMIXSERVER
[name]
Specifies the name of the Informix server matching an entry in the sqlhosts file.
$APT_COMMIT_INTERVAL
[rows]
Specifies the commit interval in rows for Informix HPL Loads. The default is 10000.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
20 0
Setting
[path] [sid] [num] [seconds]
Description
Specifies installation directory for current Oracle instance. Normally set in a users environment by scripts. Specifies the Oracle service name, corresponding to a TNSNAMES entry. These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method. Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows. Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)
$APT_ORACLE_LOAD_OPTIONS
$APT_ORACLE_LOAD_DELIMITED
Specifies a field delimiter for target Oracle stages using the Load method. Setting this variable makes it possible to load fields with trailing or leading blank characters. When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the tables tablespace. Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands. Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.
$APT_ORA_WRITE_FILES
[filepath]
$DS_ENABLE_RESERVED_CHAR_CONVERT
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
20 1
Setting
[name]
Description
Starting with v7, specifies the database used for the terasync table. By default, EE uses the
$APT_TERA_SYNC_USER
[user]
Starting with v7, specifies the user that creates and writes to the terasync table. Specifies the password for the user identified by $APT_TERA_SYNC_USER. Enables 64K buffer transfers (32K is the default). May improve performance depending on network configuration. This environment variable is not recommended for general use. When set, this environment variable may assist in job debugging by preventing the removal of error tables and partially written target table.
$APT_TER_SYNC_PASSWORD
[password]
$APT_TERA_64K_BUFFERS
$APT_TERA_NO_ERR_CLEANUP
$APT_TERA_NO_PERM_CHECKS
Disables permission checking on Teradata system tables that must be readable during the TeraData Enterprise load process. This can be used to improve the startup time of the load.
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
20 2
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
20 3
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Assumptions
This module assumes that you have an understanding of the topics covered in:
Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Module 04: Best Practices and Job Design Tips Material covered in DS324PX: DataStage Enterprise Edition Essentials
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
20 6
Optimizing Performance
The ability to process large volumes of data in a short period of time requires optimizing all aspects of the job flow and environment for maximum throughput and performance:
Job Design Stage Properties DataStage Parameters Configuration File Disk Subsystem
y Especially RAID arrays / SANs
20 7
20 8
Number of processes generated Operator combination Framework-inserted sorting and partitioning Data distribution (partitioning) Throughput and bottlenecks
y Use UNIX system monitoring tools to determine resource utilization (CPU, memory, disk, network)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
20 9
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 0
Use DataStage Job Monitor to identify CPU bottlenecks Selectively disable combination through Designer stage properties
21 1
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 2
On clustered / MPP configurations, named pools can be used to further specify resources across physical servers
Through careful job design, can minimize data shipping Specifies server(s) with database connectivity
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 3
When generating Lookup reference data to be used in subsequent jobs, use Lookup File Sets
Internal format, partitioned Pre-indexed
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 4
Impact of Partitioning
Ensure data is as close to evenly distributed as possible
When business rules dictate otherwise, re-partition to a more balanced distribution as soon as possible to improve performance of downstream stages
21 5
Impact of Sorting
Use parallel sorts if possible (sort by key-column groups)
Where sequential sort is required, parallel sort + sort merge collector is generally much faster than a sequential sort
Parallel data sets maintain sort order and partitioning across jobs
Stable sorts are slower than non-stable sorts; use only when necessary Use the Restrict Memory Usage (MB) option to increase amount of memory per partition (default is 20MB)
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 6
Impact of Transformers
Minimize number and use of Transformers
Consider more appropriate stages / methods
y Copy, Output Mappings, Modify, Lookup
Use stage variables to perform calculations used by multiple derivations Replace complex Transformers that do not meet performance requirements with BuildOps And NEVER use the BASIC Transformer for highvolume flows!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 7
Impact of Buffering
Consider maximum row width
For very wide rows, it may be necessary to increase buffer size to hold more rows in memory (default is 3MB / partition) Set through stage properties or for entire job using $APT_BUFFER_MAXIMUM_MEMORY
Tune all other factors (job design, configuration file, disk, resources, etc) before tuning buffer settings Be careful changing buffering mode
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 8
$APT_BUFFER_FREE_RUN=1000
y Writes excess buffer to disk instead of slowing down producer y Buffer will not slow down producer until it has written 1000*$APT_MAXIMUM_MEMORY to disk
Important notes:
These settings will generate a significant amount of disk I/O! Use configuration file buffer disk pools to isolate buffer file systems from scratch and resource disks Do NOT use these settings for production jobs!
January 21, 2012
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
21 9
22 0
In some disk array configurations, set the following environment variable equal to the read/write size in bytes:
$APT_CONSISTENT_BUFFERIO_SIZE
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
22 1
DataStage Enterprise Edition Best Practices and Performance Tuning document Dont be afraid to try!
2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
22 2
NOTE: These slides are Copyright 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.