Professional Documents
Culture Documents
Day 1
Review of EE Concepts Sequential Access Best Practices DBMS as Source
Day 3
Combining Data Configuration Files Extending EE Meta Data in EE
Day 2
EE Architecture Transforming Data DBMS as Target Sorting Data
Day 4
Job Sequencing Testing and Debugging
Online Help
Intro Part 1
Introduction to DataStage EE
What is DataStage?
Ideal tool for data integration projects such as, data warehouses, data marts, and system migrations Import, export, create, and managed metadata for use within jobs
Schedule, run, and monitor jobs all within DataStage Administer your DataStage development and execution environments
DataStage Administrator
Client Logon
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage
Define global and project properties in Administrator Import meta data into Manager Build job in Designer
Compile Designer
Validate, run, and monitor in Director
DataStage Projects
DataStage Designer is used to build and compile your ETL jobs Manager is used to execute your jobs after you build them Director is used to execute your jobs after you build them Administrator is used to set global and project properties
Intro Part 2
Configuring Projects
Module Objectives
Project Properties
Projects can be created and deleted in Administrator Project properties and defaults are set in Administrator
To set project properties, log onto Administrator, select your project, and then click Properties
Licensing Tab
Environment Variables
Permissions Tab
Tracing Tab
Tunables Tab
Parallel Tab
Intro Part 3
Module Objectives
What Is Metadata?
Data
Source
Meta Data
Transform
Target
Meta Data
DataStage Manager
Manager Contents
Metadata describing sources and targets: Table definitions DataStage objects: jobs, routines, table definitions, etc.
Any object in Manager can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers
Export Procedure
In Manager, click Export>DataStage Components Select DataStage objects for export Specified type of export: DSX, XML
You can export DataStage objects such as jobs, but you cant export metadata, such as field definitions of a sequential file.
The directory to which you export is on the DataStage client machine, not on the DataStage server machine.
Import Procedure
Import Options
Exercise
Metadata Import
Import format and column destinations from sequential files Import relational table column destinations Imported as Table Definitions
In Manager, click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select Manager category Examined format and column definitions and edit is necessary
Intro Part 4
Module Objectives
What Is a Job?
Executable DataStage program Created in DataStage Designer, but can use components from Manager Built using a graphical user interface
Designer Toolbar
Provides quick access to the main functions of Designer Show/hide metadata markers
Job properties
Compile
Tools Palette
Stages can be dragged from the tools palette or from the stage type branch of the repository view Links can be drawn from the tools palette or by right clicking and dragging from one stage to another
Used to extract data from, or load data to, a sequential file Specify full path to the file Specify a file format: fixed width or delimited
Transformer Stage
Used to define constraints, derivations, and column mappings A column mapping maps an input column to an output column In this module will just defined column mappings (no derivations)
Result
Job Properties
Short and long descriptions Shows in Manager
Annotation stage
Is a stage on the tool palette Shows on the job GUI (work area)
Compiling a Job
Intro Part 5
Running Jobs
Module Objectives
DataStage Director
Can schedule, validating, and run jobs Can be invoked from DataStage Manager or Designer
Tools > Run Director
Schedule job to run on a particular date/time Clear job log Set Director options
Row limits Abort after x warnings
Module 1
ANY SOURCE CRM ERP SCM RDBMS Legacy Real-time Client-server Web services Data Warehouse Other apps.
PREPARE
ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-time Client-server Web services Data Warehouse Other apps.
Data Quality
Course Objectives
Course Agenda
Day 1
Review of EE Concepts Sequential Access Standards DBMS Access
Day 3
Combining Data Configuration Files
Day 2
EE Architecture Transforming Data Sorting Data
Day 4
Extending EE Meta Data Usage Job Control Testing
Module Objectives
Skip this module if you recently completed the DataStage EE essentials modules
Review Topics
Client-Server Architecture
Command & Control
ANY SOURCE
ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-Time Client-server Web services Data Warehouse Other apps.
Designer
Director
Administrator
Repository Manager
Discover Extract
Extend Integrate
Server
Repository
Process Flow
Administrator add/delete projects, set defaults Manager import meta data, backup projects Designer assemble jobs, compile, and execute Director execute jobs, examine job run logs
DataStage Manager
Designer Workspace
Stages
Can now customize the Designers palette Select desired stages and drag to favorites
Row generator
Peek
Row Generator
Repeatable property
Peek
Why EE is so Effective
Emphasis on memory
Data read into memory and lookups performed like hash table
DataStage EE Enables parallel processing = executing your application on multiple CPUs simultaneously
If you add more resources (CPUs, RAM, and disks) you increase system performance
1 2
3 4
note
cpu
cpu
Operational Data
Transform
Clean
Load
Source
Traditional approach to batch processing: Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging
Target
Pipeline Multiprocessing
Data Pipelining
Transform, clean and load processes are executing simultaneously on the same processor rows are moving forward through the flow Operational Data
Archived Data
Transform
Clean
Load
Data Warehouse
Source
Target
Start a downstream process while an upstream process is still running. This eliminates intermediate storing to disk, which is critical for big data. This also keeps the processors busy. Still has limits on scalability
Partition Parallelism
Data Partitioning
Break up big data into partitions
Transform
Node 2
Transform
Node 3
N-T U-Z
Transform
Node 4
Transform
Source Data
Transform
Clean
Data Warehouse
Load
Source
Target
Repartitioning
Putting It All Together: Parallel Dataflow with Repartioning on-the-fly
Pipelining
U-Z N-T G- M A-F
Source Data
Data Warehouse
Transform
Clean
Customer zip code
Load
Credit card number
Source
Without Landing To Disk!
Targe
EE Program Elements
Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset
DataStage EE Architecture
DataStage:
Provides data integration platform
Orchestrate Framework:
Provides application scalability
Orchestrate Program
(sequential data flow)
Flat Files
Import
Relational Data
Inter-node communications
Parallelization of operations
Introduction to DataStage EE
DSEE:
Automatically scales to fit the machine Handles data flow among multiple CPUs and disks
and gets: parallel access, propagation, transformation, and load. The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file. No need to modify or recompile the design
Exercise
Module 2
Module Objectives
Sequential
Fixed or variable length
Data Set
The EE Framework processes only datasets For files other than datasets, such as flat files, Enterprise Edition must perform import and export operations this is performed by import and export OSH operators generated by Sequential or FileSet stages
During import or export DataStage performs format translations into, or out of, the EE internal format
Data is described to the Framework in a schema
Generates Import/Export operators, depending on whether stage is source or target Performs direct C++ file I/O streams
Importing/Exporting Data
Data import:
EE internal format
Data export
EE internal format
Recordization
Divides input stream into records Set on the format tab
Columnization
Divides the record into columns Default set on the format tab but can be overridden on the columns tab Can be incomplete if using a schema or not even specified in the stage if using RCP
Field 1
Field 1
Field 1
, Last field
, nl
Will reject any records not matching meta data in the column definitions
Stage categories
Show records
Format Tab
Read Methods
Reject Link
Target
All records that are rejected for any reason
Can read or write file sets Files suffixed by .fs File set consists of:
1. Descriptor file contains location of raw data files + meta data 2. Individual raw data files
Descriptor file
Can create file sets Usually used in conjunction with Lookup stages
Data Set
Operating system (Framework) file Suffixed by .ds Referred to by a control file Managed by Data Set Management utility from GUI (Manager, Designer, Director) Represents persistent data Key to good performance in set of linked jobs
Persistent Datasets
input.ds
record ( partno: int32; description: string; )
Data file(s)
contain the data multiple Unix files (one per node), accessible in parallel
node1:/local/disk1/ node2:/local/disk2/
Quiz!
True or False?
Everything that has been data-partitioned must be collected in same job
Occurs on import
From sequential files or file sets From RDBMS
Occurs on export
From datasets to file sets or sequential files From datasets to RDBMS
Engine most efficient when processing internally formatted records (I.e. data contained in datasets)
Managing DataSets
GUI (Manager, Designer, Director) tools > data set management Alternative methods
Orchadmin
Unix command line utility List records Remove data sets (will remove all components)
Dsrecords
Display data
Schema
Orchadmin
Exercise
Module 3
Objectives
Job Presentation
Naming conventions
Container
Use copy or peek stage as stub Test job in phases small first, then increasing in complexity Use Peek stage to examine records
Copy stage
Suggestions Always include reject link. Always test for null value before using a column in a function. Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later. Be aware of Column and Stage variable Data Types.
Often user does not pay attention to Stage Variable type. Try to maintain the data type as imported.
can be inserted on: input link (Partitioning): Partitioners, Sort, Remove Duplicates)
output link (Mapping page): Rename, Drop.
Developing Jobs
1.
Keep it simple
Jobs with many stages are hard to debug and maintain.
2.
3.
4.
Final Result
Use job parameters Some helpful environmental variables to add to job parameters
$APT_DUMP_SCORE
Report OSH to message log Establishes runtime parameters to EE engine; I.e. Degree of parallelization
$APT_CONFIG_FILE
Make a set for 1X, 2X,. Use different ones for test versus production Include as a parameter in each job
Exercise
Module 4
DBMS Access
Objectives
Understand how DSEE reads and writes records to an RDBMS Understand how to handle nulls on DBMS lookup Utilize this knowledge to:
Read and write database tables Use database tables to lookup data Use null handling options to clean data
Client Client
Enterprise Edition
Sort
Client Client
Load
Parallel RDBMS
Parallel RDBMS
Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes Higher levels of integration possible
Each application has only one connection Suitable only for small data volumes
RDBMS Access
Supported Databases
RDBMS Access
Automatically convert RDBMS table layouts to/from Enterprise Edition Table Definitions RDBMS nulls converted to/from nullable field values Support for standard SQL syntax for specifying:
field list for SELECT statement filter for WHERE clause
Can write an explicit SQL query to access RDBMS EE supplies additional information in the SQL query
RDBMS Stages
RDBMS Usage
As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined SQL User-defined can perform joins, access views
As a target
Inserts Upserts (Inserts and updates) Loader
Stream link
Columns in SQL statement must match the meta data in columns tab
Exercise
User-defined SQL
Exercise 4-1
Reject link
Null Handling
Must handle null condition if lookup record is not found and continue option is chosen Can be done in a transformer stage
Link name
Must have same column name in input and reference links. You will get the results of the lookup in the output column.
DBMS as a Target
DBMS As Target
Write Methods
Delete Load Upsert Write (DB2)
Target Properties
Generated code can be copied
Use Transformer stage to test for fields with null values (Use IsNull functions) In Transformer, can reject or load default value
Exercise
Module 5
Platform Architecture
Objectives
Understand how Enterprise Edition Framework processes data You will be able to:
Read and understand OSH Perform troubleshooting
Concepts
Parallel or Sequential
Partitioner
EE Stage
Business Logic
Pipeline
Partition
Execution Mode (sequential/parallel) is controlled by Stage default = parallel for most Ascential-supplied Stages Developer can override default mode Parallel Stage inserts the default partitioner (Auto) on its input links Sequential Stage inserts the default collector (Auto) on its input links Developer can override default execution mode (parallel/sequential) of Stage > Advanced tab choice of partitioner/collector on Input > Partitioning tab
Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage
OSH
OSH Script
Where:
op is an Orchestrate operator in.ds is the input data set out.ds is the output data set
OSH Operators
OSH Operator is an instance of a C++ class inheriting from APT_Operator Developers can create new operators Examples of existing operators:
Import Export RemoveDups
Operator
Schema
OSH Practice
Datasets
Consist of Partitioned Data and Schema Can be Persistent (*.ds) or Virtual (*.v, Link) Overcome 2 GB File Limit
What you program: What gets processed:
Node 1 Node 2
Operator A
Node 3
Operator A
Node 4
Operator A
GUI
Operator A
. . .
Shared Disk
Shared Nothing
Disk
Disk
CPU CPU CPU CPU
Disk
Disk
Disk
Disk
CPU
CPU
CPU
CPU
CPU
Memory
Shared Memory
Memory
Memory
Memory
Memory
Uniprocessor
PC Workstation Single processor server
SMP System
(Symmetric Multiprocessor)
IBM, Sun, HP, Compaq 2 to 64 processors Majority of installations
Processing Node
SL
Section Leader
Forks Players processes (one/Stage) Manages up/down communication.
Processing Node
SL
Players
The actual processes associated with Stages Combined players: one process only Send stderr to SL
Communication:
Establish connections to other players for data flow Clean up upon completion.
'1-node' file
testing parallelism
'MedN-nodes' file - aims at a mix of pipeline and data-partitioned 'BigN-nodes' file - aims at full data-partitioned parallelism
$APT_CONFIG_FILE
Nodes = # logical nodes declared in config. file Ops = # ops. (approx. # blue boxes in V.O.) Processes = # Unix processes CPUs = # available CPUs
Ops
N Y
Processes
Nodes * Ops "
CPUs
N Y
Y Y
3 1
4 2
3 1
4 2
Re-Partitioning
Parallel to parallel flow may incur reshuffling: Records may jump between nodes
node 1 node 2
partitioner
Partitioning Methods
Collectors
Collectors combine partitions of a dataset into a single input stream to a sequential Stage ...
data partitions
collector
sequential Stage
Partitioner
Collector
Set APT_DUMP_SCORE to true Can be specified as job parameter Messages sent to Director log If set, parallel job will produce a report showing the operators, processes, and datasets in the running job
Exercise
Module 6
Transforming Data
Module Objectives
Understand ways DataStage allows you to transform data Use this understanding to:
Create column derivations using user-defined code or system functions Filter records based on business criteria Control data flow based on data conditions
Transformed Data
Stages Review
Flow Control
Separate records flow down links based on data condition specified in Transformer stage constraints Transformer stage can filter records Other stages can filter records but do not exhibit advanced flow control
Sequential can send bad records down reject link Lookup can reject records based on lookup failure Filter can select records based on data value
Rejecting Data
Reject links (from Lookup stage) result from the drop option of the property If Not Found
Lookup failed All columns on reject link (no column mapping option)
Reject constraints are controlled from the constraint editor of the transformer
Can control column mapping Use the Other/Log checkbox
Constraint Other/log option Property Reject Mode = Output If Not Found property
First of transformer stage entities to execute Execute in order from top to bottom
Can write a program by using one stage variable to point to the results of a previous stage variable
Multi-purpose
Counters Hold values for previous rows to make comparison Hold derivations to be used in multiple field dervations Can be used to control execution of constraints
Stage Variables
Show/Hide button
Transforming Data
Derivations
Using expressions Using functions
Date/time
Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition Can be handled in constraints, derivations, stage variables, or a combination of these
Constraint Rejects
All expressions are false and reject row is checked
All > Processing > Transformer Is the non-Universe transformer Has a specific set of functions No DS routines available
Parallel > Processing Basic Transformer Makes server style transforms available on the parallel palette
Date & Time Logical Null Handling Number String Type Conversion
Exercise
Module 7
Sorting Data
Objectives
Understand DataStage EE sorting options Use this understanding to create sorted list of data to enable functionality within a transformer stage
Sorting Data
Important because
Some stages require sorted input Some stages may run faster I.e Aggregator
Can be performed
Option within stages (use input > partitioning tab and set partitioning to anything other than auto) As a separate stage (more complex sorts)
Sorting Alternatives
Sort Stage
Sort Utility
Sort Stage
Tunable to use more memory before spilling to scratch.
Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE
Removing Duplicates
OR
Exercise
Module 8
Combining Data
Objectives
Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages Use this understanding to create jobs that will
Combine data from separate input streams Aggregate data to form summary totals
Combining Data
Vertically: One input link, one output link with column combining values from all input rows. E.g.,
Aggregator
These "three Stages" combine two or more input links according to values of user-designated "key" column(s). They differ mainly in:
Memory usage Treatment of rows with unmatched key values Input requirements (sorted, de-duplicated)
Naming convention:
Joins Left Right Lookup Source LU Table(s) Merge Master Update(s)
Tip: Check "Input Ordering" tab to make sure intended Primary is listed first
Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)
One of four variants: Inner Left Outer Right Outer Full Outer
no pre-sort necessary allows multiple keys LUTs flexible exception handling for source input rows with no match
0 0
2 1
Lookup
Output
Reject
Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging) On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link On an SMP, no physical duplication of a Lookup Table occurs
RDBMS LOOKUP
NORMAL
SPARSE
Combines
one sorted, duplicate-free master (primary) link with one or more sorted update (secondary) links. Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but opposite to lookup).
Unmatched ("Bad") update rows in input link can be captured in a "reject" link Matched update rows are consumed.
1 1
Lightweight
2
Output
Joins
Model Memory usage # and names of Inputs Mandatory Input Sort Duplicates in primary input Duplicates in secondary input(s) Options on unmatched primary Options on unmatched secondary On match, secondary entries are # Outputs Captured in reject set(s) RDBMS-style relational light exactly 2: 1 left, 1 right both inputs OK (x-product) OK (x-product) NONE NONE reusable 1 Nothing (N/A)
Lookup
Source - in RAM LU Table heavy 1 Source, N LU Tables no OK Warning! [fail] | continue | drop | reject NONE reusable 1 out, (1 reject) unmatched primary entries
Merge
Master -Update(s) light 1 Master, N Update(s) all inputs Warning! OK only when N = 1 [keep] | drop capture in reject set(s) consumed 1 out, (N rejects) unmatched secondary entries
= separator between primary and secondary input links (out and reject links)
Zero or more key columns that define the aggregation units (or groups) Columns to be aggregated Aggregation functions:
count (nulls/non-nulls) sum max/min/range
Grouping Methods
Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed
doesnt require sorted data good when number of unique groups is small. Running tally for each groups aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM. Example: average family income by state, requires .05MB of RAM
Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out.
requires input sorted by grouping keys can handle unlimited numbers of groups Example: average daily balance by credit card
Aggregator Functions
Sum Min, max Mean Missing value count Non-missing value count Percent coefficient of variation
Aggregator Properties
Aggregation Types
Aggregation types
Containers
Two varieties
Local Shared
Local
Simplifies a large, complex diagram
Shared
Creates reusable object that many jobs can include
Creating a Container
Create a job Select (loop) portions to containerize Edit > Construct container > local or shared
Using a Container
Exercise
Module 9
Configuration Files
Objectives
Understand how DataStage EE uses configuration files to determine parallel behavior Use this understanding to
Build a EE configuration file for a computer system Change node configurations to support adding resources to processes that need them Create a job that will change resource allocations at the stage level
Determine the processing nodes and disk space connected to each node When system changes, need only change the configuration file no need to recompile jobs When DataStage job runs, platform reads configuration file
Platform automatically scales the application to fit the system
Locations on which the framework runs applications Logical rather than physical construct Do not necessarily correspond to the number of CPUs in your system
Typically one node for two CPUs
Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node
Optimizing Parallelism
Degree of parallelism determined by number of nodes defined Parallelism should be optimized, not maximized
Increasing parallelism distributes work load but also increases Framework overhead
Hardware influences degree of parallelism possible System hardware partially determines configuration
SMP leave some processors for operating system Desirable to equalize partitioning of data Use an experimental approach
Start with small data sets Try different parallelism while scaling up data set sizes
Configuration File
Node Options
Fastname
Name of node as referred to by fastest network in the system Operators use physical node name to open connections NOTE: for SMP, all CPUs share single connection to network
Pools
Names of pools to which this node is assigned Used to logically group nodes Can also be used to group resources
Resource
Disk Scratchdisk
Disk Pools
Disk pools allocate storage By default, EE uses the default pool, specified by
pool "bigdata"
Sorting Requirements
Resource pools can also be specified for sorting:
The Sort stage looks first for scratch disk resources in a sort pool, and then in the default disk pool
{"sort"}
{}
2
1
{}
{}
Resource Types
Number of CPUs CPU speed Available memory Available page/swap space Connectivity (network/back-panel speed)
Is the machine dedicated to EE? If not, what other applications are running on it? Get a breakdown of the resource usage (vmstat, mpstat, iostat) Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?
Exercise
Module 10
Extending DataStage EE
Objectives
Understand the methods by which you can add functionality to EE Use this understanding to:
Build a DataStage EE stage that handles special processing needs not supplied with the vanilla stages Build a DataStage EE job that uses the new stage
EE Extensibility Overview Sometimes it will be to your advantage to leverage EEs extensibility. This extensibility includes:
Wrappers are good if you cannot or do not want to modify the application and performance is not critical. Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces. Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.
pipe-safe
can read rows sequentially no random access to data
Wrappers (Contd)
User must know at design time the intended behavior of the wrapper and its schema interface
If the wrappered application needs to see all records prior to processing, it cannot run in parallel.
LS Example
Creating a Wrapper
Name of stage
Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others.
If your stage will have properties appear, complete the Properties page
Interfaces input and output columns these should first be entered into the table definitions meta data (DS Manager); lets do that now.
Interface schemas
Layout interfaces describe what columns the stage:
Needs for its inputs (if any) Creates for its outputs (if any) Should be created as tables with columns in Manager
Define the schema for export and import Schemas become interface schemas of the operator and allow for by-name column access
This wrapper will have no input interface i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Therefore, only the Output tab entry is needed.
Resulting Job
Wrapped stage
Job Run
Hardware Environment:
IBM SP2, 2 nodes with 4 CPUs per node.
Software:
DB2/EEE, COBOL, EE
Buildops
Buildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper) Reasons to use Buildop include:
Speed / Performance
Complex business logic that cannot be easily represented using existing stages
Lookups across a range of values Surrogate key generation Rolling aggregates
Build once and reusable everywhere within project, no shared container necessary Can combine functionality from different stages into one
BuildOps
The DataStage programmer encapsulates the business logic
The Enterprise Edition interface called buildop automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary plumbing for a correct and efficient parallel execution.
Exploits extensibility of EE Framework
General Page
Identical to Wrappers, except:
Enter Business C/C++ logic and arithmetic in four pages under the Logic tab
Main code section goes in Per-Record page- it will be applied to all rows NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it wont compile within EE either!
Write row
Transfer all columns from input to output. If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written
Example - sumNoTransfer
Add input columns "a" and "b"; ignore other columns that might be present in input Produce a new "sum" column Do not transfer input columns
a:int32; b:int32
sumNoTransfer
sum:int32
No Transfer
From Peek:
Transfer
TRANSFER
- RCP set to "True" in stage definition or - Auto Transfer set to "True"
Effects:
- new column "sum" is transferred, as well as - input columns "a" and "b" and - input column "ignored" (present in input, but not mentioned in stage)
C/C++ type Need declaration (in Definitions or Pre-Loop page) Value persistent throughout "loop" over rows, unless modified in code
Exercise
Exercise
Custom Stage
Custom Stage
DataStage Manager > select Stage Types branch > right click
Custom Stage
The Result
Module 11
Objectives
Understand how EE uses meta data, particularly schemas and runtime column propagation Use this understanding to:
Build schema definition files to be invoked in DataStage jobs Use RCP to manage meta data usage in EE jobs
Data definitions
Recordization and columnization Fields have properties that can be set at individual field level
Described as properties on the format/columns tab (outputs or inputs pages) OR Using a schema file (can be full or partial)
Schemas
Can be imported into Manager Can be pointed to by some job stages (i.e. Sequential)
Format tab Meta data described on a record basis Record level properties
Column Overrides
Edit row from within the columns tab Set individual column properties
Editing Columns
Schema
Alternative way to specify column definitions for data used in EE jobs Written in a plain text file Can be written as a partial record definition Can be imported into the DataStage repository
Creating a Schema
Importing a Schema
Data Types
String
Time Timestamp
DataStage EE is flexible about meta data. It can cope with the situation where meta data isnt fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). RCP is always on at runtime. Design and compile time column mapping enforcement. RCP is off by default. Enable first at project level. (Administrator project properties) Enable at job level. (job properties General tab) Enable at Stage. (Link Output Column tab)
Go to output links columns tab For transformer you can find the output links columns tab by first going to stage properties
To utilize runtime column propagation in the sequential stage you must use the use schema option Stages with this restriction:
Sequential File Set External Source External Target
Exercise
Module 12
Objectives
Understand how the DataStage job sequencer works Use this understanding to build a control job to run a sequence of DataStage jobs
Job Sequencer
Build a controlling job much the same way you build other jobs Comprised of stages and links No basic coding
Job Sequencer
Build like a regular job Type Job Sequence Has stages and links Job Activity stage represents a DataStage job Links represent passing control
Stages
Example
Job Activity stage contains conditional triggers
Trigger appears as a link in the diagram Custom options let you define the code
Options
Can add wait for file to execute Add execute command stage to drop real tables and rename new tables to current tables
Sequencer Stage
Notification Stage
Notification
Notification Activity
E-Mail Message
Exercise
Module 13
Objectives
Understand spectrum of tools to perform testing and debugging Use this understanding to troubleshoot a DataStage job
Environment Variables
Environment Variables
Stage Specific
Environment Variables
Environment Variables
Compiler
The Director
Typical Job Log Messages:
Tracing/Debug output
Troubleshooting
If you get an error during compile, check the following:
Compilation problems If Transformer used, check C++ compiler, LD_LIRBARY_PATH If Buildop errors try buildop from command line Some stages may not support RCP can cause column mismatch . Use the Show Error and More buttons Examine Generated OSH Check environment variables settings Very little integrity checking during compile, should run validate from Director.
Row Generator plus lookup stages provides good way to create robust test data from pattern files