You are on page 1of 7

Example for ETL Best Practices DataStage

Overview: Like any programming tool DataStage jobs must follow the principles of the SDLC in the delivery of components. To assist in the development, management and support of the DataStage environment, there are a set of best practices identified by Ascential and the DataStage community that should be implemented. These best practices are additional to a standard DataStage install. Failure to implement these practices usually result in problems further down the track, such as a higher cost of future development, increased time spent on administration tasks, and problems with reliability. Best Practices: The following best practices have been identified for Phase 1 of the Data Quality project:

Naming Conventions: will be in place from the start of the development. Folder Management: DataStage components will be organized into a set of folders for each type of extract and a subset of folders for each type of job. Failover and Recovery: because the loads are not large the strategy in this area will be to restart the extract from the beginning. The standard best practices of milestone points and staging will not be required. Error Handling: will target 100% error trapping of jobs and components called by DataStage. Release Management: will use the DataStage Version Control tool to provide a process to deliver read only components to testing and production. Components will only be moved to production after first passing through the System Test environment. Process Reporting: status and row counts of jobs will be retrieved for accurate process reporting. Notification: Email will be the primary notification vehicle and it will be integrated with the error handling and process reporting routines to ensure information about successful and unsuccessful runs are delivered to the administrator and stake holders. Parameter Management: the ability to manage job parameters across environments so that components can be delivered and run without requiring any modifications. Optimization: Instantiation will be built in where data is written out to staging files through each part of the process. Multiple instances will not be required in phase 1 due to the small size of the files, instead parallelism will be used when different types of extracts run at the same time. Standard optimization techniques such as hash file lookups will be in place. Simple Design: each job will have up to six stages and be designed to carry out one part of the processing. Metadata Management: DataStage will be the central repository of metadata information on source and final output files. File definitions will be setup using the standard DataStage import table definition functions. Metadata on the addresses processed by QualityStage will be managed by the QualityStage repository and imported into DataStage. Version numbering of metadata will be in place in DataStage to map changes to metadata against jobs affected. Best of breed processing: DataStage will handle those tasks best suited to ETL while QualityStage will handle all validation and standardisation tasks. This will result in a solution which is inherently easier to maintain and support.

Naming Conventions DataStage components

All jobs will include the two-character code "DQ" to indicate they belong to the data quality project. Sequence jobs will be prefixed by "Seq:" Batch jobs will be prefixed by "Batch:"

All jobs will have a two-digit code identifying which type of extract they belong to, i.e. BP, CP, or DM. After all prefixes the job name will be a description of the primary job function. Descriptions will be standard across extract types. Routines will have the prefix DQ indicating they were written for the data quality project. The rest of the routine name will describe the primary routine function to make it easy to identify in a transformer. Saved Files will retain the same name they had in the source system. Job Stages A prefix will be used on all stage names. This helps identify the stage when metadata about the jobs

are extracted into a customer report:

Transformer xfm Sequential file stage seq Hash file stage hsh Complex flat file stage cff The rest of the stage name will describe the primary function of the stage. Files Names Source files are given names by the processes or Data Analysts who create them. These names are

expected to be unique. DataStage will retain these names throughout the processing cycle and add extra identifiers to the names as the file passes through each process stage, as follows (assuming the input file is named customers010108.dat):

Filename Customers010108.dat Init_customers010108.dat

Messages_customers010108.dat Clean_addresses.hsh Rejects_customers010108.dat Apid_customers010108.dat Apid_stat_customers010108.dat Log_Apid_customers010108.dat Err_Apid_customers010108.dat Apid_matches.hsh Out_customers010108.dat

Contents The original source file; name must be unique. The original file content is updated with a unique identifier for each record; the original filename is prepended with the init_ prefix. The log file that receives all the trapped error and warning messages from QualityStage, DataStage, and Unix scripts. A file of clean addresses in hash file format to allow lookup. Addresses that are rejected by DataStage due to validation or transformation problems. The text file created for the APID application. The APID statistics file showing APID success rates and error rates. The log file produced by the APID process. The error rows created by the APID process. The APID output loaded into a hash file for lookups. The output from the APID application containing

unique IDs and matching APID codes. The final file containing clean addresses and DQ_customers010108.dat APIDs. Failed_APID_customers010108.dat.txt The report containing addresses without a APID. The before / after report containing all clean and Before_After_customers010108.dat.xml original address information. Temporary file that contains the text of an email Mail.msg notification. Temporary file containing all the text for an email Attach.txt file attachment.
QualityStage components

All jobs will include the two-character code (CP, BL, or DM) indicating what source system they belong to. CP-Customer Correspondence, BL-billing file , DM-Direct Marketing The three Routines within QS include Investigation, Standardization and Matching beginning with each routine will be prefix with INV for investigation processing, STN for standardization processing and MTC for matching processing. There is a detail description field for each routine to make it easy to identify a routine. Saved Files will retain the same name they had in the source system. Job Stages

A prefix will be used on all stage names. This helps identify the stage when metadata about the jobs are extracted into a customer report. Investigation INV Standardization STN Matching MTC The rest of the stage name will describe the primary function of the stage. Example of the naming convention used in the Investigation routine

Customer-id INVTCUS INVCCUS INVDCUS Customer Name INVWNAM Street Address INVWAD1 Postcode Address INVWAD2 Locality Address

Description Investigate Customer-id (T) pattern Investigate Customer-id (C) content Investigate Customer-id (D) duplicate Description Investigation Name (w) word Description Investigation Address1 (w) word Description Investigation Address2 (w) word Description

INVWAD3 State Address INVTSTA INVCSTA INVDSTA


Files Names

Investigation Address3 (w) word Description Investigate State (T) patttern Investigate State (C) content Investigate State (D) duplicate

Source files are given names by the Data Analysts or automated processes that create them. These names are unique. QualityStage will retain these names throughout the processing cycle unless requested to change these names during testing. Examples of naming standards used in the Standardization and Matching routines: QualityStage Standardization and Matching Naming Standard

Customer Correspond file CP10030A CP10040 CP10050 CP1MTCHI CP1MTCHI.RPT CP1MTCHO CP1MTCHO.RPT Billing File BL0003A BL0003C BL0003E BLMTCHI BLMTCHI.RPT BLMTCHO BLMTCHO.RPT

QualityStage Naming Standards Standardization file Handled Standardization file Unhandled Standardization file (Exception Report) CP Individual Match file CP Individual Match report CP Organisational Match file CP Organisational Match report

QualityStage Naming Standards Standardization file Handled Standardization file Unhandled Standardization file (Exception Report) BL Individual Match file BL Individual Match report BL Organisational Match file BL Organisational Match report file QualityStage Naming Standards Standardization file Handled Standardization file (with DM business rules) Unhandled Standardization file (Exception Report) DM Individual Match file DM Individual Match report

Direct Marketing file DM00030A DM00088A DM00050 DMMTCHI DMMTCHI.RPT

DMMTCHO DMMTCHO.RPT
Error Handling

DM Organisational Match file DM Organisational Match report

The target is 100% error handling. Sequence job error handling

An exception handling stage will exist in each sequence job to trap all unexpected errors, this stage will trigger notification. Check the return code of server jobs called and trap any non-success statuses. Server jobs will be checked for reject records. Check the return code of unix scripts. Unix scripts will follow a template where any errors will return a non-zero code to DataStage to flag an error. They will also echo success and failure messages back to the DataStage log for inclusion in notification. Server job error handling

Errors in the opening and creation of files will be trapped by the calling sequence job. Errors in data will be trapped by a reject link from each transform writing the record out to a reject file. Minimal transformation and validation will be carried out in server jobs, most of the data validation will be performed in QualityStage thereby reducing the risk of DataStage errors.

Release Management The DataStage Version Control tool is used to deliver components to testing and production. This tool handles all components in a DataStage project and any Unix scripts or applications placed in custom folders under the project directory.

Components for delivery are collected into a numbered release. The Version Control tool attaches to the development project and required components are imported into the version control project as a release batch. The batch is then promoted to testing as read only components. The tool will compile the jobs as they are delivered. Parameters are then updated to reflect the paths on the new system.

Notification DatatStage will be responsible for all communication with support and users. The main form of communication will be emails sent from the xmail application on the DataStage server. The following files are collected and written out in an email message:

Errors_filename : All error messages logged by DataStage server and sequence jobs. If a job aborts without a trappable error message then a message will be created for the log and this file. Warnings_filename : Contains up to 50 warning messages from each job. If a job has more than 50 warnings it aborts. The full text of the warning is retrieved to make trouble shooting easier. Rejects_filename : This file is sent as an attachment. Any job that has a complex transformer will have a reject link, all these reject links append rows to this file. Since it is of unknown length and width it is sent as a tab delimited attachment that can be opened in Excel. Apid_stats_filename : This file contains APID statistics on the number of rows processed and the success of APID processing. Since this information is also available in the row counts this file is ignored.

Links_filename : This file contains important link row counts such as the rows at the beginning of processing, the rows passing APID, rejected row counts and the final output row counts.

MetaData Management Source file definitions are imported into DataStage via the import function. Imports of flat files will use the headings already appearing in the first row of those files. Imports of COBOL files will use the COBOL definition files provided by DIT. Cobol definition workaround The COBOL definition files provided by DIT needed the following amendments so they could be imported and used by DataStage:

A 01 entry was added to the start of the file as DataStage requires this numbering to identify a table. All comments were removed as the import function was failing to parse these rows. All 88 commands were removed as these commands are intended for report formatting and altered the raw data when they were parsed. A Unique Id field was added to the start of each row. The definition files with the above changes were then used to create the table definitions in

DataStage, defining the layout of the Billing and Correspondence files. When a file definition is first imported it is given a version number of v_1 to indicate this is the first version. It retains this number until it reaches testing. If a table definition needs to be changed, for example the Direct Marketing file receives a new address field for validation, the format is imported and number as v_2. Jobs that use this file load up that field using the v_ definition. It is then possible to extract a metadata report from the Reporting tool that shows all jobs using the old table that need to be updated to the new format. Optimisation To ensure the shortest possible delivery time of processed files the DataStage controlling sequence job will run in continuous mode waiting for the arrival of files in the Landing directory. This caters for files arriving on a schedule and files delivered ad hoc by Data Analysts. Parallel processing will be possible by running more than one type of extract at the same time. For example a billing extract can run at the same time as a direct marketing extract in the event that the files arrive around the same time. The use of multiple instance jobs, where more than one instance of a billing job can run at the same time, was investigated and deemed an unnecessary development overhead for phase 1 given the small size of the files. The following guidelines are used to optimize each job:

First transform reduces breadth. Where possible the first transform in any job will only carry through those columns that are needed later in the job. Any columns that are not required are ignored.

First transform reduces depth. Where a subset of rows need to be processed the first transform in a job, or a complex flat file stage, will attempt to filter out those rows that do not need to be processed to increase the performance of the rest of the job. Unique Id will be used for all look-ups. A single unique id is the fastest way to look-up and match different sets of data.

You might also like