Professional Documents
Culture Documents
Category Guideline
Design jobs for restartability.
Design
Set $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject rows when data value exceeds column size.
Code
Use "Hash" aggregation for limited distinct key values. Outputs after all rows are read.
Code
Use "Sort" aggregation for large number of distinct key values. Data must be pre-sorted. Outputs after each aggregation
Code group.
Use multiple aggregators to reduce collection time when aggregating all rows. Define a constant key column using row
Code generator. First aggregator sums in parallel. Second aggregator sums sequentially.
Make sure sequences are not too long. Break up into logical units of work.
Code
Source
What is the source (Database or Files and category of the same)?
Source
Is this the best source to extract this data?
Source
Is this source maintained manually or systematically?
Source What additional checks are used when data is entered manually to ensure data quality issues are handled due to manual
mistakes?
Source
Is source is cleaned and loaded every time?
Source
How are we making sure we are not getting incomplete data, if source is cleaned and loaded?
Source
Do we get changed data or complete data (if complete, why can't we get delta data)?
Source
What is the frequency data gets updated in the source?
Source
What is the volume of source (Delta or Complete)?
Source
Can we increase the frequency of load, if volume of delta records is huge?
Source
What is % of growth after 2 and 5 years?
Source
What is the record length of source?
Source
Is the source partitioned?
Source
Is the source in Detroite or other location?
Target
Is this target table populated by multiple source system data?
Target
How are we differentiating each source system, if yes to above question?
Target
Is this target table populated by multiple table data?
Target
How are we differentiating each table detail, if yes to above question?
Target
Are there any integrity/referential constraints that should be satisfied?
Target
What would happen if the integrity constraint is not satisfied when it is loading?
Target How are we making sure that it gets into the system with single format for date and other fields in this table, if data comes from
multiple locations?
Target
What are the business rules around this table load?
Target
How are we making sure that data is not shown to the bossiness when we are in the process of loading the same?
Scheduling
How many jobs are we going to have as part of this review?
Scheduling
Is number of jobs more or less for standard?
Scheduling
Can we combine or split so we can reduce number of jobs or complexity respectively?
Scheduling
What is the time window this job should be executed
Scheduling
Can we run this schedule in the non-busy time, if we have resource bottleneck on specified time window?
Scheduling
How the dependency scheduled for the master child relationship?
Scheduling
How do we get initial data to the system?
Scheduling
Will there be a separate review, if we have separate process for the initial load?
Scheduling
What would be the result, if two schedules are deployed in the system at the same time?
Scheduling
How to avoid the above said scenario?
Scheduling
Will this flow satisfy all the requirements of requirement?
Exception Handling
Do we expect any rejections in the process?
Exception Handling
Do we have exception handling at the stage level or group of stage (for the records don’t satisfy condition)?
Exception Handling
How is it handled, if abort in the middle flow?
Exception Handling
Are we going to re-start or re-run, if any abort?
Exception Handling
How restart and re-run handled, so that it doesn't impact the data integrity?
Exception Handling
What if the environment is not up and it is failing due to the same?
Exception Handling
What would be the way data will get populated wrongly in the target table?
Exception Handling
Is there any recovery plan, if data gets populated wrongly?
Exception Handling
Would it be OK if process load part of data?
Restartability
Is this flow is restartable at any point (Job) of failure?
Restartability
Should this to be re-run if failure happens?
Restartability
What would need to be done, if source data is cleaned up by source system(for re-processing or dunirng failure of processing)?
Restartability
Is there a possibility of same data getting processed more than one time?
Knowledge Sharing
How busy is the system during the scheduled time window?
Knowledge Sharing
Update resource available and resource utilization/sharing
Knowledge Sharing
Partitioning/Sorting
Environmental
Is all metadata are all available?
Environmental
Is ETL related objects are available and is developer has access?
Environmental
Is Database related objects are available?
Environmental
Is all connection information are available?
Environmental
Is source database access are available?
Environmental
Is target database access is available?
ign & Code Guidelines
wed when preparing for Design and Code Reviews.
resented in checklist format.
Guideline
Same" partitioning.
ata sets.
of distinct key values. Data must be pre-sorted. Outputs after each aggregation
tion time when aggregating all rows. Define a constant key column using row
el. Second aggregator sums sequentially.
reak up into logical units of work.
ematically?
ta is entered manually to ensure data quality issues are handled due to manual
e?
n the source?
mplete)?
able data?
shown to the bossiness when we are in the process of loading the same?
d?
e executed
f requirement?
ng to be scheduled?
base and Datastage environment?
ed for null?
is null?
, if any?
fields?
g transaction?
s?
w?
bort?
pulated wrongly?
a?
) of failure?
tilization/sharing
?
ETL Code Review Checklis
Guidelines to be followed when preparing for Design an
Reviews, presented in checklist format.
Guideline
1 Design jobs for restartability/ if not designed then what is the reason ?
3 check if the APT_CONFIG_FILE parameter is added. This is required to change the number of
runtime.
7 Use "Hash" aggregation for limited distinct key values. Outputs after all rows are read.
8 Use "Sort" aggregation for large number of distinct key values. Data must be pre-sorted. Outp
aggregation group.
9 Use multiple aggregators to reduce collection time when aggregating all rows. Define a consta
using row generator. First aggregator sums in parallel. Second aggregator sums sequentially.
10 Make sure sequences are not too long. Break up into logical units of work.
11 Is the error handling done properly? It is prefered to propogate errors from lower jobs to the hig
sequence)
12 What is the volume of extract data( is there a where clause in the SQL)
13 Are the correct scripts to clean up datasets after job complete revoked ?
16 It is not recommended to have an increase in the number of nodes if there are too many stages
increases the number of processes spun off)
17 Volume information and growth information for the Lookup/Join tables?
18 Check if there is a select * in any of the queries. It is not advised to have select * , instead the r
columns have to be added in the statement
20 When a sequence is used make sure none of the parameters passed are left blank
21 Check if there are separate jobs for atleast extract, transform and load
22 Check if there is annotation for each stage and the job, the job properties should have the auth
out
23 Check for naming convention of the jobs, stages and links
24 Try avoiding peeks in production jobs, peeks are generally used for debug in the development
25 Make sure the developer has not suppressed many warnings that are valid
27 Verify that the jobs conform to the Flat File and Dataset naming specification. This is especially
cleaning up files and logging errors appropriately.
28
Verify that all fields are written to the Reject flat files. This is necessary for debugging and reco
ew Checklist
eparing for Design and Code
hecklist format.
eason ?
units of work.
the SQL)
revoked ?
complexity respectively?
n tables?
ed to have select * , instead the required
and load