You are on page 1of 12

File Stages

There are 8 file stages. 1) Complex Flat File (CFF)

- The Complex Flat File (CFF) stage is a file stage. - You can use the stage to read a file or write a file, but you cannot use the same stage to do both. - The stage can have a single input link or a single output link, as well as a single reject link. - When used as a source, the stage allows you to read data from one of more complex flat files,
including MVS datasets with QSAM and VSAM files. A complex flat file may contain one or more GROUPs, REDEFINES, OCCURS, or OCCURS DEPENDING ON clauses. Complex Flat File source stages execute in parallel mode when they are used to read multiple files, but you can configure the stage to execute sequentially if it is only reading one file with a single reader. When used as a target, the stage allows you to write data to one or more complex flat files. It does not write to MVS datasets. To use the CFF stage: In the File Options Tab, specify the stage properties. If reading a file or files: Specify the type of file you are reading. Give the name of the file or files you are going to read. Specify the record type of the files you are reading. Define what action to take if files are missing from the source. Define what action to take with records that fail to match the expected meta data.

If writing a file or files: Specify the type of file you are writing. Give the name of the files you are writing. Specify the record type of the files you are writing. Define what action to take if records fail to be written to the target file(s). In the Record Options Tab, describe the format of the data you are reading or writing. In the Stage page Columns Tab, define the column definitions for the data you are reading or writing using this stage. 2) Data Set The Data Set stage is a file stage. What is a data set? DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. If you open the file in OS, it will show in different format. A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments. Each partition of a data set is stored on a single processing node. Each data segment 1

contains all the records written by a single DataStage job. So a segment can contain files from many partitions, and a partition has files from many segments. The descriptor file for a data set contains the following information: Data set header information. Creation time and data of the data set. The schema of the data set. A copy of the configuration file use when the data set was created. For each segment, the descriptor file contains: The time and data the segment was added to the data set. A flag marking the segment as valid or invalid. Statistical information such as number of records in the segment and number of bytes. Path names of all data files, on all processing nodes. This information can be accessed through the Data Set Manager. It will act as a lookup also. By using orchadmin rm <<path>>/<<dataset name>> command you can delete the Data Set from the UNIX. You can Append or Over write the DataSet. Through Data Set management utility you can delete a partition or Segment or entire Data set. The stage can have a single input link or a single output link. It does not support reject link. It can be configured to execute in parallel or sequential mode. As a source you cannot specify the partition to read data. As a target you can specify the partition technique which is to be used. DataSet cannot maintain the unique records. If you send the duplicate records to it then it contains the duplicate records as it is. Each time you can change the configuration file and can load the data into same dataset. But for Append it will contain the config file details of config files used while it was created first time. For overwrite it will have the new config details only in descriptor file If you set the update policy to append and send the more no.of columns data than it contains the it takes the data for existing matching columns and load into dataset and logs the warning.

Questions: -----------1) How to count the no.of records in a DataSet? Ans) dsrecords <<DataSet Name>> Note: You can not use the same Data Set name as a source and target. You cannot update the Data Set. 3) External Source The External Source stage is a file stage. It allows you to read data that is output from one or more source programs. The stage calls the program and passes appropriate arguments. The stage can have a single output link, and a single rejects link. It can be configured to execute in parallel or sequential mode. It will not use as a target. It will not take any input link. The External Source stage allows you to perform actions such as interface with databases not currently supported by the DataStage Enterprise Edition.

When reading output from a program, DataStage needs to know something about its format. The information required is how the data is divided into rows and how rows are divided into columns. You specify this on the Format tab.

4) External Target The External Target stage is a file stage. It allows you to write data to one or more source programs. The stage can have a single input link and a single rejects link. It can be configured to execute in parallel or sequential mode. It will not act as a source. The External Target stage allows you to perform actions such as interface with databases not currently supported by the DataStage Parallel Extender. When writing to a program, DataStage needs to know something about how to format the data. The information required is how the data is divided into rows and how rows are divided into columns. You specify this on the Format tab. Settings for individual columns can be overridden on the Columns tab using the Edit Column Metadata dialog box.

5) File Set The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can have a single input link or a single output link and a single rejects link. It only executes in parallel mode. What is a file set? DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on: The number of processing nodes in the default node pool The number of disks in the export or default disk pool connected to each processing node in the default node pool The size of the partitions of the data set Unlike data sets, file sets carry formatting information that describes the format of the files to be read or written. File name will be having suffix .fs It creates the data files under each node at specified directories in config file.

6) FTP Plug-in - It will allow to have only one input or output link. - You can load the data to or get the data from remote server. As a source - It cannot produce a reference or reject link. As a target - It can not have a reference link 3

7) Lookup File Set The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs. As a source (i.e. reference) you can not see the execution mode When using an Lookup File Set stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, note that the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup. Specify the key that the lookup on this file set will ultimately be performed on. You can repeat this property to specify multiple key columns. You must specify the key when you create the Lookup file set, you cannot specify it when performing the lookup. The key column is case sensitive by default. It will allow you to store duplicate records if you set the option Allow Duplicates to True. By default if Lookup File Set contains duplicate records it is allowed to send duplicate records to target. If any other reference If the lookup contains more Lookup File Sets with duplicates then job will get aborted.

8) Sequential File Default options available for this stage Execution Mode Combinability Mode Preserve Partitioning Buffering Mode Partition Type Sorting

As a Source -------------- You can read 1 or more delimited or fixed width files at once. But column tab is only one tab for all the files that means all the files should contain same column structure. While processing it combines all the data as one source. - The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. - By default, it will read data in sequential mode. If you want to read data in parallel mode you have to specify the options as read data from multiple nodes, then automatically it will change the mode from sequential to parallel, you can not select execution mode.( if you specify no.of nodes as 1 it is sequential other than 1 like 2,3,0,-1,-2 etc.. will be parallel mode). - The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each node writes to a single file, but a node can write more than one file. 4

If you specify read method as File Pattern then specify the pattern (ex:- *.txt etc). But column tab is only one tab for all the files that means all the files should contain same column structure. You can filter the data at source itself. You can generate the row numbers file wise. You can read no. of records from top or bottom. You can have one Reject link.

As a Target -------------- If you specify multiple files as target files, some of data will be moved to one file and some of data will be moved to other file etc.. - You can load the data into delimited or fixed width flat files. - You can filter the data while storing. - You can have one Reject link.

Processing Stages

1) Aggregator
It supports one Input and one Output link. It will not support reject link. Used to perform some aggregation functions on data. You can do group by on any no. of columns and you can specify the case sensitive also. Everything is based on group by column(s). There are three types of aggregations Calculation Re-calculation Count Rows Calculation Maximum value output column

Mean value output column. Formula: = X/N Minimum Value Missing Value what value needs to consider for missing values. Corrected Sum of Squares

2 2 Formula: Missing Values Count how many records contain NULL values.

x = X -

or xy = (XY) -

Non-missing Values Count - how many records contains Not NULL values. Percent Coefficient of Variation Range (Max Min) Standard Deviation.

Formula Standard Error Formula:


STANDARD DEVIATION / SQUARE ROOT OF THE POPULATION SIZE -orSTDEV(range of values)/SQRT(number) (i.e. Total no.of records)

Sum of Weights no.of records per each group key. Sum Summary Uncorrected Sum of Squares Variance Formula:

Important Points:- We can use one Aggregator stage for one aggregation type only. Questions:1) If column contains null values while doing group by, what will happen? Re-calculation It contains all the properties of calculation category. Count Rows

Count output column Weighting column 2) What are the 2 Aggregate methods? - Hash - Sort 2) Change Apply - It is used to apply the changes. - It will take the one input as a original dataset and for the second input it takes the dataset which is created by using (after) changeCapture stage. - By using change code it will decide which record need to be inserted. - It will create the dataset, which does not contains the deleted records. Rest of all other records will be there. 3) Change Capture - It is used to capture the changes in two different files. - You can specify the following options for comparison o Explicit Keys & Values explicitly need to specify key columns and value columns. o Explicit Keys, All Values all columns are value columns. Key columns need to be specified explicitly. o All Keys, Explicit Values all columns are key columns. Value columns need to be specified explicitly. - Other options on data o Drop output for copy records which are not modified. o Drop output for delete records, which are not there in new file. o Drop output for edit records, which are modified. o Drop output for insert records, which are not there in old file and presented in new file. Change codes 0 common record without changes 1 new record (not exists in old file and exists in new file) 2 deleted record (means record is there in old file and not in new file.) 3 record is modified. Exclude as a key means ignore this column as a key only for comparison.

4) Compare It is used to compare 2 files. 5) Compress It will do gzip.

6) Copy
It can have only one link as input i.e. primary. It can have many output links (output links may be reference links), it does not support reject link as output link. We can change the column names also. The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns (to copy with more modification for example changing column data types use the Modify stage as described in Chapter 28). Copy lets you make a backup copy of a data set on disk while performing an operation on another copy. Where you are using a Copy stage with a single input and a single output, you should ensure that you set the Force property in the stage editor TRUE. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job. You can copy the specified columns to specified output links with out modifying the datatypes. We need to set the force property (True or False) for copy stage.

Force. Set True to specify that DataStage should not try to optimize the job by removing a Copy operation where there is one input and one output. Set False by default.

7) Decode It converts the source data into UNIX decode format. 8) Difference It will compare and will give the difference. 9) Encode It converts the data into UNIX encode format. 10) Expand It is used to unzip the gzip files. 11) External Filter It uses the UNIX commands. 12) Filter - It filters and route the records to specific links. Condition like a=2. - It can have single input link. - It can have many output links by specifying many conditions. - It can have single reject link by setting the property. - It supports standard SQL expressions, except when comparing strings. - You can specify multiple conditions by using AND or OR operators. - You cannot write if then else conditions. - You cannot use predefined functions. - You can restrict the record to go to first link where it matches the condition by selecting the Output Row Only Once property to True. 13) FTP Enterprise It uses the URL to transfer the files 14) Funnel - Is used to combine the multiple datasets. - It can have many no.of input links and one output link and no reject link. - We need to specify the funnel type, the default is continuous funnel. - There are 3 types of funnels - Continuous Funnel - Sort Funnel - Sequence - Continuous Funnel combines the records of the input data in no guaranteed order. It takes one record from each input link in turn. If data is not available on an input link, the stage skips to the next link rather than waiting. For this mode LinkOrdering is important. - Sort Funnel combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys. For this we must specify the key column and order mode. We can specify the null values positions when to come either first or last. - Sequence copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on. 15) Generic to use the orchestrate operator 16) Join - It performs join operations on two or more data sets. - In the Join stage, the input data sets are notionally identified as the right set and the left set, and intermediate sets. You can specify which is which. - It has any number of input links and a single output link (except full outer join, for this it accepts only 2 input links). - But, the output link would be either stream or reference link. - It does not support the Reject link. - Except left and right sets all are intermediate sets (ex: intermediate1, intermediate2 etc) - Key column(s) name must be unique. - It performs inner, left outer, right outer and full outer joins on input data. - You can specify the Execution Mode. - You can specify the link ordering which is left or intermediate or right set(s). - If left, intermediate and right sets have same column it takes left set data only. - If left and intermediate sets have same column it takes left set data only.

If intermediate and right sets have same column then it takes intermediate set data only. That means it follows the ascending order of the input link order. Join stage itself sort the data coming to it and after that it performs the join operations.

Performance Tips - The data sets input to the Join stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Join stage, DataStage in auto mode will detect this and not repartition (alternatively you could explicitly specify the Same partitioning method).

17) Lookup
It is used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data. The Lookup stage can have one primary input link, many reference link(s) (one reference link is must), a single output link and a single rejects link (optional). You can change the order of reference links. The lookup key columns do not have to have the same names in the primary and the reference links. You can specify a condition on each of the reference links. If condition fails it does not do lookup for that record even though match is found. You can specify condition on Input data or previous links according to link ordering. You cannot specify the condition on that link itself. If you want the comparison to be performed on this column to ignore case then select the Caseless checkbox. The following are the condition Not Met and lookup failure options - Continue : It produces the output without lookup values - Drop : It drops that record - Fail : It makes the job gets aborted - Reject : It sends the non-matching records into reject link If reference link contains duplicate records, it logs the warnings into log file (i.e. director).If Input contains duplicate records it does not do anything. If you want to send duplicate records to target from only one link then Specify the link name for Multiple rows returned from link option from the drop-down list. Generally while doing lookup it fetch all the reference data into memory and performs the lookup. If reference link is database then following lookup types available - Normal - Sparse In Normal, lookup the data will get fetched into memory and lookup will be performed. In Sparse, lookup happened at database at run time for each record. But, with Sparse mode you cannot have other reference links to lookup stage. It can have only one reference link that is database (sparse mode) Build tab is required when you want to use C++ code to be used for lookup. It will override the project defaults and uses the C++ code (compiler and Linker Flags).

Imp: ---1) Make sure reference data is small. If reference data does not fit into memory then job will get aborted. 2) Normal is faster than Sparse. If you are doing more no. of operations then go for Sparse.

3) If you load all the reference data into lookupFileset/Dataset and uses this then it perform the direct lookup. So, it works fast. In lookupFilest lookuptable will get created. Lookup File Set link will not be visible in Multiple rows returned from link option.

Drawbacks: Be aware, though, that large in-memory look up tables will degrade performance because of their paging requirements.

18) Merge
It can have any number of input links(at least 2,one-master set,second-updateset), a single output link, and the same number of reject links as there are update input links or no reject links. It sends all the master data to output link, but if master data contains duplicate records then the first record only contains update values rest of the records of same key column contains null values in the space of update values. Unmatched records from the update sets will move to associated reject links. As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well. If update set contains duplicate records then this stage uses the first record as the update set record and sends the other same records and non-matching records to reject link. This is happening only when job contains more than one update set. If it contains only one update set then it is performing join, it is not sending other same records to reject, it sends to output link. If any link contains duplicate records then job raises warnings, to overcome this some properties need to be set in Merge stage. If targets are Sequential file stages then in Merge stage you need to set the preserve Partition to clear else it raises warnings. Input data to merge must be key partitioned and sorted

Performance Key Tips :The data sets input to the Merge stage must be key partitioned and sorted. Questions: ---------1) What are the input link names? Ans) Primary input is Mater others are Update1, Update2, Update3 etc. 2) What are the output link names? Ans) Primary input is Mater others are Update1 reject, Update2 reject, Update3 reject etc. 3) How many input links can Merge stage have? Ans) Primary is 1 and others are many. 4) How many output links can Merge stage have? Ans) Primary is 1 and others are as many as Update sets (these are rejected links). 5) How many reject links a Merge stage can have? Ans) None or as many as input Update sets. 6) What will happen if Primary input contains duplicate records? Ans) Matching will happen only for one record from all the duplicate records. For other duplicate records values will be null (String values) or 0(Integer values). 7) What will happen if Update sets contains duplicate records? Ans) for multiple update sets - It considers only one record from all the duplicate records for matching and rest of the records will moved to rejected links. For single update set It considers all the duplicate records for matching that means equi join will be performed. 8) What is the minimum requirement for Marge? Ans) Key column should have same name.

10

19) Modify
It can have a single input link and a single output link. It cannot produce reject links. The modify stage alters the record schema of its input data set. The modified data set is then output. You can drop or keep columns from the schema, or change the type of a column. Specification options need to be specified for columns, which are required for modification. KEEP col_name1 [,col_name2] means it keeps these columns only and drops other columns. DROP col_name1 [,col_name2] means it drops these columns only and keeps other columns. For newly added columns new_col_name: string = string_from_decimal (CUSTID) We can modify existing columns also. i.e. handling null values.

20) Pivot
Converting columns into rows (in derivation need to be written columns like a,b (these are input column) It will not support reject link.

21) Remove Duplicates


It sends the unique records based on key column (s) to target. It contains only one input and output links. It does not support reject link. Under key property, you can specify the key column name; the dropdown list for this property contains all the input columns. You can choose which one to retain from duplicate records whether first one or last one. You can pass specific columns to target also.

22) Sort
It contains single input and output links only. It does not support reject link. It sorts the data by using specified key columns. It uses temporary disk space while performing sort. (i.e. TMPDIR, nodes.) We can specify the null values positions. We can stop or process the duplicate records. We can specify how much memory needs to be used. We can specify the option stable sort set to true, if data is already sorted. You can log the output statistics by setting this property to True. You can specify the Sort Utility as DataStage in-built sort or Unix Sort. By setting this property.

23) Surrogate Key Generator We can generate the series of numbers. 24) Switch - It routes the records. It works like C Switch statement. - It can have single input link. - It can have many output links (only 128 output links are allowed). - It can have single reject link by choosing the property. - You can specify only one column to route the records. - Link order is important while routing the records. - Selector, Selector mode and Case are needed to be specified for routing. - Selector indicates on which column need to use for routing. - Selector Modes are

11

User-defined Mapping:- This is the default, and means that you must provide explicit mappings from case values to outputs. If you use this mode you specify the switch expression under the User-defined Mapping category. o Auto:- This can be used where there are as many distinct selector values as there are output links. o Hash:- The incoming rows are hashed on the selector column modulo the number of output links and assigned to an output link accordingly. We can discard some of the records, which are not at all required. You can specify the condition like Case = 1(value)=0(link number) Case = 2=0 This means all records having 1 or 2 values will be going to link 0.

25) Transformer - It does not use the generic user interface (stage editor). - You can pass data or change the data to the target. - You can create the Stage Variables. - It supports only one input link. - It supports many output links - It will support only one reject link. - It will support many reference links. - You can change the execution order of the Targets. - You can make the job gets aborted by using Abort after rows option. - You can add new columns. - You cannot specify the rejected records for specific link (by using rejected word in constraints)

RCP Runtime Column Propagation DataStage is also flexible about meta data. It can cope with the situation where meta data isnt fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). This can be enabled for a project via the DataStage Administrator and set for individual links via the Outputs Page Columns tab for most stages, or in the Outputs page General tab for Transformer stages. You should always ensure that runtime column propagation is turned on if you want to use schema files to define column meta data.

12

You might also like