You are on page 1of 12

Dataset components

Dataset components represent data records or act upon data records as follows:
INPUT FILE represents data records read as input to a graph from one or more serial
files or from a multifile.
INPUT TABLE unloads data records from a database into an Ab Initio graph, allowing
you to specify as the source either a database table, or an SQL statement that selects data
records from one or more tables.
INTERMEDIATE FILE represents one or more serial files or a multifile of
intermediate results that a graph writes during execution, and saves for your review after
execution.
LOOKUP FILE represents one or more serial files or a multifile of data records small
enough to be held in main memory, letting a transform function retrieve records much
more quickly than it could retrieve them if they were stored on disk.
OUTPUT FILE represents data records written as output from a graph into one or
multiple serial files or a multifile.
OUTPUT TABLE loads data records from a graph into a database, letting you specify
the records' destination either directly as a single database table, or through an SQL
statement that inserts data records into one or more tables.
READ MULTIPLE FILES sequentially reads from a list of input files.
WRITE MULTIPLE FILES writes records to a set of output files.

INPUT FILE
Purpose
Input File represents data records read as input to a graph from one or multiple serial files
or from a multifile.
Parameters
The Input File Properties dialog does not have a Parameters tab.
However, you can specify values for parameters on the Description, Access, and Ports
tabs. This includes parameters such as input file location, file handling behavior and
permissions, and the input record format.

File Properties: Description Tab


To display the File Properties: Description Tab:
Double-click a file component to open the File Properties dialog.
Click the Description tab.
Use this tab to view information about the component, to display Help for the component,
to convert the File type to another type, or to set or change any of the following:
Label

Displays the name of the component by default. We recommend that you edit the label to
give the component a meaningful name, for example, XYZcustomers or
ABCtransactions.
Phase
Indicates the component's runtime stage. You can reassign the phase number here.
Name
Converts Label into a format that can be stored with the graph, for example, the name
automatically converts any spaces in the Label to underscores.
File Type
This option lets you set a file type for a new file or change a File component from one
type of file to another. For example, you might want to use an Output File from one graph
as an Input File for another graph. Select one of the following file types:
Input Changes the File component to an INPUT FILE.
Output Changes the File component to an OUTPUT FILE.
Note: To write to a file that can be used as a lookup file in a later Phase, click Output,
then select Add To Catalog.
Intermediate Changes the File component to an INTERMEDIATE FILE.
Lookup Changes the File component to a LOOKUP FILE.
Add To Catalog Makes a file available as a Lookup File in later phases of the
graph.
When you select Add to Catalog, a Parameters tab becomes available to set the key
parameter for the Lookup File so you can specify the key field against which you want
Lookup File to match its arguments.
URL
Displays the location of the input, output, or temporary file. An example of the location
of an input file is:
file://mynode.abinitio.com/tmp/mfs/input.dat
This example refers to a serial file located on a computer named mynode.abinitio.com in
the directory /tmp/mfs/, called input.dat.
You can edit the URL, or click Browse to access the Open File dialog and search for the
file you want to use.
NOTE: If you use a multifile URL to specify a parallel layout, the format is:
mfile://nodename/directory name/file name.
Partitions
Creates a multifile without creating a multidirectory. Select Partitions, then click Edit to
define or modify the partitions.
Export Displays the Export as Parameter Dialog. Click Help in this dialog for
instructions on how to export the graph or subgraph parameters.
Edit Data Displays the Editing Dataset Window. Click Help in this window for
instructions on how to modify the data manually and then save to the data file.

Owner Displays the creator of the file. If the box is blank, enter your name.
Comment Displays a brief description of the component. You can add your own
notes to the end of the description or overwrite any existing text.

File Properties: Access Tab


To display the File Properties: Access Tab:
Double-click a file component to view the File Properties dialog.
Click the Access tab.
Use this tab to view, assign, or change File Handling preferences and File Protection
options on an Input, Output, or Intermediate File component, or to display Help for the
component.
File Handling
You set File Handling preferences in Dataset components using checkboxes. There are no
default settings for Input datasets; the defaults for Output and Intermediate datasets are
Delete file and Create file. Different options are available, depending on whether this is
an Input, Output, or Intermediate dataset.
Select this checkbox:
Maintain exclusive access to
file
Truncate forward when
reading

To:
Prohibit use by others.

Delete file after last use

Remove the file after the phase that reads it completes.

Delete file if it exists first

Remove any files with this name before proceeding.


The default for Intermediate and Output.

Create file if it does not exist

Generate a file with this name if it does not already


exist. The default for Intermediate and Output.

Append output to end of file


when writing
Don't roll back file if job fails

Add output to the end of this file when writing to file.

Delete segments of the file after they are read. This


option only applies to a segmented file.

Keep this file during recovery if the Co>Operating


System encounters an error.

File Protection
These checkboxes represent the permissions assigned, during file creation, to the Input,
Output, and Intermediate files, also called Dataset components. The checkboxes match
the UNIX file protection standards: r (read), w (write), and x (execute) for User, Group,
and Other. You can change the settings in these checkboxes for Output or Intermediate
files only.

Properties: Ports tab

Use the Ports tab to set the record formats for a component's ports, to validate the record
format, or to display Help for the selected component.
NOTE: Yellow boxes indicate ports that require record formats.
To display this tab, do one of the following:
Double-click any component that has ports to view its Properties, and then click
the Ports tab.
Double-click a component port that does not have a record format associated with
it to view its Properties, and then click the Ports tab.
Defining a Record Format
The Ports tab offers the following choices for defining a record format.
Propagate From Neighbors
Select Propagate From Neighbors to automatically copy record formats to this port
from connected components or, if Propagate Through is selected, to copy record
formats between the in and out ports of this component. Propagate From Neighbors is
the default.
NOTE: Although Propagate Through is the default on most components, it is not the
default on many transform components, due to the fact that a transform function most
often results in different record formats on the out and in ports. Therefore, you will often
see yellow To-Do Cues on the out or in port of a transform component even though all
flows are connected and Propagate From Neighbors is selected. If you encounter a
transform component that uses the same record format on both ports, you can select
Propagate Through manually for that component.
Same As
Select Same As to apply the record format from another port in the graph to the port you
have selected in the Output or Input Ports pane.
To use the Same As option:
Select Same As.
A menu appears that lists the names of all the other components in the graph.
On this menu, point to the name of the component that has the port with the
record format you want to use. (A menu with a list of the ports on the component
you point to appears.)
Click the name of the port that has the record format you want to use. (The name
of the port and component you selected appears in the Same As box.)
Goto
The Goto button becomes available when you choose:
Same As
Propagate from Neighbors and select a port that has a propagated record format
(the flows in the graph must be connected in order for propagation to occur).
Click Goto to take you to the Ports tab for the port which is supplying the record format
for the current port. That port is identified in the Same As box.

Path
Select Path to specify the name of a file that contains the record format you want to use
for the port you have selected in the Input or Output Ports pane.
To specify the filename, type the path to the file you want to use in the Path box, or do
the following:
Select Path.
Beneath the Path box, select Local File, EME Datastore, or Host File, depending
on the location of the file you want to use.
Do one of the following:
o In the Path box, type the path to the file you want to use
o Click Browse and navigate to the file you want to use.
Embedded
Select Embedded if you want to write the record format for the port you have selected in
the Input or Output Ports pane and save it as a string embedded in the graph. Using this
option limits the number of small record format files you need to save outside the graph.
Once you have selected Embedded, various combinations of four buttons become
available to the right of the Embedded box. Possible buttons are:
New
Edit
Generate
Validate
When you select a port that has a default embedded record format, the DML string that
defines that default format appears in the Embedded box. If you want to edit the default
format, you can edit it in the box, or click Edit to open the Record Format Editor and
edit the format.
To write an embedded record format, first select Embedded. Then do one of the
following:
Type the record format in the Embedded box.
Click New to open the Record Format Editor and create the format.
Other Choices
Depending on the option you choose for defining a record format, various combinations
of buttons become available along the right side of the tab. The possible buttons are
described below.
Propagate Through
This option is available regardless of whether you choose Propagate From Neighbors,
Same As, Path, or Embedded.
Click Propagate Through to copy record formats between a pair of ports within a
particular component. The pair of ports must be one of the following:
in and out
in and out0
in0 and out

ini and outi, for all i


read and write
The ports must have these exact names. If a component does not have a pair of ports with
these names, Propagate Through is not enabled. A record format can be propagated in
either direction. That is, you can propagate from in to out or from out to in.
In a Reformat component that has multiple output ports labeled out0, out1, out2, and so
on, selection of Propagate Through affects the out0 port, but not the other outn ports.
Similarly, in a Join component that has multiple input ports labeled in0, in1, and so on,
selection of Propagate Through affects the in0 port, but not the other inn ports.
New
Click New to open the Record Format Editor. Use the Editor to create a record format
for the selected port. New only becomes available when the selected port has no record
format attached to it.
Edit
Click Edit to open the Record Format Editor. Use the Editor to edit an existing record
format displayed in the Same As, Path, or Embedded boxes. Edit only becomes
available when the selected port has a record format attached to it.
Generate
Click Generate to automatically create the record format for the selected port from the
record format of a database table or SAS dataset. Generate only becomes available for
Input Table and Output Table components.
Validate
Click Validate to check a record format for correct syntax.
More/Less
Click More for access to features that enable you to:
Modify the interpretation of a record format
Select one of the choices in the Interpretation box.
Export a record format from a component as a parameter of its containing graph
or subgraph.
Click Export to open the Export as Parameter dialog. In the dialog, click Help for
more information.

Layout and Record Format Propagation


To expedite the graph building process, the GDE automatically assigns component layout
and port properties when you connect flows. This is called propagation, and it is the
default behavior.

NOTE: Before setting the layouts or record formats for program components, connect the
flows in the graph. Most program component layouts and record formats propagate
automatically.

Layouts
When the GDE propagates a layout, it puts an asterisk (*) next to the layout marker
(L1*). The GDE propagates layout between components connected by either straight or
all-to-all flows. If a component has both a straight and an all-to-all flow connected to it,
the GDE propagates the layout for that component from the component connected to it by
the straight flow.
To display or hide layout markers, on the menu bar of the GDE, choose View > Options
and select or clear the Layouts checkbox .
Record Formats
When the GDE propagates a record format, it puts an asterisk (*) next to the port name
(out*). The GDE propagates record formats between components connected by any type
of flow.
NOTE: The GDE will not propagate record formats through a transform component, so
you must set these record formats manually.
Disabling/Re-enabling Propagation
You can disable or re-enable layout and record format propagation at the component and
at the graph level.
To disable/re-enable propagation on a component:
Double-click the component to display the Properties dialog.
On the Layout and Ports tabs, clear or select Propagate from Neighbors.
To disable/re-enable propagation across a graph:
From the Edit menu, clear or select Propagation>Record Format or
Propagation>Layout.

What is Layout?
Before you can run an Ab Initio graph, you must specify layouts to describe the following
to the Co>Operating System:
The location of files
The number and locations of the partitions of multifiles

The number of, and the locations in which, the partitions of program components
execute.
A layout is one of the following:
A URL that specifies the location of a serial file
A URL that specifies the location of the control partition of a multifile
A list of URLs that specifies the locations of:
The partitions of an ad hoc multifile
The working directories of a program component
Every component in a graph both dataset and program components has a layout. Some
graphs use one layout throughout; others use several layouts and repartition data as
needed for processing by a greater or lesser number of processors.
During execution, a graph writes various files in the layouts of some or all of the
components in it. For example:
An INTERMEDIATE FILE component writes to disk all the data that passes
through it.
A phase break, checkpoint, or watcher writes to disk, in the layout of the
component downstream from it, all the data passing through it.
A buffered flow writes data to disk, in the layout of the component downstream
from it, when its buffers overflow.
Many program components Sort is one example write, then read and remove,
temporary files in their layouts.
A checkpoint in a continuous graph writes files in the layout of every component
as it moves through the graph.
Critical Concerns
The layouts you choose can be critical to the success or failure of a graph. In order for a
layout to be effective, it must fulfill the following conditions:
The Co>Operating System must be installed on the computers specified by the
layout.
The run host must be able to connect to the computers specified by the layout.
The layout must allow enough space for the files the graph needs to write there.
The permissions in the directories of the layout must allow the graph to write files
there.
If a layout does not fulfill these conditions, the graph will fail. See Actual Working
Directories for details about exactly where the Co>Operating System locates the files it
writes during the execution of a graph.

INTERMEDIATE FILE
Purpose

Intermediate File represents one or more serial files or a multifile of intermediate results
that a graph writes during execution, and saves for your review after execution.
Parameters
The Intermediate File Properties dialog does not have a Parameters tab.
However, you can specify values for parameters on the Description, Access, and Ports
tabs. This includes parameters such as intermediate file location, file handling behavior
and permissions, and the intermediate record format.
Runtime behavior
The upstream component writes to INTERMEDIATE FILE through INTERMEDIATE
FILEs write port. After the flow of data records into the write port is complete, the
downstream component reads from Intermediate Files read port. This guarantees that the
writing and reading processes are in two separate phases.
NOTE: You cannot add an Intermediate File component to a continuous flows graph.
When the target of an Intermediate File component is a special file (such as /dev/null,
NULL, a named pipe, or some other special file), the Co>Operating System treats the
target file in a special way. The Co>Operating System never deletes and recreates such a
file, nor does it ever truncate a special file.

OUTPUT FILE
Purpose
Output File represents data records written as output from a graph into one or more serial
files or a multifile.
When the target of an Output File component is a special file (such as /dev/null, NUL, a
named pipe, or some other special file), the Co>Operating System treats the target file in
a special way. The Co>Operating System never deletes and recreates such a file, nor does
it ever truncate a special file.
Parameters
The Output File Properties dialog does not have a Parameters tab. However, you can
specify values for parameters on the Description, Access, and Ports tabs of the Output
File Properties dialog. This includes parameters such as output file location, file handling
behavior and permissions, and the output record format.
Output File and Continuous Flow Applications
Output File components are not continuously enabled; they are not supported in
continuous flow applications. To write to a file from a continuous flow application use a
Publish component. See Guide to Continuous Flows.
Converting an Output File to a Lookup File
You can convert an OUTPUT FILE generated in one phase of a graph to a Lookup File
used in a later phase. To do this:

Create an Output File to contain the data records you want to use as a Lookup
File.
On the Description tab of the File Properties dialog box for that Output File,
check Add to Catalog.

LOOKUP FILE
Purpose
Lookup File represents one or more serial files or a multifile. Lookup File associates key
values with corresponding data values to index records and retrieve them. Lookup File is
not connected to other graph components, however, its associated data is accessible from
other components.
Using Lookup File can make graph processing more quick and efficient. The amount of
data you associate with a Lookup File should be small enough to be held in main
memory. Then you can define a transform function in another component to access
Lookup File and retrieve associated records from main memory much more quickly than
retrieving them from disk.
Parameters for Lookup File
Key (key specifier, required)
Name(s) of the key field(s) against which Lookup File matches its arguments.
RecordFormat (record format, required)
The record format you want Lookup File to use when returning data records.
How to Use Lookup File
Unlike other dataset components, Lookup File is not connected to other components in a
graph. In other words, it has no ports. However, its contents are accessible from other
components in the same or later phases of a graph.
You use the Lookup File in other components by calling one of the following DML
functions in any transform function or expression parameter: lookup, lookup_count, or
lookup_next.
The first argument to these lookup functions is the name of the Lookup File. The
remaining arguments are values to be matched against the fields named by the key
parameter. The lookup functions return a record that matches the key values and has the
format given by the RecordFormat parameter. For details, see the Data Manipulation
Language Reference.
A file you want to use as a Lookup File must be small enough to fit into main memory. If
a file is too large to fit into memory, use INPUT FILE followed by JOIN .
Information about Lookup Files is stored in a catalog, which allows you to share them
with other graphs.
Restriction on Special Modifiers for Lookup File Key Fields
The key specifier for Lookup File can be one of the following:

There is one key field and it has the interval modifier.


There is one key field and it has the regex modifier.
There are two or more key fields. Two fields have the interval_top and
interval_bottom modifiers. Any other fields have the exact modifier.
There are any number of key fields that all have the exact modifier.
For example, suppose you want to determine if a particular code in a range of codes is
valid during a range of dates. You cannot do this with the interval modifier because you
would need to use two key fields with the interval modifier which is not allowed. To
obtain the information you want, you need to make the codes exact matches. You can
then use interval_top and interval_bottom to modify the date fields.
Converting an Output File to a Lookup File
You can convert an OUTPUT FILE generated in one phase of a graph to a Lookup File
used in a later phase. To do this:
Create an Output File to contain the data records you want to use as a Lookup
File.
On the Description tab of the File Properties dialog box for that Output File,
check Add to Catalog.
About Interval Lookup Files
If the key specifier for a Lookup File contains the word interval, interval_bottom, or
interval_top, then the Lookup File is an interval lookup. The following special rules
apply:
Each record in the Lookup File represents an interval, that is, a range of values.
The lower and upper bounds of the range are usually given in two separate fields
of the same type.
A key field marked interval_bottom holds the lower endpoint of the interval
A key field marked interval_top holds the upper endpoint.
If a field in the Lookup File's key specifier is marked as interval, it must be the
only key field for that Lookup File. You cannot specify a multipart key as an
interval lookup.
For example, in a Lookup File that is an interval lookup, the following DML function
returns the record, if any, for which arg is between the lower and upper endpoints.
lookup("Lookup_File_name",arg)
By default, the interval endpoints are inclusive, but you can add the modifier exclusive to
specify otherwise. For example, suppose Lookup File insurance_coverage has the
following key specifier:
"{coverage_start interval_bottom exclusive; coverage_end interval_top}"
This identifies the fields coverage_start and coverage_end as the endpoints of the
interval.
The following DML function returns record R if R.coverage_start is less than arg and
less than or equal to R.coverage_end.
lookup("insurance_coverage",arg)

Your Lookup File must contain well-formed intervals. This means that for each record,
the value of the interval_bottom field(s) must be less than or equal to the value of the
interval_top field(s). The intervals must not overlap and must be sorted into ascending
order.
Sometimes it is not convenient to specify both endpoints of an interval in every record.
You can specify interval boundaries with only one key field f by setting the key specifier
to {f interval}. In this case, a given record's value for field f is interpreted as the bottom
(inclusive) of the interval for that record. The value in the subsequent record's field f is
interpreted as the top (exclusive) of the interval. The last record in the table is an orphan.
It specifies no interval because there is no subsequent record from which to determine the
upper endpoint.

You might also like