You are on page 1of 33

A Data warehouse is a repository of integrated information, available

for queries and analysis. Data and information are extracted from
heterogeneous sources as they are generated. This makes it much
easier and more efficient to run queries over data that originally came
from different sources". Another definition for data warehouse is: " A
data warehouse is a logical collection of information gathered from
many different operational databases used to create business
intelligence that supports business analysis activities and decision-
making tasks, primarily, a record of an enterprise's past transactional
and operational information, stored in a database designed to favor
efficient data analysis and reporting (especially OLAP)". Generally, data
warehousing is not meant for current "live" data, although 'virtual' or
'point-to-point' data warehouses can access operational data. A 'real'
data warehouse is generally preferred to a virtual DW because stored
data has been validated and is set up to provide reliable results to
common types of queries used in a business.

Ab-Initio Interview Questions

1.how to create multifile system in windows?


2.what does vector field mean? Explain with an example?
3.what is the difference between rollup and scan?
4.How to create repository in abinitio for stand-alone machine?
5.What does dependency analysis mean in Ab Initio?
Ans. Dependency Analysis
It analyses the Project for the dependencies within and between
the graphs. The EME examines the Project and develops a survey
tracing how data is transformed and transferred field by field
from component to component. Dependency analysis has two
basic steps:
• Translation
• Analysis

Analysis Level:

In the check in wizard’s advanced options, the analysis level can be specified as
one of the following:

• None:
No dependency analysis is performed during the check in.
• Translation only:
Graph being checked in is translated to data store format but no error
checking is done. This is the minimum requirement during check in.
• Translation with checking: (Default)
Along with the translation, errors, which will interfere with dependency
analysis, are checked for. These include:
• Absolute paths
• Undefined parameters
• dml syntax errors
• Parameter reference to objects that can’t be resolved
• Wrong substitution syntax in parameter definition
• Full Dependency Analysis:
Full dependency analysis is done during check in. It is not recommended
as takes a long time and in turn can delay the check in process.

What to analyse:

• All files:
Analyse all files in the Project
• All unanalysed files:
Analyse all files that have been changed or which are dependent on or
required by files that have changed since the last time they were analysed.
• Only my checked in files:
All files checked in by you would be analysed if they have not been
before.
• Only the file specified:
Apply analysis to the file specified only.

6.How to Create Surrogate Key using Ab Initio?

Ans. A key is a field or set of fields that uniquely identifies a record in a file or table.

A natural key is a key that is meaningful in some business or real-world sense. For
example, a social security number for a person, or a serial number for a piece of
equipment, is a natural key.

A surrogate key is a field that is added to a record, either to replace the natural key or in
addition to it, and has no business meaning. Surrogate keys are frequently added to
records when populating a data warehouse, to help isolate the records in the warehouse
from changes to the natural keys by outside processes.

7.I am having a control file which consists of 4 mfs files, I want


to add other 4 files to the same control file, how can i do and
how to convert it into a directory?
8.What are the components new in 2.14 compared to 1.8 and
State the usage of the components.
9.For data parallelism, we can use partition components. For
component parallelism, we can use replicate component. Like
this which component(s) can we use for pipeline parallelism?
10.What is. abinitiorc and what it contains?
11.What do you mean by .profile in Abinitio and what does it
contains?
12.What are data mapping and data modeling?

Data mapping deals with the transformation of the extracted data at


FIELD level i.e. the transformation of the source field to target field is
specified by the mapping defined on the target field. The data mapping
is specified during the cleansing of the data to be loaded.

For Example:

source;

string(35) name = "Siva Krishna ";

target;

string("01") nm=NULL("");/*(maximum length is string(35))*/

Then we can have a mapping like:

Straight move.Trim the leading or trailing spaces.

The above mapping specifies the transformation of the field nm.

13. How to execute the graph from start to end stages? Tell me
and how to run graph in non-Abinitio system?
14. Can we load multiple files?
15. How can we test the abintio manually and automation?
16. What is the difference between sandbox and EME, can we
perform checkin and checkout through sandbox/ Can anybody
explain checkin and checkout?
17. What does layout means in terms of Ab Initio ?
18. Can anyone please explain the environment varaibles with
example?
19. How do we handle if DML changing dynamically?

There are lot many ways to handle the DMLs which changes
dynamically with in a single file. Some of the suitable methods are to
use a conditional DML or to call the vector functionality while calling
the DMLs.

I think we can use MULTIREFORMAT component to handle


dynamically changing DML's.
20. How will you use EME for view/publish metadata reports
using EME?

21. Explain the differences between api and utility mode?

API and UTILITY are the two possible interfaces to connect to the
databases to perform certain user specific tasks. These interfaces
allow the user to access or use certain functions (provided by the
database vendor) to perform operation on the databases. The
functionality of each of these interfaces depends on the databases.

API has more flexibility but often considered as a slower process as


compared to UTILITY mode. Well the trade off is their performance and
usage.

22. Please let me know whether we have Ab-Initio GDE version


1.14 and what is the latest GDE version and Co-op version?

23. What r the Graph parameter?

The graph paramaters are one which are added to the respective
graph. You can added the graph parameters by selecting the
edit>parameters from the menu tab. Here's the example for the graph
parameters.

If you want to run a same graph for n number of files in a directory,


You can assign a graph parameter to the input file name and you can
supply the paramter value from the script before invoking the graph.

How to Schedule Graphs in AbInitio, like workflow Schedule in


Informatica? And where we must is Unix shell scripting in AbInitio?

24. How to Improve Performance of graphs in Ab initio?


Give some examples or tips.

There are so many ways to improve the performance of the graphs in


Abinitio.

I have few points from my side.

1.Use MFS system using Partion by Round by robin.

2.If needed use lookup local than lookup when there is a large data.
3.Takeout unnecessary components like filter by exp instead provide
them in reformat/Join/Rollup.

4.Use gather instead of concatenate.

5.Tune Max_core for Optional performance.

6.Try to avoid more phases.

to improve the perfomance of the graph,

1. Go Parallel as soon as possible using Ab Initio Partitioning


technique.
2. Once Data Is partitioned do not bring to serial , then back to
parallel. Repartition instead.
3. For Small processing jobs serial may be better than parallel.
4. Do not access large files across NFS, Use FTP component
5. Use Ad Hoc MFS to read many serial files in parallel and use
concat coponenet.

1. Using Phase breaks let you allocate more memory to


individual component and make your graph run faster
2. Use Checkpoint after the sort than land data on to disk
3. Use Join and rollup in-memory feature
4. Best performance will be gained when components can work
with in memory by MAX-CORE.
5. MAR-CORE for SORT is calculated by finding size of input data
file.
6. For In-memory join memory needed is equal to non-driving
data size + overhead.
7. If in-memory join cannot fir its non-driving inputs in the
provided MAX-CORE then it will drop all the inputs to disk and in-
memory does not make sence.
8. Use rollup and Filter by EX as soon as possible to reduce
number of records.
9. When joining very small dataset to a very large dataset, it is
more efficient to broadcast the small dataset to MFS using
broadcast component or use the small file as lookup.
1. Use MFS, use Round robin partition or load balance if you are not
joining or rollup
2. Filter the data in the beginning of the graph.
3.Take out unnecessary components like filter by expression instead
use select expression in join, rollup, reformat etc
4. Use lookups instead of joins if you are joining small tale to large
table.
5. Take out old components use new components like join instead of
math merge .
6. Use gather instead of concat
7. Use Phasing if you have too many components
8. Tune the max core for optimal performance
9.Avoid sorting data by using in memory for smaller datasets join
10.Use Ab Initio layout instead of database default to achieve parallel
loads
11. Change AB_REPORT parameter to increased monitoring duration ( )
12. Use catalogs for reusability

The performance can be improved in several ways, I put some of them


what I remembered...
1. Use sort after partition component instead of before.
2. Partition the data as early as possible and departition the data as
late as possible.
3. Filter unwanted fields/records as early as possible.
4. Try to avoid the usage of join with db component.

25. What are the most commonly used components in a


Abinition graph?

can anybody give me a practical example of a trasformation of


data, say customer data in a credit card company into
meaningful output based on business rules?

The most commonly used components in to any Ab Initio project are

input file/output file

input table/output table

lookup file

reformat,gather,join,runsql,join with db,compress


components,sort,trash,partition by expression,partition by
key ,concatinate

26 How to work with parameterized graphs?

One of the main purpose of the parameterized graphs is that if we


need to run the same graph for n number of times for different files,
we set up the graph parameters like $INPUT_FILE, $OUTPUT_FILE etc
and we supply the values for these in the Edit>parameters.These
parameters are substituted during the run time. we can set different
types of parameters like positional, keyword, local etc.
The idea here is, instead of maintaining different versions of the same
graph, we can maintain one version for different files.

27. Hi can anyone tell me what happens when the graph run?
i.e The Co-operating System will be at the host, We are
running the graph at some other place. How the Co-operating
System interprets with Native OS?

CO>operating system is layered on the top of the native OS

When a graph is executed it has to be deployed in host settings and


connection method like rexec, telnet, rsh, rlogin This is what the graph
interacts with the co>op.

when ever you press Run button on your GDE,the GDE genarates a
script

and the genarated script will be transfered to your host which is


specified in to your GDE run settings. then the Co>operating system
interprets this script and executes the script on different mechins(if
required) as a sub process(threads),after compleation of each sub
process,these sub_processes will return status code to main process
this main process in tern returns error or sucess code of the job to GDE

28. Difference between conventional loading and direct


loading? When it is used in real time.

Conventional Load:
Before loading the data, all the Table constraints will be checked
against the data.

Direct load:(Faster Loading)


All the Constraints will be disabled. Data will be loaded directly. Later
the data will be checked against the table constraints and the bad data
won't be indexed.

Api conventional loading

utility direct loading.

29. How to find the number of arguments defined in graph ?

List of shell arguments $*.

then what is $# and $? ...


$# - No of positional parameters

$? - the exit status of the last executed command.

30. Sift links to MFS files on Unix for Ab Initio ? what is this ....
2) $pound what is this
3) $? what for it is used
4) types of loading
5 overwrite when it used ?

Link is a command where in unix we use for when the original file is
deleted when we create a link the other replaces file exists.

Example: ln file 1 file2

$# Total number of positional parameters.

$? exit status of the last executed command

types of loading are conventional loading and direct loading

Override not overwrite

31. what is the difference between .dbc and .cfg file?

.cfgfile is for the remote connection and .dbc is for connecting the
database.

.cfg contains :

1. The name of the remote machine

2. The username/pwd to be used while connecting to the db.

3. The location of the operating system on the remote machine.

4. The connection method.

and .dbc file contains the information:

1. The database name

2. Database version
3. Userid/pwd

4. Database character set and some more...

32. how to do we run sequences of jobs ,,


like output of A JOB is Input to B
how do we co-ordinate the jobs

By writing the wrapper scripts we can control the sequence of


execution of more than one job.

33. How would you do performance tuning for already built graph ?
Can you let me know some examples?

example :- suppose sort is used in fornt of merge component its no use


of using sort ! bcz we hv sort component built in merge.

2) we use lookup instead of JOIN,Merge Componenet.

3) suppose we wnt to join the data comming from 2 files and we dnt
wnt dupliates we will use union funtion instead of adding addtional
component for duplicate remover.

34. what is semi-join

a left semi-join on two input files, connected to ports in0 and


in1 is the Inner Join .The dedup0 parameter is set to Do not dedup
this input, but dedup1 is set to Dedup this input before joining.

Duplicates were removed from only the in1 port, that is, from Input File
2.

semijoins can be achieved by using the join component with parameter

Join Type set to explicit join and the parameters


recordrequired0,recordrequired1 set one to true and the other false
depending on whether you require left outer or right outer join.

in abinitio,there are 3 types of join...

1.inner join. 2.outer join and 3.semi join.

for inner join 'record_requiredn' parameter is true for all in ports.

for outer join it is false for all the in ports.


if u want the semi join u put 'record_requiredn' as true for the required
component and false for other components..

35. How to get DML using Utilities in UNIX?

By using the command

m_db gendml <dbc file> -table <tablename>

36. What is driving port? When do you use it?

When you set the sorted-input parameter of "JOIN" component to "In


memory: Input need not be sorted", you can find the driving port.

Generally driving port use to improve performance in a graph.

The driving input is the largest input. All other inputs are read into
memory.

For example, suppose the largest input to be joined is on the in1 port.
Specify a port number of 1 as the value of the driving parameter. The
component reads all other inputs to the join — for example, in0, and
in2 — into memory.

Default is 0, which specifies that the driving input is on port in0.

Join also improves performance by loading all records from all inputs
except the driving input into main memory.

driving port in join supplies the data that drives join . That means, for
every record from the driving port, it will be compared against the data
from non driving port.

We have to set the driving port to the larger dataset sothat non driving
data which is smaller can be kept in main memory for speedingup the
operation.

37. what is skew and skew measurement?

skew is the mesaureof data flow to each partation .

suppose i/p is comming from 4 files and size is 1 gb

1 gb= ( 100mb+200mb+300mb+5oomb)

1000mb/4= 250 mb
(100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value.

calclu for 200,500,300.

+ve value of skew is allways desriable.

skew is a indericet measure of graph.

38. What is the difference between a Scan component and a


RollUp component?

Rollup is for group by and Scan is for successive total. Basically, when
we need to produce summary then we use scan. Rollup is used to
aggregate data.

rollup :-

group1 total
10
20
30 60
group 2
40
30 70
130

roll up generates data records that summarizes group of data records


the rollup can be used to perform the data aggregatuin like
sum,avg,max,min etcScan component generates a series of
cumulative summary records such as successive yera to date totals for
group of data records

39. what is local and formal parameter

Two are graph level parameters but in local you need to initialize the
value at the time of declaration where as globle no need to initialize
the data it will promt at the time of running the graph for that
parameter.

local parameter is like local variable in c language where as formal


parameter is like command line argument we need to pass at run time.

40. what is BRODCASTING and REPLICATE ?

Broadcast can do everything that replicate does broadcast can also


send singlt file to mfs with out splitiong and brodcast makes multiple
copies of single file mfs. Replicate combines data rendomly, receives in
single flow and write a copy of that flow in each of output flow.

replicate generates multiple straight flows as the output where as


broadcast results single fanout flow.

replicate improves component parallelism where as broadcast


improves data parallelism.

Broadcast - Takes data from multiple inputs, combines it and sends it


to all the output ports.

Eg - You have 2 incoming flows (This can be data parallelism or


component parallelism) on Broadcast component, one with 10 records
& other with 20 records. Then on all the outgoing flows (it can be any
number of flows) will have 10 + 20 = 30 records

Replicate - It replicates the data for a particular partition and send it


out to multiple out ports of the component, but maintains the partition
integrity.

Eg - Your incoming flow to replicate has a data parallelism level of 2.


with one partition having 10 recs & other one having 20 recs. Now
suppose you have 3 output flos from replicate. Then each flow will
have 2 data partitions with 10 & 20 records respectively.

41. what is the importance of EME in abinitio?

EME is a repository in Ab Inition and it used for checkin and checkout


for graphs also maintains graph version.

42. what is m_dump

It is a co-opating system's command that we use to view data from the


command prompt.

m_dump command prints the data in a formatted way.

m_dump <dml> <file.dat>

43. what is the syntax of m_dump command?

m_dump <dmlfile> <datafile>

44. can anyone give me an exaple of realtime start script in the


graph?
45. what are differences between different GDE
versions(1.10,1.11,1.12,1.13and 1.15)?
what are differences between different versions of Co-op?

1.10 is a non key version and rest are key versions.

There are lot of components added and revised at following versions.

46. How to run the graph without GDE?

In the run directory a graph can be deployed as a .ksh file. Now, this
.ksh file can be run at the command prompt as:

ksh <script_name> <parameters if any>

47. What are the different versions and releases of ABinitio


(GDE and Co-op version)

49. What is the Difference between DML Expression and XFR


Expression ?

dml expression means abinitio dml are stored or saved in a file and
dml describs the data interms of expressions that performs simple
computations such as files, dml also contains transform functions that
control data transforms,and also describs data interms of keys that
specify grouping or non grouping ,that means dml expression are non
embedded record format files

.xfr means simply say it is non embedded transform files ,Transform


function is express business rules ,local variables, statements and as
well as conn between this elements and the input and the ouput fields.

50. How Does MAXCORE works?

Maxcore is a temporary memory used to sort the records

Maxcore is a value (it will be in Kb). Whenever a component is


executed it will take that much memory we specified for execution

Maxcore is the maximum memory that could be used by a component


in its execution.

51. What is $mpjret? Where it is used in ab-initio?


52. What is $mpjret? Where it is used in ab-initio?

$mpjret is return value of shell command "mp run" execution of Ab-


Initio graph.

this is generally treated as graph execution status return value

53. How do you convert 4-way MFS to 8-way mfs?

54. What is the latest version that is available in Ab-initio?

The latest version of GDE ism1.15 AND Co>operating system is 2.14

55. What is mean by Co>Operating system and why it is


special for Ab-initio ?

Co-Operating systems, that itself means a lot, it's not merely an engine
or interpretor. As it says, it's an operating system which co-exists with
another operating system. What does that mean.... in layman's term
abinitio, unlike other applications, does not sit as a layer on top of any
OS? It itself has quite a lot of operating system level capabilities such
as multi files, memory management and so on and this way it
completely integrate with any other OS and work jointly on the
available hardware resources. This sort of Synergy with OS optimize
the utilization of available hardware resources. Unlike other
applications (including most other ETL tools) it does not work like a
layer and interprete the commands. That is the major difference with
other ETL tools , this is the reason why abinitio is much much faster
than any other ETL tool and obviously much much costlier as well.

56. How to take the input data from an excel sheet?

There is a Read Excell component that reads the excel either from host
or from local drive. The dml will be a default one.

Through Read Excel component in $AB_HOME we can read excell


directly.

57. How will you test a dbc file from command prompt ??

You can test a dbc file from command prompt(Unix) using m_db test
<name-of-dbc file> command which gives the checking of data base
connection, version of data base, user

58. Which one is faster for processing fixed length dmls or


delimited dmls and why?
Fixed length DML's are faster because it will directly read the data of
that length without any comparisons but in delimited one,s every
character is to be compared and hence delays

59. what are the contineous components in Abinitio?

Contineous components used to create graphs,that produce useful


output file while running continously

Ex:- Contineous rollup,Contineous update,batch subscribe

60. what is meant by fancing in abinitio ?

The word Abinitio means from the beginning.

did you mean "fanning" ? "fan-in" ? "fan-out" ?

61. how to retrive data from database to source in that case


whice componenet is used for this?

To unload (retrive) Data from the database DB2, Informix, or Oracle


we have components like Input Table and Unload DB Table by
using these two components we can unload data from the database

62. what is the relation between EME , GDE and Co-operating


system ?

EME is said as enterprise metdata env, GDE as graphical devlopment


env and Co-operating sytem can be said as asbinitio server

relation b/w this CO-OP, EME AND GDE is as fallows

Co operating system is the Abinitio Server. this co-op is installed on


perticular O.S platform that is called NATIVE O.S .comming to the EME,
its i just as repository in informatica , its hold the
metadata,trnsformations,db config files source and targets
informations. comming to GDE its is end user envirinment where we
can devlop the graphs(mapping just like in informatica)

desinger uses the GDE and designs the graphs and save to the EME or
Sand box it is at user side.where EME is ast server side.

63. what is the use of aggregation when we have rollup


as we know rollup component in abinitio is used to summirize
group of data record. then where we will use aggregation ?
Aggregation and Rollup both can summerise the data but rollup is
much more convenient to use. In order to understand how a particular
summerisation being rollup is much more explanatory compared to
aggregate. Rollup can do some other functionalities like input and
output filtering of records.

64. what are kinds of layouts does ab initio supports

Basically there are serial and parallel layouts supported by AbInitio. A


graph can have both at the same time. The parallel one depends on
the degree of data parallelism. If the multi-file system is 4-way parallel
then a component in a graph can run 4 way parallel if the layout is
defined such as it's same as the degree of parallelism.

65. How can you run a graph infinitely?

To run a graph infinitely, the end script in the graph should call the
.ksh file of the graph. Thus if the name of the graph is abc.mp then in
the end script of the graph there should be a call to abc.ksh.
Like this the graph will run infinitely.

66. How do you add default rules in transformer?

Double click on the transform parameter of parameter tab page of


component properties, it will open transform editor. In the transform
editor click on the Edit menu and then select Add Default Rules from
the dropdown. It will show two options - 1) Match Names 2) Wildcard.

67. Do you know what a local lookup is?

If your lookup file is a multifile and partioned/sorted on a particular key


then local lookup function can be used ahead of lookup function call.
This is local to a particular partition depending on the key.

Lookup File consists of data records which can be held in main


memory. This makes the transform function to retrieve the records
much faster than retirving from disk. It allows the transform
component to process the data records of multiple files fastly.

68. What is the difference between look-up file and look-up,


with a relevant example?
Generally Lookup file represents one or more serial files(Flat files). The
amount of data is small enough to be held in the memory. This allows
transform functions to retrive records much more quickly than it could
retrive from Disk.

i have No general idea about lookup.

A lookup is a component of abinitio graph where we can store data and


retrieve it by using a key parameter.

A lookup file is the physical file where the data for the lookup is stored.

69. how to handle if DML changes dynamically in abinitio

If the DML changes dynamically then both dml and xfr has to be
passed as graph level parameter during the runtime.

By parametrization or by conditional record format or by metadata

70.Explain what is lookup?

Lookup is basically a specific dataset which is keyed. This can be used


to mapping values as per the data present in a particular file
(serial/multi file). The dataset can be static as well dynamic ( in case
the lookup file is being generated in previous phase and used as
lookup file in current phase). Sometimes, hash-joins can be replaced by
using reformat and lookup if one of the input to the join contains less
number of records with slim record length.

AbInitio has built-in functions to retrieve values using the key for the
lookup.

71. What is a ramp limit?

The limit parameter contains an integer that represents a number of


reject events

The ramp parameter contains a real number that represents a rate of


reject events in the number of records processed.

no of bad records allowed = limit + no of records*ramp.


ramp is basically the percentage value (from 0 to 1)
This two together provides the threshold value of bad records.
72. Have you worked with packages?

Multistage transform components by default uses packages. However


user can create his own set of functions in a transfer function and can
include this in other transfer functions.

73. Have you used rollup component? Describe how.

If the user wants to group the records on particular field values then
rollup is best way to do that. Rollup is a multi-stage transform function
and it contains the following mandatory functions.
1. initialise
2. rollup
3. finalise
Also need to declare one temporary variable if you want to get counts
of a particular group.

For each of the group, first it does call the initialise function once,
followed by rollup function calls for each of the records in the group
and finally calls the finalise function once at the end of last rollup call.

74. How do you add default rules in transformer?

In case of reformat if the destination field names are same or subset of


the source fields then no need to write anything in the reformat xfr
unless you dont want to use any real transform other than reducing
the set of fields or split the flow into a number of flows to achive the
functionality.

1)If it is not already displayed, display the Transform Editor Grid.


2)Click the Business Rules tab if it is not already displayed.
3)Select Edit > Add Default Rules.

Add Default Rules — Opens the Add Default Rules dialog. Select one of
the following: Match Names — Match names: generates a set of rules
that copies input fields to output fields with the same name. Use
Wildcard (.*) Rule — Generates one rule that copies input fields to
output fields with the same name.

75. What is the difference between partitioning with key and


round robin?

Partition by Key or hash partition -> This is a partitioning technique


which is used to partition data when the keys are diverse. If the key is
present in large volume then there can large data skew. But this
method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly
distribute the data on each of the destination data partitions. The skew
is zero in this case when no of records is divisible by number of
partitions. A real life example is how a pack of 52 cards is distributed
among 4 players in a round-robin manner.

If you have some 30 cards taken at random from 52 card pack-------If


take the card color as key(red or white) and distribute then the no of
cards in each partion may vary much.But in Round robin , we distribute
with block size , so the variation is limited to the block size

Partition by Key - Distribute according to the key value

Partition by Round Robin - Distribute a predefined number of records to


one flow and then the same numbers of records to the next flow and so
on. After the last flow resumes the pattern and almost evenly
distributes the records... This patter is called round robin fashion.

76. How do you truncate a table? (Each candidate would say


only 1 of the several ways to do this.)

From Abinitio run sql component using the DDL "trucate table”

By using the Truncate table component in Ab Initio

There are many ways to do it.

1. Probably the easiest way is to use Truncate Table

2. Run Sql or update table can be used to do the same thing

3. Run Program

77. Have you eveer encountered an error called "depth not


equal"? (This occurs when you extensively create graphs it is a
trick question)

When two components are linked together if their layout doesnot


match then this problem can occur during the compilation of the
graph. A solution to this problem would be to use a partitioning
component in between if there was change in layout.

have talked about a situation where you have linked

2 components - each of them having different layouts.


Think about a situation where the components on the left hand side is
linked to a serial dataset and on the right hand side the downstream
component is linked to a multifile. Layout is going to be propagaed
from naghbours.

So without any partitioning component the jump in the depth cannot


be achieved and I suppose you must need one partitioning component
which can help alleviate this depth discrepancy.

78. What is the function you would use to transfer a string into
a decimal?

In this case no specific function is required if the size of the string and
decimal is same. Just use decimal cast with the size in the transform
function and will suffice. For example, if the source field is defined as
string(8) and the destination as decimal(8) then (say the field name is
field1).

out.field :: (decimal(8)) in.field

If the destination field size is lesser than the input then use of
string_substring function can be used likie the following.
say destination field is decimal(5).

out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /*
string_lrtrim used to trim leading and trailing spaces */

Hope this solution works.

79. How many parallelisms are in Abinitio? Please give a


definition of each.

There are 3 kinds of Parallelism:


1) Data Parallesim
2)Componnent Paralelism
3) Pipeline.

When the data is divided into smalll chunks and processed on different
components simultaneously we call it DataParallelism

When different componens work on different dataa sets it is called


Component parallelism

When a graph uses multiple components to run on the same data


simltaneously we call it Pipeline parallelism
80. What are primary keys and foreign keys?

In RDBMS the relationship between the two tables is represented as


Primary key and foreign key relationship.Wheras the primary key table
is the parent table and foreignkey table is the child table.The criteria
for both the tables is there should be a matching column.

81. What is the difference between clustered and non-


clustered indices? ...and why do you use a clustered index?

82. What is an outer join?

An outer join is used when one wants to select all the records from a
port - whether it has satisfied the join criteria or not.

If you want to see all the records of one input file independent of
whether there is a matching record in the other file or not. then its an
outer join.

83. What are Cartesian joins?

joins two tables without a join key. Key should be {}.

A Cartesian join will get you a Cartesian product. A Cartesian join is


when you join every row of one table to every row of another table.
You can also get one by joining every row of a table to every row of
itself.

84. How is referential integrity enforced in a SQL Server


database?

85. What is the difference between a DB config and a CFG file?

A .dbc file has the information required for Ab Initio to connect to the
database to extract or load tables or views. While .CFG file is the table
configuration file created by db_config while using components like
Load DB Table.

Both DBC and CFG files are used for database connectivity, basically
both are of similar use. The only difference is, cfg file is used for
Informix Database, whereas dbc are used for other database such as
Oracle or Sqlserver

http://www.coolinterview.com/
86. What is the difference between a Scan component and a RollUp component?

Ans.Rollup is for group by and Scan is for successive total. Basically,


when we need to produce summary then we use scan. Rollup is used
to aggregate data.

87. what is local and formal parameter?

Ans. Two are graph level parameters but in local you need to initialize
the value at the time of declaration where as globle no need to
initialize the data it will promt at the time of running the graph for that
parameter.

88. How will you test a dbc file from command prompt ??
Ans. try "m_db test myfile.dbc"
89. Explain the difference between the “truncate” and "delete" commands ?
Ans.Truncate :- It is a DDL command, used to delete tables or clusters.
Since it is a DDL command hence it is auto commit and Rollback can't
be performed. It is faster than delete.

Delete:- It is DML command, generally used to delete a record, clusters


or tables. Rollback command can be performed , in order to retrieve
the earlier deleted things. To make deleted things permanently,
"commit" command should be used.
90. How to retrive data from database to source in that case whice componenet is
used for this?
Ans. To unload (retrive) Data from the database DB2, Informix, or
Oracle we have components like Input Table and Unload DB Table by
using these two components we can unload data from the database.
91. How many components are there in your most complicated graph?
Ans. This is a tricky question, number of component in a graph has
nothing to do with the level of knowledge a person has. On the
contrary, a proper standardized and modular parametric approach will
reduce the number of components to a very few. In a well thought
modular and parametric design, mostly the graphs will have 3/4
components, which will be doing a particular task and will then call
another sets of graphs to do the
next and so on. This way total numbers of distinct graphs will
drastically come down, support and maintenance will be much more
simplified. The bottomline is, there are lot more other things to plan
rather than to add components.
92. Do you know what a local lookup is?
Ans. This function is similar to a mlookup...the difference being that this
funtion returns NULL when there is no record having the value that has
been mentioned in the arguments of the function.
If it finfs the matching record it returns the complete record..that is all
the fields along with their values corresponding to the expression
mentioned in the lookup local function.

eg: lookup_local( "LOOKUP_FILE",81) -> null


if the key on which the lookup file is patitioned does not hold any value
as mentioned.
_______________

Local Lookup files are small files that can be accommodated into
physical memory for use in transforms. Details like country
code/country, Currency code/currency, forexrate/value can be used in
a lookup file and mapped during transformations. Lookup files are not
connected to any component of the graph but available to reformat for
mapping.
93. How to Create Surrogate Key using Ab Initio?

Ans. A key is a field or set of fields that uniquely identifies a record in


a file or table.

A natural key is a key that is meaningful in some business or real-world


sense. For example, a social security number for a person, or a serial
number for a piece of equipment, is a natural key.

A surrogate key is a field that is added to a record, either to replace


the natural key or in addition to it, and has no business meaning.
Surrogate keys are frequently added to records when populating a
data warehouse, to help isolate the records in the warehouse from
changes to the natural keys by outside processes.

94. How to Improve Performance of graphs in Ab initio?


Give some examples or tips.
Ans. There are somany ways to improve the performance of the graphs
in Abinitio.

I have few points from my side.

1.Use MFS system using Partion by Round by robin.

2.If needed use lookup local than lookup when there is a large data.

3.Takeout unnecessary components like filter by exp instead provide


them in reformat/Join/Rollup.

4.Use gather instead of concatenate.


5.Tune Max_core for Optional performance.

6.Try to avoid more phases.

There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components

3) Minimise the number of sort components


4) Minimise sorted join component and if possible replace them by in-
memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash
join with proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the
trasfer functions
10) Avoid repartitioning of data unnecessarily

95. Describe the process steps you would perform when defragmenting a data
table. This table contains mission critical data ?
Ans. There are several ways to do this:

1) We can move the table in the same or other tablespace and rebuild
all the indexes on the table.

alter table <table_name> move <tablespace_name> this activity


reclaims the defragmented space in the table

analyze table table_name compute statistics to capture the updated


statistics.

2)Reorg could be done by taking a dump of the table, truncate the


table and import the dump back into the table.

96. How do we handle if DML changing dynamically ?


Ans. There are lot many ways to handle the DMLs which changes
dynamically with in a single file. Some of the suitable methods are to
use a conditional DML or to call the vector functionality while calling
the DMLs.
97. What r the Graph parameter?
Ans. There are 2 types of graph parameters in AbInitio

1. local parameter
2. Formal parameters.(those parameters working at runtime)

98. What is meant by fancing in abinitio ?


Ans. The word Abinitio means from the beginning.
99. What is a ramp limit?
Ans. Limit and Ramp.

For most of the graph components, we can manually set the error
threshold limit, after which the graph exits. Normally there are three
levels of thresholds like "Never Exit" and "Exit on First Occurance",
very clear from the text. They represent both the extremes. The third
one is Limit along with Ramp. Limit talks about max limit where as
RAMP
talks in terms of percentage of processed records. For example a ramp
value of 5 means, if less than 5% of the total records are rejected,
continue running. If it crosses the ramp then it will come out of the
graph. Typically development starts with never exit, followed by ramp
and finally in production "Exit on First Occurance". Case to case basis
RAMP can be used in production but definitely not a desired approach.

100. What is the difference between sandbox and EME, can we perform checkin and
checkout through sandbox/ Can anybody explain checkin and checkout?
Ans. Sandboxes are work areas used to develop, test or run code
associated with a given project. Only one version of the code can be
held within the sandbox at any time.
The EME Datastore contains all versions of the code that have been
checked into it. A particular sandbox is associated with only one
Project where as a Project can be checked out to a number of
sandboxes.

101. What is skew and skew measurement?


Ans. skew is the mesaureof data flow to each partation .

suppose i/p is comming from 4 files and size is 1 gb

1 gb= ( 100mb+200mb+300mb+5oomb)

1000mb/4= 250 mb

(100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value.

calclu for 200,500,300.

+ve value of skew is allways desriable.


skew is a indericet measure of graph.

102. What is the syntax of m_dump command?


Ans. The genaral syntax is "m_dump metadata data [action] "

103. What is the latest version that is available in Ab-initio?


Ans. The latest version of GDE ism1.15 AND Co>operating system is 2.14

104. What is the Difference between DML Expression and XFR Expression ?
Ans. The main difference b/w dml & xfr is that

DML represent format of the metadata.

XFR represent the tranform functions.which will contain business rules

105. What are the most commonly used components in a Abinition graph?

can anybody give me a practical example of a trasformation of data, say customer


data in a credit card company into meaningful output based on business rules?
Ans. The most commonly used components in to any Ab Initio project
are

input file/output file

input table/output table

lookup file

reformat,gather,join,runsql,join with db,compress


components,sort,trash,partition by expression,partition by key
,concatinate

106. Have you used rollup component? Describe how ?


Ans. Rollup component can be used in different number of ways. It
basically acts on a group of records based on a certain key.

The simplest application would be to count the number of records in a


certain file or table.

In this case there would not be any "key" associated with it. A temp
variable would be created for eg. 'temp.count' which would be
increamented with every record ( since there is no key here all the
fields are trated as one group) that flows through the transform, like
temp.count=temp.count+1.
Again the rollup component can be used to discard duplicates from a
group.Rollup basically acting as the dedup component in this case.

107. What is the difference between partitioning with key and round robin?
Ans. PARTITION BY KEY:

In this, we have to specify the key based on which the partition will
occur. Since it is key based it results in very well balanced data. It is
useful for key dependent parallelism.

PARTITION BY ROUND ROBIN:

In this, the records are partitioned in sequential way, distributing data


evenly in blocksize chunks across the output partition. It is not key
based and results in well balanced data especially with blocksize of 1.
It is useful for record independent parallelism.

108. How to work with parameterized graphs?


Ans. One of the main purpose of the parameterized graphs is that if we
need to run the same graph for n number of times for different files,
we set up the graph parameters like $INPUT_FILE, $OUTPUT_FILE etc
and we supply the values for these in the Edit>parameters.These
parameters are substituted during the run time. we can set different
types of parameters like positional, keyword, local etc.

The idea here is, instead of maintaining different versions of the same
graph, we can maintain one version for different files.

109. How Does MAXCORE works?


Ans. Maxcore is a value (it will be in Kb).Whne ever a component is
executed it will take that much memeory we specified for execution.
110. What does layout means in terms of Ab Initio?

Ans.Before you can run an Ab Initio graph, you must specify layouts to
describe the following to the Co>Operating System:

• The location of files


• The number and locations of the partitions of multifiles
• The number of, and the locations in which, the partitions of
program components execute

A layout is one of the following:

• A URL that specifies the location of a serial file


• A URL that specifies the location of the control partition of a
multifile
• A list of URLs that specifies the locations of:
o The partitions of an ad hoc multifile
o The working directories of a program component

Every component in a graph — both dataset and program components


— has a layout. Some graphs use one layout throughout; others use
several layouts and repartition data as needed for processing by a greater
or lesser number of processors.
During execution, a graph writes various files in the layouts of some or
all of the components in it. For example:

• An Intermediate File component writes to disk all the data that


passes through it.
• A phase break, checkpoint, or watcher writes to disk, in the layout of
the component downstream from it, all the data passing through
it.
• A buffered flow writes data to disk, in the layout of the component
downstream from it, when its buffers overflow.
• Many program components — Sort is one example — write, then
read and remove, temporary files in their layouts.
• A checkpoint in a continuous graph writes files in the layout of
every component as it moves through the graph.

111. Can we load multiple files?

Ans.Load multiple files from my perspective means writing into more


than one file at a time. If this is the same case with you, Ab initio
provides a component called Write Multiplefiles (in dataset Component
group) which can write multiple files at a time. But the files which are
to be written must be local files i.e., they should reside in your local PC.
For more information on this component read in help file.

112. How would you do performance tuning for already built graph ? Can you let me
know some examples?

Ans.example :- suppose sort is used in fornt of merge component its no


use of using sort ! bcz we hv sort component built in merge.

2) we use lookup instead of JOIN,Merge Componenet.

3) suppose we wnt to join the data comming from 2 files and we dnt
wnt dupliates we will use union funtion instead of adding addtional
component for duplicate remover.
113. Which one is faster for processing fixed length dmls or delimited dmls and why
?

Ans. Fixed length DML's are faster because it will directly read the data
of that length without any comparisons but in delimited one,s every
character is to be compared and hence delays.

114. What is the function you would use to transfer a string into a decimal?

Ans.For converting a string to a decimal we need to typecast it using


the following syntax,

out.decimal_field :: ( decimal( size_of_decimal ) ) string_field;

The above statement converts the string to decimal and populates it to


the decimal field in output.

115. What is the importance of EME in abinitio?

Ans.EME is a repository in Ab Inition and it used for checkin and


checkout for graphs also maintains graph version.

116. How do you add default rules in transformer?

Ans.Double click on the transform parameter of parameter tab page of


component properties, it will open transform editor. In the transform
editor click on the Edit menu and then select Add Default Rules from
the dropdown. It will show two options - 1) Match Names 2) Wildcard.

117. What does dependency analysis mean in Ab Initio?

Ans.

118. What is data mapping and data modelling?

Ans. data mapping deals with the transformation of the extracted data
at FIELD level i.e. the transformation of the source field to target field
is specified by the mapping defined on the target field. The data
mapping is specified during the cleansing of the data to be loaded.

For Example:

source;

string(35) name = "Siva Krishna ";


target;

string("01") nm=NULL("");/*(maximum length is string(35))*/

Then we can have a mapping like:

Straight move.Trim the leading or trailing spaces.

The above mapping specifies the transformation of the field nm.

119. Difference between conventional loading and direct loading ? when it is used
in real time .

Ans.Conventional Load:
Before loading the data, all the Table constraints will be checked
against the data.

Direct load:(Faster Loading)


All the Constraints will be disabled. Data will be loaded directly.Later
the data will be checked against the table constraints and the bad data
won't be indexed.

Api conventional loading

utility direct loading.

120. What are the contineous components in Abinitio?

Ans.Contineous components used to create graphs,that produce useful


output file while running continously

Ex:- Contineous rollup,Contineous update,batch subscribe

121. What is mean by Co > Operating system and


why it is special for Ab-initio ?

Ans. Co > Operating System:

It converts the AbInitio specific code into the format, which the
UNIX/Windows can understand and feeds it to the native operating
system, which carries out the task.

122. How do you add default rules in transformer?

Ans.Click to transformer then go to edit …then click to add default


rule……
_______________________
Submitted by Nandy Chandan (cnandy@visa.com)

In Abinitio there is a concept called Rule Priority, in which you can


assign priority to rules in Transformer.

Let’s have a example:

Ouput.var1 :1: input.var1 + 10

Ouput.var1 :2: 100

This example shows that output variable is assigned an input variable


+ 100 or if input variable do not have a value then default value 100 is
set to the output variable.

The numbers 1 and 2 represents the priority.

123. How to do we run sequences of jobs ,,


like output of A JOB is Input to B
How do we co-ordinate the jobs

By writing the wrapper scripts we can control the sequence of


Ans.
execution of more than one job.

124. what is BRODCASTING and REPLICATE ?

Ans.Broadcast - Takes data from multiple inputs, combines it and


sends it to all the output ports.

Eg - You have 2 incoming flows (This can be data parallelism or


component parallelism) on Broadcast component, one with 10 records
& other with 20 records. Then on all the outgoing flows (it can be any
number of flows) will have 10 + 20 = 30 records

Replicate - It replicates the data for a particular partition and send it


out to multiple out ports of the component, but maintains the partition
integrity.

Eg - Your incoming flow to replicate has a data parallelism level of 2.


with one partition having 10 recs & other one having 20 recs. Now
suppose you have 3 output flos from replicate. Then each flow will
have 2 data partitions with 10 & 20 records respectively.

125. When using multiple DML statements to perform a single unit of work, is it
preferable to use implicit or explicit transactions, and why.

Ans.Because implicit is using for internal processing and explicit is


using for user open data requied.

126. What are kinds of layouts does ab initio supports

Ans.Basically there are serial and parallel layouts supported by


AbInitio. A graph can have both at the same time. The parallel one
depends on the degree of data parallelism. If the multi-file system is 4-
way parallel then a component in a graph can run 4 way parallel if the
layout is defined such as it's same as the degree of parallelism.

127.What is the difference between look-up file and look-up, with a relevant
example?

Ans.A lookup is a component of abinitio graph where we can store data


and retrieve it by using a key parameter.

A lookup file is the physical file where the data for the lookup is stored.
Data Warehouse Questions –
1. What is snapshot
Ans. You can disconnect the report from the catalog to which it is
attached by saving the report with a snapshot of the data. However,
you must reconnect to the catalog if you want to refresh the data.
2. What is the difference between datawarehouse and BI?
Ans. Simply speaking, BI is the capability of analyzing
the data of a
datawarehouse in advantage of that business. A BI tool analyzes the
data of a datawarehouse and to come into some business decision
depending on the result of the analysis.
3.

You might also like