Professional Documents
Culture Documents
for queries and analysis. Data and information are extracted from
heterogeneous sources as they are generated. This makes it much
easier and more efficient to run queries over data that originally came
from different sources". Another definition for data warehouse is: " A
data warehouse is a logical collection of information gathered from
many different operational databases used to create business
intelligence that supports business analysis activities and decision-
making tasks, primarily, a record of an enterprise's past transactional
and operational information, stored in a database designed to favor
efficient data analysis and reporting (especially OLAP)". Generally, data
warehousing is not meant for current "live" data, although 'virtual' or
'point-to-point' data warehouses can access operational data. A 'real'
data warehouse is generally preferred to a virtual DW because stored
data has been validated and is set up to provide reliable results to
common types of queries used in a business.
Analysis Level:
In the check in wizard’s advanced options, the analysis level can be specified as
one of the following:
• None:
No dependency analysis is performed during the check in.
• Translation only:
Graph being checked in is translated to data store format but no error
checking is done. This is the minimum requirement during check in.
• Translation with checking: (Default)
Along with the translation, errors, which will interfere with dependency
analysis, are checked for. These include:
• Absolute paths
• Undefined parameters
• dml syntax errors
• Parameter reference to objects that can’t be resolved
• Wrong substitution syntax in parameter definition
• Full Dependency Analysis:
Full dependency analysis is done during check in. It is not recommended
as takes a long time and in turn can delay the check in process.
What to analyse:
• All files:
Analyse all files in the Project
• All unanalysed files:
Analyse all files that have been changed or which are dependent on or
required by files that have changed since the last time they were analysed.
• Only my checked in files:
All files checked in by you would be analysed if they have not been
before.
• Only the file specified:
Apply analysis to the file specified only.
Ans. A key is a field or set of fields that uniquely identifies a record in a file or table.
A natural key is a key that is meaningful in some business or real-world sense. For
example, a social security number for a person, or a serial number for a piece of
equipment, is a natural key.
A surrogate key is a field that is added to a record, either to replace the natural key or in
addition to it, and has no business meaning. Surrogate keys are frequently added to
records when populating a data warehouse, to help isolate the records in the warehouse
from changes to the natural keys by outside processes.
For Example:
source;
target;
13. How to execute the graph from start to end stages? Tell me
and how to run graph in non-Abinitio system?
14. Can we load multiple files?
15. How can we test the abintio manually and automation?
16. What is the difference between sandbox and EME, can we
perform checkin and checkout through sandbox/ Can anybody
explain checkin and checkout?
17. What does layout means in terms of Ab Initio ?
18. Can anyone please explain the environment varaibles with
example?
19. How do we handle if DML changing dynamically?
There are lot many ways to handle the DMLs which changes
dynamically with in a single file. Some of the suitable methods are to
use a conditional DML or to call the vector functionality while calling
the DMLs.
API and UTILITY are the two possible interfaces to connect to the
databases to perform certain user specific tasks. These interfaces
allow the user to access or use certain functions (provided by the
database vendor) to perform operation on the databases. The
functionality of each of these interfaces depends on the databases.
The graph paramaters are one which are added to the respective
graph. You can added the graph parameters by selecting the
edit>parameters from the menu tab. Here's the example for the graph
parameters.
2.If needed use lookup local than lookup when there is a large data.
3.Takeout unnecessary components like filter by exp instead provide
them in reformat/Join/Rollup.
lookup file
27. Hi can anyone tell me what happens when the graph run?
i.e The Co-operating System will be at the host, We are
running the graph at some other place. How the Co-operating
System interprets with Native OS?
when ever you press Run button on your GDE,the GDE genarates a
script
Conventional Load:
Before loading the data, all the Table constraints will be checked
against the data.
30. Sift links to MFS files on Unix for Ab Initio ? what is this ....
2) $pound what is this
3) $? what for it is used
4) types of loading
5 overwrite when it used ?
Link is a command where in unix we use for when the original file is
deleted when we create a link the other replaces file exists.
.cfgfile is for the remote connection and .dbc is for connecting the
database.
.cfg contains :
2. Database version
3. Userid/pwd
33. How would you do performance tuning for already built graph ?
Can you let me know some examples?
3) suppose we wnt to join the data comming from 2 files and we dnt
wnt dupliates we will use union funtion instead of adding addtional
component for duplicate remover.
Duplicates were removed from only the in1 port, that is, from Input File
2.
The driving input is the largest input. All other inputs are read into
memory.
For example, suppose the largest input to be joined is on the in1 port.
Specify a port number of 1 as the value of the driving parameter. The
component reads all other inputs to the join — for example, in0, and
in2 — into memory.
Join also improves performance by loading all records from all inputs
except the driving input into main memory.
driving port in join supplies the data that drives join . That means, for
every record from the driving port, it will be compared against the data
from non driving port.
We have to set the driving port to the larger dataset sothat non driving
data which is smaller can be kept in main memory for speedingup the
operation.
1 gb= ( 100mb+200mb+300mb+5oomb)
1000mb/4= 250 mb
(100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value.
Rollup is for group by and Scan is for successive total. Basically, when
we need to produce summary then we use scan. Rollup is used to
aggregate data.
rollup :-
group1 total
10
20
30 60
group 2
40
30 70
130
Two are graph level parameters but in local you need to initialize the
value at the time of declaration where as globle no need to initialize
the data it will promt at the time of running the graph for that
parameter.
In the run directory a graph can be deployed as a .ksh file. Now, this
.ksh file can be run at the command prompt as:
dml expression means abinitio dml are stored or saved in a file and
dml describs the data interms of expressions that performs simple
computations such as files, dml also contains transform functions that
control data transforms,and also describs data interms of keys that
specify grouping or non grouping ,that means dml expression are non
embedded record format files
Co-Operating systems, that itself means a lot, it's not merely an engine
or interpretor. As it says, it's an operating system which co-exists with
another operating system. What does that mean.... in layman's term
abinitio, unlike other applications, does not sit as a layer on top of any
OS? It itself has quite a lot of operating system level capabilities such
as multi files, memory management and so on and this way it
completely integrate with any other OS and work jointly on the
available hardware resources. This sort of Synergy with OS optimize
the utilization of available hardware resources. Unlike other
applications (including most other ETL tools) it does not work like a
layer and interprete the commands. That is the major difference with
other ETL tools , this is the reason why abinitio is much much faster
than any other ETL tool and obviously much much costlier as well.
There is a Read Excell component that reads the excel either from host
or from local drive. The dml will be a default one.
57. How will you test a dbc file from command prompt ??
You can test a dbc file from command prompt(Unix) using m_db test
<name-of-dbc file> command which gives the checking of data base
connection, version of data base, user
desinger uses the GDE and designs the graphs and save to the EME or
Sand box it is at user side.where EME is ast server side.
To run a graph infinitely, the end script in the graph should call the
.ksh file of the graph. Thus if the name of the graph is abc.mp then in
the end script of the graph there should be a call to abc.ksh.
Like this the graph will run infinitely.
A lookup file is the physical file where the data for the lookup is stored.
If the DML changes dynamically then both dml and xfr has to be
passed as graph level parameter during the runtime.
AbInitio has built-in functions to retrieve values using the key for the
lookup.
If the user wants to group the records on particular field values then
rollup is best way to do that. Rollup is a multi-stage transform function
and it contains the following mandatory functions.
1. initialise
2. rollup
3. finalise
Also need to declare one temporary variable if you want to get counts
of a particular group.
For each of the group, first it does call the initialise function once,
followed by rollup function calls for each of the records in the group
and finally calls the finalise function once at the end of last rollup call.
Add Default Rules — Opens the Add Default Rules dialog. Select one of
the following: Match Names — Match names: generates a set of rules
that copies input fields to output fields with the same name. Use
Wildcard (.*) Rule — Generates one rule that copies input fields to
output fields with the same name.
From Abinitio run sql component using the DDL "trucate table”
3. Run Program
78. What is the function you would use to transfer a string into
a decimal?
In this case no specific function is required if the size of the string and
decimal is same. Just use decimal cast with the size in the transform
function and will suffice. For example, if the source field is defined as
string(8) and the destination as decimal(8) then (say the field name is
field1).
If the destination field size is lesser than the input then use of
string_substring function can be used likie the following.
say destination field is decimal(5).
out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /*
string_lrtrim used to trim leading and trailing spaces */
When the data is divided into smalll chunks and processed on different
components simultaneously we call it DataParallelism
An outer join is used when one wants to select all the records from a
port - whether it has satisfied the join criteria or not.
If you want to see all the records of one input file independent of
whether there is a matching record in the other file or not. then its an
outer join.
A .dbc file has the information required for Ab Initio to connect to the
database to extract or load tables or views. While .CFG file is the table
configuration file created by db_config while using components like
Load DB Table.
Both DBC and CFG files are used for database connectivity, basically
both are of similar use. The only difference is, cfg file is used for
Informix Database, whereas dbc are used for other database such as
Oracle or Sqlserver
http://www.coolinterview.com/
86. What is the difference between a Scan component and a RollUp component?
Ans. Two are graph level parameters but in local you need to initialize
the value at the time of declaration where as globle no need to
initialize the data it will promt at the time of running the graph for that
parameter.
88. How will you test a dbc file from command prompt ??
Ans. try "m_db test myfile.dbc"
89. Explain the difference between the “truncate” and "delete" commands ?
Ans.Truncate :- It is a DDL command, used to delete tables or clusters.
Since it is a DDL command hence it is auto commit and Rollback can't
be performed. It is faster than delete.
Local Lookup files are small files that can be accommodated into
physical memory for use in transforms. Details like country
code/country, Currency code/currency, forexrate/value can be used in
a lookup file and mapped during transformations. Lookup files are not
connected to any component of the graph but available to reformat for
mapping.
93. How to Create Surrogate Key using Ab Initio?
2.If needed use lookup local than lookup when there is a large data.
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
95. Describe the process steps you would perform when defragmenting a data
table. This table contains mission critical data ?
Ans. There are several ways to do this:
1) We can move the table in the same or other tablespace and rebuild
all the indexes on the table.
1. local parameter
2. Formal parameters.(those parameters working at runtime)
For most of the graph components, we can manually set the error
threshold limit, after which the graph exits. Normally there are three
levels of thresholds like "Never Exit" and "Exit on First Occurance",
very clear from the text. They represent both the extremes. The third
one is Limit along with Ramp. Limit talks about max limit where as
RAMP
talks in terms of percentage of processed records. For example a ramp
value of 5 means, if less than 5% of the total records are rejected,
continue running. If it crosses the ramp then it will come out of the
graph. Typically development starts with never exit, followed by ramp
and finally in production "Exit on First Occurance". Case to case basis
RAMP can be used in production but definitely not a desired approach.
100. What is the difference between sandbox and EME, can we perform checkin and
checkout through sandbox/ Can anybody explain checkin and checkout?
Ans. Sandboxes are work areas used to develop, test or run code
associated with a given project. Only one version of the code can be
held within the sandbox at any time.
The EME Datastore contains all versions of the code that have been
checked into it. A particular sandbox is associated with only one
Project where as a Project can be checked out to a number of
sandboxes.
1 gb= ( 100mb+200mb+300mb+5oomb)
1000mb/4= 250 mb
(100- 250 )/500= --> -150/500 == cal ur self it wil come in -ve value.
104. What is the Difference between DML Expression and XFR Expression ?
Ans. The main difference b/w dml & xfr is that
105. What are the most commonly used components in a Abinition graph?
lookup file
In this case there would not be any "key" associated with it. A temp
variable would be created for eg. 'temp.count' which would be
increamented with every record ( since there is no key here all the
fields are trated as one group) that flows through the transform, like
temp.count=temp.count+1.
Again the rollup component can be used to discard duplicates from a
group.Rollup basically acting as the dedup component in this case.
107. What is the difference between partitioning with key and round robin?
Ans. PARTITION BY KEY:
In this, we have to specify the key based on which the partition will
occur. Since it is key based it results in very well balanced data. It is
useful for key dependent parallelism.
The idea here is, instead of maintaining different versions of the same
graph, we can maintain one version for different files.
Ans.Before you can run an Ab Initio graph, you must specify layouts to
describe the following to the Co>Operating System:
112. How would you do performance tuning for already built graph ? Can you let me
know some examples?
3) suppose we wnt to join the data comming from 2 files and we dnt
wnt dupliates we will use union funtion instead of adding addtional
component for duplicate remover.
113. Which one is faster for processing fixed length dmls or delimited dmls and why
?
Ans. Fixed length DML's are faster because it will directly read the data
of that length without any comparisons but in delimited one,s every
character is to be compared and hence delays.
114. What is the function you would use to transfer a string into a decimal?
Ans.
Ans. data mapping deals with the transformation of the extracted data
at FIELD level i.e. the transformation of the source field to target field
is specified by the mapping defined on the target field. The data
mapping is specified during the cleansing of the data to be loaded.
For Example:
source;
119. Difference between conventional loading and direct loading ? when it is used
in real time .
Ans.Conventional Load:
Before loading the data, all the Table constraints will be checked
against the data.
It converts the AbInitio specific code into the format, which the
UNIX/Windows can understand and feeds it to the native operating
system, which carries out the task.
125. When using multiple DML statements to perform a single unit of work, is it
preferable to use implicit or explicit transactions, and why.
127.What is the difference between look-up file and look-up, with a relevant
example?
A lookup file is the physical file where the data for the lookup is stored.
Data Warehouse Questions –
1. What is snapshot
Ans. You can disconnect the report from the catalog to which it is
attached by saving the report with a snapshot of the data. However,
you must reconnect to the catalog if you want to refresh the data.
2. What is the difference between datawarehouse and BI?
Ans. Simply speaking, BI is the capability of analyzing
the data of a
datawarehouse in advantage of that business. A BI tool analyzes the
data of a datawarehouse and to come into some business decision
depending on the result of the analysis.
3.