Professional Documents
Culture Documents
Project: db FEEDS
Stage: Release 1
Phase: Design
Work stream: Interfaces
Document: Performance Tuning Guide
Subject: Informatica – Performance tuning
Page 1 of 23
dbFEEDS Project
Version History:
Version Date By Changes
Contributors:
Name Role Location Remarks
Approval:
Name Role Location Remarks
Sanjay De Costa
Joseph Houben
Reference Documents:
Name Author Version Date
Unix system (OS) tuning Informatica
Page 2 of 23
dbFEEDS Project
CONTENTS
Page 3 of 23
dbFEEDS Project
1 Document Description
This document describes the practices that can be followed by the ETL development
team, in order to get the best of Informatica PowerCenter (ETL). This document
mainly concentrates on optimising the performance of core ETL. In order to make
ETL to achieve the optimal performance; it is imperative to strike a good balance in
hardware, OS, RDBMS & Informatica PowerCenter 7.1.1. This document can be
used as reference by the development team & administration team.
2 Document Organisation
This document is divided into following parts
o Primary guidelines - Necessary for ETL to perform optimally,
fundamental approach for ETL design with Informatica PC 7.1.1
o Advanced guidelines - Guidelines can be applied on case-to-case basis,
Can be followed based on the problem scenario / environment
o Optimising Unix system – Performance tuning the OS (Unix/Linux
system)
3.2 Localisation
Try to localise the relational objects as far as possible. Try not to use synonyms for
remote database. Usage of remote links for data processing & loading certainly
slow the things down.
Page 4 of 23
dbFEEDS Project
This requires a wrapper function / stored procedure call. Utilizing these stored
procedures has caused performance to drop by a factor of 3 times. This slowness is
not easily debugged - it can only be spotted in the Write Throughput column. Copy
the map, replace the stored proc call with an internal sequence generator for a test
run - this is how fast you COULD run your map. If we must use a database
generated sequence number, then follow the instructions for the staging table
usage. If we're dealing with GIG's or Terabytes of information - this should save you
lot's of hours tuning. IF YOU MUST - have a shared sequence generator, then build
a staging table from the flat file, add a SEQUENCE ID column, and call a POST
TARGET LOAD stored procedure to populate that column. Place the post target
load procedure in to the flat file to staging table load map. A single call to inside the
database, followed by a batch operation to assign sequences is the fastest method
for utilizing shared sequence generators.
Page 5 of 23
dbFEEDS Project
Index Cache and Data cache are dynamically allocated first. As soon as the
session is initialised, the memory for data and index caches are setup. Their sizes
depend upon session settings
The Reader DTM also based on dynamic allocation algorithm, it uses the memory
available in chunks. Size of the chunk would be determined by the session setting
“Default Buffer block size”
Page 6 of 23
dbFEEDS Project
NOTE: if the reader session to flat file just doesn't ever "get fast", then we have got
some basic map tuning to do. Try to merge expression objects, set the lookups to
unconnected (for re-use if possible), check the Index and Data cache settings if we
have aggregation, or lookups being performed. Etc... If we have a slow writer,
change the map to a single target table at a time - see which target is causing the
"slowness" and tune it. Make copies of the original map, and break down the
copies. Once the "slower" of the N targets is discovered, talk to DBA about
partitioning the table, updating statistics, removing indexes during load, etc...
There are many database things you can do here.
Remember the TIMING is affected by READER/TRANSFORMER/WRITER
threads. With complex mappings, don't forget that each ELEMENT (field) must be
weighed - in this light a firm understanding of how to read performance statistics
generated by Informatica becomes important. In other words - if the reader is
slow, then the rest of the threads suffer, if the writer is slow, same effect. A pipe is
only as big as its smallest diameter.... A chain is only as strong as its weakest
link. Sorry for the metaphors, but it should make sense.
As far as possible, try to avoid the API’s which calls external objects, as this has
been proven slow. External modules might exhibit speed problems, instead try
using pre-processing / post processing with SED, AWK or GREP.
Page 7 of 23
dbFEEDS Project
Page 8 of 23
dbFEEDS Project
4.7 Expressions
Expressions like IS_SPACES, ISNUMBER etc. affects the performance, as this is
the data validation expression that has to scan the entire string to determine the
result. Try to avoid using these expressions unless there is absolute requirement for
its usage.
Page 9 of 23
dbFEEDS Project
4.10 Aggregator
If the mapping contains more than one aggregators, then the session will run slow,
unless the cache dir is fast & disk drive access speed is high. Placing aggregator
towards the end might be another option; however this will also bring down the
performance. As all the I/O activity would be a bottleneck in informatica.
Maplets are good source for replicating data logic, but if a maplet contains
aggregator still the performance of the mapping (that contains maplet) will affect.
Reduce the number of aggregators in the entire mapping to 1(if can), if possible,
split the mapping to several mappings for breaking down the logics.
Sorted input to aggregator will increase the performance to large extent, however if
the sorted input is enabled & the data passing to aggregator is not sorted, Session
will fail. Set the cache size to calculated amount using below mentioned formulae.
Index size = (sum of column size in group-by ports + 17) X number of groups
Data size = (sum of column size of output ports + 7) X number of groups
4.11 Joiner
Perform joins in a database. Performing a join in a database is faster than
performing a join in the session.
Use one of the following options:
o Create a pre-session stored procedure to join the tables in a database.
Designate as the master source the source with the smaller number of records.
For optimal performance and disk storage, designate the master source as the
source with the lower number of rows. With a smaller master source, the data cache
is smaller, and the search time is shorter. Set the cache size to calculated amount
using below mentioned formulae.
Index size = (sum of master column size in join condition +16) X number of rows in
master table
Data size = (sum of master column size NOT in join condition but on output ports +
8) x number of rows in master table
Page 10 of 23
dbFEEDS Project
Page 11 of 23
dbFEEDS Project
Attributes Method
Minimum Index Cache 200 * [( S column size) + 16] over all condition ports
Maximum Index Cache # rows in lookup table [(S column size) + 16] * 2
over all condition ports
Minimum Data Cache # rows in lookup table [(S column size) + 8] over all
outputports (not condition port)
Page 12 of 23
dbFEEDS Project
4.15 Loading
Make sure indexes and constraints are removed before loading into relational
targets & this can be created as soon as the load is completed. This would help in
boost up the performance in bulk data loads.
Lesser the commit interval more the time for session to complete, set the
appropriate commit interval, anything above 50K is good. Partioning the data while
loading is another wise option. Following are the partitions Informatica provides
o Key Range
o Hash Key
o Round Robin
o Pass Through
o Lookup Cache Use hash auto key partition type, equality condition
Page 13 of 23
dbFEEDS Project
Page 14 of 23
dbFEEDS Project
Page 15 of 23
dbFEEDS Project
Page 16 of 23
dbFEEDS Project
Page 17 of 23
dbFEEDS Project
`
Repository
4.22.3 Place some good server load monitoring tools on the PM Server in
development
Watch it closely to understand how the resources are being utilized, and where the hot
spots are. Try to follow the recommendations - it may mean upgrading the hardware to
achieve throughput. Look in to EMC' s disk storage array - while expensive, it appears
to be extremely fast, it may improve the performance in some cases by up to 50%
Prioritizing the database login that any of the connections use (setup in Server Manager)
can assist in changing the priority given to the Informatica executing tasks. These tasks
when logged in to the database then can over-ride others. Sizing memory for these
tasks (in shared global areas, and server settings) must be done if priorities are to be
changed. If BCP or SQL*Loader or some other bulk-load facility is utilized, these
priorities must also be set. This can greatly improve performance. Again, it's only
suggested as a last resort method, and doesn't substitute for tuning the database, or the
Page 18 of 23
dbFEEDS Project
4.26 Balance between Informatica and the power of SQL and the database
Try to utilize the DBMS for what it was built for: reading/writing/sorting/grouping/filtering
data en-masse. Use Informatica for the more complex logic, outside joins, data
integration, multiple source feeds, etc... The balancing act is difficult without DBA
knowledge. In order to achieve a balance, we must be able to recognize what
operations are best in the database, and which ones are best in Informatica. This does
not degrade from the use of the ETL tool, rather it enhances it - it's a MUST if you are
performance tuning for high-volume throughput.
Page 19 of 23
dbFEEDS Project
Run vmstat S 5 confirms memory problems and check for the following:
o Pages-outs occurring consistently? If so, you are short of memory.
If memory seems to be the bottleneck of the system, try following remedial steps:
o Reduce the size of the buffer cache, if your system has one, by
decreasing BUFPAGES. The buffer cache is not used in system V.4 and
SunOS 4.X systems. Making the buffer cache smaller will hurt disk I/O
performance.
o If you have statically allocated STREAMS buffers, reduce the number of
large (2048- and 4096-byte) buffers. This may reduce network
performance, but netstat-m should give you an idea of how many buffers
you really need.
o Reduce the size of your kernels tables. This may limit the systems
capacity (number of files, number of processes, etc.).
o Try running jobs requiring a lot of memory at night. This may not help the
memory problems, but you may not care about them as much.
Page 20 of 23
dbFEEDS Project
Page 21 of 23
dbFEEDS Project
o Write a find script that detects old core dumps, editor backup and auto-
save files, and other trash and deletes it automatically. Run the script
through croon.
o Use a smaller block size on file systems that are mostly small files (e.g.,
source code files, object modules, and small data files).
Page 22 of 23
dbFEEDS Project
Page 23 of 23