You are on page 1of 6

Insert / Update ordering in Informatica mappings

ETL-Performance.com
Stephen Barr

Does the order of inserts & updates to a target make a substantial difference to the overall performance
of the mapping, and perhaps more importantly to the overall scalability of the solution? The resounding
answer is YES!

Test case
My source and target tables exist in the same database but within different schemas. I’ve designed the
data such that 50% of the rows from the source will be updates and 50% will be inserts.

SOURCE@INFADB>select count(*)
2 from insert_update_source
3 /

COUNT(*)
----------
202992

Elapsed: 00:00:01.45
SOURCE@INFADB>select action, count(*)
2 from insert_update_source
3 group by action
4 /

ACTION COUNT(*)
------ ----------
UPDATE 101496
INSERT 101496

Elapsed: 00:00:00.48

Using these sources and targets I created two mappings.

Mapping 1 – interleaved inserts / updates

In this mapping, the target will receive an insert, update, insert, etc. This has been designed to signify a
worse case scenario.

Mapping 2 – inserts / updates routed to separate targets

In this mapping, there are two versions of the target. The inserts are routed to one target, while the
updates are routed to the other. We then use the “Target Load Plan” to choose which one we should
load first.
The scripts for creating the source and target tables are available at the bottom of this document.

Results

Overall run times –

Mapping 1 - 6 minutes 14 seconds


Mapping 2 - 2 minutes 25 seconds

As you can see there is a massive difference in the runtimes between the two mappings. Obviously,
something fundamental is happening in the first making which is making is perform so poorly – and
from looking at the oracle trace files we can see exactly what the issue is.

From the trace of the target we can see the overall statistics for the insert statement from the first
mapping –

INSERT INTO INSERT_UPDATE_TARGET(ID,OWNER,OBJECT_NAME,SUBOBJECT_NAME,


OBJECT_ID,DATA_OBJECT_ID,OBJECT_TYPE,CREATED,LAST_DDL_TIME,TIMESTAMP,STATUS,
TEMPORARY,GENERATED,SECONDARY,ACTION)
VALUES
( :1, :2, :3, :4, :5, :6, :7, :8, :9, :10, :11, :12, :13, :14, :15)

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 100922 26.60 27.14 46 1990 323326 101496
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 100923 26.60 27.14 46 1990 323326 101496

We can see that there were 100923 executions of the insert statement, resulting in 323326 current
block gets.

However, if we look at the second mapping –

INSERT INTO INSERT_UPDATE_TARGET(ID,OWNER,OBJECT_NAME,SUBOBJECT_NAME,


OBJECT_ID,DATA_OBJECT_ID,OBJECT_TYPE,CREATED,LAST_DDL_TIME,TIMESTAMP,STATUS,
TEMPORARY,GENERATED,SECONDARY,ACTION)
VALUES
( :1, :2, :3, :4, :5, :6, :7, :8, :9, :10, :11, :12, :13, :14, :15)

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 705 3.50 5.50 1 4005 28802 101496
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 706 3.50 5.50 1 4005 28802 101496
You can see that there were only 706 executions of the insert statement with only 28802 current block
gets. If we stack up the figures we can see this more starkly –

Map1 – insert Map 2 - insert Map1 – update Map 2 – update

Executions 100923 706 101497 101497


CPU time 26.6 3.5 37.39 30.68
Elapsed time 27.14 5.5 50.58 42.67
Block gets 322326 28802 110023 107386

As you can see, there is a huge difference in the inserts – especially when it comes to cpu time and the
number of block gets. The reason? Array inserts.

Informatica uses the native Oracle Call Interface to communicate with the oracle server. One of the
features of the OCI interface is it’s ability to allow an OCI client to perform array inserts / updates.
This means that for a single “execution” of the statement, multiple rows of data are processed. We can
see this is happening because of the rows / executions for our insert statement is > 1.
In fact, the average array size in this case looks to be ~170 rows of data. These array operations are
much more efficient that ordinary insert operations.

So why is one mapping performing array operations but the other is not? Informatica has implemented
its OCI interface in a very simple generic way. If an insert statement is receiving by the writer process
it will start to build an array. If another insert statement comes through, then this is simply added to the
existing array. When the array is full then Informatica will send that array to oracle for processing as a
single message. However, if we are in the middle of building an array of inserts and the writer receives
an update, then Informatica will send the insert array as it currently is, followed by the update.
Therefore, it we have interleaved inserts then updates, then we are effectively not using arrays at all.
We can see this from the raw trace files –

In mapping 1, we can see the inserts and updates are interleaved almost perfectly –

EXEC #1:c=0,e=287,p=0,cr=0,cu=3,mis=0,r=1,dep=0,og=1,tim=26087356343
WAIT #1: nam='SQL*Net message to client' ela= 6 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087356647
WAIT #1: nam='SQL*Net message from client' ela= 873 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087357719
EXEC #2:c=0,e=330,p=0,cr=2,cu=1,mis=0,r=1,dep=0,og=1,tim=26087358483
WAIT #2: nam='SQL*Net message to client' ela= 6 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087358803
WAIT #2: nam='SQL*Net message from client' ela= 877 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087359884
EXEC #1:c=0,e=268,p=0,cr=0,cu=3,mis=0,r=1,dep=0,og=1,tim=26087360720
WAIT #1: nam='SQL*Net message to client' ela= 6 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087361027
WAIT #1: nam='SQL*Net message from client' ela= 885 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087362116
EXEC #2:c=0,e=331,p=0,cr=2,cu=1,mis=0,r=1,dep=0,og=1,tim=26087362877
WAIT #2: nam='SQL*Net message to client' ela= 7 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087363197
WAIT #2: nam='SQL*Net message from client' ela= 869 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=26087364264

EXEC #1 is our insert, EXEC #2 is our update.

However, looking at the trace file from mapping 2, we can see that the operations are grouped together

EXEC #1:c=0,e=205,p=0,cr=2,cu=1,mis=0,r=1,dep=0,og=1,tim=28092846779
WAIT #1: nam='SQL*Net message to client' ela= 4 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092846876
WAIT #1: nam='SQL*Net message from client' ela= 488 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092847418
EXEC #1:c=0,e=264,p=0,cr=2,cu=1,mis=0,r=1,dep=0,og=1,tim=28092847781
WAIT #1: nam='SQL*Net message to client' ela= 5 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092847887
WAIT #1: nam='SQL*Net message from client' ela= 425 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092848366
EXEC #1:c=0,e=209,p=0,cr=2,cu=1,mis=0,r=1,dep=0,og=1,tim=28092848672
WAIT #1: nam='SQL*Net message to client' ela= 4 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092848771
WAIT #1: nam='SQL*Net message from client' ela= 414 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092849238
EXEC #1:c=0,e=207,p=0,cr=2,cu=1,mis=0,r=1,dep=0,og=1,tim=28092849540
WAIT #1: nam='SQL*Net message to client' ela= 4 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092849638
WAIT #1: nam='SQL*Net message from client' ela= 403 driver id=1413697536 #bytes=1 p3=0
obj#=-1 tim=28092850093
EXEC #1:c=0,e=207,p=0,cr=2,cu=1,mis=0,r=1,dep=0,og=1,tim=28092850387

We can see the massive difference in the traffic generated between Informatica and oracle when
comparing both mappings –

Mapping 1 –

SQL*Net message to client 100922 0.02 0.74


SQL*Net message from client 100922 0.08 63.88

Mapping 2 –

SQL*Net message to client 705 0.00 0.00


SQL*Net message from client 705 0.08 5.02

This is a massive reduction in the time the mapping is spending communicating with the oracle
database – and if we start to scale these figures up to production volumes you can see that these sort of
issues need to be seriously considered.

Those of you with a good eye will have spotted something a bit strange. The figures for the update
statement are effectively the same for both mappings. It actually looks like Informatica does not
support array updates. If this is true then it seems like a glaring hole in its OCI implementation.
However, if you have evidence to the contrary let me know!

Implications

This was a very contrived test on a small single cpu box. I was using relatively small volumes and very
simple structures. However, the trace files from oracle reflect the magnitude of the difference between
the two approaches which will hold true even for the biggest of systems or the most complex of
mappings.

It’s very easy to detect whether or not you’re experiencing these issues and if you are then the
performance benefits you could glean from separating your inserts / updates from each other could be
fantastic – especially given how little effort is required to make a change like this.
Scipts to create source & target tables –

SOURCE

create sequence insert_update_seq;

create table insert_update_source


as
select insert_update_seq.nextval,
owner,
object_name,
subobject_name,
object_id,
data_object_id,
object_type,
created,
last_ddl_time,
timestamp,
status,
temporary,
generated,
secondary,
decode(mod(rownum,2),0,'INSERT','UPDATE') as action
from dba_objects
/

insert into insert_update_source


(
select insert_update_seq.nextval,
owner,
object_name,
subobject_name,
object_id,
data_object_id,
object_type,
created,
last_ddl_time,
timestamp,
status,
temporary,
generated,
secondary,
decode(mod(rownum,2),0,'INSERT','UPDATE') as action
from insert_update_source
)
/
/
/
commit;

exec dbms_stats.gather_table_stats(user, 'INSERT_UPDATE_SOURCE');

TARGET

create table insert_update_target


(
ID NUMBER,
OWNER VARCHAR2(30),
OBJECT_NAME VARCHAR2(128),
SUBOBJECT_NAME VARCHAR2(30),
OBJECT_ID NUMBER,
DATA_OBJECT_ID NUMBER,
OBJECT_TYPE VARCHAR2(19),
CREATED DATE,
LAST_DDL_TIME DATE,
TIMESTAMP VARCHAR2(19),
STATUS VARCHAR2(7),
TEMPORARY VARCHAR2(1),
GENERATED VARCHAR2(1),
SECONDARY VARCHAR2(1),
ACTION VARCHAR2(6)
)
/

insert into insert_update_target


(
select *
from source.insert_update_source
where action = 'UPDATE'
)
/

commit;

create index id_idx on insert_update_target(id);

exec
dbms_stats.gather_table_stats(user,'INSERT_UPDATE_TARGET',cascade=>TR
UE);

You might also like