IBM Infosphere Change Data Capture

IBM Infosphere Change Data Capture
[Performance recommendation for Production Environment]
Prepared by: Ashutosh Chandra

Create Date: April 26th, 2016
Version: 1.0
Copyright
Copyright Mark's Work Wearhouse. All rights reserved.
This document is confidential and proprietary to Mark's Work Wearhouse and/or its affiliated or related entities. This document
shall not be duplicated, transmitted, used, or otherwise disclosed, in whole or in part, to anyone other than the organization or
specific individuals whom which this document is delivered on a need-to-know basis. The recipient may only use this
document to assist the recipient in its provision of services to Mark's Work Wearhouse and/or its affiliated or related entities.
These restrictions are applicable to the entirety of this document and all of its constituent parts individually. Mark's Work
Wearhouse reserves the right to require the recipient to return all copies of this document at any time. In the event that the
organization or specific individual to whom this document is given and Mark's Work Wearhouse enter into an agreement
applicable to any of the information contained in this document, in the event of a conflict between such agreement and this
notice, the use, disclosure and return of the copies of this document will be governed by the terms and conditions of such
agreement.
Mark's Work Wearhouse
30, 1035 - 64 Ave. S.E.
Calgary, AB T2H 2J7
TELEPHONE 403.255.9220
FACSIMILE 403.255.6005
WEB SITE WWW.MARKS.COM
1 | Page
Contents
1. About IBM InfoSphere CDC.......................................................................................................3
2. Current specifications for CDC server in Production..................................................................5
3. Major issues faced in CDC so far are as below:..........................................................................5
4. Resolution/immediate actions taken for above issues:................................................................6
5. Major Factors affecting performance of CDC for Oracle:..........................................................6
A.
Disk Space:..................................................................................................................................6
B.
RAM allocation for each instance:..............................................................................................7
C.
Archive log retention period:......................................................................................................7
D.
Maintaining active TCP connections in a network environment................................................7
E.
Local Log reading and Remote log reading................................................................................7
6. Improvement suggestions:...........................................................................................................8
2 | Page
1. About IBM InfoSphere CDC

IBM InfoSphere Change Data Capture (InfoSphere CDC) is a replication solution that
captures database changes as they happen and delivers them to target databases,
message queues, or an ETL solution such as InfoSphere DataStage based on table
mappings configured in the InfoSphere CDC Management Console GUI application.
InfoSphere CDC provides low impact capture and fast delivery of data changes for key
information management initiatives including dynamic data warehousing, master data
management, application consolidations or migrations, operational BI, and enabling SOA
projects. InfoSphere CDC also helps reduce processing overheads and network traffic by
only sending the data that has changed. Replication can be carried out continuously or
periodically. When data is transferred from a source server, it can be remapped or
transformed in the target environment.
The following diagram illustrates the key components of InfoSphere CDC.
The key components of the InfoSphere CDC architecture are described below:
Access ServerControls all of the non-command line access to the replication
environment. When you log in to Management Console, you are connecting to Access
Server. Access Server can be closed on the client workstation without affecting active
data replication activities between source and target servers.
Admin APIOperates as an optional Java-based programming interface that you can
use to script operational configurations or interactions.
Apply agentActs as the agent on the target that processes changes as sent by the
source.
3 | Page
Command line interfaceAllows you to administer datastores and user accounts, as well
as to perform administration scripting, independent of Management Console.

Communication Layer (TCP/IP)Acts as the dedicated network connection between
the Source and the Target.
Source and Target DatastoreRepresents the data files and InfoSphere CDC instances
required for data replication. Each datastore represents a database to which you
want to connect and acts as a container for your tables. Tables made available for
replication are contained in a datastore.
Management ConsoleAllows you to configure, monitor and manage replication on
various servers, specify replication parameters, and initiate refresh and mirroring
operations from a client workstation. Management Console also allows you to
monitor replication operations, latency, event messages, and other statistics
supported by the source or target datastore. The monitor in Management Console is
intended for time-critical working environments that require continuous analysis of
data movement. After you have set up replication, Management Console can be
closed on the client workstation without affecting active data replication activities
between source and target servers.
MetadataRepresents the information about the relevant tables, mappings,
subscriptions, notifications, events, and other particulars of a data replication
instance that you set up.
MirrorPerforms the replication of changes to the target table or accumulation of
source table changes used to replicate changes to the target table at a later time. If
you have implemented bidirectional replication in your environment, mirroring can
occur to and from both the source and target tables.
RefreshPerforms the initial synchronization of the tables from the source database to
the target. This is read by the Refresh reader.
Replication EngineServes to send and receive data. The process that sends replicated
data is the Source Capture Engine and the process that receives replicated data is
the Target Engine. An InfoSphere CDC instance can operate as a source capture
engine and a target engine simultaneously.
Single ScrapeActs as a source-only log reader and a log parser component. It checks
and analyzes the source database logs for all of the subscriptions on the selected
datastore.
Not all InfoSphere CDC engines use Single Scrape. For InfoSphere CDC for DB2 for
i, there is a Scraper job (that acts as a log reader) and a Mirror job that performs the
function of mirroring (see Mirror above).
Source transformation engineProcesses row filtering, critical columns, column filtering,
encoding conversions, and other data to propagate to the target datastore engine.
Source database logsMaintained by the source database for its own recovery purposes.
The InfoSphere CDC log reader inspects these in the mirroring process, but filters out
the tables that are not in scope for replication.
Target transformation engineProcesses data and value translations, encoding
conversions, user exits, conflict detections, and other data on the target datastore
engine.
There are two types of target-only destinations for replication that are not databases:
JMS MessagesActs as a JMS message destination (queue or topic) for row-level
operations that are created as XML documents.
4 | Page
InfoSphere DataStageProcesses changes delivered from InfoSphere CDC that can be
used by InfoSphere DataStage jobs.
2. Current specifications for CDC server in Production.

Server Role
mcdcesb4
ap01
mcdcesb4ap
02
Server
IP
OS
Name
(Assigned
(Allocate by Server
d
by team)
Server
team)
cdc01.pd
10.100.1
RHL 6.4
16.59
cdc02.pd
10.100.116
.61
RHL 6.4
vCPU
Memo
ry
(GB)
Dis Services
k
Siz
e
(G
B)
150
1.
Datastage
Services
Tier
2.
CDC
(Change
Data
Capture) Access
Data
15
0
1.
Datastage
Services
Tier
2. CDC (Change
Data Capture) Access
Data
We have above configuration for CDC servers. Access server for CDC is installed in above
location and is utilised by Access manager and management console. For each subscription
we have source and target CDC agent installed in respective servers
ExOracle AIX Server [for Oracle],
I-series server [for DB2] &
Linux box [For Datastage]
Each instance is configured as Datastore in Access Server and then used by CDC
subscription for Replication.
5 | Page
3. Major issues faced in CDC so far are as below:

-
Subscription failures due to unavailability of Redo-logs.

Subscription failures due to Network connection issue between two CDC instances.
Datastore is down due to memory exception [Ram allocated to CDC instance in DB
server].
Long running transaction in Oracle resulting unavailability of Redo logs [archive
retention period].
Bugs in the version being used for CDC Agent. [Upgraded version is not in Production].
Error in archiving Redo log due to insufficient Disk space.
Bookmark Journal entry is corrupted. [DB2 instance]
Subscription was in hung state and not replicating data
4. Resolution/immediate actions taken for above issues:

-
Recovery of Archive logs. [Usually take longer time if logs are backed up in Disc/tape].
Memory cleanup/Addition of memory to instance having issue.
Instance restart to clear out hung connections and to restart respective Datastores.
Set bookmark value to last commit position, Mark table capture and restart
Subscriptions.
Clearing of logs/staging store information/temp files to free memory.
Increase Retention period of Archive logs and increase Global disk space.
Reset journal entry for DB2 instance.
Changing source id/name of subscriptions based on input from IBM.
5. Major Factors affecting performance of CDC for Oracle:

Current instance configuration is attached at the end of this document.
A. Disk Space:
Recommendation from IBM:CDC source system:
100 GBDefault value for the Staging Store Disk Quota for each instance of InfoSphere
CDC. The minimum is 1 GB. Although the minimum is 1 GB, prepare for more disk
space since there is a staging store on the source. Use the InfoSphere
CDCconfiguration tool to configure disk space for this quota.
5 GBFor installation files, data queues, and log files .
Global disk quotaDisk space is required on your source system for this quota which is
used to store in-scope change data that has not been committed in your database.
The amount of disk space required is determined by your replication environment
6 | Page
and the workload of your source database. Use the mirror_global_disk_quota_gb system
parameter to configure the amount of disk space used by this quota.
CDC target system:
1 GBThe minimum amount of disk space allowed for the disk quota for each
instance of InfoSphere CDC. The minimum value for this quota is sufficient for all
instances created on your target system. Use the InfoSphere CDC configuration tool
to configure the disk space for this quota.
5 GBFor installation files, data queues, and log files.
Global disk quotaDisk space is required on your target system for this quota which is
used to store LOB data received from your InfoSphere CDC source system. The
amount of disk space required is determined by your replication environment and the
amount of LOB data you are replicating. To improve performance, InfoSphere
CDC will only persist LOB data to disk if RAM is not available on your target system.
Use the mirror_global_disk_quota_gb system parameter to configure the amount of disk
space used by this quota.
InfoSphere CDC may require additional disk space in the following situations:
You are running large batch transactions in the database on your source system.
You are configuring multiple subscriptions and one of your subscriptions is latent. In
this type of scenario, InfoSphere CDC on your source system may persist transaction
queues to disk if RAM is not available.
You are replicating large LOB data types.
You are replicating "wide" tables that have hundreds of columns.
B. RAM allocation for each instance:
Each instance of CDC requires memory for the Java Virtual Machine (JVM). The
following default values for memory are assigned:
1024 MB of RAM Default value for each 64-bit instance of InfoSphere CDC. This will
change depending of no of subscription/usage of instance .
InfoSphere CDC source deployments may require additional RAM in the following
scenarios:
You are replicating large LOB data types with your InfoSphere CDC source
deployment. These data types are sent to target while being retrieved from the
source database. The target waits until all LOBs (for each record) are received before
applying a row. LOBs are stored in memory as long as there is adequate RAM,
otherwise they are written to disk on the target.
You are replicating "wide" tables with hundreds of columns.
You are performing large batch transactions in your source database rather than
online transaction processing (OLTP).
If multiple subscriptions are using same Datastore/CDC instance.
C. Archive log retention period:

7 | Page
CDC replication is based on transaction log created by Database. There should be

enough logs present in Archive, so that CDC can refer to logs in case of frequent log
creation/huge DB activity/load.
Archive logs are usually present on Disk, hence will occupy global disk space. It is
necessary to retain Archive logs for minimum of 72 hrs. IBMs recommendation is 5
days.
Retention period should be based on the Restart position of instance and/or longest
running session in DB. This is major factor in failure of CDC subscriptions. Recovery
of Archive is a time taking process and can impact business.
D. Maintaining active TCP connections in a network environment

If your deployment of CDC is in a network environment that uses a firewall, VPN gateway,
or local system tools to detect idle TCP connections, it may be necessary to configure the
product to prevent these connections from being closed during periods of application
inactivity between the source and target.
By default, InfoSphere CDC sends a message over TCP connections every 20 seconds to
ensure these connections remain active during periods of inactivity. If your network policies
close TCP connections for idle periods of less than 20 seconds, you must change the
configuration of each instance of InfoSphere CDC to ensure the TCP connections remain
open.
E. Local Log reading and Remote log reading
CDC supports both local as well as remote log reading. By default, the product is
configured to read both online redo log files and archived log files. This provides for low
latency replication as the online log is continuously written by Oracle and read by
the InfoSphere CDC for Oracle databases log reader. However, the product can also be
configured for reading archive log files only .
6. Improvement suggestions:
Increase Archive log retention period to minimum of 72 hrs. for Source Oracle
instances (IBM recommendation is 5 days)
We would need to increase disk allocated to Archive directory to accommodate 72
hrs. of Archive logs.
Increase Staging Area disk quota to minimum of 10 GB, where it is less than 10 GB.
MDR is being used as source instance and Target instance, creating additional
overhead for Datastore.
We can create a new instance which will be used as source instance for subscriptions
reading from MDR.
At the end we will be having 19 subscriptions which are using MDR as source and 39
subscriptions writing to MDR from various source databases.
8 | Page
Limiting no. of subscriptions to 40 per Datastore/instance. A new instance will be

created if there is need of more subscriptions.
For all source Oracle instances, Increase memory allocation to 4 GB where it is less
than that.
Need analysis on network bandwidth to avoid network contention and connection
issue(s).
Limiting No. of replicating tables to 20 Tables. If exceeding create a new subscription
to accommodate attritional tables.
We can propose to have a separate AIX box for all CDC instances and can utilize
remote log reading.
This will reduce dependency over oracle server.
Proposal to use continuous capturing. Continuous Capture is a product feature that is
designed to accommodate those replication environments in which it is necessary to
separate the reading of the database logs from the transmission of the logical
database operations. This is useful when you want to continue processing log data
even if replication and your subscriptions stop due to issues such as network
communication failures over a fragile network, target server maintenance, or some
other issue. You can enable or disable Continuous Capture without stopping
subscriptions.
7. Current configuration with recommended value:

We are currently working With AIX team to get missing values.
CDC
recommendations.xlsx
9 | Page

IBM Infosphere Change Data Capture

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IBM Infosphere Change Data Capture

Uploaded by

Copyright:

Available Formats

IBM Infosphere Change Data Capture

[Performance recommendation for Production Environment]

Prepared by: Ashutosh Chandra

1. About IBM InfoSphere CDC

as to perform administration scripting, independent of Management Console.

Source transformation engineProcesses row filtering, critical columns, column filtering,

InfoSphere DataStageProcesses changes delivered from InfoSphere CDC that can be

used by InfoSphere DataStage jobs.

2. Current specifications for CDC server in Production.

3. Major issues faced in CDC so far are as below:

Subscription failures due to unavailability of Redo-logs.

4. Resolution/immediate actions taken for above issues:

5. Major Factors affecting performance of CDC for Oracle:

You are replicating "wide" tables that have hundreds of columns.

B. RAM allocation for each instance:

C. Archive log retention period:

CDC replication is based on transaction log created by Database. There should be

D. Maintaining active TCP connections in a network environment

Limiting no. of subscriptions to 40 per Datastore/instance. A new instance will be

7. Current configuration with recommended value:

You might also like