You are on page 1of 18

BIK Information Services, Inc.

Oracle's Change Data Capture


By Boris Knizhnik (borisk@bikinfo.com) To the guys who dared to grant me DBA_ROLE Dave LaPoint, Jim Kirsch, and Ran Pan
This article in slightly different form appeared in Pinnacle - in May issue of 2003 and in SELECT journal - in second quarter, 2004.

We appreciate any links to this site as they show your appreciation and work as free ads. If you decide to include a link to our site on your site, please drop us a note.

TABLE OF CONTENTS
Setting up Subscriptions Setting up a Subscriber Creating and activating a subscription Extending the window Extend window and create a view Processing change records Source_colmap$ Dropping views and purging the window Review Clean-up Conclusion References

Introduction
Oracle Change Data Capture, or CDC, is one of the latest features of Oracle 9i to support Data Warehousing. To set the record straight - CDC is NOT a Data Warehousing solution - it is just a tool to help manage data changes. The most typical need for Data Warehousing is the need to know what data has been changed since the last load. There are other uses of this feature outside of Data Warehousing - daily activity reports against the transactional database, for instance. Unless the online database of record contains timestamps, it is virtually impossible to find out which data has changed since last time you ran the refresh job. The usual alternatives are to either load the entire database every time you need to refresh, or use scripts using various business rules to determine which data has changed. Ad-hoc methods have their shortfalls and can't compete with a more direct way of just knowing what has changed. CDC was designed to solve this problem. Unfortunately, Oracle documentation is inadequate in explaining the overall process. Individual parts are explained, but it is hard to piece the entire process together. So, in this document we are trying to explain how various pieces of CDC fit together. Installation problems aren't addressed here.

Basic Concept
The basic concept behind CDC is simple. For each table that we wish to track changes, we create an INSERT, DELETE and UPDATE trigger that stores the updated information into a specified table. The key here is that CDC is transparent and automatic. Using CDC, the source system DBA doesn't have to do much to set it up and it works on a known technology of triggers. In CDC, for EACH table involved we have to create a table to keep track of the changes to the original table. This results in a group of tables with changes only. The developer or DBA then can decide what to do with these changes. This is very critical to understand. CDC doesn't provide developers with any tools to handle changes, but it provides them with a clean cut (point in time, committed transactions) set of tables containing changes to each corresponding source table. What to do with these changes - is up to developers. Third part tools maybe used here to go against the change tables or custom scripts written to go through changes and make adjustments to the datawarehouse fact and dimension tables.

General comments
Even though strictly speaking CDC process may all be done within one instance, it is easier to explain it assuming that the data source is in one instance and the "subscriber" is in a different instance. CDC is using publish - subscribe pattern.

Preparations
There are a few tedious processes involved in setting up the CDC process. They include the creation of change tables that most likely will be very similar to the original ones. If you din't want to spend most of your time working on typos resulting in infamous Oracle message "invalid column", we recommend using the file etl_util.sql - it contains a little package we have designed to help automate this process. There are two functions in this package that we need at the moment: get_cols_definition and get_col_names. The first one (get_cols_definition) creates a string that is usually used to define a table:
ABC VARCHAR2(10),BCD NUMBER(20,2), CDE DATE . . .

It creates this list based on the table passed to the function call as a source. By default, it creates the entire list of columns but there is a parameter that allows you to skip several columns in the table definition. The other function (get_col_names) creates a list of column names, again, based on the source table:
ABC, BCD, CDE, . . .

There is also a possibility to skip a few columns. Both functions list columns in alphabetical order. Compile this package as 'system' and give it a public synonym - to make it callable from any schema. Armed with these two functions, let's see how we can create one coherent example of setting up CDC:

Actors

In my opinion, part of the confusion about CDC is that no one makes clear who does which part of the setup. Let's settle it now. We will be working with the notorious tables emp and dept owned by SCOTT. We need three more actors in this setup - dba (SYSTEM), a publisher (BORIS_PUBLISHER) and a subscriber (BORIS_SUBSCRIBER). For each step, we will try to specify exactly who is running it.

Setting up publications
Setting up a publisher In addition to the regular set of privileges and grants, BORIS_PUBLISHER has to be granted EXECUTE_CATALOG_ROLE and SELECT_CATALOG_ROLE privileges. So our dba SYSTEM connects to the source instance and grants these privileges to BORIS_PUBLISHER:
connect system/manager@whatever GRANT EXECUTE_CATALOG_ROLE to boris_publisher; GRANT SELECT_CATALOG_ROLE to boris_publisher;

Now, a publisher has to have read access to all the tables. So, now as SCOTT (the table owner) we have to grant the publisher permissions to the source tables:
connect scott/tiger@whatever GRANT SELECT on emp to boris_publisher; GRANT SELECT on DEPT to boris_publisher;

Creating Change Tables To publish the tables a publisher basically has to choose which columns of the source tables he/she wants to publish and create a table that would contain just these columns. I am not sure why simply listing the column names isn't enough, but Oracle requires specifying the column names along with data type and length. I guess one can run into a situation where the original table would contain the column VARCHAR2(10) while in the change table one would want to see it as VARCHAR2(12), but if you just need the exact replica of each column you can use our little helper package. Consider the following script (you can also find it at etl_publish.sql)
Create Change Tables connect boris_publisher/boris_publisher@whatever set serveroutput on size 1000000 set linesize 120 /* Login as a publisher to run this */ BEGIN DECLARE work_sql VARCHAR2(32767); v_publisher_id VARCHAR2(20) := 'boris_publisher'; v_source_schema VARCHAR2(20) := 'scott'; v_source_table VARCHAR2(100); v_cdc_table VARCHAR2(100); CURSOR c_tables IS SELECT 'emp' table_name FROM dual UNION SELECT 'dept' table_name FROM dual ; BEGIN FOR tables IN c_tables LOOP v_source_table := tables.table_name; v_cdc_table := 'CDC_'||v_source_table; work_sql := etl_util.get_cols_definition(

v_source_schema,v_source_table, '',''); DBMS_LOGMNR_CDC_PUBLISH.CREATE_CHANGE_TABLE ( OWNER => v_publisher_id, SOURCE_SCHEMA => v_source_schema ,CHANGE_SET_NAME => 'SYNC_SET', CAPTURE_VALUES => 'both' ,RS_ID => 'y', ROW_ID => 'n', USER_ID => 'y' , TIMESTAMP => 'y' ,OBJECT_ID => 'n' -- leave it as 'N' or you will have -- "table has no columns" error ,SOURCE_COLMAP => 'y', TARGET_COLMAP => 'y' ,OPTIONS_STRING => null ,SOURCE_TABLE => v_source_table ,CHANGE_TABLE_NAME => v_cdc_table ,COLUMN_TYPE_LIST => work_sql); DBMS_output.put_line('Change table '||v_cdc_table ||' was created successfully'); END LOOP; EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start ********************************************'); DBMS_output.put_line('Error during change table ' || v_cdc_table || ' creation:'); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm); DBMS_output.put_line( 'Error end ********************************************'); END; END; /

Let's talk about this publishing script. It consists of a loop that creates a change table for each of the two tables we are setting up this feature for. Of course, instead of
CURSOR c_tables IS SELECT 'emp' table_name FROM dual UNION SELECT 'dept' table_name FROM dual

you could have defined a view that contains all the tables for which you need to set up your CDC, or as the other extreme manually cut and paste similar code for each table. We tried to set it up to minimize manual labor. Now, let's look at the line:
work_sql := etl_util.get_cols_definition(v_source_schema,v_source_table, '','');

What we get in variable work_sql is something like this:


comm number(7,2), deptno number(2), empno number(4), ename varchar2(10) , hiredate date, job varchar2(9), mgr number(4), sal number(7,2)

The package creates this string for you to save you some aggravation. Now, if you didn't want to include columns hiredate and job in a list of columns to publish you would call the function like this:
work_sql := etl_util.get_cols_definition(v_source_schema,v_source_table, 'hiredate,job','');

Here we chose to name change tables as CDC_{original_table_name}. You are free to choose your own naming convention. Parameter CAPTURE_VALUES => 'both' means that you want to capture both values before and after change (you will have two records in your change tables). Parameters SOURCE_COLMAP => 'y', TARGET_COLMAP => 'y' mean that in your change tables you want to include special columns that would allow you to figure out which of the columns have actually been changed. If you look at the change table definitions, you will see that in addition to the columns requested, our tables have several additional columns.

describe cdc_emp Name Type ----------------- -----------OPERATION$ CHAR(2) CSCN$ NUMBER COMMIT_TIMESTAMP$ DATE RSID$ NUMBER USERNAME$ VARCHAR2(30) TIMESTAMP$ DATE SOURCE_COLMAP$ RAW(128) TARGET_COLMAP$ RAW(128) COMM NUMBER(7,2) DEPTNO NUMBER(2) EMPNO NUMBER(4) ENAME VARCHAR2(10) HIREDATE DATE JOB VARCHAR2(9) MGR NUMBER(4) SAL NUMBER(7,2)

describe cdc_dept Name Type ----------------- -----------OPERATION$ CHAR(2) CSCN$ NUMBER COMMIT_TIMESTAMP$ DATE RSID$ NUMBER USERNAME$ VARCHAR2(30) TIMESTAMP$ DATE SOURCE_COLMAP$ RAW(128) TARGET_COLMAP$ RAW(128) DEPTNO NUMBER(2) DNAME VARCHAR2(14) LOC VARCHAR2(13)

What exactly will these change table track? Whenever the source table is changed one or two records are inserted in its corresponding change table. For Delete operation operation$ column will contain 'D' and the columns published by a publisher will contain the values of the deleted row. For Insert operation$ column will contain 'I' and published columns will contain new values. Updated rows will generate a record with operation$='UU' (if you chose to capture both - old and new values) and all the old values, and another record with operation$='UN' or 'UL' with new values. Note, that these records for update will be generated even if a publisher didn't choose to publish columns that were changed (for instance, if publisher didn't publish column job, and the program updated only job column the record would still be inserted in the change table). So it would appear that old and new records are the same. Technically the target_colmap and source_colmap should help in figuring out what exactly has been changed, but there is no official documentation on how these columns work. Also, another shortcoming in CDC logic is that even if the program changes values of columns to the same ones it still considers them "changes" and creates a record in change tables. Our guess is that internal processing logic relies on a list of columns in UPDATE statements rather than really on change. Basically, we are done with what a publisher does to set everything up. There are other more complicated situations, which I will not get into now - for instance, a publisher may want to set up several publishing scenarios for the same table - publish all the columns in one change table and only few columns in another change tables for the same source table. This may be a part of elaborate security implementation. Again, once the basics are understood one can easily find the details in documentation - see References. Let's look what we have accomplished: We as publishers have just set up the change tables for all (well, all two) our tables - promising to publish changes to the original tables. The change tables belong to a publisher's schema (which doesn't have to be the same as the owner's schema) and reside within the same instance as the source. Assuming that the subscribers belong to a different instance this setup isolates change tables from connectivity problems. Whatever happens to the database links the changes keep being collected. It also gives publisher's better control over data.

NOTE: As soon as the change table is created it starts collecting the changes.

Setting up Subscriptions
Setting up a subscriber The subscriber BORIS_SUBSCRIBER has to be granted select permissions to source and change tables:
CONNECT scott/tiger@whatever GRANT SELECT ON emp TO boris_subscriber; GRANT SELECT ON dept TO boris_subscriber; CONNECT boris_publisher/boris_publisher GRANT SELECT ON cdc_emp TO boris_subscriber; GRANT SELECT ON cdc_dept TO boris_subscriber;

Note, that one would think that subscriber doesn't need permissions to the source tables, but Oracle security is such that unless the subscriber is allowed access to the source tables he/she won't be able to access the change tables. Note also, that the subscriber is the user on the publisher's machine. If the real subscriber is on a different instance of Oracle boris_subscriber would be the one used for a dblink. Creating and activating a subscription The fact that a publisher published a few tables doesn't automatically mean that the subscriber will need them all. Moreover, there could be several subscribers potentially subscribing to different sets of tables from the published set. And to add more complexity to the situation even when the subscriber chooses a specific table he/she may not want to see all the columns that a publisher defined. So, the subscriber's goal is to indicate which tables and which columns in them he/she is interested in. Since potentially there could be many tables involved and various lists of columns Oracle decided to give the subscriber a way to collectively manage all the table/column combinations he/she is working with. This is done through something called HANDLE. The idea is to request and receive a handle to your subscription first, and then tie all your table/column combinations to this handle so you have something tangible to work with. When you create a handle, you can specify a description. We find description to be much more useful for purposes of working with CDC than using a handle simply because it is a text rather then a number and when you need to drop subscription and start over the handle will change, but the description won't (you are the one assigning it a value). We will demonstrate this a bit later. But let's see what we need to do to subscribe to all the columns of both tables: EMP and DEPT. Consider the script below (you can also find it online at etl_subscribe.sql) Please note that now you have to be a subscriber to run this.
Subscribe all the tables, activate subscription, extend CDC window connect boris_subscriber/boris_subscriber@whatever set serveroutput on size 1000000 set linesize 120 /* Login as a subscriber to run this */ BEGIN DECLARE vSubhandle NUMBER; v_subscription_description VARCHAR2(40):='scott -> Datawarehouse'; v_publisher_id VARCHAR2(20) := 'BORIS_PUBLISHER'; v_source_schema VARCHAR2(20) := 'SCOTT'; v_subscriber_id VARCHAR2(20) := 'BORIS_SUBSCRIBER'; v_source_table VARCHAR2(100);

v_cdc_table VARCHAR2(100); col_names VARCHAR2(32767); v_err VARCHAR2(10):='OK'; CURSOR c_tables IS SELECT 'emp' table_name FROM dual UNION SELECT 'dept' table_name FROM dual ; BEGIN -- find/create a subscription BEGIN vSubhandle :=0; -- get the handle SELECT handle INTO vSubhandle FROM all_subscriptions WHERE description = v_subscription_description; EXCEPTION WHEN OTHERS THEN vSubhandle := 0; dbms_output.put_line('Subscription ' ||v_subscription_description || ' does not exist in the database'); END; BEGIN IF (vSubhandle = 0) THEN -- create handle DBMS_CDC_SUBSCRIBE.GET_SUBSCRIPTION_HANDLE( CHANGE_SET => 'SYNC_SET', DESCRIPTION => v_subscription_description, SUBSCRIPTION_HANDLE => vSubhandle ); dbms_output.put_line('Obtained handle ' ||TO_CHAR(vSubhandle) || ' for subscription '||v_subscription_description ); END IF; EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start ***********************************'); DBMS_output.put_line( 'Error getting a handle ' || ' for a CDC subscription ' || v_subscription_description|| ':'); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm || ' Handle = ' ||TO_CHAR(vSubhandle)); DBMS_output.put_line( 'Error end ***********************************'); v_err :='ERROR'; END; -- Subscribe for all the tables FOR tables IN c_tables LOOP v_source_table := UPPER(TRIM(tables.table_name)); BEGIN IF (v_err = 'ERROR') THEN RETURN; END IF; v_cdc_table := 'CDC_'||v_source_table; col_names := etl_util.get_col_names( v_source_schema,v_source_table, '',''); DBMS_LOGMNR_CDC_SUBSCRIBE.SUBSCRIBE (vSubhandle , v_source_schema, v_source_table, col_names); DBMS_output.put_line('Added table '|| v_source_schema ||'.' || v_source_table || ' to the CDC subscription ''' || v_subscription_description || ''' successfully. Handle = '||TO_CHAR(vSubhandle)); EXCEPTION WHEN OTHERS THEN

DBMS_output.put_line( 'Error start ***********************************'); DBMS_output.put_line('Error adding table ' || v_source_schema||'.'||v_source_table || ' to the CDC subscription ''' || v_subscription_description || ''':'); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm || ' Handle = ' ||TO_CHAR(vSubhandle)); DBMS_output.put_line( 'Error end ***********************************'); v_err :='ERROR'; END; END LOOP; BEGIN -- Activate the subscription IF (v_err = 'ERROR') THEN RETURN; END IF; DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION(vSubhandle); DBMS_output.put_line('CDC subscription ''' || v_subscription_description ||''' was successfully activated. Handle = ' ||TO_CHAR(vSubhandle)); EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start ***********************************'); DBMS_output.put_line( 'Error activating the CDC subscription ''' || v_subscription_description || ''':'); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm || ' Handle = ' ||TO_CHAR(vSubhandle)); DBMS_output.put_line( 'Error end ***********************************'); v_err :='ERROR'; END; -- Extend the window immediately BEGIN -- Extend the window for subscription IF (v_err = 'ERROR') THEN RETURN; END IF; DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(SUBSCRIPTION_HANDLE=>vSubhandle); DBMS_output.put_line('Window to the CDC subscription ''' || v_subscription_description || ''' was successfully extended'); EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start ***********************************'); DBMS_output.put_line( 'Error during window extention to CDC subscription ''' ||v_subscription_description || ''':'); DBMS_output.put_line('Handle # '||TO_CHAR(vSubhandle)); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm); DBMS_output.put_line( 'Error end ***********************************'); v_err :='ERROR'; END; END; END; /

Here is the output:


Added table SCOTT.DEPT to the CDC subscription 'scott -> Datawarehouse' successfully. Handle = 87 Added table SCOTT.EMP to the CDC subscription 'scott -> Datawarehouse' successfully. Handle = 87 CDC subscription 'scott -> Datawarehouse' was successfully activated.

Handle = 87 Window to the CDC subscription 'scott -> Datawarehouse' was successfully extended

Again, let's talk about the script above. The first step gets a new subscription handle and associates the description with it (more on that later). The next two steps just add tables to the subscription. The fact that we added tables doesn't automatically mean that everything starts working. You need to activate the subscription. This is done in the third step here. You could do this at a later time, but then you would need to use the same handle you created for the subscription. In the next piece of code we will show you the way we propose to do that, though it is certainly not the only one. In our demonstration case we just don't see any reason not to activate the subscription right away. If you no longer want to process data you need to drop the subscription. Now, there is a little logistical problem here - let's say we processed data in a change table. It may take some time. In the meantime new records could have been stored in this change table. You have to make sure not to see any new records that may come in the middle of your processing. Plus, there is another side to it - let's say everything went smoothly and you have processed the records. Next time your processing program kicks in you may have a few more records in those tables. How are you going to tell the old processed records from the new ones? In a minute we are going to explain the way to handle it, but the last step with an mysterious call to Extend_window procedure belongs to this logic. We don't see why we shouldn't run the "Extend window" step right away, which is why it is included in this script. You can organize the sequence in a different way, but remember that you always have to mention the same handle value you got when created the subscription. Important: Oracle documentation leaves the impression that you have to get a new handle every time you want to do something in CDC. But the problem is that if you call get_handle again you will get a different value, the one to which you never tied your subscription (get_handle method is apparently just a call to a brainless sequence generator). So whatever else you want to do - activate, extend window, create view, etc. all these steps will fail, because you don't refer to the same subscription. There is also another caveat - even if you use the same value, but have logged in as somebody other then the one who initially got that handle (subscriber), the system will behave exactly in the same manner - as if you supplied a wrong handle.

Extending the window


This step seems to be the most confusing part of the entire CDC process, because the name of this procedure is somewhat misleading. You hear "extending" and you want to know the time period through which you are extending the window and can't find it. The reason for that is that this procedure doesn't do that at all. By contrast, it does just the opposite - it marks the end of a data collection cycle for the view that you will need to create in the next step, and while it still allows the new data to be collected in the change table, they are "invisible" to your code. Here, the basic idea is this: you will be accessing the change data only through the view. This view is created based on System Change Numbers (SCN) collected in change tables. When you request a view creation, the system creates a view for you, which includes all the change records collected between the last two extend_vindow calls. That is why we are calling the first extend_window call now.

More on this further.

Extend window and create a view


What we were doing so far is supposed to be done only once. Now we are getting into a cyclical part. You may remember we mentioned in the beginning that CDC process is NOT a datawarehouse solution, but rather lays a nice ground work for it. So, at this point on one end Publisher created change tables and is constantly collecting change records. On another end Subscriber specified which of these tables he/she is interested in. It is time now to somehow start processing change records. Reading change tables directly is not recommended by Oracle, because the tables are not stable - number of records keeps growing while your datawarehouse process reads these records. The solution is to create views that give you a fixed set of records for each underlying change table. After your datawarehouse script finishes processing records you may drop this view and when you are ready for a next cycle the process begins again. To create the view you have to first issue an extend_window call and then a create_view call. That way the create_view call would include only records that were posted from your previous call to extend_window until your last call to extend_window. Remember, we made a call to extend_window right after creating the subscription? So, if first time you run your cycle you call extend_window and then create_view the first capture would be from the time you activated your subscription to this first run. Next time, it would be from the first run to the second run, etc. Take a look at the script below (and also etl_extend_create_views.sql)
Extend substription window and create CDC views connect boris_subscriber/boris_subscriber@whatever set serveroutput on size 1000000 set linesize 200 /* Login as a subscriber to run this */ BEGIN DECLARE vSubhandle NUMBER; v_subscription_description VARCHAR2(40):='scott -> Datawarehouse'; v_publisher_id VARCHAR2(20) := 'BORIS_PUBLISHER'; v_source_schema VARCHAR2(20) := 'SCOTT'; v_subscriber_id VARCHAR2(20) := 'BORIS_SUBSCRIBER'; v_source_table VARCHAR2(100); v_cdc_table VARCHAR2(100); v_cdc_view_name VARCHAR2(40); our_view_name VARCHAR2(200); vSQL VARCHAR2(1000); CURSOR c_tables IS SELECT 'emp' table_name FROM dual UNION SELECT 'dept' table_name FROM dual ; BEGIN -- extend the window BEGIN -- get the handle SELECT handle INTO vSubhandle FROM all_subscriptions WHERE description = v_subscription_description; -- Extend the window for subscription DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(SUBSCRIPTION_HANDLE=>vSubhandle); DBMS_output.put_line('Window to the CDC subscription ''' || v_subscription_description || ''' was successfully extended'); EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start *************************************');

Error start

);

DBMS_output.put_line( 'Error during window extention to CDC subscription ''' ||v_subscription_description || ''':'); DBMS_output.put_line('Handle # '||TO_CHAR(vSubhandle)); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm); DBMS_output.put_line( 'Error end *************************************'); END; -- create subscriber views/synonyms BEGIN FOR tables IN c_tables LOOP -- create a view v_source_table := UPPER(TRIM(tables.table_name)); BEGIN v_cdc_table := 'CDC_'||v_source_table; v_cdc_view_name:=v_cdc_table|| '_vw'; DBMS_CDC_SUBSCRIBE.PREPARE_SUBSCRIBER_VIEW( SUBSCRIPTION_HANDLE=> vSubhandle ,SOURCE_SCHEMA => v_source_schema ,SOURCE_TABLE => v_source_table ,VIEW_NAME => our_view_name); DBMS_output.put_line('Subscriber view ''' ||our_view_name|| ''' was successfully created for table ' || v_source_schema||'.'||v_source_table); EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start **********************************'); DBMS_output.put_line( 'Error during creation of subscriber view' || for source table ' ||v_source_table||': '); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm || ' Handle #'||TO_CHAR(vSubhandle)); DBMS_output.put_line( 'Error end **********************************'); END; -- drop the previous synonym BEGIN vSQL := 'DROP SYNONYM ' || v_cdc_view_name; EXECUTE IMMEDIATE vSQL; EXCEPTION WHEN OTHERS THEN NULL; END; -- create a private synonym to point to this view: BEGIN vSQL := 'CREATE SYNONYM ' || v_cdc_view_name ||' FOR '|| our_view_name; EXECUTE IMMEDIATE vSQL; DBMS_output.put_line('Private synonym ''' || v_cdc_view_name || ''' for view ''' || our_view_name || ''' was successfully created.'); EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start ******************************'); DBMS_output.put_line( 'Error during creation of the synonym ''' || v_cdc_view_name || ''' for view ''' || our_view_name || ''': '); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm); DBMS_output.put_line( 'Error end ******************************'); END; END LOOP; END; END; END; /

When you run this you script you will get output similar to this:
Window to the CDC subscription 'scott -> Datawarehouse' was successfully extended Subscriber view 'CDC#CV$8757846' was successfully created for table SCOTT.DEPT Private synonym 'CDC_DEPT_vw' for view 'CDC#CV$8757846' was successfully created. Subscriber view 'CDC#CV$8757848' was successfully created for table SCOTT.EMP Private synonym 'CDC_EMP_vw' for view 'CDC#CV$8757848' was successfully created.

A couple of words about this script: As you can see, the view names that CDC system assigns to you aren't good to be used in your PL/SQL or Java program. So we propose to assign private synonyms to the views assigned by CDC. That way when you run this script on subsequent occasions you will still have the same view name to work with. If you happen to look at the views you just created you will see something like this:
CREATE OR REPLACE VIEW CDC#CV$8757846 ( OPERATION$, CSCN$, COMMIT_TIMESTAMP$, TIMESTAMP$, USERNAME$, TARGET_COLMAP$, SOURCE_COLMAP$, RSID$, DEPTNO, DNAME, LOC ) AS SELECT OPERATION$, CSCN$, COMMIT_TIMESTAMP$, TIMESTAMP$, USERNAME$, TARGET_COLMAP$, SOURCE_COLMAP$, RSID$, "DEPTNO", "DNAME", "LOC" FROM "BORIS_PUBLISHER"."CDC_DEPT" WHERE CSCN$ >= 40802127 AND CSCN$ <= 41013754 WITH READ ONLY

Condition WHERE CSCN$ >= 40802127 AND CSCN$ <= 41013754 is really the whole point of this view - these two CSCN$ values are collected as you issue an EXTEND_WINDOW call and this is how CDC keeps your transactions separate: The ones you processed previously have CSCN$ < 40802127 (or whatever it will be in your case) and your new ones, that are posted to CDC_dept change table will have CSCN$ > 41013754. It is really that simple. Also, this is the first time we really need to call a handle that we created in previous runs. Oracle leaves it up to you to somehow know the handle you need. We think selecting it by trying to match the description is a good way of doing it (assigning the subscription ID would probably be a better way, but this is not an option Oracle gives you).

Processing change records


This is really where you write your own code to do whatever you want to do with the change transactions. There are several comments to make here. First of all - remember that if you are using CDC feature you get only table by table changes. If you were thinking of joins of your CDC views to update your resulting tables - you better think again. For example if your target database table contains emlpoyee information and in addition to dept_no you decided to keep dept_name in it, you cannot do the join of cdc_emp_vw and cdc_dept_vw to get the name, because when they update an employee, they don't have to update department records and therefore you won't have a matching dept record in your cdc_dept_vw. You may need to go to the real table to do that. Another comment is that most likely your target database will be different from the source so you may need to establish a dblink, and a set of synonyms on your target environment to access the CDC views on your source machine. And the last comment is that using CDC gives you an interesting possibility to do the initial load using the same set of scripts that you will have for going against CDC views. Just consider the following:
CREATE VIEW CDC_EMP_VW AS SELECT'I' operation$, 1 cscn$, SYSDATE commit_timestamp$, 1 rsid$ , 'initial_load' username$, SYSDATE timestamp$, HEXTORAW('FEFFFFFF)' SOURCE_COLMAP$ , HEXTORAW('FEFFFFFF') TARGET_COLMAP$ , t.* FROM emp t;

e p t;

This view looks exactly as the matching view that we built using the CDC feature. Now, granted, it requires some more work to set up (for instance, the code above is really a view, whereas CDC_EMP_VW was really a synonym to a view with a weird name), but potential savings on not doing double work are well worth considering. The data in CDC views (real ones) has different operation$ codes - UU (old values), UN or UO (new values), 'I' (inserted values), and 'D' (deleted values). You can build your logic accordingly. Important: Don't forget when processing records from the views order them by CSCN$. This is the only way to ensure that your script will read "UN" after "UU" for the same changed record.

Source_colmap$
Both SOURCE_COLMAP$ and TARGET_COLMAP$ columns are left completely unexplained by Oracle. Since it is undocumented and therefore is not supported by Oracle please take the rest of this paragraph for what it's worth. These columns are supposed to have information about which columns have been changed in update statement. We couldn't figure out the TARGET_COLMAP$ column, but received some encouraging results about SOURCE_COLMAP$ column. In all_tab_columns view for the source table there is a column column_id. It represents the column number in a table definition. It seems Oracle uses a bit map to present the columns starting not with bit 0 but with bit 1.
Column id 1 2 3 4 5 6 7 8 9 ... 16 10 100 1000 10000 100000 1000000 10000000 00000000 00000001 00000000 00000010 ... Binary Value Hex Value 2 4 8 10 20 40 80 00 01 00 02 ...

00000001 00000000 00000000 01 00 00

The complications come when you display the columns in hexadecimal format. Apparently Oracle's inner presentation of the values is as set of binary words (two bytes) and they are usually for historical reasons reversed in memory presentation - the least significant byte comes first and the most significan byte follows. So in our case should the columns 2 and 3 were changed the SOURCE_COLMAP$ would be displayed as 0C (1002 + 10002 = 11002=0Chex). Change in columns 2, 3 and 5 would be displayed as 2C (1002+10002+1000002=1011002=2Chex). But when it gets to column 8 we get something that looks weird even though it makes sense: change in column 2, 3, and 8 will result in code: 0C01hex - the number spils to the byte to the right, rather than what one would expect - towards more significant digits. Anyway, we are including the third function in package etl utils get modified names which for a given table and SOURCE COLMAP$

package etl_utils, get_modified_names, which for a given table and SOURCE_COLMAP$ will build a list of column names that were changed. The functions is included for illustration purposes only (The idea of using the all_tab_columns may be not a bad one, but doing it dynamically is not exactly a performance enhancer). Another comment is that while learning what columns have been changed maybe important to your script, doing it using SOURCE_COLMAP$ may not give you correct results anyway, since Oracle doesn't check whether values really changed. It grabs columns that were mentioned in the UPDATE statement even if this statement is assigning the same values back.

Dropping views and purging the window


Now that you have run your update scripts, it's time to finish the loop. Remember, we started the loop by extending the window and creating cdc views. Now we need to drop these views, so nothing will run against them, necessitating us to purge the data in the change tables that correspond to those views. Here is the script (and also at etl_drop_views.sql):
Drop CDC Views and Purge CDC Window connect boris_subscriber/boris_subscriber@whatever set serveroutput on size 1000000 set linesize 200 /* Login as a subscriber to run this */ BEGIN DECLARE vSubhandle NUMBER; v_subscription_description VARCHAR2(40):= 'scott -> Datawarehouse'; v_publisher_id VARCHAR2(20) := 'BORIS_PUBLISHER'; v_source_schema VARCHAR2(20) := 'SCOTT'; v_subscriber_id VARCHAR2(20) := 'BORIS_SUBSCRIBER'; v_source_table VARCHAR2(100); v_cdc_table VARCHAR2(100); v_cdc_view_name VARCHAR2(40); our_view_name VARCHAR2(200); vSQL VARCHAR2(1000); v_err VARCHAR2(10):='OK'; CURSOR c_tables IS SELECT 'emp' table_name FROM dual UNION SELECT 'dept' table_name FROM dual ; BEGIN -- find a subscription BEGIN vSubhandle :=0; -- get the handle SELECT handle INTO vSubhandle FROM all_subscriptions WHERE description = v_subscription_description; EXCEPTION WHEN OTHERS THEN vSubhandle := 0; dbms_output.put_line('Subscription ' ||v_subscription_description || ' does not exist in the database' ); END; FOR tables IN c_tables LOOP v_source_table := UPPER(TRIM(tables.table_name)); BEGIN v_cdc_table := 'CDC_'||v_source_table; v_cdc_view_name:=v_cdc_table || '_vw'; -- drop the synonym

BEGIN vSQL := 'DROP SYNONYM ' || v_cdc_view_name; EXECUTE IMMEDIATE vSQL; EXCEPTION WHEN OTHERS THEN NULL; END; -- drop the subscriber view DBMS_CDC_SUBSCRIBE.DROP_SUBSCRIBER_VIEW( SUBSCRIPTION_HANDLE=> vSubhandle ,SOURCE_SCHEMA => v_source_schema , SOURCE_TABLE => v_source_table); DBMS_output.put_line('Subscriber View for CDC table ''' || v_cdc_table || ''' was successfully dropped. Handle # ' ||TO_CHAR(vSubhandle)); EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error start ********************************'); DBMS_output.put_line( 'Error during dropping the subscriber view for ''' || v_cdc_table || ''' ' || ' to the CDC subscription ''' || v_subscription_description || ''':'); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm || ' Handle = '||TO_CHAR(vSubhandle)); DBMS_output.put_line( 'Error end ********************************'); END; END LOOP; BEGIN -- purge window DBMS_CDC_SUBSCRIBE.PURGE_WINDOW( SUBSCRIPTION_HANDLE=> vSubhandle); DBMS_output.put_line( 'Subscriber Window for subscription ''' || v_subscription_description || ''' was successfully purged '); EXCEPTION WHEN OTHERS THEN DBMS_output.put_line( 'Error during subscriber view drop: '); DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm ||' Handle # '||TO_CHAR(vSubhandle)); END; END; END; /

Here is the output:


Subscriber View for CDC table 'CDC_DEPT' was successfully dropped. Handle # 86 Subscriber View for CDC table 'CDC_EMP' was successfully dropped. Handle # 86 Subscriber Window for subscription 'scott -> Datawarehouse' was successfully purged

Purging the views is an interesting step. This is another case of misleading names. Remember, our subscription may not necessarily be the only subscription in the system. We may be done with our processing of the data but someone else's application may not. Purging data in this case may not be a good idea. What call to purge_window really means is that we report being done with the data up to the last CSCN$ we used for our views. When all the subscribers report that they are done with theirs, then publisher will be able to really purge the data to the least CSCN$ of all the subscribers.

Review

Review
So, you are going to design some kind of a process that would kick in every so often and do the following: Extend a window and create CDC views for all CDC tables Process CDC tables applying updates to the target tables (again, don't forget to order records in CSCN$ ascending order) Drop CDC views and purge the CDC window The frequency of this cycle obviously depends on how much time your update process takes. Note 1: If your source database is really on another instance your update process will be the one with a lot of @db_link tables. Note 2: It would be wise for you to design the update process in such a way that it could be applied again without causing problems. For instance, you may want to treat inserts as updates if the key already exists in a target database or that deletes don't really delete anything (this will happen if you are running your update script the second time) - that allows better debugging the scripts. Note 3: We recommend slightly different sequence of steps for production environmen step 1 - drop CDC views (this will fail the first time, since there are none), step 2 - purge CDC window (this will also fail the first time), step 3 - extend the windows, create CDC views, create synonyms, step 4 - process updates This sequence leaves your CDC views intact between runs and you can do the research what went wrong between runs. Our view of the correctly designed CDC process is this:

Note 4: Your update script may run quickly or take a long time, depending on the intensity of updates in the system. You better design your scripts in such way that they won't run into each other.

Clean-up
From our own experience, we can guarantee that you are going to make a lot of mistakes before setting everything up "just so", so we are providing a script that you may use to undo the changes and start over (See etl_undo_cdc.sql)

Conclusion
Oracle CDC is a new and exciting feature that nicely complements the Oracle suite of tools for data warehousing and generic replication. Mastering it and making it work flawlessly requires a fair degree of effort, but it can help your organization handle data movement in a more efficient and systematic manner.

References
Something strange is happening to Oracle's CDC documentation - it disappears from servers that used to have it I updated these links at least 3 times and every time at some point the

that used to have it. I updated these links at least 3 times and every time at some point the links become broken. Sorry, folks, but I tried. Here is where I found them last time: 1. DBMS_LOGMNR_CDC_PUBLISH http://hoth.stsci.edu/public/920doc/appdev.920/a96612/d_logpub.htm. 2. DBMS_LOGMNR_CDC_SUBSCRIBE http://hoth.stsci.edu/public/920doc/appdev.920/a96612/d_logsub.htm. 3. Change Data Capture http://os.cc.biu.ac.il/documentation/oracle/9.0.2_doc/server.920/a96520/cdc.htm.
We hope you found this article useful. If you have any comments about this article please send them to Boris Knizhnik borisk@bikinfo.com. We appreciate any links to this site as they show your appreciation and work as free ads. If you decide to include a link to our site on your site, please drop us a note.
Copyright 2002-2004 BIK Information Services, Inc. All Rights Reserved.

Last modified on Saturday, 09-Apr-2005 09:52:57 EDT

Back to BIK Information Services home page

Go to top

You might also like