You are on page 1of 18

WHITE PAPER How to Obtain Flexible, Cost-effective Scalability and Performance through Pushd own Processing Under the

Hood of the Pushdown Optimization Option Now Available Through Informa tica PowerCenter 8

This document contains Confidential, Proprietary and Trade Secret Information (Co nfidential Information) of Informatica Corporation and may not be copied, distrib uted, duplicated, or otherwise reproduced in any manner without the prior writte n consent of Informatica. While every attempt has been made to ensure that the i nformation in this document is accurate and complete, some typographical errors or technical inaccuracies may exist. Informatica does not accept responsibility for any kind of loss resulting from the use of information contained in this doc ument. The information contained in this document is subject to change without n otice. The incorporation of the product attributes discussed in these materials into any release or upgrade of any Informatica software productas well as the tim ing of any such release or upgradeis at the sole discretion of Informatica. Prote cted by one or more of the following U.S. Patents: 6,032,158; 5,794,246; 6,014,6 70; 6,339,775; 6,044,374; 6,208,990; 6,208,990; 6,850,947; 6,895,471; or by the following pending U.S. Patents: 09/644,280; 10/966,046; 10/727,700. This edition published April 2006

White Paper Table of Contents Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Historical Approaches to Data Integration . . . . . . . . . . . . . . . . . . . . . . . . .4 The Combined Engine- and RDBMS-based Approac h to Data Integration . .5 How Pushdown Optimization Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Overview of Pushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Two-Pass Pushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Partial P ushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Full Pushdown Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Platform-s pecific Pushdown Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Limitations on the Types of Transformations that Can Be P ushed to the Database . . . . . .9 Benefits of Pushdown Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Increased Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Increased IT Team Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 Red uced Risk and Enhanced Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Conclusion and Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Pushdown Optimization 1

Executive Summary Over the next five to 10 years and beyond, the two dominant variables in the ent erprise data integration equation are painfully clearmore data and less time. Giv en these, whats the right data integration strategy to effectively manage terabyt es or even hundreds of terabytes of data with enough flexibility and adaptabilit y to cope with future growth? Historically, data integration was performed by de veloping hand-coded programs that extract data from source systems, apply busine ss/transformation logic and then populate the appropriate downstream system, be it a staging area, data warehouse or other application interface. Helping to over come the challenges of implementing data integration as an enterprise-wide funct ion, PowerCenter 8 offers key new features that can enable near-universal data a ccess, deliver greater performance and scalability, and significantly increase d eveloper productivity. The push-down logic will allow us to take further advanta ge of our database processing power. Mark Cothron Data Integration Architect, Ace Hardware Hand-coding has been replaced, in many instances, by data integration software t hat performs the access, discovery, integration, and delivery of data using an en gine or data integration server and visual tools to map and execute the desired pro cess. Driven by accelerated productivity gains and ever-increasing performance, s tate of the art data integration platforms, such as Informatica PowerCenter, handle the vast majority of todays scenarios quite effectively. PowerCenter has enjoyed wide acceptance and use by high-volume customers representing companies and gov ernment organizations of all sizes. Based on this use, Informatica has identifie d performance scenarios where processing data in a source or target databaseinste ad of within the data integration servercan lead to significant performance gains . These scenarios are primarily where data is co-located within a common database instance, such as when staging and production reside in a single Oracle relation al database management system (RDBMS) or where a large investment has been made in database hardware and software that can provide additional processing power. With these scenarios in mind, Informatica Corporation set out to deliver a solut ion that delivers the best of both worlds without incurring undo configuration a nd management burden; a solution that best leverages the performance capabilitie s of its data integration server and/or the processing power of a relational dat abase interchangeably to optimize the use of available resources. 2

White Paper Informatica has developed a solution that offers IT architects flexibility and e ase of performance optimization through push down processing into a relational dat abase using the same metadata-driven mapping and execution architecture: the Pow erCenter Pushdown Optimization Option now available through Informatica PowerCen ter 8. PowerCenter 8 is the latest release of Informaticas single, unified enterp rise data integration platform for accessing and integrating data from virtually any business system, in any format, and delivering that data throughout the ent erprise at any speed. This white paper describes the flexibility, performance op timization, and leverage provided by the PowerCenter 8 Pushdown Optimization Opt ion. It examines the historical approaches to data integration and describes how a combined engine- and RDBMS-based approach to data integration can help the en terprise: Cost-effectively scale by using a flexible, adaptable data integration architecture Increase developer and team productivity Save costs through greate r leverage of RDBMS and hardware investments Eliminate the need to write customcoded solutions Easily adapt to changes in underlying RDBMS architecture Maintai n visibility and control of data integration processes After reading this paper, you will understand how pushdown processing works, the options technical capabil ities, and how these capabilities will benefit your environment. Pushdown Optimization 3

Historical Approaches to Data Integration Historically, there have been four approaches to data integration: 1. Hand-codin g. Since the early days of data processing, IT has attempted to solve integratio n problems through development of hand-coded programs. These efforts still proli ferate in many mainframe environments, data migration projects, and other scenar ios where manual labor is applied to extract, transform, and move data for the p urposes of integration. The high risks, escalating costs, and lack of compliance associated with hand-coded efforts are well documented, especially in todays env ironment of heightened regulatory oversight and the need for data transparency. Early on, solutions for automation emerged to replace handcoding as an alternati ve cost effective solution. 2. Code generators. The first early attempts at incr easing IT efficiency led to the development of code generation frameworks that l everaged visual tools to map out processes and data flow but then generated and compiled code as the resultant run-time solution. Code generators were a step-up from hand-coding for developers, but this approach did not gain widespread adop tion as solution requirements and IT architecture complexity arose and the issue s around code maintenance, lack of visibility through metadata, and inaccuracies in the generation process led to higher rather than lower costs. 3. RDBMS-centr ic SQL Code generators. An offspring of early generation code generators emerged from the database vendors themselves. Using the database as an engine and SQL as a language, RDBMS vendors delivered offerings that centered on their flavor of dat abase programming. Unfortunately, these products exposed the lack of capability of the SQL language and the database-specific extensions (e.g., PL/SQL, stored p rocedures) to handle cross-platform data issues; XML data; the full range of fun ctions such as data quality, profiling, and conditional aggregation; and the res t of the complete range of business logic needed for enterprise data integration . What these products did prove was that for certain scenarios, the horsepower o f the relational database can be effectively used for data integration. 4. Metad ata-driven engines. Informatica pioneered a data integration approach that lever aged a data server, or engine, powered by open, interpreted metadata as the workho rse for transformation processing. This approach addressed complexity and met th e needs for performance. It also provided the added benefit of re-use and openne ss due to its meta data-centricity. Others have since copied this approach throu gh other types of engines and languages, but it wasnt until this metadata-driven, engine-based approach was widely adopted by the market as the preferred method for saving costs and rapidly delivering on data integration requirements that ex traction, transformation, and loading (ETL) was established as a proven technolo gy. Figure 1 shows this engine-based data integration approach. THE POWERCENTER PUSHDOWN OPTIMIZATION OPTION Automatically generates and pushes down mapping logic Generates database-specific logic that represents overall data flow Pushes the e xecution of the logic into the database to perform data transformation processin g Provides a single design environment with an easy-to-use GUI Decouples data transformation logic from the physical execution plan Controls wh ere processing takes place Dynamically creates and executes database specific tr ansformation language Allows you to preview the processing you can push to the d atabase Leverages a single, unified data integration platform Applies pushdown optimization to all data integration processing available on th e PowerCenter platform, including data cleansing, data profiling, and unstructur ed and semi-structured data processing Data Integration Server Data Target Metadata Repository Data Sources

Figure 1: Informatica Pioneered the Metadata-driven Engine Approach to Data Inte gration 4

White Paper The Combined Engine- and RDBMS-based Approach to Data Integration Using an engine-based approach, Informatica PowerCenter has become the industry performance leader for enterprise data integration. This leadership has been dem onstrated in industry benchmarks, with continued success in complex, high-volume customer environments and in head-to-head evaluations with other competitive of ferings. Performance capabilities, such as source-specific partitioning, 64-bit support, threaded architecture, and continued testing and refinement of the data server, led to organizations to choose PowerCenter to meet their most strenuous requirements. In reviewing requirements for the latest version of the product, PowerCenter 8, Informatica evaluated certain scenarios in which it made sense, w hile processing transformations in a relational database, to limit the movement of data out and subsequent back in the database during data co-resident periods. I t is with these scenarios in mind that Informatica developed the pushdown optimi zation capabilities to round out the optimal performance architecture of its ent erprise data integration platform. Pushdown optimization is enabled through Powe rCenters metadata-driven architecture, which decouples the data transformation lo gic from the physical execution plan. This unique architecture allows processing to be pushed down inside an RDBMS when possible. PowerCenter 8 is the only softwa re on the market that offers engine-based and RDBMS-based integration technology in a single, unified platform. This approach ensures a broad spectrum of data i ntegration initiatives and enables IT to save costs through intelligent use of e xisting computing resources. Both approaches are required for organizations look ing to develop an Integration Competency Center where all integration efforts ar e developed and/or managed by an expert team faced with varying solution require ments. Although processing is spread between the data integration engine and the database engine, with Power Center 8 developers use a single design environment and the same standard set of PowerCenter tools. For example, a developer can de sign the data flow using the PowerCenter Designer, and can design job workflow u sing the PowerCenter Workflow Manager. Metadata continues to be generated and ma naged within PowerCenter. By simply selecting pushdown optimization in the Power Center graphical user interface (GUI), developers can control where processing t akes place, and database-specific transformation language will be dynamically cr eated and executed as appropriate. Pushdown optimization ensures that existing I T assets are fully utilized, helping organizations maximize their investment in RDBMS horsepower. A SQL code generator-only approach to data integration hampers ITs ability to deliver on the various needs of the enterprise due to the limitat ions of SQL as a comprehensive language for data integration efforts. How Pushdown Optimization Works The Pushdown Optimization Option increases systems performance by providing the flexibility to push data transformation processing to the most appropriate proce ssing resource, whether within a source or target database or through the PowerC enter server. This section explains how pushdown processing works, including two -pass, partial, and full pushdown processing. It describes platform-specific pus hdown processing and outlines the limitations on the types of transformations th at can be pushed to the database. Pushdown Optimization 5

Overview of Pushdown Processing Separating logical business logic from physical run-time execution, the Pushdown Optimization Option is coupled with the creation and management of workflows. W orkflows tie the execution of a metadata-based mapping to an actual physical env ironment. This environment spans not only the PowerCenter Data Integration Servi ces that may reside on multiple hardware systems, but also the relational databa ses where pushdown processing will occur. As shown in Figure 2, data integration solution architects can configure the pushdown strategy through a simple dropdo wn menu in the PowerCenter 8 Workflow Manager. Figure 2: Data Integration Solution Architects Can Configure the Pushdown Strate gy through a Simple Drop-Down Menu in the Powercenter 8 Workflow Manager Pushdown optimization can be used to push data transformation logic to the sourc e or target database. The amount of work data integration solution architects ca n push to the database depends on the pushdown optimization configuration, the d ata transformation logic, and the mapping configuration. When pushdown optimizat ion is used, PowerCenter writes one or more SQL statements to the source or targ et database based on the data transformation logic. PowerCenter analyzes the dat a transformation logic and mapping configuration to determine the data transform ation logic it can push to the database. At run time, PowerCenter executes any S QL statement generated against the source or target tables, and it processes any data transformation logic within PowerCenter that it cannot push to the databas e. Using pushdown processing can improve performance and optimize available reso urces. For example, PowerCenter can push the data transformation logic for the m apping seen in Figure 2 to the source database. 6

White Paper Figure 3 shows a mapping that can be pushed to the source database. Figure 3: Sample Mapping Pushed to Source Database The mapping contains a filter transformation that filters out all items except f or those with an ID greater than 1005. PowerCenter can push the data transformat ion logic to the database, and it generates the following SQL statement to proce ss the data transformation logic: INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DES C, n_PRICE) SELECT ITEMS.ITEM_ID, ITEMS.ITEM_NAME, ITEMS.ITEM_DESC, CAST(ITEMS.P RICE AS INTEGER) FROM ITEMS WHERE (ITEMS.ITEM_ID >1005) PowerCenter generates an INSERT SELECT statement to obtain and insert the ID, NAME, and DESCRIPTION colu mns from the source table, and it filters the data using a WHERE clause. PowerCe nter does not extract any data from the database during this process. Because Po werCenter does not need to extract and load data, performance improves and resou rces are maximized. Two-Pass Pushdown Processing Pushdown processing is based on a two-pass scan of the mapping metadata. In the first pass, PowerCenter starts scanning the mapping objects starting with source definition object, moving towards the target definition object. When the scan e ncounters an object containing data transformation logic that cannot be represen ted in SQL, the scanning process stops, and all transformation upstream of this transformation are grouped together with equivalent SQL for execution inside the source system. In the second pass, PowerCenter scans in the opposite direction (i.e., from the target definitions towards the source definitions). When the sca n encounters an object containing data transformation logic that cannot be repre sented in SQL, the scanning process stops, and all transformation objects downst ream of this transformation are grouped together with equivalent SQL for executi on inside the target system. PowerCenter executes any remaining data transformat ion logic. When you configure PowerCenter to use pushdown optimization, it can p rocess the data transformation logic using full or partial pushdown optimization . Pushdown Optimization 7

Partial Pushdown Processing Partial pushdown processing occurs when either the source and target systems are in different database instances, or only some of the data transformation logic can be represented in SQL. In such cases, some processing may be pushed into sou rce database, some processing occurs inside PowerCenter, and some processing may be pushed to the target database. In Figure 4, all transformations up to and in cluding the aggregate transformation are pushed into the source database. The Up date Strategy transformation is executed within PowerCenter, and the Expression transformation is executed inside the target database. Figure 4 shows an example of partial pushdown processing. Figure 4: Mapping for Partial Pushdown Processing Full Pushdown Processing Full pushdown processing occurs when the source and target relational database m anagement systems are the same instance, and data transformation logic can be co mpletely represented in SQL. In this case, PowerCenter pushes down the entire ma pping processing inside the database system. Figure 5 shows example mapping that is fully processed inside the database system. Figure 5: Mapping for Full Pushdown Processing In Figure 5, the sources and targets are the same instance, and the data transfo rmation logic can be pushed to the database. The work of the filtering, joining, and sorting the data is performed by the database, freeing PowerCenter resource s to perform other tasks. However, the transformation logic is represented in Po werCenter, so it is platform independent and easy to modify. The visual represen tation makes it simple to review the flow of logic, and the Pushdown Optimizer V iewer allows you to preview the SQL statements PowerCenter will execute at run t ime. 8

White Paper Platform-specific Pushdown Optimization When pushdown optimization is applied to specific database type, the PowerCenter Data Integration Services generate SQL statements using native database SQL. St andards-based generation for ODBC is also supported, and PowerCenter generates S QL statements using ANSI SQL. PowerCenter can generate a greater variety of tran sformation functions when a specific database type is used and ensures optimal g eneration of the fastest execution plan. Limitations on the Types of Transformations that Can Be Pushed to the Database PowerCenter includes a number of transformations that perform functions that can not be performed within a database. Figure 6 summarizes the transformation types that can be pushed to the database. Transformation Aggregator Application Sourc e Qualifier Custom Expression External Procedure Filter Java Joiner Lookup Norma lizer Rank Router Sequence Generator Sorter Source Qualifier Stored Procedure Ta rget Transaction Control Union Update Strategy XML Generator XML Parser XML Sour ce Qualifier Figure 6: Transformation Types that Can Be Pushed to The Database Using PowerCen ter Source-side Target-side x x x x x x x x x x x With the PowerCenter Pushdown Optimization Option, data integration solution arc hitects canleverage both the database and PowerCenters capabilities by pushing so me transformation logic to the database and processing other data transformation logic using PowerCenter. Pushdown Optimization 9

For example, users might have a mapping that filters and sorts data and then out puts the data to an XML target. To utilize database and PowerCenter capabilities to their fullest potential, data integration solution architects might push the transformation logic for the Source Qualifier, Filter, and Sorter transformatio ns to the source database, and then the extract the data to output it to the XML target. Figure 7 shows a mapping that uses database capabilities and PowerCente rs XML capabilities. Figure 7: Mapping Pushes Transformation Logic to the Source and Writes to an XML Target Benefits of Pushdown Optimization The PowerCenter Pushdown Optimization Option offers many benefits, including: In creased performance by using optimal resources Increased ease-of-use with a meta data-driven architecture that provides metadata lineage Increased IT team produc tivity with simplified debugging and performance tuning Reduced risk and enhance d flexibility through database neutrality Increased Performance The PowerCenter Pushdown Optimization Option increases systems performance by pr oviding the flexibility to push data transformation processing to the most appro priate processing resource, whether within a source or target database or throug h the PowerCenter server. With this option, PowerCenter is the only enterprise d ata integration software on the market that allows data integration solution arc hitects to choose when pushing down processing offers a performance advantage. W ith the PowerCenter Pushdown Optimization Option, data integration solution arch itects can choose to push all or part of the data transformation logic to the so urce or target database. Data integration solution architects can select the dat abase they want to push transformation logic to, and they can choose to push som e sessions to the database, while allowing PowerCenter to process other sessions . For example, lets say an IT organization has an Oracle source database with ver y low user activity. This organization may choose to push transformation logic f or all sessions that run on this database. In contrast, lets say an IT organizati on has a Teradata source database with heavy user activity. This organization may choose to allow PowerCenter to process the transformation logic for sessions th at run on this database. In this way, the sessions can be tuned to work with the load on each database, optimizing performance. With the PowerCenter Pushdown Op timization Option, data integration solution architects can also use variables t o choose to push different volumes of transformation logic to the source or targ et database at different times during the day. For example, partial pushdown opt imization may be used during the peak hours of the day, but full pushdown optimi zation is used from midnight until 2 a.m. when activity is low. 10

White Paper Increased IT Team Productivity With its unique metadata-driven architecture, the PowerCenter Pushdown Optimizat ion Option increases IT team productivity in several ways: Ease-of-use on differ ent platforms. PowerCenters metadata-driven architecture allows transformation lo gic to be easily ported to different platforms. The same transformation logic ca n easily be performed on different databases. The same session can be assigned d ifferent database connections, and the same data transformation can be performed without rewriting code or using different SQL syntax. Ease of maintenance. Powe rCenters metadata-driven architecture makes it easy to track data for the purpose s of error-logging, data cleansing, or data profiling. In addition, metadata for repository objects is also maintained in the PowerCenter repository. Modificati ons to repository objects, import and export metadata can be tracked and a histo ry of changes to repository objects can be maintained. Simplified debugging and performance tuning. When data transformation logic is configured in PowerCenter, the data transformation logic is represented in a mapping, which provides a vis ual representation of the data flow, making it simple to debug and edit the tran sformation logic. Because PowerCenter is a single, unified platform, different f unctions can be applied to the same metadata without exiting the PowerCenter GUI . For example, a developer might create a mapping to represent data transformati on, and then launch the Data Profiling tool to assess the status of the data. La ter, the developer can open the Workflow Manager to perform the transformation a nd launch the Workflow Monitor to track the data as it moves from the source to the target. A tool called the Pushdown Optimization Viewer lets data integration solution architects preview the flow of data to the source or target database. This tool allows data integration solution architects to preview the data flow, the amount of transformation logic that can be pushed to the source or target da tabase, and the SQL statements that will be generated at run time as well as any messages related to pushdown optimization. Figure 8 shows the mapping from Figu re 6 displayed in the Pushdown Optimization Viewer. Figure 8: Pushdown Optimization Viewer Pushdown Optimization 11

Reduced Risk and Enhanced Flexibility IT organizations typically support several different relational databases. Even when they are able to standardize on a single RDBMS, changing business condition sresulting from mergers and acquisitions, cost cutting, etc.dictate that they need to be prepared to support multiple relational databases architectures. IT organ izations need to be able to fully leverage the capabilities of each type of data base, and yet stay agile enough to rapidly integrate other types of databases as the need arises. New regulatory and governance requirements also dictate increa sed visibility and control into the business rules applied to data as it moves t hroughout the enterprise. PowerCenter reduce the risk of changing database archi tectures and enhances flexibility by being database-neutral. PowerCenters metadat a-driven architecture extends to mappings that leverage the Pushdown Optimizatio n Option. The appropriate database-specific logic can be easily regenerated post -database change, providing flexibility of choice and ease of change. Leveraging metadata analysis and reporting, rather than having business logic tied to vend or-specific handcoded logic, enables effective data governance and transparency. Conclusion and Next Steps Todays challenges to save costs and also drive revenue are pushing organizations to examine their current data integration infrastructure needs and choose soluti ons that provide flexibility and maximum leverage of current assets. Informatica PowerCenter provides IT organizations with the flexibility to optimize performa nce in response to changing runtime demands, peak processing needs, or other dyn amic aspects of the production environment, helping IT organizations achieve cos t-effective scalability and performance. By delivering a combined engine-centric and RDBMS-centric approach to data integration in a single, unified platform, P owerCenter 8, with its Pushdown Optimization Option, ensures optimal performance for the broad spectrum of data integration projects and helps IT save costs thr ough the intelligent use of existing computing resources. With the Pushdown Opti mization Option, PowerCenter 8 can help the enterprise: Cost-effectively scale t hrough a flexible, adaptable data integration architecture Increase developer an d team productivity Save costs through greater leverage of RDBMS and hardware in vestments Eliminate the need to write custom coded solutions Easily adapt to cha nges in underlying RDBMS architecture Maintain visibility and control of data in tegration processes To find out how using a combined engine- and RDBMS-centric a pproach can benefit your data integration initiatives, or to find out more about PowerCenter 8, please visit us at www.informatica.com or call us at (800) 653-3 871. 12

White Paper Pushdown Optimization 13

Worldwide Headquarters, 100 Cardinal Way, Redwood City, CA 94063, USA phone: 650 .385.5000 fax: 650.385.5500 toll-free in the US: 1.800.653.3871 www.informatica. com Informatica Offices Around The Globe: Australia Belgium Canada China France Germ any Japan Korea the Netherlands Singapore Switzerland United Kingdom USA 2006 Informatica Corporation. All rights reserved. Printed in the U.S.A. Informa tica, the Informatica logo, and, PowerCenter are trademarks or registered tradem arks of Informatica Corporation in the United States and in jurisdictions throug hout the world. All other company and product names may be tradenames or tradema rks of their respective owners. J50701 6650 (04/25/06)

You might also like