Professional Documents
Culture Documents
Informatica PowerCenter
(Versions 8.1.1-8.6)
Informatica Data Quality Integration for PowerCenter Guide Versions 8.1.1, 8.5.1, and 8.6 January 2009 Copyright (c) 1998-2008 Informatica Corporation. All rights reserved. This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and international Patents and other Patents Pending. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us in writing. Informatica, PowerCenter, PowerExchange, Informatica B2B Data Exchange, Informatica B2B Data Transformation, Informatica Data Quality, Informatica Data Explorer, Informatica Identity Resolution and Matching, Informatica On Demand, PowerMart, PowerBridge, PowerConnect, PowerChannel, PowerPartner, PowerAnalyzer, PowerCenter Connect and PowerPlug are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright Melissa Data Corporation. All rights reserved. Copyright MySQL AB. All rights reserved. Copyright Platon Data Technology GmbH. All rights reserved. Copyright Seaview Software. All rights reserved. Copyright Sun Microsystems. All rights reserved. Copyright Oracle Corporation. All rights reserved. This product includes software developed by the Apache Software Foundation (http://www.apache.org/), software developed by lf2prod.com (http://common.l2fprod.com) and other software which is licensed under the Apache License, Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software which was developed by the JFreeChart project (http://www.jfree.org/freechart/), software developed by the JDIC project (https:// jdic.dev.java.net/) and other software which is licensed under the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/ lgpl.html. The materials are provided free of charge by Informatica, as-is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine, and Vanderbilt University, Copyright (c) 1993-2006, all rights reserved. This product includes ICU software which is copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding this software are subject to terms available at http://www-306.ibm.com/software/globalization/icu/license.jsp. This product includes software which is licensed under the MIT License, which may be found at http://www.opensource.org/licenses/mit-license.html. This product includes software which is licensed under the Eclipse Public License, which may be found at http://www.eclipse.org/org/documents/epl-v10.html. Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporaotin and other parties. The authors hereby grant permission to use, copy, modify, distribute, and license this software and its documentation for any purpose. This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright 2000-2004 Jason Hunter and Brett McLaughlin. All rights reserved. This product includes software which is licensed under the Open LDAP Public License, which may be found at http://www.openldap.org/software/release/license.html. Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com). This Software may be protected by U.S. and international Patents and Patents Pending. DISCLAIMER: Informatica Corporation provides this documentation as is without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this product or documentation is error free. The information provided in this product or documentation may include technical inaccuracies or typographical errors.
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Associating and Consolidating Data from a Single Source . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Associating and Consolidating Data from Multiple Data Sources . . . . . . . . . . . . . . . . . . . . . . 34
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
iv
Table of Contents
Preface
This guide describes the Data Quality Integration components developed for Informatica Data Quality 8.6. It is provided for the following audiences:
PowerCenter systems administrators who will install and register the Data Quality Integration components on their PowerCenter systems. PowerCenter users who will run data quality plans embedded in PowerCenter mappings.
Note: The Data Quality Integration transformation has changed in significant ways in the Data Quality 8.6
release. Read the Data Quality Integration Release Notes before installing and registering these components.
Informatica Resources
Informatica Customer Portal
As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica How-To Library, the Informatica Knowledge Base, Informatica Documentation Center, and access to the Informatica user community.
Informatica Documentation
The Informatica Documentation team takes every effort to create accurate, usable documentation. If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation team through email at infa_documentation@informatica.com. We will use your feedback to improve our documentation. Let us know if we can contact you regarding your comments. The Documentation team updates documentation as needed. To get the latest documentation for your product, navigate to the Informatica Documentation Center from http://my.informatica.com.
support@informatica.com for technical inquiries support_admin@informatica.com for general customer service requests
WebSupport requires a user name and password. You can request a user name and password at http:// my.informatica.com. Use the following telephone numbers to contact Informatica Global Customer Support:
North America / South America Informatica Corporation Headquarters 100 Cardinal Way Redwood City, California 94063 United States Europe / Middle East / Africa Informatica Software Ltd . 6 Waltham Park Waltham Road, White Waltham Maidenhead, Berkshire SL6 3TN United Kingdom Asia / Australia Informatica Business Solutions Pvt. Ltd. Diamond District Tower B, 3rd Floor 150 Airport Road Bangalore 560 008 India Toll Free Australia: 1 800 151 830 Singapore: 001 800 4632 4357 Standard Rate India: +91 80 4112 5738
Standard Rate Brazil: +55 11 3523 7761 Mexico: +52 55 1168 9763 United States: +1 650 385 5800
Standard Rate Belgium: +32 15 281 702 France: +33 1 41 38 92 26 Germany: +49 1805 702 702 Netherlands: +31 306 022 797 Spain and Portugal: +34 93 480 3760 United Kingdom: +44 1628 511 445
vi
Preface
CHAPTER 1
Introduction
This chapter includes the following topics:
Overview, 1
Overview
Informatica Data Quality components can add significant data quality management capabilities to your PowerCenter projects. The Data Quality Workbench application allows you to design data quality management processes, called plans, and write them to the PowerCenter repository. The Data Quality Integration plug-in installs a set of transformations to PowerCenter that allow you to run plans in sessions and to create mappings that identify and consolidate groups of duplicate records. Table 1-1 summarizes the tasks you can perform by integrating PowerCenter and Data Quality:
Table 1-1. Data Quality-PowerCenter Integration Options
To Perform This Task Save a data quality plan to the PowerCenter repository as a mapplet Import a plan file as a PowerCenter mapplet Run a plan in a PowerCenter session using the PowerCenter Integration Service. Link related data rows for consolidation Consolidate duplicate or overlapping data rows Use These Components Use Data Quality Workbench to export the plan from the DQ repository to the PowerCenter repository. Use Data Quality Workbench to export the plan from the DQ repository to XML file. Import the XML file using PowerCenter Repository Manager. Use PowerCenter to run a mapplet in a session.
Use the Data Quality Association transformation in PowerCenter. Use the Data Quality Consolidation transformation in PowerCenter.
Note: The Integration plug-in does not install the Data Quality Integration transformation. Informatica
provided this transformation with Data Quality version 8.5, and the transformation is deprecated. The current integration supports instances of the Data Quality Integration transformation that are saved in the PowerCenter repository, but it does not permit you to edit these transformations or to create new instances of this transformation.
Data Quality allows you to export data quality plans from the Data Quality repository to the PowerCenter repository as mapplets. Follow this path to integrate your plans into PowerCenter processes. You can convert to mapplets instances of the Data Quality Integration transformation in your PowerCenter repository.
Chapter 1: Introduction
CHAPTER 2
Overview, 3 Integrating with Data Quality Workbench, 3 Using Association and Consolidation Transformations, 4 Using the Data Quality Integration Transformation, 5 Deploying Data Quality Plans in PowerCenter, 5 Adding Address Validation Functionality to PowerCenter, 6 Adding Identity Matching Functionality to PowerCenter, 6
Overview
The Data Quality Integration plug-in adds a set of transformations to your PowerCenter client-side and serverside installations. Several of these transformations correspond to Data Quality Workbench components. For information on the features and functionality of these components, see the Informatica Data Quality User Guide. The plug-in also installs Data Quality Association and Consolidation transformations, which are not found in Data Quality Workbench. For information on the Association and Consolidation transformations, see page 15 and page 19 respectively. There are client-side and server-side versions of the plug-in.
Install the client version locally to the PowerCenter Designer. Install the server version locally to the PowerCenter Integration Service that runs a workflow containing either transformation.
quality management task. Plan designers configure these components to read data sources, perform data analysis or data enhancement tasks on source columns, and write the results to data targets. Plan designers save plans to the Data Quality repository and write saved plans from the Data Quality repository to the PowerCenter repository, where they appear as mapplets. When a plan enters the PowerCenter repository, PowerCenter stores the plan as a mapplet and converts each component in the data quality plan to a PowerCenter transformation. Workbench users can also save a data quality plan as XML to the file system for import to the PowerCenter repository. The result in both cases is the same the plan appears as a mapplet in the PowerCenter repository. A session containing one or more data quality mapplet runs with the PowerCenter Integration service and does not require a Data Quality engine. The Data Quality and PowerCenter repositories can reside on remote machines.
Note: Not all Workbench components are installed as transformations to PowerCenter. For more information
about the Data Quality components that install to PowerCenter, see page 10.
Association transformation . Allows you to create links between related records so that they are treated as members of a single set in data consolidation. The Association transformation allows you to link records that do not share a group ID but share other characteristics that make them candidates for consolidation. This transformation generates an association ID that you can use to link such records. Consolidation transformation . Allows you to create a single, consolidated record using field values from one more records with a common association ID. You can create expressions to determine how fields in the consolidated record are defined.
Informatica provides the Association and Consolidation transformations to consolidate duplicate records identified by data quality plans, but you can use these transformations on data from any source in a mapplet or mapping.
Use the Association transformation to order related records into groups for processing and to generate an association ID for each group of associated records. Use this association ID in downstream transformations to enable operations on grouped records.
Use the Consolidation transformation to create a single, consolidated record from a group of associated records.
You can use the Association and Consolidation transformations in the same mapping as a Data Quality Integration transformation. You can also use the transformations in any PowerCenter mapping, independent of each other and the Data Quality Integration transformation. The Association and Consolidation transformations operate independently of Data Quality and do not require that their inputs originate from data quality plans.
Note: You cannot use multiple partitions, grids, incremental recovery, real-time processing, or web service
It provided a means for PowerCenter users to read the Data Quality repository and save data quality plan information into the PowerCenter repository as metadata extensions to the transformation. When you ran a session containing this transformation, PowerCenter loaded an instance of the Data Quality engine to process the plan information. This capability was introduced in an earlier version of the transformation. The current Integration components support any Data Quality Integration transformations saved in your repository, but you can no longer use the Data Quality Integration transformation to read plans from the Data Quality repository. The current Data Quality-PowerCenter integration model does not require a Data Quality engine to run a plan from PowerCenter, and Informatica recommends that you re-save any plans embedded in a Data Quality Integration transformation as a mapplet in the PowerCenter repository.
X
The Data Quality Integration transformation provides a Convert to DQ Mapplet option to facilitate the process of saving a data quality plan that has been embedded in a Data Quality Integration transformation as a mapplet. Right-click on the transformation in Mapping Designer to access this option.
Note: Do not convert a Data Quality Integration transformation to a mapplet if the transformation is
currently saved within a mapplet. PowerCenter does not support the presence of mapplets within mapplets. If you wish to run a session containing a Data Quality Integration in which a data quality plan is embedded, verify that an instance of the Data Quality engine is installed locally to the Integration Service that runs the session.
Use Data Quality Workbench to export the plan as a mapplet to the PowerCenter repository, and add the mapplet to a mapping. Use Data Quality Workbench to export the plan as an XML file and import this file to the PowerCenter repository. PowerCenter imports this file as a mapplet. Add the mapplet to a mapping. Locate a Data Quality Integration transformation that contains a data quality plan, and use this transformation in a mapping. You cannot create a Data Quality Integration transformation and add a plan to it.
The PowerCenter Integration Service runs the session containing a data quality mapplet wholly within the PowerCenter engine. When PowerCenter runs a session containing a Data Quality Integration transformation, it calls an instance of the Data Quality engine to process the plan information embedded in the transformation.
Using the Data Quality Integration Transformation 5
In the latter case, PowerCenter requires an instance of the Data Quality engine on the same machine as the PowerCenter Integration service that runs the session. Data quality plan information is stored as XML in the PowerCenter repository. Plan information that has been added through the Data Quality Integration transformation is stored with the transformation XML when the transformation was saved. Plans that you export from Data Quality Workbench are stored as mapplets, and no PowerCenter steps are necessary to save the mapplet.
Note: Ensure that resources required by Data Quality mapplets are present on the machine running the
PowerCenter Integration Service. Examples of resources used in Data Quality mapplets are dictionary files, database dictionaries, and address validation files. For more information about writing plans to the PowerCenter repository, see the Informatica Data Quality User Guide.
Identity matching transformations use the index as a reference dataset when searching for duplicate identities in the source data. For information on identity transformations, see the Informatica Data Quality User Guide.
Note: When you run a mapping containing an identity match mapplet in a session, you must ensure that your
session properties reference the location of the index file. How you do so depends on your PowerCenter version. For more information on setting the location of the key index, see page 41.
6 Chapter 2: Integrating Data Quality and PowerCenter
Population Files
Before you can use identity transformations, you must install population files. A population file contains keybuilding algorithms, search strategies, and matching schemas that enable duplicate analysis of identity information. Population files can allow for multiple languages and character sets within the source data. Informatica provides proprietary population files for use in Data Quality and PowerCenter. Before you begin, ensure you have a suitable population file for your source data installed on your computer. Use the Data Quality Content Installer to install your population files.
CHAPTER 3
Overview, 9 Running Plans as Mapplets, 10 Running Plans with the Data Quality Integration Transformation, 12 Data Quality Integration Transformation Properties, 13
Overview
Integrate your data quality plans with PowerCenter mappings and sessions in one of the following ways: Export the plan from Data Quality repository to the PowerCenter repository. Use Data Quality Workbench to export a plan from the Data Quality repository directly to the PowerCenter repository. The Workbench components used to create the plan appear in PowerCenter as transformations, and the plan is saved to the PowerCenter repository as a mapplet. Create a mapping that includes this mapplet, and add the mapping to a session task. Export the plan to XML file from Data Quality repository, and import the XML file to the PowerCenter repository. Use this method when you cannot connect to the required PowerCenter repository from Data Quality Workbench. Use Data Quality Workbench to export as an XML file to the file system, and use PowerCenter Repository Manager to import the plan to the repository. The Workbench components used to create the plan appear in PowerCenter as transformations, and the plan is saved to the PowerCenter repository as a mapplet. Create a mapping that includes this mapplet and add the mapping to a session task. Use a plan embedded in a Data Quality Integration transformation. Your PowerCenter repository may contain one or more Data Quality Integration transformations in which a plan is embedded. Informatica recommends converting the embedded plan to a data quality mapplet. Create a mapping that includes this mapplet, or add it to the mapplet that contained the original transformation, and add the mapping to a session task.
Note: You cannot use the Data Quality Integration transformation to connect to the Data Quality repository.
A data quality mapplet contains the operational components configured for the plan in Data Quality Workbench. These components appear onscreen as transformations. The Integration plug-in installs these transformations to PowerCenter. You cannot edit the properties of these transformations in PowerCenter. Figure 3-2 shows the transformations in this mapplet:
Figure 3-2. PowerCenter Mapplet Containing Data Quality Components
Add data quality mapplets to the PowerCenter repository through Data Quality Workbench. For information about exporting data quality plans as mapplets from the Data Quality repository to the PowerCenter repository, see the Informatica Data Quality User Guide. For information about importing XML files to the PowerCenter repository and for information about adding a mapplet to a mapping, see your PowerCenter Designer online help.
for each port. Do not edit any other aspect of the transformation, as this will invalidate the session in which the transformation is used. Not all Workbench components are added as transformations in PowerCenter, and not all Workbench sources and targets can convert to mapplet inputs and outputs.
10
Table 3-1 lists the data quality plan components that are usable in PowerCenter. If your data quality plan contains components not listed in this table, the plan will not function in PowerCenter.
Table 3-1. Data Quality Components Usable In PowerCenter
Component Bigram Description Calculates levels of similarity between pairs of strings. Outputs a match score for two strings based on pairs of consecutive characters that are common to both strings. Parse free-text fields containing multiple tokens into multiple single-token fields. Creates a character-by-character profile of data values in a data field. File-based source in an identity matching plan. Performs identity matching on CSV sources using keys created by the Identity Group Target. Converts to an Identity Match Pair Generator. An Identity Match Pair Generator configures pairs of data values that will be subjected to match analysis in an identity data matching operation. CSV Identity Match Target File-based target in an identity matching plan. Converts to an Identity Match Identifier. A Match Identifier appends the match score and match cluster information calculated by matching components to each output record at the end of the matching process. CSV Match Source File-based source in a matching plan. Converts to a Match Pair Generator. A Match Pair Generator configures pairs of data values that will be subjected to match analysis in a data matching operation. CSV Match Target File-based target in a matching plan. Converts to a Match Identifier. A Match Identifier appends the match score and match cluster information calculated by matching components to each output record at the end of the matching process. CSV Source CSV Target DB Identity Group Source File-based source in a non-matching plan. Converts to a mapplet input. File-based target in a non-matching plan. Converts to a mapplet output. Database source in an identity matching plan. Performs identity matching on database sources using keys created by the Identity Group Target. Converts to an Identity Match Pair Generator. A Match Pair Generator configures pairs of data values that will be subjected to match analysis in a data matching operation. DB Source DB Target Edit Distance Database source in a non-matching plan. Converts to a mapplet input. Database target in a non-matching plan. Converts to a mapplet output. Calculates levels of similarity between pairs of strings. Outputs a match score for two strings by calculating the minimum cost of transforming one string into another by the insertion, deletion, and replacement of characters. Global address validation component. Enables Data Quality to evaluate input addresses against address reference data with third-party validation engines. Calculates levels of similarity between pairs of strings. Outputs a match score for two strings by calculating the number of positions in which characters differ between them. Generates keys for groups of input data for use by the CSV Identity Group Source and the DB Identity Group Source. Converts to an Identity Key Store. Identifies similar or duplicate strings at identity level. An identity is a set of fields providing name and address information for a person or organization. Calculates levels of similarity between pairs of strings. Outputs a match score for two strings by calculating the minimum cost of transforming one string into another by the insertion, deletion, and replacement of characters. Reduces this score if the two strings do not share a common prefix.
11
mappings that contain transformations, but you cannot create a new instance of this transformation or an instance of this transformation that exists in your repository. Informatica recommends converting a plan saved with a Data Quality Integration transformation to a data quality mapplet.
Tip: The name of the deprecated transformation is the Data Quality Integration. The name of the installer that
adds Informatica Data Quality components and capabilities to PowerCenter is the Integration installer. Take care when using these names. PowerCenter need not communicate with a Data Quality engine or Data Quality repository to run a plan embedded in a Data Quality Integration transformation. You cannot re-connect to the Data Quality repository to change or refresh the plan.
12 Chapter 3: Working with Plans in the PowerCenter Repository
Figure 3-3 shows a Data Quality Integration transformation in iconic form in a mapping:
Figure 3-3. Mapping Containing a Data Quality Integration Transformation
A Data Quality Integration transformation contains a single data quality plan. You can add multiple Data Quality Integration transformations to a mapping. When you add a series of Data Quality Integration transformations to a single mapping, the session containing the mapping will run faster than a session containing a series of mappings with a single Data Quality Integration transformation in each one.
13
For information about the other tabs on this dialog box, consult PowerCenter Designer online help. Table 3-2 describes the options on this tab:
Table 3-2. Configurations Tab Options List
Option Plan Name Grouping Port Plan Location Status I/O Ports Description Identifies the plan to be added to the transformation. Buffers data on the selected field before sending to the Data Quality engine. Used in matching plans. Lists the location of the Data Quality repository from which PowerCenter read the plan and the original path to the plan within that repository. Describes the last connection state between PowerCenter and Data Quality. Lists any pass-through ports added to the transformation. These ports enable data to pass through the transformation unchanged. They are not included in the input and output ports created by the data quality plan and are added in PowerCenter. Activates pass-through ports on the transformation.
14
CHAPTER 4
Association Transformations
This chapter includes the following topics:
Overview
In data quality, association is an extension of the data matching process and a precursor of the data consolidation process. The Association transformation creates links between records that share duplicate characteristics across more than one data field. The transformation generates an association ID value for each row in a group of associated records and writes the association ID values as a new output port. Use a Consolidation transformation to create a master record based on the records with common association ID values.
Note: You cannot use multiple partitions, grids, incremental recovery, real-time processing, or web service
15
The Association Ports tab is unique to this transformation. When you configure ports as association ports, PowerCenter creates a common association ID for all records with common values in either port. Configure at least two association ports for each Association transformation. The transformation creates a new AssociationID port that contains the ID values. You can configure any input/output port as an association port. To associate data from Data Quality matching plans, configure the cluster ID ports as association ports. The port date is of type Integer(10).
Note: In addition to the association ID port, you can create input/output ports to pass related data to
downstream transformations. Create and configure all Association transformation ports on the Association Ports tab.
Example
The following data fragment includes cluster IDs generated by data quality plans that matched last names and addresses:
First_Name John Mary Stuart Kevin Paul Last_Name Smith Anne Peterson Smith Smith Address 11, Bridge ST 345, tracy blvd 1 Main street 11 Bridge street 11 Bridge st ClusterID_ LastName 9 10 11 9 9 ClusterID_ Address 14 15 16 14 14
When you route data to an Association transformation and configure each ClusterID port as an association port, the Association transformation evaluates cluster IDs and generates a single association ID for associated rows:
First_Name John Kevin Paul Mary Stuart Last_Name Smith Smith Smith Anne Peterson Address 11, Bridge ST 11 Bridge street 11 Bridge st 345, tracy blvd 1 Main street ClusterID_ LastName 9 9 9 10 11 ClusterID_ Address 14 14 14 15 16 AssociationID 1 1 1 2 3
16
In the Mapping Designer, click Transformation > Create. Select Association transformation and enter the name of the transformation. The naming convention for Association transformation is AT_ TransformationName. Click Create, and then click Done.
2.
Select and drag ports from an upstream transformation to the Association transformation. Copies of these ports appear as input/output ports in the Association transformation. Or, in the Association transformation properties, click the Association Ports tab and create each port manually.
Note: To make this transformation reusable, you must create each port manually within the transformation.
3.
On the Association Ports tab, select Associate to define a port as an association port. Select two or more association ports.
4.
Click OK.
17
18
CHAPTER 5
Consolidation Transformations
This chapter includes the following topics:
Overview, 19 Consolidation Transformation Ports, 20 Consolidation Expressions, 21 Consolidation Functions, 21 Creating a Consolidation Transformation, 22
Overview
Use the Consolidation transformation to create a single, consolidated record from a group of associated records. When you use a Consolidation transformation, you configure the following components:
Group by port. The port used to group related records. The Consolidation transformation generates one record for each group. Consolidation expressions. The expression used to define each field of the consolidated record. You can use consolidation functions as well as other standard PowerCenter functions to define expressions.
Use the Consolidation transformation with the Association transformation to consolidate duplicate records from Data Quality matching plans. When used with an Association transformation, configure the Consolidation transformation to group records by association ID.
Note: You cannot use multiple partitions, grids, incremental recovery, real-time processing, or web service
19
Stores 15 3 4
Association transformations provide data sorted by association ID. When you use the Consolidation transformation with an Association transformation and configure AssociationID as the group by port, you do not need to perform additional sorting on input data. Otherwise, you can use a Sorter transformation to sort data.
Default Values
The Consolidation transformation uses the fields in the first record in a group as default data for the consolidated record. You can configure consolidation expressions to provide specific results for the consolidated record. When you do not enter consolidation expressions or when consolidation expressions do not generate a result, the field defaults to the value in the first record of the group. For example, if you do not configure any consolidation expressions for the sample data, above, the consolidated record for Maryland is MD, Montgomery, 10.
Group By Port
In a Consolidation transformation, the group by port determines how records are grouped for consolidation. When you create a Consolidation transformation, configure at least one group by port. For each group of consecutive records that has the same value in the group by port, the Consolidation transformation generates a consolidated record. You can select more than one group by port. When you create a composite group identifier, the Consolidation transformation creates a consolidated record for each composite group identifier. You can configure any input/output port as a group by port. However, when you use a Consolidation transformation with an Association transformation, use the association ID as the group by port. The Association transformation provides data that is already associated and sorted by association ID. Configure the group by port using the GroupBy option on the Consolidated Ports tab in the transformation properties. Data in the group by port should be sorted to ensure expected results. For more information, see Sorting Input Data on page 19.
IsConsolidatedRecord
The Consolidation transformation provides an IsConsolidatedRecord output port to indicate if a record was consolidated from a group of records.
20
Table 5-1 describes the flags that are used in the IsConsolidatedRecord field:
Table 5-1. Consolidation Flags
Consolidation Flag 1 0 Description The record was consolidated. It represents a group of records that share the same group by value. The record was not consolidated. It represents a group consisting of a single record.
Consolidation Expressions
You can create expressions for any input/output port in the Consolidation transformation except group by ports. The expression determines how the port is defined for the consolidated record. For example, you can configure an expression to return the most frequently appearing value in the port for each group. If you do not enter an expression for a port, the Consolidation transformation uses the default value for the port. For more information, see Default Values on page 20. Use the Expression Editor to create and validate expressions. You can use any valid PowerCenter function or variable in expressions. You can also use the consolidation functions installed with the Consolidation transformation.
Consolidation Functions
The Consolidation transformation installation includes a set of new consolidation functions:
Store. Stores the value of the port or of an expression related to the current record. Stored. Returns the stored value of the port. Most_Frequent. Returns the most frequently occurring value for the port within a group, including blank and null values. Most_Frequent_NonBlank. Returns the most frequently occurring value for the port within a group, excluding blank and null values.
For information about other PowerCenter functions, see the Transformation Language Reference.
Store
Store uses the following syntax:
STORE(port)
or
STORE(port, expression)
Store( port) stores the value of the port as the candidate value for the consolidated record. Store( port, expression ) stores the value of the expression as the candidate value for the consolidated record. The expression must return a value of the same datatype as the port. Use only the port that you are configuring with Store. Do not use Store to generate string literals.
Consolidation Expressions
21
Stored
Stored uses the following syntax:
STORED(port)
Stored returns the stored candidate value of the port. When using Stored, include Store in the same expression to store candidate values. If you use Stored without Store, the Integration Service returns null values. For example, to store the most recent date in a Date port, you might use the following expression:
IIF (ISNULL (STORED(Date)), STORE(Date), IIF (Date > STORED(Date), STORE(Date)))
The return value must be the same datatype as the port you are configuring. Do not use Stored to generate string literals.
Most_Frequent
Most_Frequent function uses the following syntax:
MOST_FREQUENT(port)
Most_Frequent evaluates all values within a group and returns the most frequently occurring value in a group, including null and blank values. The return value must be the same datatype as the port you are configuring. Do not use aggregate functions with Most_Frequent. Do not use Most_Frequent to generate string literals.
Most_Frequent_NonBlank
Most_Frequent_NonBlank uses the following syntax:
MOST_FREQUENT_NONBLANK(port)
Most_Frequent_NonBlank ignores blank and null values when it evaluates values within a group. It returns the most frequently occurring value in a group, excluding null and blank values. The return value must be the same datatype as the port you are configuring. Do not use aggregate functions with Most_Frequent_NonBlank. Do not use Most_Frequent_NonBlank to generate string literals.
In the Mapping Designer, click Transformation > Create. Select Consolidation transformation and enter the name of the transformation. The naming convention for Consolidation transformation is CON_ TransformationName. Click Create, and then click Done.
2.
Select and drag ports from an upstream transformation to the Consolidation transformation. Copies of these ports appear as input/output ports in the Consolidation transformation. Or, in the Consolidation transformation properties, click the Consolidation Ports tab and create each port manually.
Note: To make this transformation reusable, you must create each port manually within the transformation.
3.
On the Consolidation Ports tab, select GroupBy to define a group by port. Define at least one group by port.
22
4.
To enter an expression for a port, click the button in the Expression field and use the Expression Editor. To prevent typographic errors, use the listed port names and functions when possible.
5. 6. 7.
Click Validate to validate the expression. To close the Expression Editor, click OK. For each port that requires an expression, repeat steps 4 - 5. Click OK.
23
24
CHAPTER 6
Overview, 25 Creating a Mapping to Cleanse, Parse, or Validate Data, 27 Creating a Mapping to Match Data from a Single Source, 28 Creating a Mapping to Match Data from Two Data Sources, 28 Creating a Mapping to Match Identity Information, 29
Overview
This section describes how to define the mappings that will add the data quality plans you exported or imported to the PowerCenter repository to PowerCenter processes. The mapping descriptions demonstrate how data quality transformations interact with default PowerCenter transformations and what dependencies may apply. How you define a mapping depends on the type of data quality plan you use.
Note: The descriptions illustrate the ways that mappings can be configured for different types of plan but do not
25
Add a Sorter transformation to sort the data on a key field upstream of the data quality mapplet or Data Quality Integration transformation. Optionally, add a Sequence Generator transformation if your data lacks unique IDs. The Sorter transformation performs the same task as a pre-match grouping plan in Data Quality. Grouping plans are not necessary in PowerCenter. The Grouping port must be set on any Data Quality Integration transformation that contains a matching plan. Plan designers can create both single-source and dual-source matching plans in Data Quality Workbench. PowerCenter runs single-source matching plans only. Therefore, you must combine two data sources into a single stream to match across them in PowerCenter. Use a Union transformation to combine data from two data sources.
Identity Match
CSV Identity Group Source or DB Identity Group Source Identity Match or other data quality match component CSV Identity Match Target
Identity Match
Identity Match
The Integration installer installs the identity transformations to PowerCenter. For more information on identity matching in Data Quality, see the Informatica Data Quality User Guide.
26
You can configure a mapping like this one with multiple data quality mapplets to conduct cleansing, standardization, parsing, or validation operations in sequence in a single mapping. If your repository contains a Data Quality Integration transformation, you can use that transformation in place of the mapping.
To create a mapping that includes a single data quality mapplet or transformation: 1. 2. 3. 4. 5.
Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations. Add the required data quality mapplet or transformation. Connect the outputs from the Source Qualifier to the input ports of the data quality mapplet. Connect like fields. For example, connect an output port carrying name data to an input port that anticipates name data. Add a Target Definition and connect the mapplet output ports to it.
27
Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations in the mapping.
Note: If the input records lack unique identifiers, add a Sequence Generator transformation. This
transformation will generate a series of incremented values for the records passed into it, creating a column of unique IDs. If the input records have unique IDs, you can omit this step.
3.
Add a Sorter transformation. Set this transformation to sort the input records according to values in a suitable field. To do so, open the transformation on the Ports tab and check the Key column box for the required port name. Add the required data quality mapplet. Connect all required ports. In the data quality mapplet, verify that you have selected as a group key port the field you set as the Key column in the Sorter transformation. Add a Target Definition and connect the mapplet output ports to it.
4. 5. 6.
28
Add two Source Definitions to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation for each source definition. These read the data from the source files and enable the data to be read by other transformations in the mapping. Add two Expression transformations. Use each Expression transformation to flag the data from each source as Source A and Source B. This facilitates matching across the sources. Add a Union transformation. Use this transformation to combine the Source A and Source B data into a single dataset, as required by the matching plan. Add a Sequence Generator transformation. This generates a series of incremented values for the records passed into it, creating a column of unique IDs. If the input records contain a field that constitutes a unique ID, you can omit this step. Add a Sorter transformation. Set the Sorter transformation to sort the input records according to values in a suitable field. To do so, open the transformation on the Ports tab and check the Key column box for the required port name.
6.
7. 8. 9.
Add the required data quality mapping. Connect all required ports. In the data quality mapplet, verify that you have selected as a group key port the field you set as the Key column in the Sorter transformation. Add a Target Definition and connect the mapplet output ports to it.
29
Population files install through the Data Quality Content Installer. For more information on installing population files, see the Informatica Data Quality Installation Guide.
To create a mapping that performs index key generation: 1. 2. 3.
Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations. Add the required data quality mapplet or transformation.
Note: The Identity Key Store transformation in your mapplet creates an index of key values and writes it to
a folder on the system. You must ensure that any identity matching components that must read the index are able to do so. For information on setting the index directory, see page 41.
4. 5.
Connect the outputs from the Source Qualifier to the input ports of the data quality mapplet. Connect like fields. For example, connect an output port carrying name data to an input port that anticipates name data. Add a Target Definition and connect the mapplet output ports to it.
The high-level steps to create a mapping that matches a source dataset against the contents of the key index are almost identical.
Note: You can perform identity matching across two datasets by selecting a different source dataset in the each
plan. For this form of matching to succeed, both datasets must have the same structure and their respective columns must contain the same types of information. If
To create a mapping that performs identity match analysis between a source dataset and a key index: 1. 2. 3.
Add a Source Definition to the Mapping Designer workspace and connect to your source data. Add a Source Qualifier transformation. This reads the data from the source file and enables the data to be read by other transformations. Add the required data quality mapplet or transformation.
Note: The identity matching components within the mapplet used in this mapping must refer to the index
key folder specified in the preceding mapping. For information on setting the index directory, see page 41.
4. 5.
Connect the outputs from the Source Qualifier to the input ports of the data quality mapplet. Connect like fields. For example, connect an output port carrying name data to an input port that anticipates name data. Add a Target Definition and connect the mapplet output ports to it.
Figure 6-4 shows the Mapping Designer view of a mapping created for identity match analysis above the Mapplet Designer view of the identity mapplet exported from Data Quality:
Figure 6-4. Identity Matching Mapping and Mapplet
Mapplet Input Identity Match Pair Generator Edit Distance Identity Match Identifier Mapplet Output
Note: If a workflow containing an identity match analysis mapplet fails, you must delete the key index folder
that was read by the identity components in the mapplet. The workflow failure corrupts the index data. Recreate the index by running the workflow that contains the index key generation mapplet.
31
32
CHAPTER 7
Overview, 33 Associating and Consolidating Data from a Single Source, 33 Associating and Consolidating Data from Multiple Data Sources, 34
Overview
Informatica provides the Association and Consolidation transformations to process records that have been identified as potential duplicates. The transformations are designed to process data from data quality matching plans, although they are not limited to such data. Use the Association and Consolidation transformations to link groups of matching records and to generate a consolidated master record for each group. You can create mappings that use the Association and Consolidation transformations with Data Quality data in the following general formats:
Single source mapping. A single pipeline that routes all data through a more than one matching plan before passing to Association and Consolidation transformations. Create this type of mapping when you want to run different matching plans on different ports in the same source dataset. Multiple source mappings. Multiple pipelines that merge data from different sources before passing through a matching plan to Association and Consolidation transformations.
To understand how matching plans run in PowerCenter mappings, read chapter 6, Creating Mappings for Association and Consolidation before you read this chapter.
33
matching plans on different ports in the same source data and create a single, consolidated record for each group of related records. Figure 7-1 shows an example of this type of mapping:
Figure 7-1. Associating and Consolidating Data from a Single Source
Sequence Generator transformation. If source data does not include a column of unique identifiers that can be used as a key column, you can use a Sequence Generator transformation to generate keys. Data Quality Integration transformations or data quality mappings. Use two Data Quality Integration transformations or two data quality mappings that each contain a matching plan. For example, one plan may match surname data and the other may match city name data in an address dataset. They pass their respective cluster IDs to the Association transformation. Sorter and Joiner transformations. Use the Sorter transformations to ensure that all records with common cluster IDs are sequenced together in the data streams. As the Association transformation allows only one input group, use a Joiner transformation to merge data. When you use more than two data sources, you can use multiple Joiner transformations in the pipeline.
Tip: When you use relational sources, you can configure the Number of Sorted Ports option in the Source
Association transformation. Use the cluster IDs generated by the matching plans to associate related data records. This transformation generates a single association ID for each associated group and provides output data sorted according to the new groups to the Consolidation transformation. Consolidation transformation. Use to generate a single, consolidated record for each group of consecutive records that share a common association ID. The value for each field is created based on the expressions configured for each port.
Mappings for Association and Consolidation. In chapter 6, the two data sources are joined before matching takes place.
34
Data quality mappings. Use two data quality mappings that each contain a matching plan. Each plan may perform the same type of matching operations on the same or similar fields in each dataset. The plans will pass their respective cluster IDs to the Association transformation. Sorter and Joiner transformations. Use the Sorter to ensure that all records with common cluster IDs are sequenced together in the data streams. As the Association transformation allows only one input group, use a Joiner transformation to merge data. When you use more than two data sources, you can use multiple Joiner transformations in the pipeline.
Tip: When you use relational sources, you can configure the Number of Sorted Ports option in the Source
Association transformation. Use the cluster IDs generated by the matching plans to associate related data records. This transformation generates a single association ID for each associated group and provides output data sorted according to the new groups to the Consolidation transformation. Consolidation transformation. Use to generate a single, consolidated record for each group of consecutive records that share a common association ID. The value for each field is created based on the expressions configured for each port.
35
36
APPENDIX A
Overview, 37
Overview
Data matching plans identify duplicate records in a dataset or across datasets. Matching plans operate differently to other types of plan. To understand matching plans, you must understand how data quality mapplets handle matching in PowerCenter. A matching plan compares all values on a user-selected input port with one another and generates a match score for every comparison pair. The score represents the degree of similarity between the two matched values, taking the first value as a baseline and calculating the percentage similarity of the second value to the first. All values are matched against all other values in this way, and the plan designer sets a match threshold value that defines the level of similarity that constitutes a strong match. A set of records with values that demonstrate a high level of identity with one another is called a cluster.
Data Quality users typically create a separate grouping plan that runs before the matching plan. The grouping plan creates a set of temporary files or database entries that indicate the records that belong to each group. These files or database entries are read by the matching plan to determine which records are matched together. They can be discarded when the matching plan has run, and they can be recreated by re-running the grouping plan. Every time you run the grouping plan in Workbench, you overwrite the group data. In PowerCenter, pre-match grouping is not necessary. Use a Sorter transformation instead.
When designing a mapping with a data quality mapplet, add a Sorter transformation before the mapplet and sort the data records according to a key field. Link the Group Key port on the mapplet input to the Sorter transformation key field.
37
When designing a mapping with a Data Quality Integration transformation, add a Sorter transformation before the Data Quality Integration transformation and sort the data records according to a key field. Set the Grouping Port in the Data Quality Integration transformation to this field to enable match processing on grouped data.
As well as assigning data to groups, a grouping plan may create columns of potential group keys. You can select one of these columns as the group key in the Sorter transformation. Always select the same group key port in the mapplet input as you selected in the Sorter transformation. Any column containing a statistically meaningful range of values can be used as a group key column, so long as the range of values has a meaningful association with the main focus of the matching exercise. For example, if your data quality plan focuses on matching person names, you could select date of birth information as a group key, on the basis that two records with common values for name and date of birth are likely to he the same person. In such a case, a City or Town name column would be a poor choice of group key, as there may be many people with similar or identical names in a city whose records are not duplicates of one another. In Workbench you can create composite group keys composed of data from two or more existing fields. For example, you could create a composite group key that included both date of birth and city or town of residence.
ClusterID Format
Data Quality Workbench and PowerCenter create ClusterID values in different ways. In Data Quality Workbench, the ClusterID values created by a matching plan are numbers that increment for each new cluster. In PowerCenter, the ClusterID value contains additional information that ensures it is unique within the system. The output format for a data row on the ClusterID port in PowerCenter is as follows:
<hostname>:<process id>:<thread id>:<internal cluster id for the row>
38
Note: If you are performing identity match analysis in PowerCenter, you must run an index key generation
process prior to an identity match analysis process. This creates the index of key values from which PowerCenter will create a set of possible identities within the source data. Data Quality plans use an Identity Group Target to create the key index. This target component installs to PowerCenter as the Identity Key Store transformation. To facilitate the grouping port, add a Sorter transformation before the data quality mapplet when you create your mapping and sort the data on the same port that you use as the grouping port. This ensures that all data rows with common values on this port are grouped together as they enter the mapplet.
Overview
39
40
APPENDIX A
Overview, 41
Overview
When you create a plan to generate a key index in Informatica Data Quality, you specify the location of the index within the Data Quality folder structure. When you create a plan to read that key index, you must set the path to the index in the plans identity components. You must follow similar steps when you use a mapplet created from an identity matching plan in a PowerCenter. When you add the parent mapping to a session, you must specify the index location as sessionlevel properties for any identity component that reads or writes an identity index in the session. There are three transformations that can read or write an index: Identity Match Identifier, Identity Key Store, and Identity Match Pair Generator.
41
Figure B-1 shows the Edit Task dialog box with an Identity Key Store transformation selected.
Figure B-1. Setting the Identity Index Location, Edit Tasks dialog box
transformations do not read the index location from the session-level properties. Instead, they read the index location from the mapplet variables. This location is typically a folder within the Data Quality folder structure on the PowerCenter Integration Service machine. If the data quality plan read an index folder at this location:
C:\Program Files\Infomatica Data Quality\Identity\MyIndexes
Then the Identity Key Store and Identity Match Pair Generator transformations in PowerCenter 8.5.1 will read the index folder at this location:
C:\Informatica\PowerCenter8.5.1
You must ensure that the Identity Match Identifier reads the index from the same folder. Set the folder path for the Identity Match Identifier as a session-level property.
Note: PowerCenter provides an EBF that enables the transformations affected on the Integration Service
machine to read the session properties. For more information, contact Global Customer Support. Table B-1 summarizes how different versions of PowerCenter read the key index folder location.
Table B-1. Index Folder Location Settings in PowerCenter
Transformation name Identity Key Store Data Quality component name Identity Group Target Behavior in PC 811 SP5 Reads the index folder location from sessionlevel properties Behavior in PowerCenter 851 Reads the index folder location at a path specified in the mapplet Behavior in PC 860 Read the index folder location from sessionlevel properties
42
Overview
43
44
INDEX
A
Address validation 6 Data Quality Content Installer 6 association IDs using to consolidate records 19 association ports description 15 Association transformation creating 16 example 16 in Data Quality mappings 33 naming convention 17 overview 4, 15 ports 15 AssociationID port 16 description 15
in mappings for data standardizing, parsing, and validation 27 in mappings for dual-source matching 28 in mappings for single-source matching 28 overview 5 pass-through ports 14 Data Quality mappings for association and consolidation 33 deprecated component Data Quality Integration transformation 1, 5 designing mappings for association and consolidation, multiple sources 34 for association and consolidation, single source 33 for data standardizing, parsing, and validation 27 for dual-source matching 28 for single-source matching 28
E
Expression transformation in data quality mapplets 27 expressions Consolidation transformations 19, 21 default values for Consolidation transformations 20
C
cluster IDs using to associate records 15 Consolidation functions Most_Frequent 22 Most_Frequent_NonBlank 22 overview 21 Store 21 Stored 22 Consolidation transformation creating 22 default values 20 expressions 21 expressions in 19 group by port 19 in Data Quality mappings 33 IsConsolidatedRecord port 20 naming convention 22 overview 4, 19 ports 20 sorted input 19
G
group by ports defined 19 in Consolidation transformations 20 grouping records in Association transformations 15 in Consolidation transformations 20 overview 37
I
identity matching creating mappings 26, 29 description 6 index folder location 41 population files 7 Workbench components in PowerCenter 11 IsConsolidatedRecord port Consolidation transformation port 20
D
data matching plans 37, 41 Data Quality Content Installer 6 Data Quality Integration transformation convert plans to mapplets 5 deprecation 1, 5 embedded plans 12
45
M
mappings data standardizing, parsing, and validation 27
N
naming conventions Association transformation 17 Consolidation transformation 22
P
pass-through ports in Data Quality Integration transformations 14 ports Association transformation 15 AssociationID 15 Consolidation transformation 20 group by 19, 20 IsConsolidatedRecord 20
R
records associating 19
S
Sequence Generator transformation 26 Sorter transformation 20, 26, 34, 37 sorting data for Consolidation transformations 19 for data matching 25 syntax for consolidation functions 21
T
transformations active and passive 13 Association 4, 15 Consolidation 4, 19 Data Quality Integration transformation 12
46
Index
NOTICES
This Informatica product (the Software) includes certain drivers (the DataDirect Drivers) from DataDirect Technologies, an operating company of Progress Software Corporation (DataDirect) which are subject to the following terms and conditions: 1. THE DATADIRECT DRIVERS ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. 2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.