You are on page 1of 198

Data Profiling Guide

Informatica PowerCenter®
(Version 7.1.2)
Informatica PowerCenter® Data Profiling Guide
Version 7.1.2
February 2005

Copyright (c) 2003-2005 Informatica Corporation.


All rights reserved. Printed in the USA.

This software and documentation contain proprietary information of Informatica Corporation, they are provided under a license agreement
containing restrictions on use and disclosure and is also protected by copyright law. Reverse engineering of the software is prohibited. No
part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
without prior consent of Informatica Corporation.

Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software
license agreement as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR
12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable.

The information in this document is subject to change without notice. If you find any problems in the documentation, please report them to
us in writing. Informatica Corporation does not warrant that this documentation is error free.
Informatica, PowerMart, PowerCenter, PowerCenter Connect, PowerConnect, and PowerChannel are trademarks or registered trademarks of
Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade
names or trademarks of their respective owners.

Portions of this software are copyrighted by DataDirect Technologies, 1991-2000.

Informatica PowerCenter products contain ACE (TM) software copyrighted by Douglas C. Schmidt and his research group at Washington
University and University of California, Irvine, Copyright © 1993-2002, all rights reserved.

DISCLAIMER: Informatica Corporation provides this documentation “as is” without warranty of any kind, either express or implied,
including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. The information
provided in this documentation may include technical inaccuracies or typographical errors. Informatica could make improvements and/or
changes in the products described in this documentation at any time without notice.
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
New Features and Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiv
About Informatica Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
About this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xvi
Other Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Visiting Informatica Customer Portal . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Visiting the Informatica Webzine . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Visiting the Informatica Web Site . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Visiting the Informatica Developer Network . . . . . . . . . . . . . . . . . . . . xvii
Obtaining Technical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii

Chapter 1: Data Profiling Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Understanding Data Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Profiling Source Data During Mapping Development . . . . . . . . . . . . . . . 2
Profiling Source Data to Monitor Data Quality . . . . . . . . . . . . . . . . . . . . 2
Steps for Profiling Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Step 1. Create the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . 3
Step 2. Create a Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Step 3. Create and Run the Profile Session . . . . . . . . . . . . . . . . . . . . . . . 4
Step 4. View Profile Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Using the Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Profile View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Source View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 2: Installing and Configuring Data Profiling . . . . . . . . . . . . . 9


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Installing and Configuring Data Profiling . . . . . . . . . . . . . . . . . . . . . . . 10
Upgrading from a Previous Version . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Installing PowerAnalyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

iii
Configuring a Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Creating a Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Upgrading the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . 13
Configuring a Relational Database Connection to the Data Profiling
Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Installing PowerAnalyzer Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . 16
Installing Data Profiling XML Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Importing Data Profiling Schema and Reports . . . . . . . . . . . . . . . . . . . 16
Configuring a Data Source for the Data Profiling Warehouse . . . . . . . . . 20
Creating a Data Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Adding a Data Profiling Data Source to a Data Connector . . . . . . . . . . . 23
Configuring Default Data Profile Options . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Purging the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Chapter 3: Managing Data Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . .31


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Profiling Sources and Mapplet Output data . . . . . . . . . . . . . . . . . . . . . . 32
Profile Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Creating an Auto Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Auto Profile Naming Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Steps to Create an Auto Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Creating a Custom Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Step 1. Specify a Profile Name and Profile Description . . . . . . . . . . . . . 38
Step 2. Add Sources to the Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Step 3. Add Functions and Enable Session Configuration . . . . . . . . . . . . 39
Step 4. Configure Profile Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Step 5. Configure the Profile Session . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Generating the Profile Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Editing a Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Deleting a Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Working with Profile Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Copying Data Profiling Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Combining Data Profile Mappings with Other Mappings . . . . . . . . . . . . 51
Using Mapplets to Extend Data Profiling Functions . . . . . . . . . . . . . . . . . . . 52
Extending Data Profiling Functionality with Mapplets . . . . . . . . . . . . . . 52
Profiling Aggregate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Profiling Multiple Sources with One Matching Port . . . . . . . . . . . . . . . . 55

iv Table of Contents
Profiling Sources that Use All Matching Ports . . . . . . . . . . . . . . . . . . . . 57
Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Chapter 4: Running Profile Sessions . . . . . . . . . . . . . . . . . . . . . . . . . 61


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Working with Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Running Sessions from the Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . 63
Configuring a Session in the Profile Wizard . . . . . . . . . . . . . . . . . . . . . 64
Running a Session when You Create a Data Profile . . . . . . . . . . . . . . . . 67
Running a Session for an Existing Data Profile . . . . . . . . . . . . . . . . . . . 67
Monitoring Interactive Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Profiling Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Step 1. Select a Function Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Step 2. Select a Data Sampling Mode . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Profiling Relational Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Performing Sampling as the PowerCenter Server Reads Data . . . . . . . . . 70
Delegating Sampling to the Database . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Improving Data Sampling Accuracy and Performance . . . . . . . . . . . . . . 72
Creating a Session in the Workflow Manager . . . . . . . . . . . . . . . . . . . . . . . 74
Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Chapter 5: Viewing Profile Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
PowerCenter Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Auto Profile Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Custom Profile Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Viewing PowerCenter Data Profiling Reports . . . . . . . . . . . . . . . . . . . . 83
PowerAnalyzer Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Chapter 6: Working with Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


Working with Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Prepackaged Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Custom Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
List of Values Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Regular Expression Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Domain Definition Filename Domains . . . . . . . . . . . . . . . . . . . . . . . . . 98
Specifying Localization and Code Page Information . . . . . . . . . . . . . . . 99

Table of Contents v
Editing a Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Deleting a Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Chapter 7: Working with Functions . . . . . . . . . . . . . . . . . . . . . . . . . 103


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Source-Level Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Row Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Business Rule Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Candidate Key Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Redundancy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Row Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Column-Level Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Business Rule Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Domain Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Domain Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Distinct Value Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Intersource Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Orphan Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Join Complexity Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Appendix A: Data Profiling Views . . . . . . . . . . . . . . . . . . . . . . . . . . 127


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
List of Data Profiling Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
DPR_LATEST_PRFLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
DPR_PRFL_AUTO_COL_FN_METRICS . . . . . . . . . . . . . . . . . . . . . . . . 130
DPR_PRFL_CART_PROD_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . 131
DPR_PRFL_COL_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
DPR_PRFL_COL_FN_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
DPR_PRFL_COL_FN_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
DPR_PRFL_CP_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
DPR_PRFL_FN_DTLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
DPR_PRFL_OJ_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
DPR_PRFL_OJ_FN_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
DPR_PRFL_OJ_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
DPR_PRFL_RUN_DTLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
DPR_PRFL_SRC_FN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

vi Table of Contents
DPR_PRFL_SRC_FN_METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
DPR_PRFL_SRC_FN_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
DPR_PRFL_VER_DTLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Appendix B: Code Page Compatibility . . . . . . . . . . . . . . . . . . . . . . . 151


Code Page Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Appendix C: Data Profiling Error Messages . . . . . . . . . . . . . . . . . . 155


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Designer Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Server Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
DP Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Appendix D: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169


Glossary Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Table of Contents vii


viii Table of Contents
List of Figures
Figure 1-1. Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Figure 1-2. Profile Manager Source View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 2-1. Configuring Default Profile Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 3-1. Profile Wizard - Auto Profile Column Selection Page . . . . . . . . . . . . . . . . . . . . . . 36
Figure 3-2. Profile Wizard - Profile Sources Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 3-3. Profile Wizard - Function-Level Operations Page . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 3-4. Profile Wizard - Function Details Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 3-5. Profile Wizard - Function Role Details Page (Row Count Function) . . . . . . . . . . . 42
Figure 3-6. Data Profile Mappings in the Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 3-7. Sample Mapplet to Aggregate and Filter Mapplet Output Data . . . . . . . . . . . . . . . 53
Figure 3-8. Sample Mapplet - Aggregator Transformation Settings . . . . . . . . . . . . . . . . . . . . . 54
Figure 3-9. Sample Verbose Summary Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 3-10. Sample Mapplet to Join Data from Two Sources . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 3-11. Sample Data Profiling Report for Mapplet Output Data . . . . . . . . . . . . . . . . . . . 57
Figure 3-12. Sample Mapplet to Merge Mapplet Output Data . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 3-13. Sample Data Profiling Report for the Union Mapplet . . . . . . . . . . . . . . . . . . . . . 59
Figure 4-1. Profile Wizard - Profile Run page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 4-2. Profile Wizard - Session Setup Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 4-3. Profile Session Status in the Profile Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 4-4. Manual PowerCenter Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 4-5. Automatic Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 4-6. Data Sampling Delegated to the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 4-7. Data Profile of an Historical Sample of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 5-1. Sample Auto Profile Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figure 5-2. Sample Verbose Data Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figure 5-3. Sample Custom Profile Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 5-4. PowerAnalyzer Data Profiling Analytic Workflow . . . . . . . . . . . . . . . . . . . . . . . . 85
Figure 6-1. Regular Expression Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 6-2. Test Data Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Figure 7-1. Row Count Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Figure 7-2. Business Rule Validation Source-Level Function . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure 7-3. Business Rule Validation Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 7-4. Candidate Key Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 7-5. Redundancy Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Figure 7-6. Row Uniqueness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Figure 7-7. Business Rule Validation Column-Level Function . . . . . . . . . . . . . . . . . . . . . . . 114
Figure 7-8. Domain Validation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure 7-9. Domain Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 7-10. Domain Inference Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Figure 7-11. Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

List of Figures ix
Figure 7-12. Distinct Value Count Function . . . . . . . .. . .. . . .. . . .. . .. . . .. . . .. . .. . .121
Figure 7-13. Orphan Analysis Function . . . . . . . . . . . .. . .. . . .. . . .. . .. . . .. . . .. . .. . .124
Figure 7-14. Join Complexity Evaluation Function . . . .. . .. . . .. . . .. . .. . . .. . . .. . .. . .126
Figure B-1. Configuring Code Page Compatibility . . . .. . .. . . .. . . .. . .. . . .. . . .. . .. . .153

x List of Figures
List of Tables
Table 2-1. Scripts for Creating a Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Table 2-2. Scripts for Upgrading the Data Profiling Warehouse . . . . . . . . . . . . . . . . . . . . . . . 13
Table 2-3. Data Connector Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Table 2-4. Default Data Profile Options - General Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 2-5. Default Data Profile Options - Prefixes Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Table 3-1. Employee Expenses Sample Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Table 4-1. Profile Run Properties for Profile Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Table 4-2. Session Setup Properties for Profile Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Table 4-3. Function Behavior with Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 4-4. Recommended Percentages for Manual Random Sampling . . . . . . . . . . . . . . . . . . . 73
Table 5-1. Auto Profile Report Summary Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 5-2. Custom Profile Report Summary Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Table 5-3. Custom Profile Report Function Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Table 5-4. PowerAnalyzer Data Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 6-1. Prepackaged Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Table 6-2. perl Syntax Guidelines for Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Table 6-3. COBOL Syntax and perl Syntax Compared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table 6-4. SQL Syntax and perl Syntax Compared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table 7-1. Row Count Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table 7-2. Business Rule Validation Source-Level Function Options . . . . . . . . . . . . . . . . . . . 107
Table 7-3. Candidate Key Evaluation Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Table 7-4. Redundancy Evaluation Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Table 7-5. Row Uniqueness Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Table 7-6. Business Rule Validation Column-Level Function Options . . . . . . . . . . . . . . . . . . 114
Table 7-7. Domain Validation Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Table 7-8. Aggregate Functions to Add Based on the Column Datatype. . . . . . . . . . . . . . . . . 119
Table 7-9. Aggregate Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Table 7-10. Distinct Value Count Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Table 7-11. Orphan Analysis Function Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Table A-1. DPR_LATEST_PRFLS Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Table A-2. DPR_PRFL_AUTO_COL_FN_METRICS Column Information . . . . . . . . . . . . . 130
Table A-3. DPR_PRFL_CART_PROD_METRICS Column Information . . . . . . . . . . . . . . . 131
Table A-4. DPR_PRFL_COL_FN Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Table A-5. DPR_PRFL_COL_FN_METRICS Column Information . . . . . . . . . . . . . . . . . . . 134
Table A-6. DPR_PRFL_COL_FN_VERBOSE Column Information . . . . . . . . . . . . . . . . . . . 136
Table A-7. DPR_PRFL_CP_FN Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Table A-8. DPR_PRFL_FN_DTLS Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Table A-9. DPR_PRFL_OJ_FN Column Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Table A-10. DPR_PRFL_OJ_FN_VERBOSE Column Information . . . . . . . . . . . . . . . . . . . 143
Table A-11. DPR_PRFL_OJ_METRICS Column Information . . . . . . . . . . . . . . . . . . . . . . . 144

List of Tables xi
Table A-12. DPR_PRFL_RUN_DTLS Column Information . . . . . . . .. . .. . . .. . . .. . .. . .145
Table A-13. DPR_PRFL_SRC_FN Column Information . . . . . . . . . . .. . .. . . .. . . .. . .. . .146
Table A-14. DPR_PRFL_SRC_FN_METRICS Column Information . .. . .. . . .. . . .. . .. . .148
Table A-15. DPR_PRFL_SRC_FN_VERBOSE Column Information . .. . .. . . .. . . .. . .. . .149
Table A-16. DPR_PRFL_VER_DTLS Column Information . . . . . . . . .. . .. . . .. . . .. . .. . .150

xii List of Tables


Preface

Welcome to PowerCenter, Informatica’s integrated suite of software products that delivers an


open, scalable data integration solution addressing the complete life cycle for data warehouse
and analytic application development. PowerCenter combines the latest technology
enhancements for reliably managing data repositories and delivering information resources in
a timely, usable, and efficient manner.
PowerCenter Data Profiling is a technique for analyzing source data. PowerCenter Data
Profiling lets you determine information, such as implicit datatype, suggest candidate keys,
detect data patterns, and evaluate join criteria.

xiii
New Features and Enhancements
This section describes new features and enhancements to Data Profiling 7.1.1.
♦ Data sampling. You can create a data profile for a sample of source data instead of the
entire source. You can view a profile from a random sample of data, a specified percentage
of data, or for a specified number of rows starting with the first row.
♦ Verbose data enhancements. You can specify the type of verbose data you want the
PowerCenter Server to write to the Data Profiling warehouse. The PowerCenter Server can
write all rows, only the rows that meet the business rule, or only the rows that do not meet
the business rule.
♦ Profile session enhancements. You can save sessions that you create from the Profile
Manager to the repository.
♦ Domain Inference function tuning. You can configure the Profile Wizard to filter the
Domain Inference function results. You can configure a maximum number of patterns and
a minimum pattern frequency. You may want to narrow the scope of patterns returned to
view only the primary domains, or you may want to widen the scope of patterns returned
to view exception data.
♦ Row Uniqueness function. You can determine unique and duplicate rows for a source
based on a selection of columns for the specified source.
♦ Define mapping, session, and workflow prefixes. You can define default mapping,
session, and workflow prefixes for the mappings, sessions, and workflows generated when
you create a data profile.
♦ Profile mapping display in the Designer. The Designer now displays profile mappings
under a profile mappings icon in the repository folder.

xiv Preface
About Informatica Documentation
The complete set of documentation for PowerCenter includes the following books:
♦ Data Profiling Guide. Provides information about how to profile PowerCenter sources to
evaluate source data and detect patterns and exceptions.
♦ Designer Guide. Provides information needed to use the Designer. Includes information to
help you create mappings, mapplets, and transformations. Also includes a description of
the transformation datatypes used to process and transform source data.
♦ Getting Started. Provides basic tutorials for getting started.
♦ Installation and Configuration Guide. Provides information needed to install and
configure the PowerCenter tools, including details on environment variables and database
connections.
♦ PowerCenter Connect® for JMS® User and Administrator Guide. Provides information
to install PowerCenter Connect for JMS, build mappings, extract data from JMS messages,
and load data into JMS messages.
♦ Repository Guide. Provides information needed to administer the repository using the
Repository Manager or the pmrep command line program. Includes details on
functionality available in the Repository Manager and Administration Console, such as
creating and maintaining repositories, folders, users, groups, and permissions and
privileges.
♦ Transformation Language Reference. Provides syntax descriptions and examples for each
transformation function provided with PowerCenter.
♦ Transformation Guide. Provides information on how to create and configure each type of
transformation in the Designer.
♦ Troubleshooting Guide. Lists error messages that you might encounter while using
PowerCenter. Each error message includes one or more possible causes and actions that
you can take to correct the condition.
♦ Web Services Provider Guide. Provides information you need to install and configure the Web
Services Hub. This guide also provides information about how to use the web services that the
Web Services Hub hosts. The Web Services Hub hosts Real-time Web Services, Batch Web
Services, and Metadata Web Services.
♦ Workflow Administration Guide. Provides information to help you create and run
workflows in the Workflow Manager, as well as monitor workflows in the Workflow
Monitor. Also contains information on administering the PowerCenter Server and
performance tuning.
♦ XML User Guide. Provides information you need to create XML definitions from XML,
XSD, or DTD files, and relational or other XML definitions. Includes information on
running sessions with XML data. Also includes details on using the midstream XML
transformations to parse or generate XML data within a pipeline.

Preface xv
About this Book
The Informatica PowerCenter Data Profiling Guide provides information to install Data
Profiling, build data profiles, run profile sessions, and view profile results. It is written for the
database administrators and developers who are responsible for building PowerCenter
mappings and running PowerCenter workflows.
This book assumes you have knowledge of relational database concepts, database engines, and
PowerCenter. You should also be familiar with the interface requirements for other
supporting applications.
The material in this book is available for online use.

Document Conventions
This guide uses the following formatting conventions:

If you see… It means…

italicized text The word or set of words are especially emphasized.

boldfaced text Emphasized subjects.

italicized monospaced text This is the variable name for a value you enter as part of an
operating system command. This is generic text that should be
replaced with user-supplied values.

Note: The following paragraph provides additional facts.

Tip: The following paragraph provides suggested uses.

Warning: The following paragraph notes situations where you can overwrite
or corrupt data, unless you follow the specified procedure.

monospaced text This is a code example.

bold monospaced text This is an operating system command you enter from a prompt to
run a task.

xvi Preface
Other Informatica Resources
In addition to the product manuals, Informatica provides these other resources:
♦ Informatica Customer Portal
♦ Informatica Webzine
♦ Informatica web site
♦ Informatica Developer Network
♦ Informatica Technical Support

Visiting Informatica Customer Portal


As an Informatica customer, you can access the Informatica Customer Portal site at
http://my.informatica.com. The site contains product information, user group information,
newsletters, access to the Informatica customer support case management system (ATLAS),
the Informatica Knowledgebase, Informatica Webzine, and access to the Informatica user
community.

Visiting the Informatica Webzine


The Informatica Documentation team delivers an online journal, the Informatica Webzine.
This journal provides solutions to common tasks, detailed descriptions of specific features,
and tips and tricks to help you develop data warehouses.
The Informatica Webzine is a password-protected site that you can access through the
Customer Portal. The Customer Portal has an online registration form for login accounts to
its webzine and web support. To register for an account, go to http://my.informatica.com.
If you have any questions, please email webzine@informatica.com.

Visiting the Informatica Web Site


You can access Informatica’s corporate web site at http://www.informatica.com. The site
contains information about Informatica, its background, upcoming events, and locating your
closest sales office. You will also find product information, as well as literature and partner
information. The services area of the site includes important information on technical
support, training and education, and implementation services.

Visiting the Informatica Developer Network


The Informatica Developer Network is a web-based forum for third-party software
developers. You can access the Informatica Developer Network at the following URL:
http://devnet.informatica.com

Preface xvii
The site contains information on how to create, market, and support customer-oriented add-
on solutions based on Informatica’s interoperability interfaces.

Obtaining Technical Support


There are many ways to access Informatica technical support. You can call or email your
nearest Technical Support Center listed below or you can use our WebSupport Service.
WebSupport requires a user name and password. You can request a user name and password at
http://my.informatica.com.

North America / South America Africa / Asia / Australia / Europe

Informatica Corporation Informatica Software Ltd.


100 Cardinal Way 6 Waltham Park
Redwood City, CA 94063 Waltham Road, White Waltham
Phone: 866.563.6332 or 650.385.5800 Maidenhead, Berkshire
Fax: 650.213.9489 SL6 3TN
Hours: 6 a.m. - 6 p.m. (PST/PDT) Phone: 44 870 606 1525
email: support@informatica.com Fax: +44 1628 511 411
Hours: 9 a.m. - 5:30 p.m. (GMT)
email: support_eu@informatica.com

Belgium
Phone: +32 15 281 702
Hours: 9 a.m. - 5:30 p.m. (local time)

France
Phone: +33 1 41 38 92 26
Hours: 9 a.m. - 5:30 p.m. (local time)

Germany
Phone: +49 1805 702 702
Hours: 9 a.m. - 5:30 p.m. (local time)

Netherlands
Phone: +31 30 602 2797
Hours: 9 a.m. - 5:30 p.m. (local time)

Singapore
Phone: +65 322 8589
Hours: 9 a.m. - 5 p.m. (local time)

Switzerland
Phone: +41 800 81 80 70
Hours: 8 a.m. - 5 p.m. (local time)

xviii Preface
Chapter 1

Data Profiling Overview

This chapter includes the following topics:


♦ Understanding Data Profiling, 2
♦ Steps for Profiling Source Data, 3
♦ Using the Profile Manager, 6

1
Understanding Data Profiling
Data profiling is a technique used to analyze source data. PowerCenter Data Profiling can
help you evaluate source data and detect patterns and exceptions. PowerCenter lets you profile
source data to suggest candidate keys, detect data patterns, evaluate join criteria, and
determine information, such as implicit datatype.
You can use Data Profiling to analyze source data in the following situations:
♦ During mapping development
♦ During production to maintain data quality

Profiling Source Data During Mapping Development


You can use Data Profiling during mapping development for the following purposes:
♦ To learn more information about your source data. If you want to know more about your
source data, you can profile the data to understand data patterns and exceptions. As a
result, you can base the design of mappings and workflows on actual data, rather than
make theoretical assumptions about sources.
♦ To validate documented business rules about the source data. For example, if you have a
business rule that columns in a source table must contain U.S. zip codes, you can profile
the source data to verify that the rows in this table contain the proper values.

Profiling Source Data to Monitor Data Quality


You can use Data Profiling during production for the following purposes:
♦ To verify that the assumptions you made about your source data during project
development are still valid. For example, you may want to view statistics about how many
rows satisfied a business rule and how many did not.
♦ To verify the validity of business intelligence (BI) reports. When you use a BI tool, you
can profile your data to verify the validity of your BI reports. For example, if you run a BI
report that states that your most profitable customers are in California, you might want to
profile your source data to verify what percentage of data about your customers contains a
state code. If a certain percentage of the data does not contain a state code, the results of
your BI report may not be reliable.

2 Chapter 1: Data Profiling Overview


Steps for Profiling Source Data
You create data profiles to profile source data in PowerCenter. A data profile contains
functions that perform calculations on the source data. When you create a data profile, the
Designer generates a mapping. The PowerCenter repository stores your data profile and the
associated mapping.
You can run profile sessions for the mapping to gather information about your source data.
The Data Profiling warehouse stores the results of profile sessions. After you run profile
sessions, you can view reports that display the profile session results.
To profile a source, mapplet, or groups in a source or mapplet, perform the following tasks:
1. Create the Data Profiling warehouse (one-time step).
2. Create a data profile.
3. Create and run a session for the data profile.
4. View reports based on the profile results.
The PowerCenter Designer provides a Profile Manager and Profile Wizard to complete these
tasks.

Step 1. Create the Data Profiling Warehouse


The Data Profiling warehouse is a relational database that stores data profile information from
profile sessions. When you run multiple sessions for a data profile, the Data Profiling
warehouse retains the information for each session run. If you update your data profile, the
Data Profiling warehouse creates a new version of the data profile.
The PowerCenter Client and other applications can access the Data Profiling warehouse for
reporting purposes. For more information about viewing Data Profiling reports, see “Viewing
Profile Results” on page 77.
You create a separate Data Profiling warehouse for each repository when you configure Data
Profiling. You can create the warehouse on any relational database that PowerCenter supports.
You can also create the warehouse on the same database instance as your PowerCenter
repository.
Note: If you create a Data Profiling warehouse on Informix, you cannot use PowerAnalyzer to
view Data Profiling reports.

Step 2. Create a Data Profile


To profile a source, you must first create a data profile. Data profiles contain functions that
perform calculations on the source data. For example, you can use an Aggregate function or
Business Rule Validation function in a data profile. For more information about profile
functions, see “Working with Functions” on page 103.

Steps for Profiling Source Data 3


The PowerCenter repository stores the profile as an object in the repository. You can apply
profile functions to a column within a source, to a single source, or to more than one source.
You can create the following types of data profiles:
♦ Auto profile. Contains a predefined set of functions for profiling source data. You can use
an auto profile during mapping development to learn more about your source data. For
more information about auto profiles, see “Creating an Auto Profile” on page 34.
♦ Custom profile. A profile you define with the functions you need to profile source data.
You can use a custom profile during mapping development to validate documented
business rules about the source data. You also use a custom profile to monitor data quality.
In addition, you can use a custom profile to validate the results of BI reports. For more
information about creating custom profiles, see “Creating a Custom Profile” on page 38.
You use the Designer to create an auto profile. You can use the Designer or the Profile
Manager to create a custom profile by invoking the Profile Wizard. Once you create an auto
profile or custom profile, the Designer generates a mapping based on the profile information.
PowerCenter stores the data profile and profile mapping in your repository.
You can configure a data profile to write verbose data to the Data Profiling warehouse during
a profile session. Verbose data provides more details about the data that results from a profile
function. For example, for a function that validates business rules, verbose data may include
the invalid rows in the source. For a function that determines the number of distinct values,
verbose data can include a list of distinct values.
After you create a data profile, you can view profile details from the Profile Manager. You can
also modify and remove the data profile. For more information about the Profile Manager, see
“Using the Profile Manager” on page 6. For more information about creating, editing, and
deleting data profiles, see “Managing Data Profiles” on page 31.

Step 3. Create and Run the Profile Session


Once you define a profile, run a PowerCenter session for the profile mapping. The
PowerCenter Server writes the profile session results to the Data Profiling warehouse.
You can run profile sessions from the following places:
♦ Profile Manager. You can create and run temporary and persistent profile sessions from
the Profile Manager. A temporary session runs on demand and is not stored in the
repository. A persistent session can run on demand and is stored in the repository.
♦ Workflow Manager. You can create and run persistent sessions from the Workflow
Manager, which are stored in the repository. Use the Workflow Manager to create a session
for a profile mapping and include the session in a workflow. You can run the workflow
from the Workflow Manager or at a scheduled time.
For more information about running profile sessions, see “Running Profile Sessions” on
page 61.

4 Chapter 1: Data Profiling Overview


Step 4. View Profile Reports
After the profile session completes, the PowerCenter Server loads the session results to the
Data Profiling warehouse. You can view the session results using profile reports.
You can view reports on the profile session results in the following ways:
♦ View PowerCenter Data Profiling reports from the Profile Manager.
♦ View customizable reports in PowerAnalyzer.
For more information about viewing profile reports, see “Viewing Profile Results” on page 77.

Steps for Profiling Source Data 5


Using the Profile Manager
The Profile Manager is a tool in the Designer to help you manage data profiles. Use the
Profile Manager to set default profiling options, work with data profiles in your repository,
run profile sessions, view profile results, and view sources and mapplets with at least one
profile defined for them. When you launch the Profile Manager, you can access profile
information for the open folders in your repository.
There are two views in the Profile Manager:
♦ Profile View. The Profile View tab displays the data profiles in the open folders in your
repository.
♦ Source View. The Source View tab displays the source definitions in the open folders in
your repository for which you have defined data profiles.
Note: If the repository folder is read-only, you can only view data profiles in the Profile View.

From the Profile View and the Source View, you can perform the following tasks to manage,
run, and view data profiles:
♦ Create a custom profile. For more information, see “Creating a Custom Profile” on
page 38.
♦ View data profile details. You can view the details of an auto profile or custom profile.
♦ Edit a data profile. For more information, see “Editing a Data Profile” on page 47.
♦ Delete a data profile. For more information, see “Deleting a Data Profile” on page 49.
♦ Run a session. For more information, see “Running Sessions from the Profile Manager” on
page 63.
♦ Regenerate a profile mapping. You can regenerate a profile mapping to validate the
mapping.
♦ Check in profile mappings. You can check in profile mappings for versioned repositories
to commit the changes to the repository. When you check in an object, the repository
creates a new version of the object and assigns it a version number. For more information
on versioning, see the Repository Guide.
♦ Configure default data profile options. For more information, see “Configuring Default
Data Profile Options” on page 26.
♦ Configure domains for profile functions. For more information, see “Working with
Domains” on page 89.
♦ Purge the Data Profiling warehouse. For more information, see “Purging the Data
Profiling Warehouse” on page 29.
♦ Display the status of interactive profile sessions. For more information, see “Monitoring
Interactive Sessions” on page 67.
♦ Display PowerCenter Data Profiling reports. For more information, see “PowerCenter
Data Profiling Reports” on page 79.

6 Chapter 1: Data Profiling Overview


By default, the Profile Manager launches automatically after you create a data profile. You can
also manually launch the Profile Manager from the Source Analyzer or Mapplet Designer. To
launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch Profile
Manager. To launch the Profile Manager from the Mapplet Designer, select Mapplets-
Profiling-Launch Profile Manager.
Tip: If you do not want the Profile Manager to launch automatically after you create a data
profile, you can change the default data profile options in the Profile Manager. For more
information about configuring default data profile options, see “Configuring Default Data
Profile Options” on page 26.

Profile View
The Profile View tab displays all of the data profiles in the open folders in your repository.
Use the Profile View to determine the data profiles that exist for a particular repository folder.
Figure 1-1 shows the Profile Manager:

Figure 1-1. Profile Manager

Repository Data Profile Profile Properties Window Session Status


Folder Navigator

Using the Profile Manager 7


Source View
The Source View displays the source definitions with data profiles in the open folders in your
repository. Use the Source View to determine if a specific source definition already has data
profiles defined. The Source View shows if the data profile is an auto profile or custom
profile.
You can also use the Source View when you want to work with a data profile but are more
familiar with the source name than the data profile name. For example, you may want to run
a profile session, and you know the source definition name but not the data profile name.
Figure 1-2 shows the Profile Manager Source View:

Figure 1-2. Profile Manager Source View

Repository Folder Source Data Profile


Name Name Name

When you select the Source View tab in the Profile Manager, a source view tree displays data
profiles as nodes under the source definition for which you defined the data profile.
If you change or delete a data profile or a source or mapplet with a data profile, you can click
View-Refresh to refresh the Source View.

8 Chapter 1: Data Profiling Overview


Chapter 2

Installing and Configuring


Data Profiling
This chapter contains information on the following topics:
♦ Overview, 10
♦ Configuring a Data Profiling Warehouse, 12
♦ Installing PowerAnalyzer Data Profiling Reports, 16
♦ Configuring Default Data Profile Options, 26
♦ Purging the Data Profiling Warehouse, 29

9
Overview
Data Profiling installs automatically with the PowerCenter Server and Client when you
purchase the Data Profiling Option. After you install PowerCenter with the Data Profiling
Option, you must complete configuration steps before you can create data profiles, run
profile sessions, or view profile reports.

Installing and Configuring Data Profiling


To install and configure Data Profiling, complete the following steps:
1. Install and configure PowerCenter 7.1.2. Install and configure the PowerCenter 7.1.2
Server and Client with the Data Profiling Option.
♦ You must register the PowerCenter Server before you can view session properties when
you create or edit data profiles. For more information about registering the
PowerCenter Server, see “Registering the PowerCenter Server” in the Installation
Guide.
♦ You must ensure that the code pages for the operating systems that host the
PowerCenter Client and Server and the Data Profiling warehouse and domains are
compatible. For more information about configuring compatible code pages for
PowerCenter and Data Profiling components, see “Code Page Compatibility” on
page 152.
2. Create a Data Profiling warehouse. Create a Data Profiling warehouse for each
PowerCenter repository in which you want to store data profiles. For more information,
see “Configuring a Data Profiling Warehouse” on page 12.
3. Install PowerAnalyzer reports (optional). For more information, see “Installing
PowerAnalyzer Data Profiling Reports” on page 16.
4. Configure the default data profile options (optional). You can configure default options
for the Profile Wizard, data profile mappings, and data profile workflows and sessions.
For more information about configuring default data profile options, see “Configuring
Default Data Profile Options” on page 26.

Upgrading from a Previous Version


If you are upgrading from a previous version of Data Profiling, complete the following steps:
1. Install and configure PowerCenter 7.1.2. Install and configure the PowerCenter 7.1.2
Server and Client with the Data Profiling Option.
2. Upgrade the Data Profiling warehouse. For more information, see “Upgrading the Data
Profiling Warehouse” on page 13.

10 Chapter 2: Installing and Configuring Data Profiling


3. Install Data Profiling PowerAnalyzer reports (optional). For more information, see
“Installing PowerAnalyzer Data Profiling Reports” on page 16.
Note: When you upgrade Data Profiling, the Designer maintains the default Data Profiling
options you configured in the previous version.

Installing PowerAnalyzer
Before installing and configuring Data Profiling, install and configure PowerAnalyzer if you
want to view Data Profiling reports in PowerAnalyzer.
To use PowerAnalyzer for Data Profiling, you need the following product licenses:
♦ Application server license
♦ PowerAnalyzer license

Application Server License


JBoss Application Server does not require a license. However, if you are using another
application server, you may need a license. You should have received the license when you
purchased the application server product.

PowerAnalyzer License
PowerAnalyzer Data Profiling reports use PowerAnalyzer and require a PowerAnalyzer license.
Informatica does not ship the PowerAnalyzer license key with the Data Profiling installation.
To obtain a PowerAnalyzer license, send a request to: productrequest@informatica.com.
Informatica provides the license key in an email with instructions on how to apply the license
to the product.
You do not need the license key to install PowerAnalyzer. If you have the PowerAnalyzer
license key at the time of installation, you can apply the license during the installation.
Otherwise, you can complete the installation and apply the license key afterward.
The PowerAnalyzer license is a restricted license. The license can only be used to run Data
Profiling and PowerCenter Metadata Reporter. For more information about the license,
contact Informatica Technical Support.
For more information about activating the PowerAnalyzer license or installing and
configuring PowerAnalyzer, see the PowerAnalyzer Installation Guide.

Overview 11
Configuring a Data Profiling Warehouse
When you install Data Profiling for the first time, create a Data Profiling warehouse for each
PowerCenter repository in which you want to store data profiles. When you upgrade from a
previous version, you must upgrade your Data Profiling warehouse.
Once you create a Data Profiling warehouse, you must configure a relational database
connection to the warehouse in the Workflow Manager.

Creating a Data Profiling Warehouse


You can create a Data Profiling warehouse on any relational database that PowerCenter
supports. You can also create the Data Profiling warehouse on the same database instance as
your PowerCenter repository.
Note: If you create a Data Profiling warehouse on Informix, you cannot use PowerAnalyzer to
view Data Profiling reports.
To create a Data Profiling warehouse, run a script to create a warehouse schema for your
database type. When you install Data Profiling, the installation program places the script in
the following directory:
<PowerCenter Client>\Extensions\DataProfile\ProfileWarehouse

Table 2-1 shows the script you must run for each database type:

Table 2-1. Scripts for Creating a Data Profiling Warehouse

Database Type Script

IBM DB2 create_schema_db2.sql

Informix create_schema_inf.sql

Microsoft SQL Server create_schema_sql.sql

Oracle create_schema_ora.sql

Sybase create_schema_syb.sql

Teradata create_schema_ter.sql

Creating a Data Profiling Warehouse on IBM DB2


To create a Data Profiling warehouse on IBM DB2, you must connect to DB2. Then, enter
the following command from the command line:
db2 -tvf create_schema_db2.sql

Creating a Data Profiling Warehouse on Informix


To create a Data Profiling warehouse on Informix, run the dbaccess command line program.
Select create_schema_inf.sql as the file you want to execute. Select Run.

12 Chapter 2: Installing and Configuring Data Profiling


Creating a Data Profiling Warehouse on Microsoft SQL Server
To create a Data Profiling warehouse on Microsoft SQL Server, open Query Analyzer and run
the create_schema_sql.sql script.
Note: Informatica recommends that you use Query Analyzer rather than isql to run the script.
Otherwise, the script may not run successfully.

Creating a Data Profiling Warehouse on Oracle


To create a Data Profiling warehouse on Oracle, run the following command from the
command line:
sqlplus <user name>/<password>@<database server> @create_schema_ora.sql

Creating a Data Profiling Warehouse on Sybase


To create a Data Profiling warehouse on Sybase, run the following command from the
command line:
isql -U <user name> -P <password> -S <database server>
-i create_schema_syb.sql

Creating a Data Profiling Warehouse on Teradata


To create a Data Profiling warehouse on Teradata, open Teradata SQL Assistant. Open the file
create_schema_ter.sql and click Execute.

Upgrading the Data Profiling Warehouse


To upgrade a Data Profiling warehouse, run a script to upgrade the warehouse schema for
your database type. The script you run depends on your database type and the version of the
PowerCenter Data Profiling installation you are upgrading from.
When you install Data Profiling, the installation program places the script in the following
directory:
<PowerCenter Client>\Extensions\DataProfile\ProfileWarehouse

Table 2-2 shows the script you must run for each database type and PowerCenter Data
Profiling installation version:

Table 2-2. Scripts for Upgrading the Data Profiling Warehouse

Upgrade Database Type Script

From 7.1 to 7.1.2 IBM DB2 upgrade_schema02-03_db2.sql

From 7.1 to 7.1.2 Informix upgrade_schema02-03_inf.sql

From 7.1 to 7.1.2 Microsoft SQL Server upgrade_schema02-03_sql.sql

From 7.1 to 7.1.2 Oracle upgrade_schema02-03_ora.sql

Configuring a Data Profiling Warehouse 13


Table 2-2. Scripts for Upgrading the Data Profiling Warehouse

Upgrade Database Type Script

From 7.1 to 7.1.2 Sybase upgrade_schema02-03_syb.sql

From 7.1 to 7.1.2 Teradata upgrade_schema02-03_ter.sql

From 7.0 to 7.1.2 IBM DB2 upgrade_schema01-03_db2.sql

From 7.0 to 7.1.2 Informix upgrade_schema01-03_inf.sql

From 7.0 to 7.1.2 Microsoft SQL Server upgrade_schema01-03_sql.sql

From 7.0 to 7.1.2 Oracle upgrade_schema01-03_ora.sql

From 7.0 to 7.1.2 Sybase upgrade_schema01-03_syb.sql

From 7.0 to 7.1.2 Teradata upgrade_schema01-03_ter.sql

Upgrading a Data Profiling Warehouse on IBM DB2


To upgrade a Data Profiling warehouse on IBM DB2, you must connect to DB2. Then, enter
the following command from the command line:
db2 -tvf <upgrade schema script>

Upgrading a Data Profiling Warehouse on Informix


To upgrade a Data Profiling warehouse on Informix, run the dbaccess command line
program. Select the appropriate script from Table 2-2 as the file you want to execute. Select
Run.

Upgrading a Data Profiling Warehouse on Microsoft SQL Server


To upgrade a Data Profiling warehouse on Microsoft SQL Server, open Query Analyzer and
run the upgrade script indicated on Table 2-2.
Note: Informatica recommends that you use Query Analyzer rather than isql to run the script.
Otherwise, the script may not run successfully.

Upgrading a Data Profiling Warehouse on Oracle


To upgrade a Data Profiling warehouse on Oracle, run the following command from the
command line using the appropriate script from Table 2-2:
sqlplus <user name>/<password>@<database server>
@<upgrade schema script>

14 Chapter 2: Installing and Configuring Data Profiling


Upgrading a Data Profiling Warehouse on Sybase
To upgrade a Data Profiling warehouse on Sybase, run the following command from the
command line using the appropriate upgrade script listed on Table 2-2:
isql -U <user name> -P <password> -S <database server>
-i <upgrade schema script>

Upgrading a Data Profiling Warehouse on Teradata


To upgrade a Data Profiling warehouse on Teradata, open Teradata SQL Assistant. Open the
upgrade script indicated on Table 2-2 and click Execute.

Configuring a Relational Database Connection to the Data Profiling


Warehouse
After you create a Data Profiling warehouse, you must configure a relational database
connection to the warehouse in the Workflow Manager. When you run profile sessions, use
this relational database connection as the target connection. For more information about
configuring a relational database connection, see the Workflow Administration Guide. For
more information about running profile sessions, see “Running Profile Sessions” on page 61.

Configuring a Data Profiling Warehouse 15


Installing PowerAnalyzer Data Profiling Reports
If you want to view Data Profiling reports on PowerAnalyzer, you must install and configure
PowerAnalyzer. Informatica recommends that a PowerAnalyzer user with the System
Administrator role set up the Data Profiling reports.
To install PowerAnalyzer Data Profiling reports for the first time, complete the following
steps:
1. Install Data Profiling XML scripts. For more information, see “Installing Data Profiling
XML Scripts” on page 16.
2. Import the Data Profiling schema and reports into PowerAnalyzer. For more
information, see “Importing Data Profiling Schema and Reports” on page 16.
3. Create a data source for the Data Profiling warehouse in PowerAnalyzer. For more
information, see “Configuring a Data Source for the Data Profiling Warehouse” on
page 20.
4. Create a data connector in PowerAnalyzer. If you do not already have a data connector,
you must create one. When you create the data connector, add your Data Profiling
warehouse data source to the data connector. If you create a data connector, you can skip
the next step. For more information, see “Creating a Data Connector” on page 21.
5. Add the Data Profiling warehouse data source to a data connector in PowerAnalyzer. If
you already have a data connector, add your Data Profiling warehouse data source to the
data connector. If you recently created a data connector, you should have already added
your Data Profiling warehouse data source to the data connector. For more information,
see “Adding a Data Profiling Data Source to a Data Connector” on page 23.

Installing Data Profiling XML Scripts


The Data Profiling installation includes XML scripts to create the following PowerAnalyzer
objects:
♦ Reports
♦ Schemas
Locate the XML scripts in the following directory:
PowerCenter Client\Extensions\DataProfile\IPAReports

Importing Data Profiling Schema and Reports


Before you import the Data Profiling schema and reports, ensure that PowerAnalyzer works
properly. To import the Data Profiling schema and reports, log in to PowerAnalyzer with
system administrator privileges.

16 Chapter 2: Installing and Configuring Data Profiling


Import the following PowerAnalyzer objects from the XML files in the following order:
1. Schemas. Import the schema tables, attributes, and metrics from an XML file.
2. Reports. Import the reports from an XML file.

Importing the Data Profiling Schema


The dpr_schema.xml file contains the schema, attribute, and metric definitions required for
Data Profiling reports.

To import the Data Profiling schema:

1. Click Administration > XML Export / Import > Import Schemas.


The Import Schemas page displays.

2. On the Import Schemas page, select Validate XML Against DTD.


3. Click Browse to locate the dpr_schema.xml file.
4. Click Import XML.

Installing PowerAnalyzer Data Profiling Reports 17


The Import Schemas page displays the schema objects to be imported.

5. Click Continue.
The Import Schemas page displays the following message:
The objects have been successfully imported to the target repository.

Importing the Data Profiling Reports


The dpr_reports.xml file contains the definitions of the Data Profiling reports. Locate the file
in the Data Profiling Reports folder on the PowerCenter installation CD. You can copy the
file to your PowerAnalyzer machine.

To import the Data Profiling reports:

1. Click Administration > XML Export / Import > Import Reports.

18 Chapter 2: Installing and Configuring Data Profiling


The Import Reports page displays.

2. On the Import Reports page, select Validate XML Against DTD.


3. Click Browse to locate the dpr_reports.xml file.
4. Click Import XML.
The Import Reports page displays the reports to be imported.

By default, the Publish to Everyone and Run Scheduled Reports After Import options are
selected.
5. If you do not want to give all PowerAnalyzer users access to the reports, clear the Publish
to Everyone option.
You can give specific individuals access to the Data Profiling reports. For more
information about providing access to a PowerAnalyzer report, see the PowerAnalyzer
Administrator Guide.
6. Clear the Run Scheduled Reports option.

Installing PowerAnalyzer Data Profiling Reports 19


7. Click Continue.
The Import Schemas page displays the following message:
The objects have been successfully imported to the target repository.

Configuring a Data Source for the Data Profiling Warehouse


You must configure a data source for each Data Profiling warehouse that contains profiling
data for which you want to view PowerAnalyzer Data Profiling reports.

To create the Data Profiling warehouse data source:

1. Click Administration > Schema Design > Data Sources.


2. On the Data Sources page, click Add.
3. Enter a name and description for the data source.

4. Select the server type of your Data Profiling warehouse.


PowerAnalyzer provides JDBC drivers to connect to the Data Profiling warehouse. When
you select the server type, PowerAnalyzer supplies the driver name and connection string
format for the JDBC drivers that PowerAnalyzer provides.
The server type list includes the following databases:
♦ Oracle. Select to connect to an Oracle repository.
♦ SQL Server. Select to connect to a Microsoft SQL Server repository.
♦ DB2. Select to connect to an IBM DB2 repository.
♦ Sybase. Select to connect to a Sybase repository.

20 Chapter 2: Installing and Configuring Data Profiling


♦ Teradata. Select to connect to a Teradata repository.
Note: You cannot create an Informix data source.

5. Customize the JDBC connection string with the information for your PowerCenter
repository database.
6. Enter the user name and password to connect to the repository database.
7. Test the connection.
If the connection fails, verify that the repository database information is correct. Consult
your database administrator if necessary.
8. Click OK to save the data source.
For more information about PowerAnalyzer data sources, see the PowerAnalyzer Administrator
Guide.

Creating a Data Connector


You must create a data connector. Once you create the data connector, add your data source
to the data connector. You can create data connectors for specific users and groups.
Note: If you already have a data connector to which you want to add your data source, you can
skip this step. However, you must add your Data Profiling data source to your data connector.
For more information about adding a data source to a data connector, see “Adding a Data
Profiling Data Source to a Data Connector” on page 23.

To create a data connector:

1. Click Administration > Schema Design > Data Connectors.


The list of data connectors defined in the repository displays in the task area.

2. Click Add.

Installing PowerAnalyzer Data Profiling Reports 21


The Data Connectors page displays.

3. Enter the data connector properties listed in Table 2-3:

Table 2-3. Data Connector Properties

Required/
Property Description
Optional

System Name Required Enter the name of the data connector. The data connector name must be
unique. The system name can include any character except a space, tab,
newline character, and the following special characters:
\ / : * ? “ < > | ‘ &

Description Optional Enter a description for the data connector. The connector description can
be between 1 and 255 characters.

Primary Data Source Required Select the primary data source from the list of data sources available in
PowerAnalyzer. PowerAnalyzer uses this data source to connect to the
Data Profiling warehouse and read the metric and attribute data for a
report. For information about how PowerAnalyzer connects to the primary
and additional data sources, see the PowerAnalyzer Administrator Guide.

Primary Time Optional This option does not apply to Data Profiling reports.
Dimension

Additional Schema Optional This option does not apply to Data Profiling reports.
Mappings

4. If you do not want to set up additional data sources, click OK.


If this is the first data connector in PowerAnalyzer, PowerAnalyzer saves the data
connector to the repository as the system data connector, making it available to all users.
You cannot edit the list of users who can use this data connector.
If this is not the system data connector, you can edit the list of users who can use this data
connector. You can activate this data connector in addition to the system data connector.

22 Chapter 2: Installing and Configuring Data Profiling


Adding a Data Profiling Data Source to a Data Connector
PowerAnalyzer uses a data connector to connect to a data source and read the data for Data
Profiling reports. Typically, PowerAnalyzer uses the system data connector to connect to all
the data sources required for reports. To enable PowerAnalyzer to run Data Profiling reports,
add the Data Profiling data source to the system data connector. If you have several data
connectors and you want to use a specific data connector for Data Profiling reports, add the
Data Profiling data source to the specific data connector.
Note: If you configured a data connector in the previous step, verify that you have already
added your Data Profiling data source. Click the data connector name to see if your data
source is listed as the primary data source.
For more information about PowerAnalyzer data connectors, see the PowerAnalyzer
Administrator Guide.

To add the Data Profiling warehouse data source to the system data connector:

1. Click Administration > Schema Design > Data Connectors.


The Data Connectors page displays.

2. Click the name of the system data connector.

Installing PowerAnalyzer Data Profiling Reports 23


PowerAnalyzer displays the properties of the system data connector.

3. In the Additional Schema Mappings, click Add.


PowerAnalyzer expands the section and displays the available schemas in the repository.
4. In the Data Source list, select the Data Profiling data source you created in “Configuring
a Data Source for the Data Profiling Warehouse” on page 20.

24 Chapter 2: Installing and Configuring Data Profiling


5. In the Available Schemas section, select DP_SCHEMA, and click Add >>.

6. Click Add.
PowerAnalyzer displays the additional schema mappings for the system data connector.

7. Click OK.
You can now run Data Profiling reports using the system data connector.

Installing PowerAnalyzer Data Profiling Reports 25


Configuring Default Data Profile Options
After you install and configure Data Profiling client components, you can configure the
default data profile options for the PowerCenter Client machine. Each data profile you create
in the Designer uses these default options. You configure the default data profile options from
the Profile Manager.
Figure 2-1 shows the options dialog box for configuring default data profile options:

Figure 2-1. Configuring Default Profile Options

Select to revert to the default settings.

To configure default data profile general options:

1. Launch the Profile Manager.


To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch
Profile Manager. To launch the Profile Manager from the Mapplet Designer, select
Mapplets-Profiling-Launch Profile Manager.
2. In the Profile Manager, click Tools-Options.
The General tab for the Options dialog box displays.

26 Chapter 2: Installing and Configuring Data Profiling


3. Configure the following options as necessary:

Table 2-4. Default Data Profile Options - General Tab

Required/
Option Description
Optional

Always save changes Optional Select to always save changes to the profile mapping before running a
before interactive run profile session interactively. If you clear this option, the Designer
prompts you to save changes before you run an interactive session.

Display Profile Optional Select to always launch the Profile Manager after you create a data
Manager after profile. If you clear this option, the Profile Manager does not launch
creating a profile automatically after you create a data profile.

Always run profile Optional Select to always run a profile session interactively when you create a
interactively data profile. If you clear this option, you can still run auto and custom
profile sessions interactively from the Profile Manager. For more
information about running interactive sessions, see “Running Sessions
from the Profile Manager” on page 63. For more information about
creating custom profiles, see “Creating a Custom Profile” on page 38.

Check in profile Optional Select if you want the Designer to check in profile mappings when you
mapping when profile save changes for versioned repositories. Saving versions of profile
is saved mappings in your repository can consume large amounts of disk space.
Make sure you have enough disk space on the machine hosting the
repository.

Always invoke auto Optional Select to display the Auto Profiling dialog box when you create a new
profiling dialog auto profile. If you clear this option, the Auto Profiling dialog box does
not display when you create a new data profile. Also, you cannot
configure Domain Inference tuning options and verbose data loading
options when you create the auto profile. However, if you clear this
option and create an auto profile for a source with 25 or more columns,
the Auto Profiling dialog box displays. For more information about
creating auto profiles, see “Creating an Auto Profile” on page 34.

Use source owner Optional Select to add the table owner name to relational sources when the
name during profile Designer generates a profile mapping.
mapping generation Note: If the owner name changes after you generate the profile mapping,
you must regenerate the mapping. You can regenerate a profile mapping
in the Profile Manager. For more information about regenerating profile
mappings, see “Using the Profile Manager” on page 6.

4. Enter the name of a text editor for the session log file.
By default, the Profile Manager selects Wordpad as the text editor.
5. Enter the location where you want your session log files to be written.
6. Click the Prefixes tab.

Configuring Default Data Profile Options 27


7. Configure the following options as necessary:

Table 2-5. Default Data Profile Options - Prefixes Tab

Required/
Option Description
Optional

Profile Mapping Required Edit the prefix to use with all profile mapping names. Profile mappings
Prefix use the following naming convention:
<prefix>_<Profile Name>
The default prefix is m_DP_.
The prefix must be 1 to 10 characters. It cannot contain spaces.

Profile Workflow Required Edit the prefix to use with all profile workflow names. Profile workflows
Prefix use the following naming convention:
<prefix>_<Profile Name>
The default prefix is wf_DP_.
The prefix must be 1 to 10 characters. It cannot contain spaces.

Profile Session Prefix Required Edit the prefix to use with all profile session names. Profile sessions use
the following naming convention:
<prefix>_<Profile Name>
The default prefix is s_DP_.
The prefix must be 1 to 10 characters. It cannot contain spaces.

Revert All Optional Revert prefixes back to the default settings.

8. Click OK.

28 Chapter 2: Installing and Configuring Data Profiling


Purging the Data Profiling Warehouse
When you delete data profiles from a repository, the related metadata and profile session
results remain in the Data Profiling warehouse. You can purge metadata and profile session
results from the Data Profiling warehouse for deleted data profiles. You can also purge profile
session results that are no longer associated with data profile metadata.
When you purge the Data Profiling warehouse, you can purge all metadata and session
results, the most recent metadata and session results, or metadata data and session results
based on a date. You can purge metadata and session results associated with a particular
repository folder or associated with all folders in the repository.

To purge the Data Profiling warehouse:

1. Launch the Profile Manager.


To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch
Profile Manager. To launch the Profile Manager from the Mapplet Designer, select
Mapplets-Profiling-Launch Profile Manager.
2. Select Target Warehouse-Connect to connect to your Data Profiling Warehouse.
3. Select Target Warehouse-Purge.
The Purge Data Profiling Warehouse dialog box displays.

4. Select the folder from which you want to purge metadata and/or profile session results.
The folder must be open to purge it. Or, select All Open Folders to purge metadata and/
or profile session results from it.

Purging the Data Profiling Warehouse 29


5. Select one of the following options to purge from your Data Profiling warehouse:
♦ Profiles. Purge metadata and profile session results for deleted data profiles.
♦ Orphan Profiles Only. Purges profile session results that are no longer associated with
data profile metadata.
6. If you selected Profiles, select one of the following options for Purge Profile Metadata and
Runs. Or, if you selected Orphan Profiles Only, select one of the following options for
Purge Profile Runs Only:
♦ All. Purges all profile metadata and session run results for deleted data profiles.
♦ All Except Latest. Purges all profile metadata and session run results for deleted data
profiles except the latest instance of each.
♦ Older Than. Select a date to purge all profile metadata and session run results for
deleted data profiles prior to that date.
7. Select Purge Metadata and Runs for Orphan Profiles to delete both metadata and session
results that are no longer associated with data profile metadata.
8. Click Purge.
The Data Profiling warehouse may take a few minutes to purge.

30 Chapter 2: Installing and Configuring Data Profiling


Chapter 3

Managing Data Profiles

This chapter includes the following topics:


♦ Overview, 32
♦ Creating an Auto Profile, 34
♦ Creating a Custom Profile, 38
♦ Editing a Data Profile, 47
♦ Deleting a Data Profile, 49
♦ Working with Profile Mappings, 51
♦ Using Mapplets to Extend Data Profiling Functions, 52
♦ Troubleshooting, 60

31
Overview
You can create, edit, and delete data profiles. Data profiles contain a set of functions to apply
to a specified set of source data. The functions return the metadata about the profile sources
that make up the data profile reports.
You can create the following types of data profiles:
♦ Auto profile. Contains predefined functions for profiling source data. Use an auto profile
during mapping or mapplet development to learn more about source data. For more
information about creating an auto profile, see “Creating an Auto Profile” on page 34.
♦ Custom profile. You create and add functions to create a custom profile. Use a custom
profile during mapping or mapplet development to validate documented business rules
about the source data. You can also use a custom profile to monitor data quality. For more
information about creating a custom profile, see “Creating a Custom Profile” on page 38.
When you create a data profile, the Designer generates a mapping based on the profile
functions. When you run a session for the mapping, the PowerCenter Server writes profile
data to the Data Profiling warehouse.
Once you create a data profile, you can edit and delete the data profile. For more information
about editing a data profile, see “Editing a Data Profile” on page 47. For more information
about deleting a data profile, see “Deleting a Data Profile” on page 49.

Profiling Sources and Mapplet Output data


You can profile sources and output data from connected ports in mapplet output groups. A
source is any external data source that you create or import in PowerCenter as a source
definition. You can profile the following types of sources:
♦ Relational database sources
♦ Flat file sources
♦ XML sources
♦ VSAM sources
♦ Application sources, such as SAP R/3, PeopleSoft, and MQSeries
When you profile a multi-group source, such as an XML source, you can select the groups in
the source you want to profile or you can profile the entire source.
When you profile mapplet output data, the Profile Wizard creates a data profile based on the
data output from the connected ports in the Output transformation.
You can profile output data from mapplets that meet the following conditions:
♦ The mapplet contains a source definition for input.
♦ The mapplet contains no Input transformation.
♦ The mapplet contains no Transaction Control transformations.

32 Chapter 3: Managing Data Profiles


♦ The mapplet contains at least one Output transformation.
♦ The ports in the Output transformation are connected.
Note: You can profile sources with ASCII and non-ASCII port names.

Profiling Multiple Sources


If you want to profile multiple sources, you can create a mapplet that combines multiple
sources and create a data profile based on the mapplet output data. For example, you might
use several Joiner transformations within the mapplet to join source data from multiple
sources, and profile the output data from this mapplet.

Profiling SAP R/3 Sources


When you create a data profile for SAP R/3 sources, you must generate an ABAP program for
the profile mapping before you can run a session for the data profile. For more information
about generating an ABAP program, see the PowerCenter Connect for SAP R/3 User and
Administrator Guide.

Profile Functions
You can add multiple profile functions to a data profile. Profile functions are calculations you
perform on the source data that return information about various characteristics of the source
data.
When you add a function to a data profile, you can choose from the following types of
functions:
♦ Source-level functions. Performs calculations on two or more columns of a source, source
group, or mapplet group. For example, you can evaluate a business rule for groups in an
XML source.
♦ Column-level functions. Performs calculations on one column of a source. For example,
you can evaluate the data in a column to find patterns that frequently occur in the data.
♦ Intersource functions. Performs calculations on two or more sources. These functions
generate information about the relationship between the sources. For example, you might
compare the values of columns in two sources to find out the percentage of identical data
that appears in both sources.
Each function type has a subset of functionality that you can configure when you add a
function to the data profile. For more information about profile functions, see “Working with
Functions” on page 103.

Overview 33
Creating an Auto Profile
Create an auto profile to learn more about source data or mapplet output data during
mapping development. When you create an auto profile, the Designer creates a data profile
with the following functions:
♦ Row Count. Counts the number of rows read from the source during the profile session.
When you create a data profile that uses the Row Count function with data samples, the
Row Count function estimates the total row count.
♦ Candidate Key Evaluation. Calculates the number and percentage of unique values in one
or more columns.
♦ Redundancy Evaluation. Calculates the number of duplicate values in one or more
columns of the source.
♦ Domain Inference. Reads all values in a column and infers patterns that occurs in the data.
You can configure the Profile Wizard to filter the Domain Inference results.
♦ Distinct Value Count. Reads all the values in a column and returns the number of distinct
values for the column. You can configure the auto profile to load verbose data to the Data
Profiling warehouse.
♦ Aggregate functions. Calculates an aggregate value for numeric or string values in a
column. Use aggregate functions to count null values, determine average values, and
determine minimum or maximum values.
For more information about functions, see “Working with Functions” on page 103.

Auto Profile Naming Conventions


The Designer uses the following naming conventions for an auto profile:
AP_<source/mapplet name>

For example, if you generate an auto profile for the source CustomerData, the Designer
names the auto profile AP_CustomerData.
After you create the auto profile, the Designer generates a mapping based on the profile
functions. The Designer uses the following naming convention when it saves the profile
mapping to the repository:
m_DP_AP_<source/mapplet name>

For example, if you create an auto profile called AP_CustomerData, the profile mapping
name is m_DP_AP_CustomerData.
Tip: You can rename an auto profile in the Profile Manager. You can click Description on the
Auto Profile Column Selection page to change the name or description of the profile. Or, you
can change the naming convention for profile mappings in the default data profile options.
For information about changing default data profile options, see “Configuring Default Data
Profile Options” on page 26.

34 Chapter 3: Managing Data Profiles


If you have an auto profile for a source, and you generate another auto profile for the same
source, the Designer does not overwrite the existing auto profile. It creates a new auto profile
using the following naming convention:
AP_<source/mapplet name>N

where N is the latest version number of the previous auto profile plus 1. For example, if you
have an auto profile AP_CustomerData, and you generate a new auto profile for the source
CustomerData, the auto profile name is AP_CustomerData1.
The mapping the Designer generates for the new auto profile uses the following naming
convention:
m_DP_AP_<source/mapplet name>N

where N is the latest version number of the auto profile plus 1.


Tip: As source data changes, you may need to create new auto profiles for the source. However,
you can preserve existing auto profiles and their corresponding mappings in the Data
Profiling warehouse for tracking or auditing purposes.

Steps to Create an Auto Profile


When you create an auto profile, you can select groups or columns in the source to profile. Or
you can profile the entire source.

To create an auto profile:

1. Select the source definition in the Source Analyzer or mapplet in the Mapplet Designer
you want to profile.
2. In the Source Analyzer, select Sources-Profiling-Create Auto Profile. In the Mapplet
Designer, select Mapplets-Profiling-Create Auto Profile.
The Auto Profile Column Selection page displays in the following cases:
♦ You set the default data profile options to open the Auto Profile Column Selection
page when you create an auto profile.
♦ The source definition contains 25 or more columns.

Creating an Auto Profile 35


Figure 3-1 shows the Auto Profile Column Selection page:

Figure 3-1. Profile Wizard - Auto Profile Column Selection Page

Add or edit
a name or
description
for the
profile.

Add
columns or
groups to
the profile.

Specify the
type of
verbose
data to
load.

Select to configure the session. Select to configure the Domain Inference function settings.

If the Auto Profile Column Selection page does not display, the Designer generates an
auto profile and profile mapping based on the profile functions. Go to step 10.
Note: If you skip this dialog box, you cannot configure verbose data loading settings or
Domain Inference tuning settings.
3. Optionally, click Description to add a description for the data profile.
If you do not want to add a description, go to step 5.
4. Enter a description up to 200 characters for the profile, and click OK to return to the
Auto Profile Column Selection page.
5. Optionally, select the groups or columns in the source that you want to profile.
By default, all columns or groups are selected.
6. Specify the type of verbose data for the Distinct Value Count function you want the
PowerCenter Server to write to the Data Profiling warehouse during the profile session.
The PowerCenter Server can write the following types of verbose data to the Data
Profiling warehouse for auto profile functions:
♦ No rows. The PowerCenter Server writes no verbose data to the Data Profiling
warehouse.

36 Chapter 3: Managing Data Profiles


♦ Duplicate rows only. The PowerCenter Server writes duplicate values to the Data
Profiling warehouse.
♦ All rows.The PowerCenter Server writes distinct and duplicate values to the Data
Profiling warehouse.
Note: If you load verbose data for columns with a precision greater than 1,000 characters,
the PowerCenter Server might write truncated data to the Data Profiling warehouse
during the profile session.
For more information about working with verbose data, see “Configuring a Function for
Verbose Mode” on page 43.
7. Click Profile Settings to specify settings for Domain Inference tuning.
The Profile Settings dialog box opens.
You can specify the following criteria to filter domain inference data:
♦ Maximum number of patterns. The PowerCenter Server returns the most frequently
occurring patterns up to the number of patterns you specify.
♦ Minimum pattern frequency. The PowerCenter Server returns patterns that occur at
or above the frequency you specify.
For more information about the Domain Inference function and Domain Inference
tuning, see “Domain Inference” on page 117.
8. Select Configure Session to configure the session properties after you create the data
profile.
9. Click Next if you selected Configure Session, or click Finish if you cleared Configure
Session.
The Designer generates a data profile and profile mapping based on the profile functions.
10. Click Repository-Save.
If you clicked Next, the Profile Run page opens. If you clicked Finish, the Profile Wizard
closes.
11. Configure the Run Profile options and click Next.
For information about configuring the Run Profile options for a session, see “Running a
Session when You Create a Data Profile” on page 67.
The Session Setup page opens.
12. Configure the Session Setup page options.
For information about configuring the Session Setup page options, see “Running a
Session when You Create a Data Profile” on page 67.
If you selected Run Session, the Profile Manager starts the session. If you cleared the Run
Session option, the Profile Wizard saves the session properties you configured and closes.

Creating an Auto Profile 37


Creating a Custom Profile
You can create a custom profile from the following areas of the Designer:
♦ Source Analyzer. Select Sources-Profiling-Create Custom Profile.
♦ Mapplet Designer. Select Mapplets-Profiling-Create Custom Profile.
♦ Profile Manager. Select Profile-Create Custom.
Note: If you select a multi-group source definition, the Profile Wizard adds all groups in the
source definition to the profile. Use the Profile Wizard to remove any groups you do not want
to profile.
You can also edit an existing data profile or delete a data profile. For more information about
editing data profiles, see “Editing a Data Profile” on page 47. For more information about
deleting data profiles, see “Deleting a Data Profile” on page 49.
To create a custom profile, complete the following steps:
1. Enter a profile name and optionally add a description of the profile. For more
information, see “Step 1. Specify a Profile Name and Profile Description” on page 38.
2. Add sources to the profile. For more information, see “Step 2. Add Sources to the
Profile” on page 38.
3. Add, edit, or delete a profile function and enable session configuration. For more
information, see “Step 3. Add Functions and Enable Session Configuration” on page 39.
4. Configure profile functions. For more information, see “Step 4. Configure Profile
Functions” on page 41.
5. Configure the profile session if you enabled session configuration. For more
information, see “Step 5. Configure the Profile Session” on page 45.

Step 1. Specify a Profile Name and Profile Description


When you start the Profile Wizard, the General Properties page prompts you to specify a
name for the data profile and add a description.
When you name a data profile, you cannot use the following characters:
.+-=~`!%^&*()[]{}'\";:/?,<>\\|\t\r\n

Also, you cannot use an at sign (@) or a number at the beginning of a data profile name.
When you are finished, click Next.

Step 2. Add Sources to the Profile


After you specify the general properties, the Add Profile Sources page prompts you to select
the sources you want to profile. If you selected the source definitions or mapplets you wanted
to profile before you launched the Profile Wizard, this page does not display.

38 Chapter 3: Managing Data Profiles


You can add source definitions, mapplets, and groups in a multi-group source definition or
output from multiple groups in mapplets to the profile. When you profile a multi-group
source, such as an XML source, you can select the groups in the source definition you want to
profile or the entire source definition. For more information about eligible sources for
profiling, see “Profiling Sources and Mapplet Output data” on page 32.
Figure 3-2 shows the Profile Sources page:

Figure 3-2. Profile Wizard - Profile Sources Page

Sources
Node

Add
Source
Remove
Source

Source Mapplets Node

To add sources to the data profile, select a source definition, group in a source definition, or
mapplet and click the Add Source button. To remove a source, select the source definition,
group, or mapplet and click the Remove Source button.
When you are finished adding sources, click Next.
Tip: If you want to profile multiple sources, you can create a mapplet that combines multiple
sources and create a data profile based on the mapplet output data.

Step 3. Add Functions and Enable Session Configuration


After you add sources to the data profile, use the Function-Level Operations page to perform
the following tasks:
♦ Add functions to the profile. When you add functions to the profile, the Profile Wizard
opens the Function Details page for you to configure details about the functions. For

Creating a Custom Profile 39


information about adding functions to a data profile, see “Step 3. Add Functions and
Enable Session Configuration” on page 39.
♦ Edit existing functions. You can edit existing functions for the data profile.
♦ Delete functions from the profile. You can remove functions from the data profile.
♦ Organize functions. Use the up and down arrows to organize the functions in a data
profile. The order of the functions does not affect the data profile results.
♦ Configure Domain Inference function parameters. You can widen or narrow the scope for
the results from the Domain Inference function depending on whether you want to view
the primary domains or exception data. For information about configuring Domain
Inference function parameters, see “Domain Inference” on page 117.
♦ Enable session configuration. When you enable session configuration, the Profile Wizard
prompts you to configure the profile session for the mapping. If you configured the default
data profile options to always run profile sessions interactively, this option is selected by
default. For more information about configuring default data profile options, see
“Configuring Default Data Profile Options” on page 26.
Figure 3-3 shows the Function-Level Operations page:

Figure 3-3. Profile Wizard - Function-Level Operations Page


Add Function Edit Function Delete Function

Up and
Down
Arrows

Configure
Domain
Inference
function
settings.

Enable session configuration.

If you finish adding functions to the profile and you have not enabled session configuration,
click Finish. The Profile Wizard generates the mapping. For more information about
generating the profile mapping, see “Generating the Profile Mapping” on page 45.

40 Chapter 3: Managing Data Profiles


If you finish adding functions to the profile and you enabled session configuration, click
Next. The Profile Wizard prompts you to configure the profile session. For more information
about configuring a profile session, see “Step 5. Configure the Profile Session” on page 45.

Step 4. Configure Profile Functions


When you add a function to the profile, the Function Details page prompts you to perform
the following tasks:
♦ Name the function. Function names are not case sensitive.
♦ Optionally, enter a description of the function. You can enter text to describe the
function.
♦ Specify the type of function. You can specify source-level, column-level, or intersource
functions. For details about these function types, see “Working with Functions” on
page 103.
♦ Specify the functionality you want to use. The functionality you can configure depends
on the type of function you choose. For details about functionality see “Working with
Functions” on page 103.
Figure 3-4 shows the Function Details page:

Figure 3-4. Profile Wizard - Function Details Page

Select the sources


to apply the function
to.

If you specify an intersource function, you must select at least two sources or two groups from
different sources to apply the function to. For more information about profile functions, see
“Working with Functions” on page 103.

Creating a Custom Profile 41


If you added multiple sources to the profile, you must select the source you want to apply the
function to.
When you finish specifying the function type and functionality, click Next. The Profile
Wizard prompts you to specify the functionality details for the function. The configuration of
the Functionality Details window and available options change depending on the type of
function you specify.
Figure 3-5 shows the Function Role Details page for a Row Count function:

Figure 3-5. Profile Wizard - Function Role Details Page (Row Count Function)

Each function type has a subset of functionality you can configure to perform calculations on
the source data. For more information about configuring functions, see “Working with
Functions” on page 103.
When you finish configuring the function, the Profile Wizard returns to the Function-Level
Operations page described in “Step 3. Add Functions and Enable Session Configuration” on
page 39. You can then continue to add and configure functions for the profile.

Configuring a Function with Group-by Columns


Some functions let you generate profile data in a profile session run by group. When you
configure a function, you can determine the column by which you want to group the data.

42 Chapter 3: Managing Data Profiles


To select a group-by column:

1. Configure a source-level function or column-level function.


2. Select Generate Profile Data By Group.
3. Click Group By Columns.
The Group By Columns dialog box displays.

4. Select the columns you want to group by.


You cannot select the column for which you created the function as a group-by column in
a column-level function. For example, if you created a Business Rule Validation function
for the column Agreement_Status, you cannot select this column to group by.
5. Click OK.

Configuring a Function for Verbose Mode


When you configure a function for verbose mode, the PowerCenter Server writes verbose data
to the Data Profiling warehouse when you run a profile session.
You can configure verbose mode for the following functions:
♦ Source-level Business Rule Validation. For information about the source-level Business
Rule Validation function, see “Business Rule Validation” on page 106.
♦ Column-level Business Rule Validation. For information about the column-level Business
Rule Validation function, see “Business Rule Validation” on page 113.
♦ Domain Validation. For information about the Domain Validation function, see “Domain
Validation” on page 115.
♦ Distinct Value Count. For information about the Distinct Value Count function, see
“Distinct Value Count” on page 121.
♦ Orphan Analysis. For information about the Orphan Analysis function, see “Orphan
Analysis” on page 123.
♦ Row Uniqueness. For information about the Row Uniqueness function, see “Row
Uniqueness” on page 110.
The type of verbose data the PowerCenter Server can load to the target depends on the
function for which you configure verbose mode.

Creating a Custom Profile 43


You can load the following types of verbose data:
♦ Rows that meet the business rule. The PowerCenter Server writes rows to the Data
Profiling warehouse that meet the business rule in the function. For example, for a
Domain Validation function, you might load the values that match the specified domain
pattern.
♦ Rows that do not meet the business rule. The PowerCenter Server writes rows to the Data
Profiling warehouse that do not meet the business rule in the function. For example, when
you create a Row Uniqueness function, you might load only duplicate rows.
♦ All rows.The PowerCenter Server writes all verbose data rows to the Data Profiling
warehouse. For example, when you create a Domain Validation function, you might load
verbose data for values that match the domain pattern as well as those that do not.
When you configure a function, you can select the columns to profile in verbose mode. If the
precision for the data in the column you select exceeds 1,000 characters, the PowerCenter
Server may write truncated data to the Data Profiling warehouse.

To configure a function for verbose mode:

1. Configure a function that can output data in verbose mode.


2. Select the type of verbose data to load.
3. Click Select Columns.
The Columns for Verbose Data dialog box displays.

By default, all columns in the source are selected.


4. Clear the columns you do not want to profile in verbose mode.

44 Chapter 3: Managing Data Profiles


5. Click OK.

Step 5. Configure the Profile Session


If you enabled session configuration on the Function-Level Operations page, the Profile
Wizard opens the Profile Run page so that you can configure the profile session properties.
You can configure and run a profile session or save the session configuration and run the
profile session at a another time. For information about configuring profile sessions, see
“Running Sessions from the Profile Manager” on page 63.

Generating the Profile Mapping


After you create the profile, the Designer generates a mapping based on the profile metadata.
You must save changes to store the new data profile and profile mapping in the repository.
The Designer saves the data profile and profile mapping in the repository folder that stores
the source or mapplet output you profiled.
You can view profile mappings in the Designer. The Designer denotes profile mappings in the
Repository Navigator with a Profile Mappings icon.
Figure 3-6 shows Data Profile mappings in the Designer:

Figure 3-6. Data Profile Mappings in the Designer

Profile Mappings

Creating a Custom Profile 45


The profile mapping name is based on the data profile name. By default, the mapping name
contains the prefix m_DP_. For example, if you name the data profile SalaryValidation, the
mapping name for the data profile is m_DP_SalaryValidation.
You can change the naming convention for profile mappings in the default data profile
options. For more information about changing default data profile options, see “Configuring
Default Data Profile Options” on page 26.

46 Chapter 3: Managing Data Profiles


Editing a Data Profile
You can edit a data profile to change any of the properties you configured. When you edit a
data profile, the Designer replaces the profile mapping.
You can edit an auto profile to change the profile name or description. However, if you add a
source to an auto profile, it becomes a custom profile. For more information about auto
profiles, see “Creating an Auto Profile” on page 34.
When you delete a source for which you have defined functions in a data profile, the Designer
marks the functions as invalid. If a function is invalid, the data profile is invalid. You can edit
the functions to use a valid source or sources.
Similarly, if you delete a column, change the column datatype, or rename the column, all
functions using this column are invalid. You can edit the functions to use the modified
version of the column or another column.

To edit a data profile:

1. Launch the Profile Manager.


To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch
Profile Manager. To launch the Profile Manager from the Mapplet Designer, select
Mapplets-Profiling-Launch Profile Manager.
The Profile Manager displays.

Editing a Data Profile 47


2. From the Profile View, select the profile you want to edit.
3. Select Profile-Edit.
The Profile Wizard displays.
4. Use the Profile Wizard to change any of the data profile properties.
For more information about using the Profile Wizard, see “Creating a Custom Profile” on
page 38.

48 Chapter 3: Managing Data Profiles


Deleting a Data Profile
You can delete any data profile from the repository. When you delete a data profile, the Profile
Manager gives you the option to delete the associated profile mapping from the repository.
However, if you delete a profile mapping, the Designer does not delete the data profile
associated with the mapping. You must delete the data profile manually.
Tip: If you delete a profile mapping, but do not want to delete the data profile, you can save
the data profile and regenerate the profile mapping in the Profile Manager later.

To delete a data profile:

1. Launch the Profile Manager.


To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch
Profile Manager. To launch the Profile Manager from the Mapplet Designer, select
Mapplets-Profiling-Launch Profile Manager.
The Profile Manager displays.

2. From the Profile View, select the profile you want to delete.
3. Select Profile-Delete.
The Profile Manager asks if you want to delete the selected data profile.

Deleting a Data Profile 49


4. Click Yes to delete the data profile and the associated profile mapping.
5. Click OK.
6. Click Repository-Save.

50 Chapter 3: Managing Data Profiles


Working with Profile Mappings
When you create a data profile, the Designer generates mappings to create the metadata for
the data profile. You can copy or deploy profile mappings to other folders or other
repositories. Or, you can combine profile mappings with other mappings in the Designer.

Copying Data Profiling Objects


After you create a data profile and the Designer generates the profile mapping, you can copy
the mapping to another repository or repository folder. When you copy a profile mapping,
the Designer does not copy the data profile the mapping is associated with. To copy all the
objects associated with a profile, you must copy or deploy the entire folder to another
repository or repository folder.
You cannot copy a profile mapping with a reusable domain. If you copy a mapping with a
reusable domain and run a session for the mapping, the session fails.
For information about copying folders, see the Repository Guide.

Combining Data Profile Mappings with Other Mappings


You can combine profile mappings with other mappings. This allows the PowerCenter Server
read the source data once to perform mapping logic and profile the data. To do this, make a
copy of the profile mapping and edit the copy rather than modify the original data profile
mapping. If you edit the original data profile, you may lose the changes you added. This
occurs because the Designer regenerates the mapping each time you edit it from the Profile
Manager. When the Designer regenerates the mapping, it overwrites any changes.

Working with Profile Mappings 51


Using Mapplets to Extend Data Profiling Functions
A function can operate on a column, source, or multiple sources. Sometimes, you need to
combine data from multiple sources or multiple columns to use a particular function with it.
Or, you may need to aggregate data in order to get the profiling results you want. For
example, you want to create a Business Rule Validation function that operates on aggregate
values from a source. You need to aggregate the values before you can profile the data using
the Business Rule Validation function. To do this, you can use a mapplet.
You can use a mapplet when you want to profile the following information:
♦ Aggregate data from a single source
♦ Data from two or more sources with one or more matching ports
♦ Data from two sources with all matching ports

Extending Data Profiling Functionality with Mapplets


Complete the following steps to extend data profiling functionality with mapplets:
1. Create the mapplet. Create a mapplet to aggregate data or join or merge sources.
2. Create a data profile using the mapplet output data as a source. Create an auto profile
based on the mapplet output data. Or, create a custom profile based on the mapplet
output data and add functions to the data profile.
3. Run the data profile. Run the profile session. When you run the profile session from the
Profile Manager, it processes the mapplet data as it would if you were running a
workflow. You do not need to run a workflow to aggregate or join the data.
4. View the Data Profiling report. Open the Data Profiling report in the Profile Manager
or PowerAnalyzer to view the results.

Profiling Aggregate Data


When you want to profile aggregate data, you can use a mapplet to aggregate the data before
you create the data profile.
Use the mapplet to aggregate data when you want to use a column-level function on aggregate
data. For example, you have an Employee Expenses flat file source that provides information
on employee expenditures.
Table 3-1 shows the data from the Employee Expenses flat file source:

Table 3-1. Employee Expenses Sample Source Data

EID Spending Date Amount Reason

12 12/3/2003 123.22 Acquired new books

19 4/09/2004 600.21 Purchased ticket to Boston

52 Chapter 3: Managing Data Profiles


Table 3-1. Employee Expenses Sample Source Data

EID Spending Date Amount Reason

213 6/29/2004 215.61 Purchased new software

12 6/12/2004 921.56 Acquired new books

19 6/16/2004 740.21 Purchased ticket to New York

21 7/21/2004 712.88 Purchased a new computer

To test data consistency, you want to create a data profile that shows employees who spent
over $1,000 in the last six months. To get this information, aggregate and filter the spending
amounts before you create a data profile.

Creating the Mapplet


Create a mapplet to aggregate and filter the data. Later, you can use the Profile Manager to
create a data profile for the mapping.
Figure 3-7 shows a mapplet that aggregates and filters the Spendings source data:

Figure 3-7. Sample Mapplet to Aggregate and Filter Mapplet Output Data

When you create the mapplet, add a Filter transformation to filter out purchases older than
six months. Add the following condition to the Filter transformation:
DATE_DIFF (SYSDATE,SPENDING_DATE,'MM')<6

After you filter the data, add an Aggregator transformation to aggregate the cumulative
spending for each employee.

Using Mapplets to Extend Data Profiling Functions 53


Figure 3-8 shows the Aggregator transformation settings:

Figure 3-8. Sample Mapplet - Aggregator Transformation Settings

In the Aggregator transformation, add the cumulative_emp_spending output port to


aggregate the total amount of money spent by all employees. Group the results by employee
ID (EID), so you can see the total for each employee. Connect the EID port and the
cumulative_emp_spending port to the Output transformation. You can then use the Profile
Manager to profile the mapplet output data.
Note: You do not need to run a session to generate the correct mapplet output data. When you
run a profile session, the Profile Manager processes the mapplet before it profiles the mapplet
output data.

Creating the Data Profile


After you create the mapplet to filter and aggregate the data, you can profile the mapplet
output data. From the Profile Manager, create a custom profile.
The custom profile locates the employees who spent more than $1,000 in the last six months
using the source-level Business Rule Validation function. Create a Business Rule Validation
function using the following expression:
cumulative_emp_spending > 1000

When you specify the type of verbose data to load to the Data Profiling warehouse, select
valid rows only. This ensures that you can view the verbose data for the employees who spent
over $1,000 in the last 6 months.

54 Chapter 3: Managing Data Profiles


Viewing the Data Profiling Report
Once you create the Data Profile, you can run a profile session. After you run a profile
session, you can view the profile results in the PowerCenter Data Profiling report or the
PowerAnalyzer Data Profiling report. For example, when you view a PowerCenter Data
Profiling report, you can view the verbose data for the rows that do not satisfy the Business
Validation Rule. You can see that the employee with employee ID 19 has spent $1,340.42. All
other employees have spent under $1,000, and therefore do not appear in the Verbose Report
Summary.
Figure 3-9 shows the verbose summary report:

Figure 3-9. Sample Verbose Summary Report

Profiling Multiple Sources with One Matching Port


When you want to profile two related sources with one or more matching ports, you can
create a mapplet with a Joiner transformation to join the sources. Then, you can profile the
mapplet output data.
For example, you have an Items relational source and a Manufacturers relational source. The
Items relational source contains information about items, such as item description, wholesale
cost, and item price. The Manufacturers relational source contains information about the
manufacturers who manufacture the items. You want to find the manufacturers whose items
sell with a markup that is greater than 50 percent of the wholesale cost.

Using Mapplets to Extend Data Profiling Functions 55


Creating the Mapplet
To determine this information, you need to join the two sources before you can create a data
profile. Create the following mapplet using a Joiner transformation to join the two sources.
Figure 3-10 shows a mapplet used to join data from two sources:

Figure 3-10. Sample Mapplet to Join Data from Two Sources

Use the following join condition:


MANUFACTURER_ID1 = MANUFACTURER_ID

Creating the Data Profile


After you join the sources, you can profile the mapplet output data to find the manufacturers
whose items sell with a markup that is greater than 50 percent of the wholesale cost.
You create a custom profile with a Source-Level business function and enter the following
expression in the Rule Editor:
PRICE > (WHOLESALE_COST +(WHOLESALE_COST * .50))

When you specify the type of verbose data to load to the Data Profiling warehouse, select
valid rows only. This allows you to see the verbose data for the rows that meet the business
rule.

Viewing the Data Profiling Report


Once you create a data profile, you can run a profile session. After you run a profile session,
you can view the profile results in the PowerCenter Data Profiling report or the
PowerAnalyzer Data Profiling report. For example, when you view a PowerCenter Data
Profiling report, you can see the information for the companies whose items sell with a
markup greater than 50 percent of the wholesale cost.

56 Chapter 3: Managing Data Profiles


Figure 3-11 shows the Data Profiling Report for the mapplet output data:

Figure 3-11. Sample Data Profiling Report for Mapplet Output Data

Profiling Sources that Use All Matching Ports


When you want to profile data from two sources that use the same ports, you can use a
mapplet to merge the sources and then profile the mapplet output data. For example, you
have the Orders1 and Orders2 flat file sources that contain order information from two
different stores. The sources use the same structure, but are stored in two different files. Both
sources contain the following ports:
♦ OrderID
♦ CID
♦ DATE
♦ AMOUNT
You want to create an auto profile that displays profile data from both sources. You also want
to ensure that the order IDs are distinct for order items in both stores.

Creating the Mapplet


To create an auto profile that displays data from both sources, create a mapplet that uses a
Union transformation to merge the sources.

Using Mapplets to Extend Data Profiling Functions 57


Figure 3-12 shows a mapplet that uses a Union transformation to merge source data:

Figure 3-12. Sample Mapplet to Merge Mapplet Output Data

After you merge the data using the Union transformation, you can profile the mapplet output
data using the Profile Manager.

Creating the Data Profile


Create an auto profile based on the merged source data. This ensures that functions, such as
the Row Uniqueness function, can generate reports based on all the data rather than only a
part of the source data.
When you specify the type of verbose data to load to the Data Profiling warehouse, select all
rows to see both valid and exception data.

Viewing the Data Profiling Report


Once you create the data profile, you can run a profile session. After you run a profile session,
you can view the profile results in the PowerCenter Data Profiling report or the
PowerAnalyzer Data Profiling report. For example, when you view a PowerCenter Data
Profiling report, you view data from both branches of the store. The percentage of duplicate
values for the ORDERID column is 0 percent. Therefore, the Order ID for each order item
is unique.

58 Chapter 3: Managing Data Profiles


Figure 3-13 shows the Data Profiling report for the Union mapplet:

Figure 3-13. Sample Data Profiling Report for the Union Mapplet

Using Mapplets to Extend Data Profiling Functions 59


Troubleshooting
I received a message that my data profile updated successfully but the mapping associated
with the data profile is not valid and the data profile must be run in interactive mode.
When you update a data profile, the Designer validates the profile mapping. After validating
the mapping, the Designer displays a message if the mapping includes blocking
transformations. Mappings that include blocking transformations may hang when you run
the associated session. For information about how blocking transformations affect mapping
validation, see the Designer Guide.

60 Chapter 3: Managing Data Profiles


Chapter 4

Running Profile Sessions

This chapter includes the following topics:


♦ Overview, 62
♦ Running Sessions from the Profile Manager, 63
♦ Profiling Data Samples, 68
♦ Profiling Relational Data Samples, 70
♦ Creating a Session in the Workflow Manager, 74
♦ Troubleshooting, 75

61
Overview
To generate information about source data from a data profile, you must create and run a
profile session. You can create and run profile sessions from the following tools:
♦ Profile Manager. You can create and run profile sessions from the Profile Manager. This
allows you to run sessions immediately to quickly obtain profile results from a source. You
can also run a session from the Profile Manager when you want to profile a sample of
source data instead of the entire source. When you create sessions from the Profile
Manager, the Profile Manager creates a workflow and associates it with the session. For
information about running sessions from the Profile Manager, see “Running Sessions from
the Profile Manager” on page 63.
♦ Workflow Manager. If you want to monitor ongoing data quality issues, you can create a
persistent session and workflow for the profile mapping in the Workflow Manager and add
a scheduling task. This allows you to perform a time-dimensional analysis of data quality
issues. You can also edit and run persistent sessions that you create in the Profile Manager
from the Workflow Manager. For more information about creating and running sessions
and workflows, see the Workflow Administration Guide.

Working with Data Samples


When you run a session from the Profile Manager, you can create a data profile based on a
sample of data rather than the entire source. You use data sampling when you want to
understand general trends within the data or view some exception data. For information
about data sampling, see “Profiling Data Samples” on page 68 and “Profiling Relational Data
Samples” on page 70.

62 Chapter 4: Running Profile Sessions


Running Sessions from the Profile Manager
You can run a profile session from the Profile Manager to quickly profile sources during
mapping development. When you run a session from the Profile Manager, the session is run
interactively. You can also run sessions from the Profile Manager for existing data profiles at
any time.
You can run sessions from the Profile Manager in the following cases:
♦ Immediately after you create a data profile.
♦ At any time for an existing data profile from the Profile Manager.
You can run temporary or persistent sessions. A temporary session runs on demand and is not
stored in the repository. A persistent session can run on demand and is stored in the
repository. When you run a temporary session, the PowerCenter Server uses normal mode to
load data to the target.
You can configure the default data profile options to run a session immediately after you
create a data profile. For more information about configuring default data profile options, see
“Configuring Default Data Profile Options” on page 26.

Running Sessions from the Profile Manager 63


Configuring a Session in the Profile Wizard
You must configure a session before you can run it. When you configure a session to run from
the Profile Manager, you create and configure the session in the Profile Wizard. When you
create a session in the Profile Wizard, you can run it later or run it interactively.
Figure 4-1 shows the Profile Run page of the Profile Wizard:

Figure 4-1. Profile Wizard - Profile Run page

64 Chapter 4: Running Profile Sessions


Table 4-1 describes the session property settings on the Profile Run page:

Table 4-1. Profile Run Properties for Profile Sessions

Profile Run Required/


Description
Options Optional

Server Required Select a PowerCenter Server to run the profile session.

Profile Run Required Select one of the following options:


- Create a persistent session for the profile. A persistent session is stored in the
repository.
- Create a temporary session for this profile. A temporary session is not stored in
the repository.

Sampling Required Select one of the following options:


- No Sampling. Select to disable profile sampling. When you disable sampling, the
Designer generates a profile based on all selected source data.
- Sample First N Rows. Select to read the first N rows from the source (up to the
number of rows you specify).
- Automatic Random Sampling. Select to run a data profile for a random sample of
source data and allow the PowerCenter Server to determine the percentage of
data to sample. The percent sampled represents a percentage of the total row
count. The PowerCenter Server scales the percent of data to sample based on the
total row count.
- Manual Random Sampling. Select to specify a percentage of data to sample. You
can specify 1 to 100 percent. The PowerCenter Server selects random row sets
from all parts of the source data.
For more information about using data samples, see “Profiling Data Samples” on
page 68.

After you configure the options for the Profile Run page, you must configure options for the
Session Setup page.

Running Sessions from the Profile Manager 65


Figure 4-2 shows the Session Setup page of the Profile Wizard:

Figure 4-2. Profile Wizard - Session Setup Page

Table 4-2 describes the session property settings on the Session Setup page:

Table 4-2. Session Setup Properties for Profile Sessions

Session Setup Required/


Description
Options Optional

Source Required Configure source connection properties on the Connections tab. Configure source
Properties properties on the Properties tab. Configure reader properties on the Reader tab.
The Source properties are the same as those in the session properties you
configure in the Workflow Manager. For more information about session property
settings for sources, see the Workflow Administration Guide.

Target Required The relational database connection to the Data Profiling warehouse database. This
Connections is the relational database connection you configured for the Data Profiling
warehouse in the Workflow Manager. For more information about configuring
relational database connections, see the Workflow Administration Guide.

Reject File Required Directory for session reject files. The default reject file directory is $PMBadFileDir\.
Directory

Run Session Optional Select to run the session immediately. Otherwise, the Profile Manager saves the
session configuration information and exits.

66 Chapter 4: Running Profile Sessions


Running a Session when You Create a Data Profile
You can configure and run profile sessions when you create a data profile. To do this, select
Configure Session on the Auto Profile Column Selection page of the Profile Wizard when you
create an auto profile. Or, select Configure Session on the Function-Level Operations page of
the Profile Wizard when you create a custom profile. After the Profile Wizard generates the
profile mapping, it prompts you to configure and run the session.
If you want to run a session immediately after creating a data profile, configure the Always
Run Profile Interactively option in the default data profile options.
For more information about configuring default data profile options to always run profile
sessions, see “Configuring Default Data Profile Options” on page 26. For more information
about running sessions from the Profile Manager, see “Using the Profile Manager” on page 6.

Running a Session for an Existing Data Profile


If you want to run a session for an existing data profile, select the data profile in the Profile
Manager and click Profile-Run. The Profile Wizard prompts you to configure the session
before running it. For more information about the Profile Manager, see “Using the Profile
Manager” on page 6.

Monitoring Interactive Sessions


After you configure an interactive session, the PowerCenter Server runs the session. You can
monitor the progress of the session from the Profile Manager. The appearance of the profile
node icon in the Navigator Window changes depending on the status of the session. The
session status displays in the Profile Session Status window.
Figure 4-3 shows profile session status in the Profile Manager:

Figure 4-3. Profile Session Status in the Profile Manager

You can also monitor the profile session from the Workflow Monitor. The PowerCenter
Server creates a workflow for profile sessions. For more information about monitoring
workflows, see the Workflow Administration Guide.
If an interactive session fails, the PowerCenter Server writes a session log. You can review the
session log, correct any errors, and restart the session. To view the session log, click View-
Session Log in the Profile Manager.
When the session successfully finishes, you can view reports that contain the profile results.
For more information about viewing Data Profiling reports, see “Viewing Profile Results” on
page 77.

Running Sessions from the Profile Manager 67


Profiling Data Samples
When you run a session from the Profile Manager, you can create a data profile based on a
sample of data rather than the entire source. Use data sampling when you want to understand
general trends within the data or view some exception data.
For example, you have a large data source and you want to quickly verify that a business rule
you created in a custom profile returns the data you expect. To test this business rule, run a
profile session based on a sample of data to verify that the business rule returns the data you
expect. This allows you to quickly test the data without running a profile session on a large
data source.
When you run a profile session with data sampling, the Data Profiling report shows the
estimated total number of rows in the source, the number of rows sampled, the mode of
sampling used, and the percentage of the total source sampled. For information about viewing
data profiling reports from the Profile Manager, see “Overview” on page 78.
When you sample relational data, the PowerCenter Server can sample data by delegating
sampling to the database or by sampling as the PowerCenter Server reads data from the
database. You may be able to improve data sampling performance and accuracy by enabling
the PowerCenter Server to delegate sampling to the database. For more information about
working with relational data samples, see “Profiling Relational Data Samples” on page 70.
To profile samples of data, complete the following steps:
1. Select a function type.
2. Select a data sampling mode.

Step 1. Select a Function Type


When you create data profiles, some functions can return useful data using data samples.
Other functions need to perform calculations on all source data to return useful information.
Table 4-3 describes function behavior with data samples:

Table 4-3. Function Behavior with Data Samples

Function Type Behavior when used with Data Samples

Source-level functions Profiles created with source functions and data samples can display general patterns within
the data.

68 Chapter 4: Running Profile Sessions


Table 4-3. Function Behavior with Data Samples

Function Type Behavior when used with Data Samples

Column-level functions Data profiles created with the following column-level functions and data samples can display
general patterns within the data:
- Domain Inference
- Business Rule Validation
- Distinct Value Count
Data samples for NULL Count and Average Value Aggregate functions display general
patterns within the data. However, Minimum and Maximum Value Aggregate functions can
have inconsistent results because a column can have unusually high maximum values or
unusually low minimum values.

Intersource functions You cannot use data samples with intersource functions.

Step 2. Select a Data Sampling Mode


When you use sample data to create a data profile, you may want to use a different sampling
mode depending on the percentage of data you want to sample. Choose from the following
data sampling modes:
♦ Automatic Random Sampling. Allows the PowerCenter Server to determine the
percentage of data to sample. The PowerCenter Server scales the percentage of data to
sample based on the size of the source data. If you do not know the size of the source data,
you can enable the PowerCenter Server to determine the percentage of data to sample.
♦ Manual Random Sampling. You can specify a percentage of data to sample from 1 to 100.
The PowerCenter Server selects random data from all parts of the source. Use this option if
you want to control the percentage of data to sample.
♦ Sample First N Rows. Select to read the first N rows from the source up the number of
rows you specify. Use First N rows when you cannot use manual random sampling for the
data profile function. You can also use First N Row sampling when you want to specify the
number of rows to profile. For example, if you have a very large source, you may want to
sample the first 100 rows to understand some basic information about the source.

Profiling Data Samples 69


Profiling Relational Data Samples
When you create a data profile based on relational data samples, the PowerCenter Server can
obtain data samples in the following ways:
♦ By sampling as the PowerCenter Server reads data. The PowerCenter Server performs a
sampling algorithm on the source data as it reads it from the database.
♦ By delegating sampling to the database. The PowerCenter Server delegates the sampling
operation to the database, and the PowerCenter Server reads the sampled data after the
sampling operation has taken place on the database.
When you run a data profile configured to sample data, the PowerCenter Server first attempts
to delegate sampling to the database. If it cannot delegate sampling to the database, it runs an
algorithm on the data as it reads the data from the database.
When you create a data profile, the PowerCenter Server can delegate sampling to the
following databases:
♦ Oracle
♦ DB2
Often, you can optimize performance and accuracy by enabling the PowerCenter Server to
delegate sampling to the database.
To configure the Profile Manager to allow the PowerCenter Server to delegate sampling to the
database, ensure that you use native database drivers to connect to the database. You do not
need to complete any other steps to configure the PowerCenter Server to delegate sampling.

Performing Sampling as the PowerCenter Server Reads Data


When the PowerCenter Server performs sampling as it reads data, it selects the data samples
using the C rand function. You can specify that the PowerCenter Server samples data in the
following ways:
♦ Manual random sampling. You specify a percentage of data to sample, from 1 to 100. The
PowerCenter Server selects random data from the source up to the percentage you specify.
♦ Automatic random sampling. The PowerCenter Server determines the percentage of data
to sample based on the size of the source data.

Using Manual Random Sampling


When you use manual random sampling, you specify a percentage of data for the
PowerCenter Server to sample, and it samples that percentage of data as it reads rows. For
example, if you have 100,000 rows and you sample 50 percent of the data, the PowerCenter
Server reads 50 percent of the first 10,000 rows, and 50 percent of the next 10,000 rows, until
it samples 50 percent of the data across the data set.

70 Chapter 4: Running Profile Sessions


Figure 4-4 shows a manual PowerCenter sample:

Figure 4-4. Manual PowerCenter Sample


50% 50% 50% 50% 50% 50% 50% 50% 50% 50%

10 K 20 K 30 K 40 K 50 K 60 K 70K 80 K 90 K 100 K

Using Automatic Random Sampling


When the PowerCenter Server performs automatic random sampling, it determines the
percentage of data to sample as it reads data from the database. Because you do not specify the
percentage of data to sample, the PowerCenter Server begins by sampling 100 percent of the
data and gradually samples progressively less data as it reads from the database. The
PowerCenter Server does this to ensure that it samples sufficient data for accuracy.
For example, you have 100,000 rows of data to sample. When the PowerCenter Server reads
the data, it samples 100 percent of the first 10,000 rows, 90 percent of the next 10,000 rows,
80 percent of the next 10,000 rows, and progressively less data for each group of 10,000 rows.
Figure 4-5 shows how data is sampled using automatic random sampling:

Figure 4-5. Automatic Random Sampling


100% 90% 80% 70% 60% 50% 40% 30% 20% 10%

10 K 20 K 30 K 40 K 50 K 60 K 70K 80 K 90 K 100 K

Delegating Sampling to the Database


When the PowerCenter Server delegates sampling to the database, the database uses native
algorithms to sample data. The database samples data in the same way for both automatic and
manual random sampling because in both cases the database uses the same algorithm to
perform the sampling function.
When the PowerCenter Server delegates sampling to the database, the database performs a
sampling operation. Once the database has sampled the data, the PowerCenter Server reads
the sampled data. For example, you have 100,000 rows of data and you sample 50 percent of
the data, the database samples 50 percent of the first 10,000 rows, and 50 percent of the next
10,000 rows. The database selects an equal percentage of data samples from each part of the
data source. This means that while the database selects random samples, it selects the samples
equally from across the data set.

Profiling Relational Data Samples 71


Figure 4-6 shows sampling delegated to the database:

Figure 4-6. Data Sampling Delegated to the Database


50% 50% 50% 50% 50% 50% 50% 50% 50% 50%

10 K 20 K 30 K 40 K 50 K 60 K 70K 80 K 90 K 100 K

Note: To enable database sampling, simply use native connections to connect to the repository
database.

Improving Data Sampling Accuracy and Performance


You can improve data sampling accuracy and performance when you work with the following
data sources:
♦ Data sources with historical data
♦ Large data sources

Working with Historical Data


If you have historical data, you can improve data accuracy by enabling the PowerCenter
Server to delegate sampling to the database. For example, you have 100,000 rows of data, and
the latest data is stored in the first 30,000 rows, the newer data is stored in the next 30,000
rows, and the oldest data is stored in the last 40,000 rows. When you select automatic random
sampling and the PowerCenter Server samples as it reads data, it samples the greatest
percentage of the data from the latest data, and the least percentage of data from the oldest
data. If there are great differences between the historical data and the latest data, the resulting
data profile reflects the latest data more accurately than the older data.
Figure 4-7 shows how Data Profiling processes an historical sample of data:

Figure 4-7. Data Profile of an Historical Sample of Data


100% 90% 80% 70% 60% 50% 40% 30% 20% 10%

10 K 20 K 30 K 40 K 50 K 60 K 70K 80 K 90 K 100 K

Newest Data 2000 - 2004 Newer Data 1995- 2000 Oldest Data 1990 - 1995

72 Chapter 4: Running Profile Sessions


To improve data accuracy, you can ensure the PowerCenter Server delegates sampling to the
database. When you enable database sampling, the database samples data equally from across
the data set.
When the PowerCenter Server cannot delegate sampling to the database, you can improve
data profiling results for historical data by using manual random sampling. Manual random
sampling, like database sampling, samples a fixed percentage of data across the data set.
Table 4-4 shows recommended sampling percentages for the most accurate results with
manual random sampling:

Table 4-4. Recommended Percentages for Manual Random Sampling

Data Size Percentage of Data to Sample

1 million or more rows 1 percent

100,000 or more rows 10 percent

100 - 1,000 rows Greater than 50 percent

Working with Large Data Sources


You can improve performance when you have a large source by ensuring the PowerCenter
Server delegates sampling to the database. For example, you have a data source with
100,000,000 rows, and you select automatic random sampling. When the PowerCenter Server
delegates sampling to the database, the database selects a sample of 1,000,000 rows and the
PowerCenter Server reads only 1,000,000 rows. However, if the PowerCenter Server performs
the sampling algorithm, it must first read 100,000,000 rows and then sample from those
rows.

Profiling Relational Data Samples 73


Creating a Session in the Workflow Manager
You can create persistent profile sessions and workflows in the Workflow Manager. You may
want to create and configure a profile session and workflow from the Workflow Manager to
monitor ongoing data quality issues, or if you want to have more control over the session and
workflow properties. Or, you may want add workflow tasks, such as a scheduler that allows
you to schedule the workflow to run regularly. This allows you to perform a time-dimensional
analysis of data quality issues.
When you create a persistent session and workflow in the Workflow Manager, the following
rules apply:
♦ You must run the session from the Workflow Manager.
♦ You cannot run the session in real time.
♦ You must view the session status in the Workflow Monitor.
You can view Data Profiling reports that you generate from the Workflow Manager in the
Profile Manager or PowerAnalyzer.
For more information about using the Workflow Manager and Workflow Monitor, see the
Workflow Administration Guide. For more information about viewing reports, see “Viewing
Profile Results” on page 77.

74 Chapter 4: Running Profile Sessions


Troubleshooting
An interactive session failed due to an invalid SQL query.
The interactive session failed because you need to specify an owner name for the Target Name
Prefix in the session properties. However, you cannot specify the owner name for an
interactive session. To specify an owner name for the session, configure a persistent session in
the Workflow Manager. Specify the owner name for the Target Name Prefix in the session
properties. For more information about configuring persistent sessions, “Creating a Session in
the Workflow Manager” on page 74.

I tried to run an interactive session for an SAP R/3 source, but the session failed.
When you create a data profile for an SAP R/3 source, the Designer does not generate an
ABAP program for the data profile. If you run an interactive session immediately after
creating the data profile, the session fails. Create an ABAP program for the profile mapping
and then run the session.

An interactive session failed with an error message stating that the buffer block size is too
low.
You ran an interactive session for a source with a large number of rows with high precision.
Or, you ran an interactive session for a multi-group source with a large number of groups. As
a result, the PowerCenter Server could not allocate enough memory blocks to hold the data,
and the session failed.
Create a persistent session for the profile mapping in the Workflow Manager. In the session
properties, set the value for the buffer block size that the error message in the session log
recommends. For more information about optimizing sessions and setting the buffer block
size, see the Workflow Administration Guide.

My profile session fails but no error message displays.


If a data profile contains a Business Rule Validation function and a column name on which
the Business Rule Validation function is based changes, the profile session fails. The
PowerCenter Server does not display the error on screen, but does write the error to the log
file.

I tried to restart a persistent session from the Profile Manager and I received the following
error message:
ERROR: Failed to fetch Workflow [id = <workflow ID>] from the repository.
Please check the session log.

However, when I try to locate the session log, no session log for this session run exists.
You deleted the workflow for the profile session in the Workflow Manager. When you delete a
workflow in the Workflow Manager, the PowerCenter Server cannot run the session or

Troubleshooting 75
generate a session log. To run the profile session from the Profile Manager, reconfigure and
save the session information in the Profile Wizard.

I edited session properties for a persistent session from the Workflow Manager. However,
the properties I configured in the Workflow Manager do not appear on the Session Setup
page of the Profile Wizard.
You edited the profile session from the Workflow Manager while the Profile Wizard was open.
To refresh the session properties in the Profile Wizard, close and reopen the Profile Wizard.

I edited session properties for a persistent session from the Profile Wizard, but the session
properties I configured did not get saved.
You edited the profile session from the Profile Wizard while the session was open in the
Workflow Manager. Close the workflow in the Workflow Manager before you edit the session
properties from the Profile Wizard.

76 Chapter 4: Running Profile Sessions


Chapter 5

Viewing Profile Results

This chapter covers the following topics:


♦ Overview, 78
♦ PowerCenter Data Profiling Reports, 79
♦ PowerAnalyzer Data Profiling Reports, 85

77
Overview
After you run a profile session, you can view the session results in a report. There are two
types of Data Profiling reports:
♦ PowerCenter Data Profiling reports. Reports you can view from the Profile Manager after
running a profile session. PowerCenter Data Profiling reports display data for the latest
session run. Use PowerCenter Data Profiling reports to quickly view profile results during
mapping development.
♦ PowerAnalyzer Data Profiling reports. Reports you can view from PowerAnalyzer after
running a profile session. PowerAnalyzer reports provide a time-dimensional view of your
data. They also display information about rejected rows in your profile results. Use
PowerAnalyzer Data Profiling reports when you want to monitor data quality during
production.

78 Chapter 5: Viewing Profile Results


PowerCenter Data Profiling Reports
After you run a profile session, you can view PowerCenter Data Profiling reports from the
Profile Manager. The reports provide information based on the last profile session run. The
reports display data profile information based on the functions in the data profile. Use
PowerCenter Data Profiling reports to view the latest profile information about your source
data during mapping development. For more information about running profile sessions, see
“Running Profile Sessions” on page 61.
The report format of PowerCenter Data Profiling reports is static and cannot be modified
except for resizing columns. PowerCenter stores the reports as an XML file in the
PowerCenter Client\Extensions\DataProfile\ProfileReports directory.
You can view two types of PowerCenter Data Profiling reports:
♦ Auto profile. An auto profile report contains profiling information for the predefined
functions in an auto profile. Viewing an auto profile report gives you an overall view of the
profile information in your source.
♦ Custom profile. Custom profile reports contain profiling information for the functions
you defined in your custom profile. The amount of information in custom profile reports
varies depending on the functions you specified in your data profile.

Auto Profile Reports


An auto profile report displays information about source data based on the functions in an
auto profile. The report displays information for the following functions:
♦ Row Count. Counts the number of rows read from the source. When profiling data
samples, the PowerCenter Server estimates the number of rows.
♦ Distinct Value Count. Returns the number of distinct values for the column. You can view
verbose data for this function if you selected this mode when creating the auto profile.
♦ Aggregate. Calculates an aggregate value for a numeric or string value or calculated
expression applied to one column of a profile source.
♦ Domain Inference. Reads all values in the column and infers a pattern that fits the data.
♦ Candidate Key Evaluation. Calculates the number and percentage of unique values for
one or more columns in a source.
♦ Redundancy Evaluation. Calculates the number of duplicate values in one or more
columns of a source.
For more information about creating an auto profile, see “Creating an Auto Profile” on
page 34. For more information about profile functions, see “Working with Functions” on
page 103.
An auto profile report contains information about the data profile in a report summary. It also
contains profile results in the body of the report.

PowerCenter Data Profiling Reports 79


Table 5-1 describes the attributes that display in the report summary of an auto profile report:

Table 5-1. Auto Profile Report Summary Attributes

Attribute Description

Auto Profile Name Name of the auto profile.

Profile Description Description of the auto profile.

Profile Run Time Date and time of the profile session run.

Folder Name Name of the folder in which the data profile is stored.

Repository Name Name of the repository in which the data profile is stored.

Source Name of the source definition on which the auto profile is based in the following format:
<database name>::<source definition name>

Sample Type The type of data sampling used in the report. For more information about using data
samples, see “Profiling Data Samples” on page 68.

Profiled Rows The number of rows applied to the data profile.

Total Rows The total number of rows in the source.

% Profiled The percentage of the total number of source rows profiled.

Groups The groups in the source definition or mapplet on which the auto profile is based, where
applicable.

The body of an auto profile report provides general and detailed information. In an auto
profile report, you can click the hyperlinks to view information about verbose data for the
Distinct Value Count function, Redundancy Evaluation function, and Domain Inference
function.

80 Chapter 5: Viewing Profile Results


Figure 5-1 shows a sample auto profile report:

Figure 5-1. Sample Auto Profile Report


Click to hide
report summary
attributes.

Click to launch
the report in a
browser.

Click the hypertext link to view verbose data.

Figure 5-2 shows verbose report results:

Figure 5-2. Sample Verbose Data Report

Click to hide the report summary


field.

Click to return to the report summary.

Custom Profile Reports


You can view PowerCenter Data Profiling reports for custom profiles. The data the report
displays depends on the functions you included in your data profile. The custom profile

PowerCenter Data Profiling Reports 81


report displays data in the order of the functions you defined in your data profile. For more
information about functions, see “Working with Functions” on page 103. For more
information about creating a custom profile, see “Creating a Custom Profile” on page 38.
A custom profile report contains information about the data profile in a report summary. It
also contains profile results in the body of the report. Each section in the body of the report
corresponds to a different function.
Table 5-2 describes the attributes that display in the report summary of a custom profile
report:

Table 5-2. Custom Profile Report Summary Attributes

Attribute Description

Custom Profile Name Name of the custom profile.

Profile Description Description of the custom profile.

Profile Run Time Date and time of the profile session run.

Folder Name Name of the folder in which the data profile is stored.

Repository Name Name of the repository in which the data profile is stored.

Source Name The type and name of the source definitions upon which the custom profile is based.

Sample Type The type of data sampling used in the report. For more information about using data
samples, see “Profiling Data Samples” on page 68.

Profiled Rows The number of rows applied to the data profile.

Total Rows The total number of rows in the source.

% Profiled The percentage of the total number of source rows profiled.

Show report for all Click to show Data Profiling reports for all sources upon which the custom profile is
sources based. By default, the report shows all sources. To view a particular source, click on the
source name on the Sampling Summary page. This filters out results for other sources.

Table 5-3 describes the attributes that display for each function in a custom profile report:

Table 5-3. Custom Profile Report Function Attributes

Attribute Description

Function Name The name of the function in the custom profile.

Function Type The type of function.

Source The name of the source definitions in the function.

Group By Columns Columns selected for grouping by in the report.

Column Names Name of the column to which the profile business rule is being applied.

Rule Name of the profile business rule.

82 Chapter 5: Viewing Profile Results


Figure 5-3 is an example of a custom profile report:

Figure 5-3. Sample Custom Profile Report

Click hypertext link to view verbose data.

You can also click the hypertext links to view information about verbose data.

Viewing PowerCenter Data Profiling Reports


Use the following procedure to view an auto or custom PowerCenter Data Profiling report.

To view a PowerCenter Data Profiling report:

1. Launch the Profile Manager.


To launch the Profile Manager from the Source Analyzer, select Sources-Profiling-Launch
Profile Manager. To launch the Profile Manager from the Mapplet Designer, select
Mapplets-Profiling-Launch Profile Manager.
2. Select the data profile for which you want to view a report.
You can select a data profile from the Navigator window. Or, you can select a data profile
from the Session Status bar.

PowerCenter Data Profiling Reports 83


3. Select View-View Report.
If the Designer requires the connection information for the Data Profiling warehouse, the
Connect to an ODBC Data Source dialog box displays. If the Designer does not require
the connection information for the Data Profiling warehouse, the report displays.

4. Select the ODBC data source for the Data Profiling warehouse.
5. Enter the user name and password for the Data Profiling warehouse.
6. Click Connect.
The PowerCenter Data Profiling report displays.

84 Chapter 5: Viewing Profile Results


PowerAnalyzer Data Profiling Reports
PowerAnalyzer displays Data Profiling reports in an analytic workflow. You can run the
reports to analyze source data from profile sessions.
The reports display data from the latest version of a profile. You can view reports for profile
information across session runs for a time-dimensional analysis of the data. PowerAnalyzer
Data Profiling reports display information across session runs by default. You can also view
reports with data from the most recent profile session run. This is useful during production to
monitor data quality.
When you display a PowerAnalyzer Data Profiling report, you can drill down to row-level
information, drill up for a summary view, and drill across data. You can also see reports for
verbose data.
PowerAnalyzer lets you modify the Data Profiling reports to fit your business needs. You can
set alerts for the Data Profiling reports to call attention to fluctuations in the source data. You
can also add or change metrics in the reports or build your own reports.
Figure 5-4 shows the PowerAnalyzer Data Profiling analytic workflow:

Figure 5-4. PowerAnalyzer Data Profiling Analytic Workflow

You access the PowerAnalyzer Data Profiling reports folder in the Public Folders on the Find
tab.
Informatica recommends that you do not use the Summary option when viewing Data
Profiling reports. If you use the Summary option, the values that display in the report may
not accurately reflect your profile session results.
Tip: When you view a PowerAnalyzer Data Profiling report, select +Workflows to open the
analytic workflow.

PowerAnalyzer Data Profiling Reports 85


Table 5-4 describes the reports you can view in the PowerAnalyzer Data Profiling reports
analytic workflow:

Table 5-4. PowerAnalyzer Data Profiling Reports

Report Name Description

Profile List This report displays information for all data profiles in a folder. The report displays
data for the latest data profile version.

Profile Function List This report displays the functions for the selected profile. The report displays data for
the latest data profile version.

Source Function Statistics This report displays details about source functions for the selected profile. The report
displays data for the latest data profile version.

Column Function Statistics This report displays details about column functions for the selected profile. The
report displays data for the latest data profile version.

Redundancy Evaluation This report displays details from the Redundancy Evaluation source-level function for
Statistics the selected profile. The report displays data for the latest data profile version.

Candidate Key Statistics This report displays details from the Candidate Key Evaluation source-level function
for the selected profile. The report displays data for the latest data profile version.

Row Count Statistics This report displays details from the Row Count source-level function for the selected
profile. The report displays data for the latest data profile version.

Source Business Rule This report displays details from the Business Rule Validation source-level function
Validation Statistics for the selected profile. The report displays data for the latest data profile version.

Rejected Rows - Source This report displays details about the rows that do not satisfy the specified business
Business Rule Validation rule for a source (verbose data details). The report displays data for the latest data
profile version.

Column Business Rule This report displays details about the Business Rule Validation column-level function
Validation Statistics for the selected profile. The report displays data for the latest data profile version.

Aggregate Column Statistics This report displays details about the Aggregate function for the selected profile. The
report displays data for the latest data profile version.

Distinct Count Statistics This report displays details about the Distinct Value Count function for the selected
profile. The report displays data for the latest data profile version.

Domain Validation Statistics This report displays details about the Domain Validation function for the selected
profile. The report displays data for the latest data profile version.

Domain Inference Statistics This report displays details about the Domain Inference column function type for the
selected profile. The report displays data for the latest data profile version.

Rejected Rows - Column This report displays details about the rows that do not satisfy the specified business
Business Rule Validation rule for a column (verbose data details). The report displays data for the latest data
profile version.

Rejected Rows - Domain This report displays details about the rows that do not satisfy the specified domain
Validation validation rule for a column (verbose data details). The report displays data for the
latest data profile version.

Outer Join Analysis This report displays the results of all the Orphan Analysis functions for the selected
profile. The report displays data for the latest data profile version.

86 Chapter 5: Viewing Profile Results


Table 5-4. PowerAnalyzer Data Profiling Reports

Report Name Description

Rejected Rows - Outer Join This report displays the unmatched rows of the Orphan Analysis function for the
Analysis selected profile. The report displays data for the latest data profile version.

Cartesian Product Analysis This report displays details about the Join Complexity Evaluation function for the
selected profile. The report displays data for the latest data profile version.

Auto Profile - Column Statistics This report displays the results of column-level statistics for a group generated as
part of an auto profile for column functions. The report displays data for the latest
data profile version.

Inter Source Function List This report displays details about intersource functions for the selected data profiles.
The report displays data for the latest data profile version.

PowerAnalyzer Data Profiling Reports 87


88 Chapter 5: Viewing Profile Results
Chapter 6

Working with Domains

This chapter covers the following topics:


♦ Working with Domains, 90
♦ Prepackaged Domains, 91
♦ Custom Domains, 92
♦ Editing a Domain, 101
♦ Deleting a Domain, 102

89
Working with Domains
Domains are sets of all valid values for a column. When you create a custom profile, you can
create domains or you can use existing domains that Informatica provides. Some domains
contain a list of all valid values that the source column can contain. Some domains contain a
regular expression that describes a range or pattern of values that the source column can
contain. You can use prepackaged domains or create your own domains.
You can use the following domains to profile your data:
♦ Prepackaged domains. For more information about prepackaged domains, see
“Prepackaged Domains” on page 91.
♦ Custom domains. For more information about creating custom domains, see “Custom
Domains” on page 92.

90 Chapter 6: Working with Domains


Prepackaged Domains
Informatica provides a set of prepackaged domains to verify data, such as phone numbers,
postal codes, and email addresses. You can use prepackaged domains when you create a
custom profile with a Domain Validation function. For more information about creating
custom profiles, see “Creating a Custom Profile” on page 38.
You can view prepackaged domains in the Domain Browser. Prepackaged domains install the
first time you open the Domain Browser. It may take a few moments for the Domain Browser
to open for the first time.
From the Domain Browser, you can edit or delete prepackaged domains. For more
information about working with domains, see “Custom Domains” on page 92.
Table 6-1 shows the prepackaged domains and their descriptions:

Table 6-1. Prepackaged Domains

Domain Name Description

Country-Codes Contains a list of country abbreviations.

Country-Names Contains a list of country names.

Email-Address Validates source against email patterns.

URL Validates source against accepted URL patterns. Validates HTTP,


HTTPS, and FTP URLs.

North-American-Phone-Number Validates source against North American phone number patterns.

Dollar-Currency Displays source currency in U.S.A. currency values.

US-Zip-Codes-Pattern (faster) Validates the source against the U.S.A. zip code pattern.

US-Zip-Codes-List (more-accurate) Contains a list of United States Postal Service zip codes.

US-State-Codes (50-states) Contains a list of U.S.A. state abbreviations.

US-State-Codes (extended) Contains a list of U.S.A. state abbreviations with additional values for
territories and outlying areas served by the United States Postal Service.

US-State-Names (50-states) Contains a list of the names of all U.S.A. states.

US-State-Names (extended) Contains a list of the names of all U.S.A. states with additional values for
territories and outlying areas.

US-Social-Security-Number Validates the source against the U.S.A. social security number pattern.

Canadian-State-Codes Contains a list of Canadian province abbreviations.

Canadian-State-Names Contains a list of Canadian province names.

Canadian-Zip-Codes Validates the source against the Canadian zip code pattern.

UK-Postal-Codes Validates the source against the U.K. postal code pattern.

North-American-Industry-Classification- Contains a list of the North American Industry Classification System


System (NAICS)-codes codes.

Prepackaged Domains 91
Custom Domains
You can create custom domains when you create a Domain Validation function. During a
profile session, the Domain Validation function uses the domains you specify to validate
source values or to help you infer patterns from source data.
You can create the following types of domains:
♦ List of Values. Domains defined by a comma-delineated list of values.
♦ Regular Expression. Domains defined by a range of values in an expression.
♦ Domain Definition Filename. Domains defined by an external file containing values.
You can create reusable and non-reusable custom domains. Apply a reusable domain to
multiple Domain Validation functions. Apply a non-reusable domain to one Domain
Validation function. For more information about configuring a Domain Validation function,
see “Column-Level Functions” on page 113.
You can create a domain from the Profile Manager or when you configure a function. Any
domain you create from the Profile Manager is a reusable domain. When you create a domain
from a function, you can make the domain reusable or non-reusable.
When you view domains in the Domain Browser from the Profile Manager, you can only view
reusable and prepackaged domains. You can view non-reusable domains from the Domain
Browser when you define a function to which the non-reusable domain applies.
Once you create a domain, you can edit or delete the domain. For more information about
editing domains, see “Editing a Domain” on page 101. For more information about deleting a
domain, see “Deleting a Domain” on page 102.

List of Values Domains


You can create a List of Values domain from the Profile Manager or when you configure a
function. When you create a List of Values domain, you add the values you want to apply to
the domain. For example, if you want to use a domain that lists area codes in a region, you
create a List of Values domain. You can also use a List of Values domain to evaluate your data
for non-null white spaces.

To create a List of Values domain:

1. To create a List of Values domain from the Profile Manager, select Tools-Domains.
To create a domain when you define a function, click the Domains button on the Profile
Function Details page.

92 Chapter 6: Working with Domains


The Domain Browser dialog box displays.

2. Click New to create a new domain.


The Domain Details dialog box displays.

Add a domain value.


Remove a domain value.

3. Enter a name for the domain.


Note: The domain name cannot contain spaces.

4. Clear Reusable Domain if you do not want to be able to use this domain in other
Domain Validation functions.
Note: If you configure a domain from the Profile Manager, the domain is automatically
reusable. You cannot make the domain non-reusable.

Custom Domains 93
5. Select List of Values as the domain type.
6. In the Value box, enter a new domain value to manually add values to the list of values. If
you want to add a file with a list of values, go to step 8.
When you enter a domain value, the Designer ignores any spaces before or after the
value.
7. Click Add to add the domain values you entered to the list of values.
8. Click Values File to add a list of values.
If you want to add domain values manually, go to step 6.
9. Navigate to the file, and select the file to use.
10. Select the appropriate code page from the drop-down list.
The code page you specify must be a subset of the code page for the operating system that
hosts the PowerCenter client. You can specify localization and code page information in
the file list. If you do not specify the localization information, the PowerCenter Server
uses default values. For information about specifying localization information, see
“Specifying Localization and Code Page Information” on page 99.
11. Repeat steps 6 and 7 for each domain value you want to add.
12. If you want to remove a domain value, select the value from the list of values and click
Remove.
13. Click OK to save the domain.
14. Click Close.

Regular Expression Domains


You can create a Regular Expression domain from the Profile Manager or when you configure
a function. When you create a Regular Expression domain, you create a range of values you
want to apply to the domain. For example, if you want to use a domain that includes values
for 5-digit employee ID numbers, create a Regular Expression Domain.

Using perl Syntax in a Regular Expression Domain


You must use perl syntax when you create a Regular Expression domain.
Table 6-2 provides perl syntax guidelines to create a regular expression:

Table 6-2. perl Syntax Guidelines for Regular Expressions

Syntax Description

. (a period) Matches any one character.

[a-z] Matches one instance of a letter. For example, [a-z][a-z] can match ab or CA.

\d Matches one instance of any digit from 0-9.

94 Chapter 6: Working with Domains


Table 6-2. perl Syntax Guidelines for Regular Expressions

Syntax Description

() Groups an expression. For example, the parentheses in (\d-\d-\d\d) groups the expression \d\d-\d\d,
which finds any two numbers followed by a hyphen and any two numbers, as in 12-34.

{} Matches the number of characters exactly. For example, \d{3} matches any three numbers, such as
650 or 510. Or, [a-z]{2} matches any two letters, such as CA or NY.

? Matches the preceding character or group of characters zero or one time. For example,
\d{3}(-{d{4})? matches any three numbers, which can be followed by a hyphen and any four
numbers.

* (an asterisk) Matches zero or more instances of the values that follow the asterisk. For example, *0 is any value
that precedes a 0.

For example, to create a regular expression for U.S.A. zip codes, you can enter the following
perl syntax:
\d{5}(-\d{4})?

This expression lets you find 5-digit U.S.A. zip codes, such as 93930, as well as 9-digit zip
codes, such as 93930-5407.
In this example, \d{5} refers to any five numbers, such as 93930. The parentheses surrounding
-\d{4} group this segment of the expression. The hyphen represents the hyphen of a 9-digit
zip code, as in 93930-5407. \d{4} refers to any four numbers, such as 5407. The question
mark states that the hyphen and last four digits are optional or can appear one time.
Figure 6-1 shows an example of a Regular Expression domain:

Figure 6-1. Regular Expression Domain

Click to enter test data to


validate a regular expression.

When you enter a regular expression, you can validate the regular expression with test data.
You can use the test data instead of the data in your repository. The test data should represent
the source data you plan to use the Regular Expression domain against.

Custom Domains 95
Figure 6-2 shows an example of test data in the Test Data dialog box:

Figure 6-2. Test Data Dialog Box

Regular Expression to Test

Enter test data.

Test Results

Tips for Converting COBOL Syntax to perl Syntax


If you are familiar with COBOL syntax, you can use the following information to help you
write perl regular expressions for a Regular Expression Domain.
Table 6-3 shows examples of COBOL syntax and their perl equivalents:

Table 6-3. COBOL Syntax and perl Syntax Compared

COBOL Syntax perl Syntax Description

9 \d Matches one instance of any digit from 0-9.

9999 \d\d\d\d Matches any four digits from 0-9, as in 1234 or 5936.
or
\d{4}

x [a-z] Matches one instance of a letter.

9xx9 \d[a-z][a-z]\d Matches any number followed by two letters and another number, as in 1ab2.

Tips for Converting SQL Syntax to perl Syntax


If you are familiar with SQL syntax, you can use the following information to help you write
perl regular expressions for a Regular Expression domain.
Table 6-4 shows examples of SQL syntax and their perl equivalents:

Table 6-4. SQL Syntax and perl Syntax Compared

SQL Syntax perl Syntax Description

% .* Matches any string.

A% A.* Matches the letter “A” followed by any string, as in Area.

_ . (a period) Matches any one character.

A_ A. Matches “A” followed by any one character, such as AZ.

96 Chapter 6: Working with Domains


Creating a Regular Expression Domain
Use the following procedure to create a Regular Expression domain.

To create a Regular Expression domain:

1. To create a domain from the Profile Manager, select Tools-Domains.


To create a domain when you define a function, click the Domains button on the Profile
Function Details page.
The Domain Browser dialog box displays.
2. Click New to create a new domain.
The Domain Details dialog box displays.
3. Enter a name for the domain.
The domain name cannot contain spaces.
4. Deselect Reusable Domain if you do not want to be able to use this domain in other
Domain Validation functions.
If you configure a domain from the Profile Manager, the domain is automatically
reusable. You cannot make the domain non-reusable.
5. In the Expression box, enter a new domain value.

6. If you want to validate the regular expression, click Test Data.


7. Enter the test data you want to test the regular expression against.
8. Click Test to validate the regular expression.
You can view the results of the test in the Result Box at the bottom of the Test Data
Dialog box.
9. When you finish validating the regular expression, click Done.

Custom Domains 97
10. Click OK to save the domain.
11. Click Close.

Domain Definition Filename Domains


You can create a Domain Definition Filename domain from the Profile Manager or when you
configure a function. Create a Domain Definition Filename domain when you have a text file
you can use to validate source data. For example, you have a file that contains a list area codes
in a region. Create a Domain Definition Filename domain for the PowerCenter Server to use
the file to validate the source data according to this list.
When you create a Domain Definition Filename domain, you select a filename with the
domain values you want to use. You must store the file on the same machine as the
PowerCenter Server on which you want to run the session. You can use the $PMSourceDir
environment variable when you specify the filename path.

To create a Domain Definition Filename domain:

1. To create a domain from the Profile Manager, select Tools-Domains.


To create a domain when you define a function, click the Domains button on the Profile
Function Role Details page.
The Domain Browser dialog box displays.
2. Click New to create a new domain.
The Domain Details dialog box displays.
3. Enter a name for the domain.
The domain name cannot contain spaces.
4. Clear Reusable Domain if you do not want to be able to use this domain in other
Domain Validation functions.
If you configure a domain from the Profile Manager, the domain is automatically
reusable. You cannot make the domain non-reusable.

98 Chapter 6: Working with Domains


5. In the Filename box, enter a new domain value.

You can use server variables, such as $PMSourceDir and $PMRootDir when you specify a
filename and path for the domain value.
Note: The file you specify must use a code page that is a subset of the PowerCenter Server
code page. You can specify a code page by entering valid syntax on the first line of the
file. For information about entering localization and code page information, see
“Specifying Localization and Code Page Information” on page 99.
6. Click OK to save the domain.
7. Click Close.

Specifying Localization and Code Page Information


When you import a list of values or a domain definition file name, the file can contain
localization information.
Use the following guidelines to specify localization information:
♦ Enter localization information in the first line of the file.
♦ Enter localization information using 7-bit ASCII.
♦ Use the following syntax:
locale=<language>_<territory>.<codepage><@Sort>

where language, territory, code page, and Sort represent the following information:
− Language. Specifies translation for month and names for days of the week.
− Territory. Specifies country dependent information such as currency symbols, numeric
and monetary formatting rules, and Date/Time formats.
− Code page. Specifies the character encoding to use. The code page you specify must be a
subset of the code page for the operating system that hosts the PowerCenter client.

Custom Domains 99
− Sort. Specifies the collation sequence to use. For example, you can use Binary.
For example, you can specify the following localization information for a U.S. English file:
locale=English_UnitedStates.US-ASCII@binary

For a Japanese file, you can specify the following localization information:
locale=Japanese_Japan.JapanEUC@binary

For more information about code page compatibility for Data Profiling components, see
“Code Page Compatibility” on page 152.

100 Chapter 6: Working with Domains


Editing a Domain
You can edit a reusable domain to change the domain name and domain value, expression, or
filename. When you define a function, you can edit a non-reusable domain for the function
to change the domain name, domain type, and domain value, expression, or filename. You
can edit prepackaged domains to change the domain value, expression, or filename.

To edit a domain:

1. To edit a reusable or prepackaged domain from the Profile Manager, select Tools-
Domains.
To edit a domain when you define a function, click the Domains button on the Profile
Function Role Details page.
The Domain Browser dialog box displays.
2. Select the domain you want to edit, and click Edit.
The Domain Details dialog box displays.
3. If you are editing a custom domain, optionally change the domain name.
If you are editing a List of Values domain, add or remove domain values as necessary.
If you are editing a Regular Expression domain, modify the domain expression as
necessary.
If you are editing a Domain Filename Definition domain, modify the filename that
contains the domain values as necessary.
4. Click OK to save your changes.
5. Click Close.

Editing a Domain 101


Deleting a Domain
You can delete a domain if you no longer want to apply it to Domain Validation functions. If
you delete a domain, the Designer invalidates all of the data profiles and related profile
mappings that reference the domain.

To delete a domain:

1. To delete a domain from the Profile Manager, select Tools-Domains.


To delete a domain when you define a function, click the Domains button on the Profile
Function Role Details page.
The Domain Browser dialog box displays.
2. Select the domain you want to delete, and click Delete.
3. Click Close.

102 Chapter 6: Working with Domains


Chapter 7

Working with Functions

This chapter contains information on the following topics:


♦ Overview, 104
♦ Source-Level Functions, 105
♦ Column-Level Functions, 113
♦ Intersource Functions, 123

103
Overview
You include functions in a profile to perform calculations on sources during a profile session.
When you create an auto profile, the Designer adds a predefined set of functions to your
profile. When you create a custom profile, you create functions that meet your business needs,
and add them to your profile. You can add the following types of functions to a profile:
♦ Source-level functions. Perform calculations on two or more source columns, source
group, or mapplet group. For more information about source-level functions, see “Source-
Level Functions” on page 105.
♦ Column-level functions. Perform calculations on one column in a source. For more
information about column-level functions, see “Column-Level Functions” on page 113.
♦ Intersource functions. Perform calculations on two or more sources, source groups, or
mapplet groups. For more information about intersource functions, see “Intersource
Functions” on page 123.
For many profile functions, you can write data to the Data Profiling warehouse in verbose
mode. When you select this option, the PowerCenter Server writes verbose data to the Data
Profiling warehouse during the profile session. You can use Data Profiling reports to view
more information about the verbose data. For more information about viewing
PowerAnalyzer reports, see “Viewing Profile Results” on page 77.
For more information about creating auto profiles, see “Creating an Auto Profile” on page 34.
For more information about creating custom profiles, see “Creating a Custom Profile” on
page 38.

104 Chapter 7: Working with Functions


Source-Level Functions
Source-level functions perform calculations on two or more columns of a source, source
group, or mapplet group.
You can add the following source-level functions to a profile:
♦ Row Count. Counts the number of rows read from the source during the profile session. If
you enable data sampling, the Row Count function returns an estimated row count based
on the rows sampled. For information about the Row Count function, see “Row Count”
on page 105.
♦ Business Rule Validation. Evaluates the number of rows for one or more source columns
that satisfy a specified business rule, and those rows that and do not satisfy the business
rule. For information about the Business Rule Validation function, see “Business Rule
Validation” on page 106.
♦ Candidate Key Evaluation. Calculates the number and percentage of unique values in one
or more source columns. This helps you identify source columns you might use as a
primary key. For information about the Candidate Key Evaluation function, see
“Candidate Key Evaluation” on page 108.
♦ Redundancy Evaluation. Calculates the number of duplicate values in one or more source
columns. This helps you identify columns to normalize into separate tables. For
information about the Redundancy Evaluation function, see “Redundancy Evaluation” on
page 109.
♦ Row Uniqueness. Calculates the number of unique and duplicate values in the source
based on the columns selected. You can profile all columns in the source row or choose
individual columns to profile. For information about the Row Uniqueness function, see
“Row Uniqueness” on page 110.
When you specify a source-level function on the Function Details page of the Profile Wizard,
the Profile Wizard prompts you to configure the function on the Function Role Details page.
The options available on the Function Role Details page for source-level functions depend on
the function you select. For more information about configuring function details, see “Step 4.
Configure Profile Functions” on page 41.

Row Count
The Row Count function returns the number of rows in a source. It can also report the
number of rows in each group. If you configure a session for random manual sampling or
automatic manual sampling, the Row Count function returns an estimate of the total source
rows. If you configure a session to sample First N Rows, the Row Count function returns the
number of rows read during the session. For more information about using data samples, see
“Profiling Data Samples” on page 68.

Source-Level Functions 105


Figure 7-1 shows the Function Role Details page for the Row Count function:

Figure 7-1. Row Count Function

Table 7-1 shows the properties of the Row Count function:

Table 7-1. Row Count Function Options

Required/
Property Description
Optional

Generate profile Optional Select to group source rows in a particular column. You can view a result for each
data by group group in the Data Profiling report.

Group by Columns Optional If you selected Generate Profile Data by Group, select a column to group by. You
can select a column of any datatype except Binary. If you select a a column of
numeric datatype, the precision must between 1 and 28 digits. If you select a
column of String datatype, the precision must be between 1 to 200 characters.

Business Rule Validation


The Business Rule Validation source-level function calculates the number of rows for one or
more columns in a source that satisfy a business rule and the number of rows for one or more
columns in a source that do not. A business rule is a valid Boolean expression that you create
with the Business Rule Validation Editor.
For example, you have a business rule that applies to products with a price of $50.00 or more.
You want to profile your source data during mapping development to learn how many rows in

106 Chapter 7: Working with Functions


the Price column in the source contain values greater than $50.00 You can define a business
rule in a Business Rule Validation source-level function that evaluates each source row in the
Price column to see how many values are $50.00 or higher.
Figure 7-2 shows the Function Role Details page for the Business Rule Validation source-level
function:

Figure 7-2. Business Rule Validation Source-Level Function

Click to enter a
business rule.

Table 7-2 shows the properties of the Business Rule Validation source-level function:

Table 7-2. Business Rule Validation Source-Level Function Options

Required/
Property Description
Optional

Rule Summary Required Click the Rule Editor button to enter a business rule. Once you enter a business
rule, the rule displays in the Rule Summary dialog box. Use a valid Boolean
expression. You can only enter Business Rule Validation functions in the Business
Rule Editor. If you enter other functions available through Business Rule
Validation, such as Date or String functions, the session may generate
unexpected results.

Generate profile Optional Select to group source rows in a particular column. You can view a result for each
data by group group.

Group by Columns Optional If you selected Generate Profile Data by Group, select a column to group by. You
can select a column of any datatype except Binary. If you select a a column of
numeric datatype, the precision must between 1 and 28 digits. If you select a
column of String datatype, the precision must be between 1 to 200 characters.

Source-Level Functions 107


Table 7-2. Business Rule Validation Source-Level Function Options

Required/
Property Description
Optional

Specify the type of Required Select for the PowerCenter Server to write verbose data to the Data Profiling
verbose data to warehouse.
load Into the You can load the following types of verbose data:
warehouse - No Rows
- Valid rows only
- Invalid rows only
- All Rows
The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for
each character in the Data Profiling warehouse code page. If the column exceeds
this limit, the PowerCenter Server writes truncated data to the Data Profiling
warehouse. For more information about configuring verbose mode, see
“Configuring a Function for Verbose Mode” on page 43.

Select Columns Optional Click to select the columns you want to profile in verbose mode. For more
information about configuring a function for verbose mode, see “Configuring a
Function for Verbose Mode” on page 43.

Figure 7-3 shows the Business Rule Validation Editor for a simple business rule:

Figure 7-3. Business Rule Validation Editor

Candidate Key Evaluation


The Candidate Key Evaluation function calculates the number and percentage of unique
values for one or more columns in a source. You can use the Candidate Key evaluation
function to determine the column in a source to use as a primary key. You may want to use
the column with the highest percentage of unique values as the primary key.
This function can evaluate unique values in columns of numeric datatypes with a precision of
1 to 28 digits or columns of the String datatype with a precision between 1 and 10 characters.

108 Chapter 7: Working with Functions


Figure 7-4 shows the Function Role Details page for the Candidate Key Evaluation function:

Figure 7-4. Candidate Key Evaluation Function

Table 7-3 shows the properties of the Candidate Key Evaluation function:

Table 7-3. Candidate Key Evaluation Function Options

Required/
Property Description
Optional

Select Column(s) Required Select the columns to profile. By default, the Designer selects all columns
of numeric datatype with a precision between 1 and 28 digits or String
datatype with a precision of 1 to 10 characters.

Enable duplicate count for Optional Select to evaluate candidate keys based on pairs of columns.
pairs of selected columns

Redundancy Evaluation
The Redundancy Evaluation function calculates the number of duplicate values in one or
more columns of a source. You can use the Redundancy Evaluation function to identify
columns to normalize into separate tables. You may want to normalize the columns that have
the highest percentage of redundant values.
This function can evaluate unique values in columns of numeric datatypes with a precision of
28 digits or less or columns of the String datatype with a precision of 10 characters or less.

Source-Level Functions 109


Figure 7-5 shows the Function Role Details page for the Redundancy Evaluation function:

Figure 7-5. Redundancy Evaluation Function

Table 7-4 shows the properties of the Redundancy Evaluation function:

Table 7-4. Redundancy Evaluation Function Options

Required/
Property Description
Optional

Select Column(s) Required Select the columns you want to profile using the Redundancy Evaluation
function. By default, the Designer selects all columns of numeric datatype
with a precision between 1 and 28 digits or String datatype with a precision
of 1 to 10 characters.

Enable duplicate count for Optional Select to perform redundancy evaluation based on pairs of columns.
pairs of selected columns

Row Uniqueness
The Row Uniqueness function calculates the number of unique and duplicate values based on
the columns selected. You can profile all columns in the source row or choose columns to
profile. This helps you identify columns to normalize into a separate table. You can also use
this function to test for distinct rows. This function is particularly useful for flat files, which
have no internal validation tools such as primary key constraints or unique indexes. For
example, if you have a flat file source that uses unique employee ID values to identify each

110 Chapter 7: Working with Functions


row, you can select all columns to test for duplicate rows. Because you have no primary key
constraints, this allows you to verify that you did not create duplicate entries for employees.
Figure 7-6 shows the Function Role Details page for the Row Uniqueness function:

Figure 7-6. Row Uniqueness Function

Table 7-5 shows the properties of the Row Uniqueness function:

Table 7-5. Row Uniqueness Function Options

Required/
Property Description
Optional

Select Column(s) Required Select the columns you want to profile using the Row Uniqueness function.
By default, the Designer selects all columns of numeric datatype with a
precision between 1 and 28 digits or String datatype with a precision of 1 to
10 characters.

Generate profile data by Optional Select to group source rows in a particular column. You can view a result
group for each group in the Data Profiling report.

Source-Level Functions 111


Table 7-5. Row Uniqueness Function Options

Required/
Property Description
Optional

Group by Columns Optional If you selected Generate Profile Data by Group, select a column to group
by. You can select a column of any datatype except Binary. If you select a a
column of numeric datatype, the precision must between 1 and 28 digits. If
you select a column of String datatype, the precision must be between 1 to
200 characters.

Specify the type of Required Select for the PowerCenter Server to write verbose data to the Data
verbose data to load Into Profiling warehouse. You can load the following types of verbose data:
the warehouse - All rows
- No rows
- Duplicate rows only
The character limit is 1,000 bytes/ K, where K is the maximum number of
bytes for each character in the Data Profiling warehouse code page. If the
column exceeds this limit, the PowerCenter Server writes truncated data to
the Data Profiling warehouse. For more information about configuring
verbose mode, see “Configuring a Function for Verbose Mode” on page 43.

112 Chapter 7: Working with Functions


Column-Level Functions
Column-level functions perform a calculation on one column in a source. You can add the
following column-level functions to a profile:
♦ Business Rule Validation. For information about the Business Rule Validation function,
see “Business Rule Validation” on page 113.
♦ Domain Validation. For information about the Domain Validation function, see “Domain
Validation” on page 115.
♦ Domain Inference. For information about the Domain Inference function, see “Domain
Inference” on page 117.
♦ Aggregate Functions. For information about Aggregate functions, see “Aggregate
Functions” on page 119.
♦ Distinct Value Count. For information about the Distinct Value Count function, see
“Distinct Value Count” on page 121.
When you specify a column-level function on the Function Details page of the Profile
Wizard, the Profile Wizard prompts you to configure the function. The options available on
the Function Role Details page for column-level functions depend on the function you select.
For more information about the Function Details page of the Profile Wizard, see “Step 3. Add
Functions and Enable Session Configuration” on page 39. For more information about the
Function Role Details page of the Profile Wizard, see “Step 4. Configure Profile Functions”
on page 41.
When you create a Domain Validation column-level function, you include domains in the
function. For more information about creating domains, see “Custom Domains” on page 92.
Note: If you enable the PowerCenter Server to write verbose data to the Data Profiling
warehouse, the PowerCenter Server loads all selected rows to the Data Profiling warehouse
except those with the datatype Raw. For more information about configuring a function for
verbose mode, see “Configuring a Function for Verbose Mode” on page 43.

Business Rule Validation


The Business Rule Validation column-level function calculates the number of rows in a single
source column that satisfy a business rule and the number of rows that do not. A business rule
is a valid Boolean expression that you create with the Rule Editor.

Column-Level Functions 113


Figure 7-7 shows the Function Role Details page for the Business Rule Validation column-
level function:

Figure 7-7. Business Rule Validation Column-Level Function

Click to enter a
business rule.

Table 7-6 shows the properties of the Business Rule Validation column-level function:

Table 7-6. Business Rule Validation Column-Level Function Options

Required/
Property Description
Optional

Selected Column Required Select the column you want to apply the business rule to.

Generate profile Optional Select to group source rows in a particular column. You can view a result for each
data by group group.

Group by Columns Optional If you selected Generate Profile Data by Group, select a column to group by. You
can select a column of any datatype except Binary. If you select a a column of
numeric datatype, the precision must between 1 and 28 digits. If you select a
column of String datatype, the precision must be between 1 to 200 characters.

114 Chapter 7: Working with Functions


Table 7-6. Business Rule Validation Column-Level Function Options

Required/
Property Description
Optional

Rule Summary Required Click the Rule Editor button shown in Figure 7-3 to enter a business rule. Once
you enter a business rule, the rule displays in the Rule Summary dialog box. Use
a valid Boolean expression. You can only enter Business Rule Validation functions
in the Business Rule Editor. If you enter other functions available through
Business Rule Validation, such as Date or String functions, the session may
generate unexpected results.

Specify the type of Required Select for the PowerCenter Server to write verbose data to the Data Profiling
verbose data to warehouse. You can load the following types of verbose data:
load Into the - No Rows
warehouse - Valid rows only
- Invalid rows only
- All Rows
The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for
each character in the Data Profiling warehouse code page. If the column exceeds
this limit, the PowerCenter Server writes truncated data to the Data Profiling
warehouse. For more information about configuring verbose mode, see
“Configuring a Function for Verbose Mode” on page 43.

Domain Validation
The Domain Validation function calculates the number of values in the profile source
column that fall within a specified domain and the number of values that do not. A domain is
the set of all possible valid values for a column. For example, a domain might include a list of
abbreviations for all of the states in the U.S.A. Or, a domain might include a list of valid
U.S.A zip code patterns.

Column-Level Functions 115


Figure 7-8 shows the Function Role Details page for the Domain Validation function:

Figure 7-8. Domain Validation Function

Table 7-7 shows the properties of the Domain Validation function:

Table 7-7. Domain Validation Function Options

Required/
Property Description
Optional

Selected Column Required Select the column you want to evaluate against the domain.

Domain Summary Required Click the Domains button to select a reusable domain or to create a non-reusable
domain. Once you enter a domain, it displays in the Domain Summary box. For
more information about domains, see “Working with Domains” on page 89.

Specify the type of Required Select for the PowerCenter Server to write verbose data to the Data Profiling
verbose data to warehouse. You can load the following types of verbose data:
load Into the - No Rows
warehouse - Valid rows only
- Invalid rows only
- All Rows
The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for
each character in the Data Profiling warehouse code page. If the column exceeds
this limit, the PowerCenter Server writes truncated data to the Data Profiling
warehouse. For more information about configuring verbose mode, see
“Configuring a Function for Verbose Mode” on page 43.

116 Chapter 7: Working with Functions


When you click the Domains button to add a domain, the Domain Browser displays.
Figure 7-9 shows the Domain Browser:

Figure 7-9. Domain Browser

Select the domain you want to use for your domain validation function. Click the Close
button to return to the Domain Validation Function Role Details page.
If you validate the domain against a List of Values domain or a Domain Definition Filename
domain, the list must use a code page that is two-way compatible with the PowerCenter
Server. For information about specifying a compatible code pages for these domains, see
“Custom Domains” on page 92.

Domain Inference
The Domain Inference function reads all values in the column and infers a pattern that fits
the data. The function determines if the values fit a list of values derived from the column
values or a pattern that describes the pattern of the source data. For example, you have a
column with social security numbers in your source. Use the Domain Inference function to
determine the pattern of numbers in the column. The Domain Inference function can also
infer a pattern of ‘STRING WITH ONLY SPACES’ for columns containing non-null blank
space data. This is useful for determining data quality.
This function can infer domains for columns with a numeric datatype with a precision of 28
digits or less or a String datatype with a precision of 200 characters or less. This function can
also infer domains for columns of the Date/Time datatype.

Column-Level Functions 117


Figure 7-10 shows the Function Role Details page for the Domain Inference function:

Figure 7-10. Domain Inference Function

When you create a Domain Inference function, select a column for which you want to infer a
domain. Click Finish.

Configuring Domain Inference Settings


When you work with the Domain Inference function, you can configure the Profile Wizard to
filter the Domain Inference results. You may want to narrow the scope of patterns returned to
view only the primary domains, or you may want to widen the scope of patterns returned to
view exception data. You configure these settings from the Domain Inference Tuning dialog
box when you create custom or auto profiles.
When you use a Domain Inference function in a data profile, you can configure the following
settings:
♦ Maximum number of patterns. The PowerCenter Server returns the most frequently
occurring patterns up to the number of patterns you specify. For example, you create an
auto profile for a source, and you set the maximum number of patterns to 20. The
PowerCenter Server returns the top 20 patterns.
♦ Minimum pattern frequency. The PowerCenter Server returns patterns that occur at or
above the frequency you specify. For example, if you set the minimum pattern frequency to
30 percent, the PowerCenter Server returns patterns that occur 30 percent of the time or
more and filters the rest.

118 Chapter 7: Working with Functions


For example, you have a source that contains customers with Canadian and United States zip
codes. Most of your customers are in the United States, but you also have Canadian
customers. You want to view exception data and see what percentage of the zip codes are
Canadian, so you configure the auto profile with a maximum of 300 patterns and a minimum
pattern frequency of 1 percent.
When you run the auto profile, the PowerCenter Server returns a maximum of 300 patterns.
This allows you to view a wide range of exception data. The auto profile also infers a domain
for Canadian zip codes, which occurs in a small percentage of the data.
Note: When you configure minimum pattern frequency, the PowerCenter Server calculates the
percentage of patterns to return based on non-null data and ignores null data. Therefore, if
you have null data in the source row, the percentage of patterns returned may not represent a
percentage of the total data.

Aggregate Functions
An Aggregate function calculates an aggregate value for a numeric or string value applied to
one column of a profile source.
You can add the following aggregate functions to a profile:
♦ NULL Value Count. The number of rows with NULL values in the source column.
♦ Average Value. The average value of the rows in the source column.
♦ Minimum Value. The minimum value of the rows in the source column.
♦ Maximum Value. The maximum value of the rows in the source column.
The Aggregate function you can add to a source column depends on the datatype of the
column.
Table 7-8 describes the Aggregate functions you can add based on the datatype of the source
column:

Table 7-8. Aggregate Functions to Add Based on the Column Datatype

Aggregate Function Allowed Datatypes and Precision

NULL Value Count - Binary


- Date/Time
- Numeric
- String
with a precision of 4,000 or less.

Minimum Value - Date/Time


- Numeric
- String
with a precision of 200 or less.

Column-Level Functions 119


Table 7-8. Aggregate Functions to Add Based on the Column Datatype

Aggregate Function Allowed Datatypes and Precision

Maximum Value - Date/Time


- Numeric
- String
with a precision of 200 or less.

Average Value - Numeric


with a precision of 1,000 or less.

Figure 7-11 shows the Function Role Details page for Aggregate functions:

Figure 7-11. Aggregate Functions

Table 7-9 shows the properties of Aggregate functions:

Table 7-9. Aggregate Function Options

Required/
Property Description
Optional

Selected Column Required Select the column you want to apply the aggregation to.

Aggregate Required Select the functions you want to apply to the source column.
Functions

120 Chapter 7: Working with Functions


Table 7-9. Aggregate Function Options

Required/
Property Description
Optional

Generate profile Optional Select to group source rows in a particular column. You can view a result for each
data by group group.

Group by Columns Optional If you selected Generate Profile Data by Group, select a column to group by. You
can select a column of any datatype except Binary. If you select a a column of
numeric datatype, the precision must between 1 and 28 digits. If you select a
column of String datatype, the precision must be between 1 to 200 characters.

Distinct Value Count


The Distinct Value Count function returns the number of distinct values for the column. You
can also enter a calculated expression to return the number of distinct values for the column
based on the result of the expression.
For example, you have work orders organized in a table by Job ID. To find out how many
unique work orders exist, you can use the distinct count function.
Figure 7-12 shows the Function Role Details page for the Distinct Value Count function:

Figure 7-12. Distinct Value Count Function

Click to enter a
business rule.

Column-Level Functions 121


Table 7-10 shows the properties of the Distinct Value Count function:

Table 7-10. Distinct Value Count Function Options

Required/
Property Description
Optional

Selected Column Required Select the column you want to apply the function to.

Rule Summary Optional Click the Rule Editor button to enter an expression. Once you enter an
expression, it displays in the Rule Summary dialog box. If you enter other
functions available through the Rule Editor, such as Date or String functions, the
session may generate unexpected results.

Generate profile Optional Select to group source rows in a particular column. You can view a result for each
data by group group.

Group by Columns Optional If you selected Generate Profile Data by Group, select a column to group by. You
can select a column of any datatype except Binary. If you select a a column of
numeric datatype, the precision must between 1 and 28 digits. If you select a
column of String datatype, the precision must be between 1 to 200 characters.

Specify the type of Required Select for the PowerCenter Server to write verbose data to the Data Profiling
verbose data to warehouse. You can load the following types of verbose data:
load Into the - All rows
warehouse - No rows
- Duplicate rows only
The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for
each character in the Data Profiling warehouse code page. If the column exceeds
this limit, the PowerCenter Server writes truncated data to the Data Profiling
warehouse. For more information about configuring verbose mode, see
“Configuring a Function for Verbose Mode” on page 43.

122 Chapter 7: Working with Functions


Intersource Functions
Intersource functions perform calculations on two or more sources, source groups from
different sources, or mapplet output groups, and generate information about their
relationship. You can add the following intersource functions to a profile:
♦ Orphan Analysis. For information about the Orphan Analysis function, see “Orphan
Analysis” on page 123.
♦ Join Complexity Evaluation. For information about the Join Complexity function, see
“Join Complexity Evaluation” on page 125.
Note: If you enable the PowerCenter Server to write verbose data to the Data Profiling
warehouse, the PowerCenter Server loads all selected rows to the Data Profiling warehouse
except those with the datatype Raw. You must select at least one column to write verbose data
to the Data Profiling warehouse. For more information about configuring a function for
verbose mode, see “Configuring a Function for Verbose Mode” on page 43.

Orphan Analysis
The Orphan Analysis function compares the values of columns in two sources. When you
create this function, you select columns that you want to analyze. During the profile session,
the PowerCenter Server reports the number and percentage of rows that appear in a specified
column in the master source but not in the detail source. This is useful for referential integrity
analysis, also known as orphan analysis.
This function can evaluate unique values in columns of numeric datatypes with a precision of
28 digits or less or columns of the String datatype with a precision of 200 characters or less.
The columns can use any combination of datatypes except Date and Numeric and Non-Raw
with Raw. You can use Raw with Raw when you disable verbose mode.

Intersource Functions 123


Figure 7-13 shows the Function Role Details page for the Orphan Analysis function:

Figure 7-13. Orphan Analysis Function

Table 7-11 shows the properties of the Orphan Analysis function:

Table 7-11. Orphan Analysis Function Options

Required/
Property Description
Optional

Source Required The source you selected to perform the intersource function. The source is listed
using the following syntax: <DBD Name>:<Source Name> or <DBD
Name>:<Mapplet Name>.

Port Name Required Select the columns you want to profile for each source.

124 Chapter 7: Working with Functions


Table 7-11. Orphan Analysis Function Options

Required/
Property Description
Optional

Datatype Required The datatype for the columns you want to profile in the corresponding source. The
columns can use any combination of datatypes except Date and Numeric and
Non-Raw with Raw. You can use Raw with Raw when you disable verbose mode.

Specify the type of Required Select for the PowerCenter Server to write all unmatched or orphaned rows to the
verbose data to Data Profiling warehouse.
load Into the You can load the following types of verbose data:
warehouse - No rows
- Orphan rows
The character limit is 1,000 bytes/ K, where K is the maximum number of bytes for
each character in the Data Profiling warehouse code page. If the column exceeds
this limit, the PowerCenter Server writes truncated data to the Data Profiling
warehouse. For more information about configuring verbose mode, see
“Configuring a Function for Verbose Mode” on page 43.

Join Complexity Evaluation


The Join Complexity Evaluation intersource function measures the columns in multiple
sources that do not have a join condition. This function provides information to help you
analyze join complexity. This is useful for designing and optimizing queries. You can select up
to five sources for the Join Complexity Evaluation function.

Intersource Functions 125


Figure 7-14 shows the Function Role Details page for the Join Complexity Evaluation
function:

Figure 7-14. Join Complexity Evaluation Function

When you configure the details for the Join Complexity Evaluation function, you must select
the column in each source that you want to profile. The columns can use any datatype except
Binary. The columns can use any combination of datatypes except Date and Numeric, Raw
with Raw, and Non-Raw with Raw.

126 Chapter 7: Working with Functions


Appendix A

Data Profiling Views

This chapter includes the following topics:


♦ Overview, 128
♦ List of Data Profiling Views, 128

127
Overview
PowerCenter Data Profiling provides views to create metrics and attributes in the Data
Profiling schema. The prepackaged reports use these metrics and attributes to provide
information about your source data.
Your business needs may require you to create custom metrics, attributes, or reports using the
prepackaged Data Profiling schema tables. The schema tables are views built on top of the
Data Profiling tables. To understand the Data Profiling schema, you must understand the
views and the information contained in each view.
This chapter provides the following information about each view:
♦ Description. Provides a description of the view.
♦ Usage. Provides information about how you can use the view for analysis.
♦ Column Name. Provides the name of the column used in the view.

List of Data Profiling Views


PowerCenter Data Profiling prepackages the following views:
♦ DPR_LATEST_PRFLS, 129
♦ DPR_PRFL_AUTO_COL_FN_METRICS, 130
♦ DPR_PRFL_CART_PROD_METRICS, 131
♦ DPR_PRFL_COL_FN, 132
♦ DPR_PRFL_COL_FN_METRICS, 134
♦ DPR_PRFL_COL_FN_VERBOSE, 136
♦ DPR_PRFL_CP_FN, 137
♦ DPR_PRFL_FN_DTLS, 139
♦ DPR_PRFL_OJ_FN, 141
♦ DPR_PRFL_OJ_FN_VERBOSE, 143
♦ DPR_PRFL_OJ_METRICS, 144
♦ DPR_PRFL_RUN_DTLS, 145
♦ DPR_PRFL_SRC_FN, 146
♦ DPR_PRFL_SRC_FN_METRICS, 148
♦ DPR_PRFL_SRC_FN_VERBOSE, 149
♦ DPR_PRFL_VER_DTLS, 150

128 Appendix A: Data Profiling Views


DPR_LATEST_PRFLS
♦ Description. This view contains the latest version and latest session run details for every
profile ID.
♦ Usage. Use this view to get the latest version and session run details.

Table A-1. DPR_LATEST_PRFLS Column Information

Column Name Description

PROFILE_ID Identifies the data profile in every repository and repository folder.

CURR_PRFL_VER_KEY The surrogate key that the Designer generates for a data profile. This identifies
the current version of a data profile.

CURR_PRFL_RUN_KEY The surrogate key that the Designer generates for a profile session run. This
identifies the current version of a profile session run.

CURR_VER_DT Identifies the last saved date of the data profile.

CURR_RUN_DT Identifies the last saved date of the profile session run.

DPR_LATEST_PRFLS 129
DPR_PRFL_AUTO_COL_FN_METRICS
♦ Description. This view contains the results (metric statistics) of the column-level
functions of an auto profile. Use this view to analyze the column-level function statistics
for an auto profile.
♦ Usage. Use this view to get all metric statistics of all the column-level functions of an auto
profile. This view does not apply to custom profiles.

Table A-2. DPR_PRFL_AUTO_COL_FN_METRICS Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a data profile function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session. Refers to
DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

TOTAL_ROWS The total row counts in column. This applies to the column-level Business Rule
Validation, Distinct Value Count, and Aggregate functions.

NULL_ROWS The null rows count in a column. This applies to the Aggregate function.

DISTINCT_ROWS The distinct count of values in a column. This applies to the Distinct Value Count
function.

AVG_VALUE The average value of the numeric fields of the selected column in a source. This
applies to the Aggregate function.

MIN_VALUE The minimum value of a column selected in a source. This applies to the
Aggregate function.

MAX_VALUE The maximum value of a column selected in a source. This applies to the
Aggregate function.

130 Appendix A: Data Profiling Views


DPR_PRFL_CART_PROD_METRICS
♦ Description. This view contains the results (metric statistics) of the Join Complexity
Evaluation functions.
♦ Usage. Use this view to get the result set (metric statistics) of the Join Complexity
Evaluation functions.

Table A-3. DPR_PRFL_CART_PROD_METRICS Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a data profile function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session. Refers to
DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

CP_VAL The value for two or more source tables based on a joined column.

COLUMN_VALUE The column value for the selected joined columns on two or more source tables.

DPR_PRFL_CART_PROD_METRICS 131
DPR_PRFL_COL_FN
♦ Description. This view contains details about column-level functions, such as Aggregate
and Domain Validation functions.
♦ Usage. Use this view to get all data profile column-level function details.

Table A-4. DPR_PRFL_COL_FN Column Information

Column Name Description

FUNCTION_LEVEL Identifies the function level for a function in a data profile. The function level is
COLUMN.

FUNCTION_KEY The surrogate key that the Designer generates for a data profile column-level
function.

PRFL_VER_KEY The profile version key that the Designer generates for a data profile. Refers to
DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.

SOURCE_NAME1 Identifies the name of the source object that is being profiled.

SOURCE_NAME2 This column does not apply to column-level functions.

SOURCE_NAME Identifies the name of the source object that is being profiled.

SOURCE_TYPE1 Identifies the type of source object that is being profiled, such as a database
table or mapplet.

SOURCE_TYPE2 This column does not apply to column-level functions.

SOURCE_TYPE Identifies the type of source object that is being profiled, such as a database
table or mapplet.

DBD_NAME1 Identifies the DBD name for the source in a data profile.

DBD_NAME2 This column does not apply to column-level functions.

DBD_NAME Identifies the DBD name for a source in a data profile.

GROUP_NAME1 Identifies the group name of the source object.

GROUP_NAME2 This column does not apply to column-level functions.

GROUP_NAME Identifies the group name of the source object.

FUNCTION_NAME Identifies the name of a function in a data profile.

FUNCTION_TYPE Identifies the function type for the column-level functions.

FN_TYPE_TEXT The detailed text of the function type for the column-level functions.

RULE_VAL Identifies the rule applied to a function. This applies to the column-level Business
Rule Validation function and the Distinct Value Count function.

COLNAME1 Identifies the column for which the profile function is defined.

COLNAME2 This column does not apply to column-level functions.

COLNAME3 This column does not apply to column-level functions.

132 Appendix A: Data Profiling Views


Table A-4. DPR_PRFL_COL_FN Column Information

Column Name Description

COLTYPE1 Identifies the datatype of the source column to which the profile function applies.
This column applies to column-level and intersource functions.

COLTYPE2 This column does not apply to column-level functions.

COLTYPE3 This column does not apply to column-level functions.

GRP_BY_COLNAME1 Identifies the first column on which the profiling function is grouped.

GRP_BY_COLNAME2 Identifies the second column on which the profiling function is grouped.

GRP_BY_COLNAME3 Identifies the third column on which the profiling function is grouped.

GRP_BY_COLUMNS Identifies the concatenated values of the three group-by columns.

GRP_BY_COLTYPE1 Identifies the datatype of the first column on which the profiling function is
grouped.

GRP_BY_COLTYPE2 Identifies the datatype of the second column on which the profiling function is
grouped.

GRP_BY_COLTYPE3 Identifies the datatype of the third column on which the profiling function is
grouped.

GRP_BY_COLTYPES Identifies the datatype of all the columns on which the profiling function is
grouped.

DOMAIN_NAME Identifies the domain name used for domain validation.

DOMAIN_TYPE Identifies the domain type for domain validation.

DOMAIN_VALUE Identifies the domain value for domain validation. This applies to Regular
Expression and Domain Validation Filename domains.

DPR_PRFL_COL_FN 133
DPR_PRFL_COL_FN_METRICS
♦ Description. This view contains the results (metric statistics) of all column-level functions.
Use this view to analyze the results of various column-level functions.
♦ Usage. Use this view to get all metric statistics of all the column-level functions. You can
use this view to analyze the result set of column-level functions.

Table A-5. DPR_PRFL_COL_FN_METRICS Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a data profile column-level
function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session. Refers to
DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

GRP_BY_COL1_VAL Identifies the first column value on which the profile function is grouped.

GRP_BY_COL2_VAL Identifies the second column value on which the profile function is grouped.

GRP_BY_COL3_VAL Identifies the third column value on which the profile function is grouped.

GRP_BY_COL_VALUES Identifies the concatenated values of the three group-by column values.

FUNCTION_TYPE Identifies the function types for the column-level functions.

FN_TYPE_TEXT The detailed text of the function type for the column-level functions.

TOTAL_ROWS The total row count in a column. This applies to the column-level Business
Rule Validation, Distinct Value Count, and Aggregate functions.

SATISFIED_ROWS The row count for rows that satisfied a rule or condition in a column. This
applies to the column-level Business Rule Validation function.

UNSATISFIED_ROWS The row count for rows that did not satisfy a rule or condition in a column. This
applies to the column-level Business Rule Validation function.

DUPLICATE_ROWS The percentage of duplicate rows in a column.

NULL_ROWS The null row count in a column. This applies to the Aggregate function.

DISTINCT_ROWS The distinct count of values in a column. This applies to the Distinct Value
Count function.

AVG_VALUE The average value of the numeric fields of the selected column in a source.
This applies to the Aggregate function.

MIN_VALUE The minimum value of a selected column in a source. This applies to the
Aggregate function.

MAX_VALUE The maximum value of a selected column in a source. This applies to the
Aggregate function.

COLUMN_PATTERN The list of values for domain validation and an inferred pattern for the Domain
Inference function.

DOMAIN_TYPE The domain type of domain inference. This could be a list of values or a regular
expression.

134 Appendix A: Data Profiling Views


Table A-5. DPR_PRFL_COL_FN_METRICS Column Information

Column Name Description

DOMAIN_TOTAL_ROWS The total row count in a source for Domain Validation and Domain Inference
functions.

DOMAIN_SATISFIED_ROWS The total row count for rows that satisfied a domain validation rule inferred
pattern. This applies to the Domain Validation and Domain Inference functions.

DOMAIN_UNSATISFIED_ROWS The total row count for rows that did not satisfy a domain validation rule or
inferred pattern. This applies to Domain Validation and Domain Inference
functions.

DOMAIN_NULL_ROWS The null row count for Domain Validation and Domain Inference functions.

DPR_PRFL_COL_FN_METRICS 135
DPR_PRFL_COL_FN_VERBOSE
♦ Description. This view contains the rejected row information for column-level functions.
♦ Usage. Use this view to get all rejected row information for the Business Rule Validation
and Domain Validation column-level functions.

Table A-6. DPR_PRFL_COL_FN_VERBOSE Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a data profile column-level
function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session run. Refers
to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

COLUMN_VALUE Identifies the column value in a source table.

GRP_BY_COL1_VAL Identifies the first column value on which the profile function is grouped.

GRP_BY_COL2_VAL Identifies the second column value on which the profile function is grouped.

GRP_BY_COL3_VAL Identifies the third column value on which the profile function is grouped.

GRP_BY_COL_VALUES Identifies all the column values on which the profile function is grouped.

136 Appendix A: Data Profiling Views


DPR_PRFL_CP_FN
♦ Description. This view contains details about Join Complexity Evaluation functions.
♦ Usage. Use this view to get all the Join Complexity Evaluation function details.

Table A-7. DPR_PRFL_CP_FN Column Information

Column Name Description

FUNCTION_LEVEL Identifies the function level for a function in a data profile. The function level is
INTER SOURCE.

FUNCTION_KEY The surrogate key that the Designer generates for a data profile Join Complexity
Evaluation function.

PRFL_VER_KEY The profile version key that the Designer generates for a data profile. Refers to
DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.

SOURCE_NAME1 Identifies the name of the source object that is being profiled.

SOURCE_NAME2 This column does not apply to Join Complexity Evaluation functions.

SOURCE_NAME Identifies the name of the source object that is being profiled.

SOURCE_TYPE1 Identifies the type of source object that is being profiled, such as a database
table or mapplet.

SOURCE_TYPE2 This column does not apply to Join Complexity Evaluation functions.

SOURCE_TYPE Identifies the type of source object that is being profiled, such as a database
table or mapplet.

DBD_NAME1 Identifies the DBD name for the source in a data profile.

DBD_NAME2 This column does not apply to Join Complexity Evaluation functions.

DBD_NAME Identifies the DBD name for a source in a data profile.

GROUP_NAME1 Identifies the group name of the source object.

GROUP_NAME2 This column does not apply to Join Complexity Evaluation functions.

GROUP_NAME Identifies the group name of the source object.

FUNCTION_NAME Identifies the name of a function in a data profile.

FUNCTION_TYPE Identifies the function type for the Join Complexity Evaluation function.

FN_TYPE_TEXT The detailed text of the function type for the Join Complexity Evaluation function.

RULE_VAL This column does not apply to Join Complexity Evaluation functions.

COLNAME1 Identifies the column for which the profile function is defined.

COLNAME2 This column does not apply to Join Complexity Evaluation functions.

COLNAME3 This column does not apply to Join Complexity Evaluation functions.

COLTYPE1 Identifies the datatype of the source column to which the profile function applies.

COLTYPE2 This column does not apply to Join Complexity Evaluation functions.

DPR_PRFL_CP_FN 137
Table A-7. DPR_PRFL_CP_FN Column Information

Column Name Description

COLTYPE3 This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLNAME1 This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLNAME2 This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLNAME3 This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLUMNS This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLTYPE1 This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLTYPE2 This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLTYPE3 This column does not apply to Join Complexity Evaluation functions.

GRP_BY_COLTYPES This column does not apply to Join Complexity Evaluation functions.

DOMAIN_NAME This column does not apply to Join Complexity Evaluation functions.

DOMAIN_TYPE This column does not apply to Join Complexity Evaluation functions.

DOMAIN_VALUE This column does not apply to Join Complexity Evaluation functions.

138 Appendix A: Data Profiling Views


DPR_PRFL_FN_DTLS
♦ Description. This view contains details about all source-level, column-level, and
intersource functions. This view is registered in the operational schema to get all the
function details.
♦ Usage. Use this view to obtain information about all profile functions.

Table A-8. DPR_PRFL_FN_DTLS Column Information

Column Name Description

FUNCTION_LEVEL Identifies the function level for a function in a data profile. The function levels are:
SOURCE, COLUMN, and INTER SOURCE.

FUNCTION_KEY The surrogate key that the Designer generates for a data profile function.

PRFL_VER_KEY The profile version key that the Designer generates for a data profile. Refers to
DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.

SOURCE_NAME1 Identifies the name of the source object that is being profiled. For Orphan
Analysis functions, this contains the master source name.

SOURCE_NAME2 Identifies the name of the detail source object that is being profiled. This only
applies to Orphan Analysis functions.

SOURCE_NAME Identifies the name of the source object that is being profiled.

SOURCE_TYPE1 Identifies the type of source object that is being profiled, such as a database
table or mapplet. For Orphan Analysis functions, this identifies the type of the
master source object.

SOURCE_TYPE2 Identifies the type of detail source object that is being profiled, such as a
database table or mapplet. This only applies to Orphan Analysis functions.

SOURCE_TYPE Identifies the type of source object that is being profiled, such as a database
table or mapplet.

DBD_NAME1 Identifies the DBD name for the source in a data profile. For Orphan Analysis
functions, this is populated with the DBD name of the master source.

DBD_NAME2 Identifies the DBD name for the detail source in a data profile. This only applies
to Orphan Analysis functions.

DBD_NAME Identifies the DBD name for a source in a data profile.

GROUP_NAME1 Identifies the group name of the source object.

GROUP_NAME2 Identifies the group name of the source object.

GROUP_NAME Identifies the group name of the source object.

FUNCTION_NAME Identifies the name of a function in a data profile.

FUNCTION_TYPE Identifies the function type for each of the function levels: source level, column
level, and intersource.

FN_TYPE_TEXT The detailed text of the function type for each of the function levels: source level,
column level, and intersource.

DPR_PRFL_FN_DTLS 139
Table A-8. DPR_PRFL_FN_DTLS Column Information

Column Name Description

RULE_VAL Identifies the rule applied to a function. This applies to the source-level and
column-level Business Rule Validation functions and the Distinct Value Count
function.

COLNAME1 Identifies the column for which the profile function is defined. This column does
not apply to source-level functions.

COLNAME2 Identifies one of the source columns to which the profile function applies. If the
profile function applies to various columns, this column populates. This column
only applies to Orphan Analysis functions.

COLNAME3 Identifies one of the source columns to which the profile function applies. If the
profile function applies to various columns, this column populates. If a function
applies to various columns, this column populates. This column applies to
Orphan Analysis functions.

COLTYPE1 Identifies the datatype of the source column to which the profile function applies.
This column applies to column-level and intersource functions.

COLTYPE2 Identifies the datatype of the source column to which the profile function applies.
This column applies to Orphan Analysis functions.

COLTYPE3 Identifies the datatype of the source column to which the profile function applies.
If a function applies to various source columns, this column populates. This
column applies to Orphan Analysis functions.

GRP_BY_COLNAME1 Identifies the datatype of first column on which the profiling function is grouped.

GRP_BY_COLNAME2 Identifies the datatype of second column on which the profiling function is
grouped.

GRP_BY_COLNAME3 Identifies the datatype of third column on which the profiling function is grouped.

GRP_BY_COLNAMES Identifies the datatype of all the columns on which the profiling function is
grouped.

DOMAIN_NAME Identifies the domain name used for domain validation.

DOMAIN_TYPE Identifies the domain type for domain validation.

DOMAIN_VALUE Identifies the domain value for domain validation. This applies to Regular
Expression and Domain Validation Filename domains.

140 Appendix A: Data Profiling Views


DPR_PRFL_OJ_FN
♦ Description. This view contains details about Orphan Analysis functions.
♦ Usage. Use this view to get all the Orphan Analysis function details.

Table A-9. DPR_PRFL_OJ_FN Column Information

Column Name Description

FUNCTION_LEVEL Identifies the function level for a function in a data profile. The function level is
INTER SOURCE.

FUNCTION_KEY The surrogate key that the Designer generates for a data profile Orphan Analysis
function.

PRFL_VER_KEY The profile version key that the Designer generates for a data profile. Refers to
DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.

SOURCE_NAME1 Identifies the name of the source object that is being profiled. For Orphan
Analysis functions, this contains the master source name.

SOURCE_NAME2 Identifies the name of the detail source object that is being profiled. This column
only applies to Orphan Analysis functions.

SOURCE_NAME Identifies the name of the source object that is being profiled.

SOURCE_TYPE1 Identifies the type of source object that is being profiled, such as a database
table or mapplet. For Orphan Analysis functions, this identifies the type of the
master source object.

SOURCE_TYPE2 Identifies the type of detail source object that is being profiled, such as a
database table or mapplet. This only applies to Orphan Analysis functions.

SOURCE_TYPE Identifies the type of source object that is being profiled, such as a database
table or mapplet.

DBD_NAME1 Identifies the DBD name for the source in a data profile. For Orphan Analysis
functions, this is populated with the DBD name of the master source.

DBD_NAME2 Identifies the DBD name for the detail source in a data profile. This only applies
to Orphan Analysis functions.

DBD_NAME Identifies the DBD name for a source in a data profile.

GROUP_NAME1 Identifies the group name of the source object.

GROUP_NAME2 Identifies the group name of the source object.

GROUP_NAME Identifies the group name of the source object.

FUNCTION_NAME Identifies the name of a function in a data profile.

FUNCTION_TYPE Identifies the function type for the Orphan Analysis functions.

FN_TYPE_TEXT The detailed text of the function type for the Orphan Analysis functions.

RULE_VAL This column does not apply to Orphan Analysis functions.

COLNAME1 Identifies the column for which the profile function is defined.

DPR_PRFL_OJ_FN 141
Table A-9. DPR_PRFL_OJ_FN Column Information

Column Name Description

COLNAME2 Identifies one of the source columns to which the profile function applies. If the
profile function applies to various columns, this column populates. This column
only applies to Orphan Analysis functions.

COLNAME3 Identifies one of the source columns to which the profile function applies. If the
profile function applies to various columns, this column populates. This column
only applies to Orphan Analysis functions.

COLTYPE1 Identifies the datatype of the source column to which the profile function applies.
This column applies to column-level and intersource functions.

COLTYPE2 Identifies the datatype of the source column to which the profile function applies.
This column only applies to Orphan Analysis functions.

COLTYPE3 Identifies the datatype of the source column to which the profile function applies.
If a function applies to various source columns, this column populates. This
column only applies to Orphan Analysis functions.

GRP_BY_COLNAME1 This column does not apply to Orphan Analysis functions.

GRP_BY_COLNAME2 This column does not apply to Orphan Analysis functions.

GRP_BY_COLNAME3 This column does not apply to Orphan Analysis functions.

GRP_BY_COLNAMES This column does not apply to Orphan Analysis functions.

GRP_BY_COLTYPE1 This column does not apply to Orphan Analysis functions.

GRP_BY_COLTYPE2 This column does not apply to Orphan Analysis functions.

GRP_BY_COLTYPE3 This column does not apply to Orphan Analysis functions.

GRP_BY_COLTYPES This column does not apply to Orphan Analysis functions.

DOMAIN_NAME This column does not apply to Orphan Analysis functions.

DOMAIN_TYPE This column does not apply to Orphan Analysis functions.

DOMAIN_VALUE This column does not apply to Orphan Analysis functions.

142 Appendix A: Data Profiling Views


DPR_PRFL_OJ_FN_VERBOSE
♦ Description. This view contains the rejected row information for Orphan Analysis
functions.
♦ Usage. Use this view to get all rejected row information for the Orphan Analysis function.

Table A-10. DPR_PRFL_OJ_FN_VERBOSE Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a data profile function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session run. Refers
to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

COL1_VALUE Identifies the column value of the first joined column.

COL2_VALUE Identifies the column value of the second joined column.

COL3_VALUE Identifies the column value of the third joined column.

COL_VALUE Identifies the concatenated text of the three column values of the joined columns.

MASTER_ROW_FLAG Identifies whether the verbose record pertains to the master source or the detail
source.

DPR_PRFL_OJ_FN_VERBOSE 143
DPR_PRFL_OJ_METRICS
♦ Description. This view contains the results (metric statistics) of Orphan Analysis
functions.
♦ Usage. Use this view to get all the Orphan Analysis function run details.

Table A-11. DPR_PRFL_OJ_METRICS Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a data profile function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session run.
Refers to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

TOTAL_PARENT_ROWS The total row count in the master source.

TOTAL_CHILD_ROWS The total row count in the detail source.

UNSATISFIED_PARENT_ROWS The row count for rows that did not satisfy join conditions in the master source.

UNSATISFIED_CHILD_ROWS The row count for rows that did not satisfy join conditions in the detail source.

SATISFIED_PARENT_ROWS The rows count for rows that satisfied join conditions in the master source.

SATISFIED_CHILD_ROWS The rows count for rows that satisfy join conditions in the detail source.

NULL_PARENT_ROWS The null rows count in the master source.

NULL_CHILD_ROWS The null rows count in the detail source.

144 Appendix A: Data Profiling Views


DPR_PRFL_RUN_DTLS
♦ Description.This view contains session run details for all data profiles.
♦ Usage. Use this view to get all session run details for all data profiles.

Table A-12. DPR_PRFL_RUN_DTLS Column Information

Column Name Description

PRFL_VER_KEY The profile version key that the Designer generates for a data profile. Refers to
DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.

PRFL_RUN_KEY The surrogate key that the Designer generates for a profile session run.

PRFL_REPO_NAME Identifies the name of the repository that stores the data profile.

PRFL_FOLDER_NAME Identifies the folder that stores the data profile.

PROFILE_ID Identifies the data profile in every repository and repository folder.

PROFILE_NAME Identifies the name of the data profile within a folder and repository.

PRFL_VER_DT Identifies the last saved date of the data profile.

PRFL_RUN_DT Identifies the last run date of the session for a data profile.

PRFL_RUN_STATUS Identifies the status of profile session runs. You can run every profile session one
or more times. The run status displays the status of every session run. The run
status can be: RUNNING, SUCCESS, or FAILURE.

LATEST_PRFL_RUN_FLAG Identifies the latest profile session run for all functions.

DPR_PRFL_RUN_DTLS 145
DPR_PRFL_SRC_FN
♦ Description. This view contains details about the source-level functions, such as
Candidate Key Evaluation and Redundancy Evaluation.
♦ Usage. Use this view to get all source-level function details.

Table A-13. DPR_PRFL_SRC_FN Column Information

Column Name Description

FUNCTION_LEVEL Identifies the function level for a function in a data profile. The function level is
SOURCE.

FUNCTION_KEY The surrogate key that the Designer generates for a data profile source-level
function.

PRFL_VER_KEY The profile version key that the Designer generates for a data profile. Refers to
DPR_PRFL_VER_DTLS.PRFL_VER_KEY column.

SOURCE_NAME1 Identifies the name of the source object that is being profiled.

SOURCE_NAME2 This column does not apply to source-level functions.

SOURCE_NAME Identifies the name of the source object that is being profiled.

SOURCE_TYPE1 Identifies the type of source object that is being profiled, such as a database
table or mapplet.

SOURCE_TYPE2 This column does not apply to source-level functions.

SOURCE_TYPE Identifies the type of source object that is being profiled, such as a database
table or mapplet.

DBD_NAME1 Identifies the DBD name for the source in a data profile.

DBD_NAME2 This column does not apply to source-level functions.

DBD_NAME Identifies the DBD name for a source in a data profile.

GROUP_NAME1 Identifies the group name of the source object.

GROUP_NAME2 This column does not apply to source-level functions.

GROUP_NAME Identifies the group name of the source object.

FUNCTION_NAME Identifies the name of a function in a data profile.

FUNCTION_TYPE Identifies the function type for the source-level functions.

FN_TYPE_TEXT The detailed text of the function type for source-level functions.

RULE_VAL Identifies the rule applied to a function. This applies to the source-level Business
Rule Validation function.

COLNAME1 This column does not apply to source-level functions.

COLNAME2 This column does not apply to source-level functions.

COLNAME3 This column does not apply to source-level functions.

146 Appendix A: Data Profiling Views


Table A-13. DPR_PRFL_SRC_FN Column Information

Column Name Description

COLTYPE1 This column does not apply to source-level functions.

COLTYPE2 This column does not apply to source-level functions.

COLTYPE3 This column does not apply to source-level functions.

GRP_BY_COLNAME1 Identifies the first column on which the profiling function is grouped.

GRP_BY_COLNAME2 Identifies the second column on which the profiling function is grouped.

GRP_BY_COLNAME3 Identifies the third column on which the profiling function is grouped.

GRP_BY_COLNAMES Identifies the concatenated values of the three group-by columns.

GRP_BY_COLTYPE1 Identifies the datatype of first column on which the profiling function is grouped.

GRP_BY_COLTYPE2 Identifies the datatype of second column on which the profiling function is
grouped.

GRP_BY_COLTYPE3 Identifies the datatype of third column on which the profiling function is grouped.

GRP_BY_COLTYPES Identifies the datatype of all the columns on which the profiling function is
grouped.

DOMAIN_NAME This column does not apply to source-level functions.

DOMAIN_TYPE This column does not apply to source-level functions.

DOMAIN_VALUE This column does not apply to source-level functions.

DPR_PRFL_SRC_FN 147
DPR_PRFL_SRC_FN_METRICS
♦ Description. This view contains the results (metric statistics) of all source-level functions.
Use this view to analyze the results of various source-level functions.
♦ Usage. Use this view to get all metric statistics of all the source-level functions.

Table A-14. DPR_PRFL_SRC_FN_METRICS Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a data profile source-level
function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session. Refers to
DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

COLUMN_NAME Identifies the concatenated text of one or a combination of two columns used for
the Candidate Key Evaluation and Redundancy Evaluation functions.

GRP_BY_COL1_VAL Identifies the first column value on which the profiling function is grouped.

GRP_BY_COL2_VAL Identifies the second column value on which the profiling function is grouped.

GRP_BY_COL3_VAL Identifies the third column value on which the profiling function is grouped.

GRP_BY_COL_VALUES Identifies the concatenated values of the three group-by column values.

FUNCTION_TYPE Identifies the function types for the source-level functions.

FN_TYPE_TEXT The detailed text of the function type for the source-level functions.

TOTAL_ROWS The total rows in a source. This applies to the source-level Business Rule
Validation and Row Count functions.

TOTAL_ROWS_EVAL The total rows in a source. This applies to the Candidate Key Evaluation and
Redundancy Evaluation functions.

SATISFIED_ROWS The total row count for rows that satisfied a rule or condition. This applies to the
source-level Business Rule Validation and Row Count functions.

PERC_SATISFIED_ROWS The percentage of rows that satisfied a rule or condition in a source. This applies
to the source-level Business Rule Validation and Row Count functions.

UNSATISFIED_ROWS The row count for rows that did not satisfy a rule or condition. This applies to the
source-level Business Rule Validation and Row Count functions.

DUP_ROWS The duplicate row count in a source. The duplicate row count is the total number
of rows minus the distinct number of rows. This applies to the Redundancy
Evaluation function.

148 Appendix A: Data Profiling Views


DPR_PRFL_SRC_FN_VERBOSE
♦ Description. This view contains the rejected row information for source-level functions.
♦ Usage. Use this view to get all rejected row information for the Business Rule Validation
source-level function.

Table A-15. DPR_PRFL_SRC_FN_VERBOSE Column Information

Column Name Description

FUNCTION_KEY The surrogate key that the Designer generates for a source-level function.

PRFL_RUN_KEY The profile run key that the Designer generates for a profile session run. Refers
to DPR_PRFL_RUN_DTLS.PRFL_RUN_KEY column.

COLUMN_NAME Identifies the column name in a source.

COLUMN_TYPE Identifies the column datatype in a source.

COLUMN_VALUE Identifies the column value in a source.

GRP_BY_COL1_VAL Identifies the first column value on which the profile function is grouped.

GRP_BY_COL2_VAL Identifies the second column value on which the profile function is grouped.

GRP_BY_COL3_VAL Identifies the third column value on which the profile function is grouped.

GRP_BY_COL_VALUES Identifies all the column values on which the profiling function is grouped.

DP_ROW_NUMBER Identifies the row to which the column belongs. Use this row number group the
columns belonging to one row.

DPR_PRFL_SRC_FN_VERBOSE 149
DPR_PRFL_VER_DTLS
♦ Description. This view contains version details for all data profiles.
♦ Usage. Use this view to get details about profile versions.

Table A-16. DPR_PRFL_VER_DTLS Column Information

Column Name Description

PRFL_VER_KEY The surrogate key that the Designer generates for a data profile.

PRFL_REPO_NAME Identifies the name of the repository that stores the data profile.

PRFL_FOLDER_NAME Identifies the folder that stores the data profile.

PROFILE_ID Identifies the data profile in every repository and repository folder.

PROFILE_NAME Identifies the name of the data profile within a folder and repository.

PRFL_VER_DT Identifies the last saved date of the data profile.

LATEST_PRFL_VER_FLAG Identifies the latest version of the profile.

PROFILE_TYPE Identifies whether the data profile is an auto profile or custom profile. Values are
AUTO PROFILE and CUSTOM PROFILE.

150 Appendix A: Data Profiling Views


Appendix B

Code Page Compatibility

This appendix provides details about the following topics:


♦ Code Page Compatibility, 152

151
Code Page Compatibility
When you use Data Profiling, configure code page compatibility between all PowerCenter
components and Data Profiling components and domains.
Follow the instructions in “Code Pages” in the Installation and Configuration Guide to ensure
that the code pages of each PowerCenter component have the correct relationship with each
other.
When you work with data profiles, ensure that the code pages for the Data Profiling
components have the correct relationship with each other:
♦ The PowerCenter Server must use a code page that is a subset of the Data Profiling
warehouse code page.
♦ The PowerCenter Server and the Repository Server must use two-way compatible code
pages.
♦ The code page for the PowerCenter Client must be two-way compatible with the code
page for the Repository Server.
♦ A Domain Definition Filename domain must use a code page that is a subset of the
PowerCenter Server code page. For more information about creating the Domain
Definition Filename domain, see “Custom Domains” on page 92.
♦ A List of Values domain must use a code page that is a subset of the code page for the
operating system that hosts the PowerCenter Client. For more information about creating
a List of Values domain, see “Custom Domains” on page 92.
♦ To view reports from the Profile Manager, the Data Profiling warehouse must use a code
page that is two-way compatible with the code page of the operating system that hosts the
PowerCenter Client. For more information about viewing Data Profiling reports from the
Profile Manager, see “PowerCenter Data Profiling Reports” on page 79.

152 Appendix B: Code Page Compatibility


Figure B-1 shows code page compatibility requirements for Data Profiling:

Figure B-1. Configuring Code Page Compatibility

The PowerCenter Client uses The PowerCenter Repository


the operating system code Server uses the operating
page. system code page.

The PowerCenter Server uses


the operating system code
page.
List of Values Data Profiling
File Warehouse

Domain Definition
File
Two-way compatible code page required.
Two-way compatible code page required for Data Profiling reports.
Must be a subset of the code page.

Code Page Compatibility 153


154 Appendix B: Code Page Compatibility
Appendix C

Data Profiling Error Messages

This appendix provides details about the following topics:


♦ Overview, 156
♦ Designer Messages, 157
♦ Server Messages, 162

155
Overview
PowerCenter returns messages when you perform data profiling tasks. Some messages are
errors and some are informational. You can use this chapter to help determine what causes
error messages to appear and what measures you can take to correct the error.
When an error message is a result of an internal error, causes are listed when possible, but
contacting Informatica Technical Support is still the best recourse. For contact information,
see “Obtaining Technical Support” on page xviii.
Data profiling messages from can originate from the following sources:
♦ PowerCenter Designer
♦ PowerCenter Server
The Designer displays messages when you create and modify data profiles. The PowerCenter
Server displays messages in the session log when you run profile sessions.
For other PowerCenter messages, see the Troubleshooting Guide.

156 Appendix C: Data Profiling Error Messages


Designer Messages
The Designer displays messages in message boxes, in the output window, and in the status bar
as you perform a task. The messages in this section relate to data profiles and profile
mappings in the Designer. For information about other error messages that appear in the
Designer, see the Troubleshooting Guide.

Cannot access session log file <log file name>.


Cause: The log file might have been deleted.
Action: Rerun the session to create a session log.

Cannot regenerate mapping for the profile since it is running.


Cause: The Designer cannot regenerate a mapping for this data profile because a
session for the mapping is running.
Action: Wait for the session to complete before you regenerate the profile mapping.

Datatype ‘number’ of port <port name> and datatype ‘character’ of port <port name> are not the
same.
Cause: You specified port with different datatypes for a Join Complexity Evaluation
function.
Action: Select ports with the same datatypes for the Join Complexity Evaluation
function.

Failed to add plug-in menu.


Cause: Internal error.
Action: Contact Informatica Technical Support.

Failed to launch the Profile Wizard.


Cause: Internal error.
Action: Contact Informatica Technical Support.

Failed to load the mapping.


Cause: The mapping was deleted.
Action: Recreate the data profile.

Failed to load the mapplet.


Cause: The mapplet was deleted.
Action: Recreate the data profile.

Designer Messages 157


Failed to validate server connection: <server name>.
Cause: You attempted to run a profile against a PowerCenter Server that is down.
Action: Run the profile on a PowerCenter Server that is running or start the
PowerCenter Server that you wish to run the profile against.

<filename> file is not a valid domain definition file. Reason: <error message>.
Cause: The domain definition file that you specified for the List of Values domain is
empty or contains a value longer than 200 characters. The domain definition
file must contain entries of 200 or fewer characters.
Action: Make sure that your domain definition file contains valid content before you
import it.
or
Action: See the additional error message for more information.

Mapping <mapping name> representing this profile is not valid. This profile cannot be run in
interactive mode.
Cause: The mapping is invalid because it has been modified.
Action: Regenerate the profile mapping.
or
Action: If this is the first time that you generated the mapping, and you did not
modify it, contact Informatica Technical Support.

No report is available for the profile <data profile>.


Cause: The PowerCenter Server has not generated a report for the selected data profile
because you have not run a profile session for the data profile.
Action: You must run a profile session before you can view a report.

No session log was created for this session.


Cause: The session failed.
Action: Check the workflow log or the Event Viewer for a message that indicates the
reason for the failure. Correct the error and rerun the session.

Not connected to the warehouse.


Cause: You specified an incorrect ODBC database connection for viewing reports.
Action: Enter the correct ODBC database connection in the Profile Manager. Make
sure that the target warehouse connection for viewing reports matches the
relational database connection you specified when you ran the profile session.

158 Appendix C: Data Profiling Error Messages


Orphan Analysis is not valid for groups in the same source.
Cause: You specified more than one group from the same source while adding or
editing an Orphan Analysis function.
Action: Move groups back to the Available Sources field, leaving only one group from
each source in the Selected Sources field.

Port <port name> cannot be used in a Join Complexity Evaluation function, because it is of the
datatype ‘binary’.
Cause: You specified a port with a Binary datatype while adding or editing a Join
Complexity Evaluation function.
Action: Specify a port with a datatype other than Binary.

Profile information does not exist in the target warehouse.


Cause: You did not run a profile session against the Data Profiling warehouse for
which you are trying to view reports.
or
Cause: The profile was modified, and you did not run a profile session after the
profile was modified.
Action: Run a profile session against the Data Profiling warehouse for which you are
trying to view reports.

Profile was created successfully but mapping <mapping name> representing this profile is not
valid. This profile cannot be run in interactive mode.
Cause: Internal error.
Action: Contact Informatica Technical Support.

Profile was updated successfully but mapping <mapping name> representing this profile is not
valid. This profile cannot be run in interactive mode.
Cause: Internal error.
Action: Contact Informatica Technical Support.

Profiling for this source is not supported.


Cause: Internal error.
Action: Contact Informatica Technical Support.

Regular expression is invalid.


Cause: You specified an invalid regular expression.
Action: Make sure that your regular expression contains valid characters.

Designer Messages 159


Regular expression validation failed at character position <position>: <error message>.
Cause: The Designer could not validate the regular expression.
Action: Check the additional error message for more information.

Some fields in source <source file name> group <group name> cannot be found. The source
might be corrupt.
Cause: The source has been modified since you last edited the data profile.
Action: Create a new data profile for the modified source.

Target warehouse does not exist.


Cause: The Data Profiling warehouse connection that you specified for viewing
reports does not match the relational database connection that you specified
for your session target.
Action: In the Profile Manager, modify your connection for viewing reports to match
the relational database connection that you specified for your session target.
or
Cause: Your database does not contain Data Profiling warehouse tables.
Action: Rerun the Data Profiling warehouse script in the PowerCenter Client
directory. Commit your SQL script after you run it. Rerun the profile session
and try to view the report again.

The mapping corresponding to this profile cannot be found. The repository might be corrupt.
Cause: The mapping corresponding to this data profile might have been deleted.
Action: Delete the data profile and create a new one.

The session log editor could not be invoked.


Cause: The path that you entered for the session log editor is invalid.
Action: In the Profile Manager, enter a valid path for your session log editor.

The target warehouse does not contain profile results for repository <repository name>.
Cause: The Data Profiling warehouse connection that you specified for viewing
reports does not match the relational database connection that you specified
for your session target.
Action: In the Profile Manager, modify your connection for viewing reports to match
the relational database connection that you specified for your session target.

There are no relational connections in this repository.


Cause: The Connection Browser does not contain any relational database connections
for the selected repository.
Action: Import a relational database connection from another repository, or create a
relational database connections in the selected repository.

160 Appendix C: Data Profiling Error Messages


This is already used for repository <repository name>.
Cause: The Data Profiling warehouse that you specified for the profile session has
been used for a different repository. A Data Profiling warehouse can contain
information for only one repository.
Action: Specify the correct Data Profiling warehouse for the profile session that you
are trying to run.

This mapplet is not valid. Invalid mapplet cannot be profiled.


Cause: The mapplet that you used to create an auto profile is invalid.
Action: Open your mapplet in the Mapplet Designer, and choose Mapplet-Validate to
view the mapplet error. The error displays in the Output window. Correct the
error and try to create the auto profile again.

Unable to load the profile.


Cause: The Designer cannot load the data profile.
Action: See the additional error message for more information.

You cannot profile a mapplet with transformations that generate transaction controls.
Cause: You tried to create a data profile for a mapplet with a transformation that
generates transaction controls. For example, you tried to create a data profile
for a mapplet with a Custom transformation configured for transaction
control.
Action: Make sure the transformations in your mapping are not configured for
transaction control.

Designer Messages 161


Server Messages
The PowerCenter Server writes error and informational messages about profile sessions in the
session log. This section lists error messages specific to profile sessions. For information about
other error messages in the session log, see the Troubleshooting Guide.

DP Codes
The following messages may appear when you run a profile session:

DP_90001 Invalid target type. Profile mapping targets should either be relational or null.
Cause: The profile mapping contains target definitions that are not relational or null.
The targets in the profile mapping might have been modified.
Action: Recreate the profile.

DP_90002 All the targets in a profiling mapping should use the same connection and have
the same connection attributes.
Cause: There are two or more relational database connections configured for targets
in the profile mapping.
Action: Make sure you use the same relational database connection for all targets in a
profiling mapping.

DP_90003 Create server database connection failed.


Cause: The database driver might not be set up correctly.
Action: Check your database driver configuration.

DP_90004 Connection to the database using user <username>, connect string <database
connect string> failed. Reason: <error message>.
Cause: The username or connect string is invalid.
Action: Verify that your username and connect string values are valid.
or
Action: See the additional error message for more information.

DP_90005 Cannot find the profile object for this mapping.


Cause: You ran a session with a copied or deployed mapping that contains reusable
domains. You cannot run a session with a copied or deployed profile mapping
that contains reusable domains.
Action: Run a profile session with the original profile mapping.

162 Appendix C: Data Profiling Error Messages


DP_90007 Profile metadata deserialization failed with error <error message>.
Cause: Your Data Profiling Client or Server installation might be missing a DTD file.
The file may have been moved from the PowerCenter Client or Server
installation directory.
Action: If the PowerCenter Client or Server installation directory does not contain the
DTD file, reinstall the PowerCenter Server or Client. If the problem persists,
contact Informatica Technical Support.
or
Cause: The DTD file does not have read permission.
Action: Verify that you have read permission on the DTD file.

DP_90008 Commit to the database failed with error <database error>.


Cause: A database error prevents the PowerCenter Server from loading data into your
Data Profiling warehouse tables.
Action: Fix the database error indicated in the message and run the session again. If
Data Profiling warehouse tables are incomplete, recreate the Data Profiling
warehouse.

DP_90009 SQL Prepare failed for statement <SQL statement> with error <database error>.
Cause: The SQL query failed.
Action: Fix the database error indicated in the message, and rerun the Data Profiling
warehouse script in the Data Profiling installation directory. Commit your
SQL script after you run it.

DP_900010 SQL Bind failed for statement <SQL statement> with error <database error>.
Cause: The Data Profiling warehouse tables are invalid.
Action: Rerun the Data Profiling warehouse script in the Data Profiling installation
directory. Commit your SQL script after you run it.

DP_900011 SQL Execute failed for statement <SQL statement> with error <database error>.
Cause: The Data Profiling warehouse tables are invalid.
Action: Rerun the Data Profiling warehouse script in the Data Profiling installation
directory. Commit your SQL script after you run it.

DP_900012 SQL Fetch failed for statement <SQL statement> with error <database error>.
Cause: The Data Profiling warehouse tables are invalid.
Action: Rerun the Data Profiling warehouse script in the Data Profiling installation
directory. Commit your SQL script after you run it.

Server Messages 163


DP_900013 Fetching the key for type <key number> failed.
Cause: A Data Profiling warehouse table is missing surrogate keys. The following
table lists the surrogate keys that correspond to the key types in error messages:

Type Surrogate Key

0 Profile run key

1 Column key

2 Source function key

3 Column function key

4 Orphan Analysis function key

5 Join Complexity Evaluation function key

Action: Rerun the Data Profiling warehouse script in the Data Profiling installation
directory. Commit your SQL script after you run it.

DP_90014 There must be exactly one input group and one output group for this
transformation.
Cause: The Custom transformation in the profile mapping has been modified and is
invalid.
Action: Regenerate the profile mapping.

DP_90015 The output port <port name> datatype should be long.


Cause: The Custom transformation in the profile mapping has been modified and is
invalid.
Action: Regenerate the profile mapping.

DP_90016 The target warehouse is already used for repository <repository name> with
GUID <global unique identifier>. Either drop the warehouse tables or use a
different one.
Cause: You tried to use two repositories for the same Data Profiling warehouse.
Action: Create a second Data Profiling warehouse. Also, create a new relational
database connection to the second Data Profiling warehouse in the Workflow
Manager.

DP_90017 The profile warehouse tables are not present in the target database connection.
Please check the target connection information.
Cause: The Data Profiling warehouse tables are not in the target database.
Action: Run the Data Profiling warehouse script in the Data Profiling installation
directory. Commit your SQL script after you run it.

164 Appendix C: Data Profiling Error Messages


DP_90019 Failed to get the folder information from repository.
Cause: Internal error.
Action: Contact Informatica Technical Support.

DP_90020 Failed to get the metadata extension <metadata extension> for the mapping.
Cause: You copied a profile mapping without copying the metadata extensions.
Action: Copy the metadata extensions from the original profile mapping to the copied
mapping and run the session again.
or
Cause: You are running a session with an original profile mapping that corresponds to
a data profile, but the metadata extensions are deleted.
Action: Regenerate the profile mapping.

DP_90022 Profile has some invalid functions. Session run failed.


Cause: You tried to run a profile session for a data profile. The data profile contains
invalid functions.
Action: Edit the data profile to modify or remove the invalid functions.

DP_90023 Warehouse table PMDP_WH_VERSION is missing. The target warehouse version


is incorrect. You may need to upgrade the warehouse.
Cause: The version of the Data Profiling warehouse does not match the version of
PowerCenter.
Action: Upgrade the Data Profiling warehouse using the upgrade script for your
database type.

DP_90024 The target warehouse uses schema version <version> and data version
<version>. You may need to upgrade the warehouse.
Cause: The version of the Data Profiling warehouse does not match the version of
PowerCenter.
Action: Upgrade the Data Profiling warehouse using the upgrade script for your
database type.

DP_90026 Warehouse table PMDP_WH_VERSION is missing or column SCHEMA_VERSION/


DATA_VERSION/DATABASE_TYPE is missing. You need to upgrade the
warehouse.
Cause: The version of the Data Profiling warehouse does not match the version of
PowerCenter.
Action: Upgrade the Data Profiling warehouse using the upgrade script for your
database type.

Server Messages 165


DP_90029 Source index of mapping parameter <mapping parameter> is invalid.
Cause: The profile mapping is modified.
Action: Regenerate the profile mapping.

DP_90030 Missing mapping parameter(s) for source with index <index>.


Cause: The profile mapping is modified.
Action: Regenerate the profile mapping.

DP_90031 Source Qualifier transformation [%s] was not found in this mapping.
Cause: The profile mapping is modified.
Action: Regenerate the profile mapping.

DP_90401 Failed to load List of Values from the file <filename>.


Cause: You tried to load a domain definition file for a List of Values domain type.
However, the file path might be incorrect.
Action: Verify your domain definition file path.
or
Cause: The file does not exist in the specified location.
Action: Make sure that the file exists in the specified location.
or
Cause: The file is empty.
Action: Make sure that the domain definition file contains valid entries.

DP_90403 List of Values domain cannot be empty.


Cause: You tried to load a domain definition file for a List of Values domain type.
However, the file might be empty.
Action: Make sure that your domain definition file contains valid content.

DP_90404 Failed to expand List of Values file <file path>.


Cause: You tried to load a domain definition file for a List of Values domain type.
However, the domain definition file path contains an unrecognized variable.
Action: Check the domain definition file path that you entered.

DP_90405 Failed to open Domain definition file <expanded file path>.


Cause: You tried to load a domain definition file for a List of Values domain type.
However, the domain definition file does not exist in the specified location.
Action: Make sure that you place the domain definition file in the file path location
that you enter.

166 Appendix C: Data Profiling Error Messages


DP_90406 Fetching Custom transportation input group failed.
Cause: Internal error.
Action: Contact Informatica Technical Support.

DP_90407 Fetching Custom transformation output group(s) failed.


Cause: Internal error.
Action: Contact Informatica Technical Support.

DP_90408 Unexpected number of input/output groups received.


Cause: You tried to run a session with a copied or deployed profile mapping that
contains reusable domains.
Action: You must use the original profile mapping.
or
Cause: You copied a profile transformation to use in another mapping.
Action: You must use the original profile mapping.

DP_90409 Could not fetch output port <port number>.


Cause: You tried to run a session with a copied or deployed profile mapping that
contains reusable domains.
Action: You must use the original profile mapping.
or
Cause: You copied a profile transformation to use in another mapping.
Action: You must use the original profile mapping.

DP_90410 Unexpected number of input/output ports received.


Cause: You tried to run a session with a copied or deployed profile mapping that
contains reusable domains.
Action: You must run a profile session with the original profile mapping if the
mapping contains reusable domains.
or
Cause: You copied a profile transformation to use in another mapping.
Action: You must run a profile session with the original profile mapping.

Server Messages 167


DP_90411 Corrupt input received.
Cause: You tried to run a session with a copied or deployed profile mapping that
contains reusable domains.
Action: You must run a profile session with the original profile mapping if the
mapping contains reusable domains.
or
Cause: You copied a profile transformation to use in another mapping.
Action: You must run a profile session with the original profile mapping.

DP_90603 Regular expression is invalid.


Cause: The regular expression is not valid. As a result, the session failed.
Action: Edit the Regular Expression function in the data profile and verify the
expression against your source data.

DP_90604 Unexpected condition encountered.


Cause: Internal error.
Action: Contact Informatica Technical Support.

DP_90802 No input port(s) found for output port <port>.


Cause: The profile mapping is modified.
Action: Regenerate the profile mapping.

DP_90803 Invalid number of input port(s) associated with output port <port>.
Cause: The profile mapping is modified.
Action: Regenerate the profile mapping.

DP_90804 There are input port(s) with no corresponding output port.


Cause: The profile mapping is modified.
Action: Regenerate the profile mapping.

168 Appendix C: Data Profiling Error Messages


Appendix D

Glossary

This appendix contains information on the following topics:


♦ Glossary Terms, 170

169
Glossary Terms
Aggregate functions
Functions that calculate an aggregate value for a numeric or string value applied to one
column of a profile source.

analytic workflow
A list of PowerAnalyzer reports linked together in a hierarchy consisting of a primary report
and related workflow reports. See PowerAnalyzer Data Profiling reports.

auto profile
A data profile containing a predetermined set of functions for profiling source data.

auto profile report


A PowerCenter Data Profiling report that displays information about source data based on the
functions in an auto profile.

Average Value function


A column-level aggregate function that calculates an aggregate value for the average value of
the rows in the source column.

Business Rule Validation column-level function


A column-level function that calculates the number of rows in a single source column that
satisfy a business rule and the number of rows that do not.

Business Rule Validation source-level function


A source-level function that calculates the number of rows for one or more columns of a
source that satisfy a business rule and the number of rows for one or more columns of a source
that do not.

Candidate Key Evaluation function


A source-level function that calculates the number of duplicate values in one or more columns
of a source.

column-level functions
Functions that perform calculations on one column of a source, source group, or mapplet
group.

custom domain
A domain you create to validate source values or to infer patterns from source data. You can
create a custom domain when you create a Domain Validation function. See domain and
Domain Validation function.

170 Appendix D: Glossary


custom profile
A data profile for which you define functions to profile source data.

custom profile report


A PowerCenter Data Profiling report that displays information about source data based on the
functions in a custom profile.

data connector
A requirement for PowerAnalyzer to connect to a data source and read data for Data Profiling
reports. Typically, PowerAnalyzer uses the system data connector to connect to all the data
sources required for reports.

data profile
A profile of source data in PowerCenter. A data profile contains functions that perform
calculations on the source data.

Data Profiling views


Views to create metrics and attributes in the Data Profiling schema. PowerCenter Data
Profiling reports and PowerAnalyzer Data Profiling reports use these metrics and attributes.
You can also use these views to create reports in your BI tool.

Data Profiling warehouse


A relational database that stores data profile information from profile sessions.

Distinct Value Count function


A column-level function that returns the number of distinct values for a column.

domain
A set of all valid values for a source column. A domain can contain a regular expression, list of
values, or the name of a file that contains a list of values. Informatica provides prepackaged
domains. You can also create your own domains.

Domain Definition Filename domain


Domains defined by an external file containing a list of values.

Domain Inference function


A column-level function that reads all values in the column and infers a pattern that fits the
data. The function determines if the values fit a list of values derived from the column values
or a pattern that describes the pattern of the source data.

Glossary Terms 171


Domain Validation function
A column-level function that calculates the number of values in the profile source column
that fall within a specified domain and the number of values that do not. The Domain
Validation function requires that you specify a domain.

group-by columns
Columns by which you want to group data for a custom profile. When you configure a
function, you can determine the column by which you want to group the data.

interactive session
A profile session that you run from the Profile Manager.

intersource functions
Functions that perform calculations on two or more sources, source groups, or mapplet
groups.

Join Complexity Evaluation function


An intersource function that measures the columns in multiple sources that do not have a join
condition.

List of Values domain


Domains defined by a comma-delineated list of values.

Maximum Value aggregate function


A column-level aggregate function that calculates an aggregate value for the maximum value
of rows in the source column.

Minimum Value aggregate function


A column-level aggregate function that calculates an aggregate value for the minimum value
of rows in the source column.

non-reusable domain
A domain that applies to one Domain Validation function. See also reusable domain.

NULL Value Count aggregate function


A column-level aggregate function that calculates an aggregate value for the number of rows
with NULL values in the source column.

Orphan Analysis function


An intersource function that compares the values of columns in two sources.

172 Appendix D: Glossary


persistent session
A session stored in the repository that you run from the Profile Manager or the Workflow
Manager. You can use a persistent session to run a profile mapping more than once.

PowerAnalyzer Data Profiling reports


Reports you can view from PowerAnalyzer after running a profile session. PowerAnalyzer
reports provide a time-dimensional view of your data and information about rejected rows in
your profile results.

PowerCenter Data Profiling reports


Reports you can view from the Profile Manager after running a profile session. PowerCenter
reports provide information based on the last profile session run.

prepackaged domains
Domains informatica provides, which verify data, such as phone numbers, postal codes, and
email addresses. See domain.

Profile Manager
A tool in the Designer that manages data profiles. Use the Profile Manager to set default data
profile options, work with data profiles in your repository, run profile sessions, view profile
results, and view sources and mapplets with at least one profile defined for them.

profile mapping
A mapping the Designer generates when you create a data profile. The PowerCenter
repository stores the data profile and the associated mapping.

profile session
A session for a profile mapping that gathers information about your source data. The Data
Profiling warehouse stores the results of profile sessions. See persistent session and interactive
session.

Redundancy Evaluation function


A source-level function that calculates the number of duplicate values in one or more columns
of the source.

Regular Expression domain


A domain defined by a range of values in an expression.

reusable domain
A domain you can apply to multiple Domain Validation functions in one or more data
profiles. See domain.

Glossary Terms 173


Row Count function
A source-level function that counts the number of rows read from the source during a profile
session.

Row Uniqueness function


A source-level function that calculates the number of unique and duplicate values based on
the columns selected. You can profile all columns in the source row or choose individual
columns to profile.

source-level function
A function that performs calculations on two or more columns of a source, source group, or
mapplet group.

temporary session
A session that is run from the Profile Manager and is not stored to the repository.

verbose mode
An option to view verbose data that the PowerCenter Server writes to the Data Profiling
warehouse during a profile session. You can specify the type of verbose data to load when you
configure the data profile.

174 Appendix D: Glossary


Index

A C
ABAP program Candidate Key Evaluation function
generating for SAP R/3 sources 33 description 108
adding profiling primary key columns 108
functions 39 checking in
aggregate data profile mappings 27
profiling 52 COBOL syntax
Aggregate functions converting to perl syntax 96
description 119 code pages
auto profile reports configuring compatibility 152
description 79 Data Profiling report requirements 152
auto profiles rules for Domain Definition Filename domain 152
auto profile reports 79 rules for List of Values domain 152
creating 34 specifying for a Domain Definition Filename domain
deleting 49 99
editing 47 specifying for a List of Values domain 99
running a session after creating an auto profile 67 syntax for Domain Definition Filename domain 99
automatic random sampling syntax for List of Values domain 99
description 71 column-level functions
Aggregate functions 119
Business Rule Validation 113
B description 113
Distinct Value Count 121
buffer block size Domain Inference 117
See Workflow Administration Guide Domain Validation 115
Business Rule Validation function
column-level function description 113
source-level function description 106

175
configuring importing into PowerAnalzyer 16
default data profile options 26 requirements
functions 41 for viewing from the Profile Manager 152
sessions 64 sample verbose summary report 55
copying viewing in PowerCenter 83
profile mappings 51 Data Profiling schemas
creating importing 16
auto profiles 34 Data Profiling warehouse
custom profiles 38 configuring a data source 20
data connectors 21 configuring a relational database connection 15
Data Profiling warehouse on IBM DB2 12 creating 12
Data Profiling warehouse on Informix 12 overview 3
Data Profiling warehouse on Microsoft SQL Server 13 upgrading 13
Data Profiling warehouse on Oracle 13 data samples
Data Profiling warehouse on Sybase 13 profiling 68
Data Profiling warehouse on Teradata 13 profiling relational data samples 70
Domain Definition Filename domain 98 data sources
List of Values domain 92 adding to data connector 23
Regular Expression domain 94 configuring for the Data Profiling warehouse 20
custom domains primary in data connector 22
description 92 supported databases 20
custom profile reports datatypes
viewing 81 Raw 113, 123
custom profiles default data profile options
creating 38 configuring 26
deleting 49 deleting
editing 47 data profiles 49
running a session 67 domains 102
Distinct Value Count function
description 121
D documentation
conventions xvi
data connectors description xv
adding the Data Profiling data source 23 online xvi
creating 21 Domain Definition Filename domain
primary data source 22 code page compatibility 152
properties 22 creating 98
data profiles description 92
creating a custom profile 38 syntax for specifying code pages 99
creating an auto profile 34 Domain Inference function
deleting 49 configuring Domain Inference settings 118
editing 47 description 117
Data Profiling Domain Validation function
during mapping development 2 description 115
during production 2 domains
overview 2 creating 90
Data Profiling reports custom 92
auto profile reports 79 deleting 102
custom profile reports 81 description 90
Domain Definition Filename domain 92
editing 101

176 Index
List of Values domain 92
non-reusable 92
I
prepackaged 91 IBM DB2
Regular Expression domain 92 creating a Data Profiling warehouse 12
reusable 92 upgrading a Data Profiling warehouse 14
IMA
views 128
E importing
Data Profiling schemas and reports 16
editing Informatica
data profiles 47 documentation xv
domains 101 Webzine xvii
error messages Informix
designer 157 creating a Data Profiling warehouse 12
DP codes 162 upgrading a Data Profiling warehouse 14
overview 156 installing
session logs 162 PowerAnalzyer Data Profiling reports 16
XML scripts 16
interactive profiling
F See interactive sessions
interactive sessions
functions
configuring 45
adding 39
definition 63
column-level functions 113
monitoring in the Profile Manager 67
configuring 41
monitoring with the Workflow Monitor 67
configuring for verbose mode 43
viewing the session log 67
configuring group-by columns 42
intersource functions
intersource functions 123
description 123
source-level functions 105
Join Complexity Evaluation 125
Orphan Analysis 123

G
generating J
ABAP program for SAP R/3 sources 33
Join Complexity Evaluation function
profile mappings 45
description 125
glossary
terms 170
group-by columns
configuring for functions 42 K
keys
profiling primary key columns 108
H
historical data
sampling 72 L
List of Values domain
code page compatibility requirements 152
creating 92
description 92
syntax for specifying code pages 99

Index 177
localization running from the Profile Manager 63
specifying for a Domain Definition Filename domain running in real time 74
99 PowerAnalyzer
specifying for a List of Values domain 99 adding data sources 23
configuring a data source 20
creating data connectors 21
M importing Data Profiling schemas and reports 16
PowerAnalyzer Data Profiling reports
manual random sampling description 78
description 70 installing 16
mappings viewing verbose data 85
generating a profile mapping 45 PowerCenter Data Profiling reports
prefix for the profile mapping name 28 auto profile reports 79
mapplets custom profile reports 81
extending data profiling functions 52 description 78
joining output data from two sources 56 PowerCenter Server
merging mapplet output data 58 performing automatic random sampling 71
profiling 32 performing manual random sampling 70
Microsoft SQL Server prepackaged domains
creating a Data Profiling warehouse 13 description 91
upgrading a Data Profiling warehouse 14 primary data sources
modifying data connector 22
profile mappings 51 primary keys
monitoring profiling with the Candidate Key Evaluation function
interactive sessions 67 108
multiple sources profile functions
profiling 33 description 33
extending functionality 52
Profile Manager
N checking in profile mappings 6
creating custom profiles 6
non-reusable domains creating domains 90
description 92 deleting data profiles 6
normal mode editing data profiles 6
running temporary sessions 63 Profile View 7
regenerating profile mappings 6
running interactive sessions 6
O running persistent sessions 63
running sessions 67
Oracle
running temporary sessions 63
creating a Data Profiling warehouse 13
Source View 8
upgrading a Data Profiling warehouse 14
using 6
Orphan Analysis function
viewing data profile details 6
description 123
profile mappings
checking in 6
copying 51
P copying with reusable domains 51
perl syntax modifying 51
using in a Regular Expression domain 94 prefix for the profile mapping name 28
persistent sessions regenerating 6
definition 63

178 Index
profile sessions reusable domains
See also interactive sessions copying a profile mapping 51
See also persistent sessions description 92
configuring to use data samples 68 Row Count function
prefix for the profile session name 28 description 105
troubleshooting 75 Row Uniqueness function
Profile View description 110
in Profile Manager 7
Profile Wizard
configuring sessions 64
creating a custom profile 38
S
creating an auto profile 34 SAP R/3 sources
creating domains 90 profiling 33
profile workflows scripts
prefix for the profile workflow name 28 See XML scripts
requirements for creating in the Workflow Manager 74 session configuration
profiling enabling 39
aggregate data 52 session logs
data samples 68 description 162
eligible sources 32 viewing for interactive sessions 67
mapplets 32 sessions
relational data samples 70 See also interactive sessions
SAP R/3 sources 33 See also persistent sessions
sources with matching ports 55, 57 configuring 64
creating in the Workflow Manager 74
running after creating a custom profile 67
R running for auto profiles 67
running from the Profile Manager 67
Raw datatype Source View in Profile Manager 8
verbose mode sessions 113, 123 source-level functions
real-time sessions Business Rule Validation 106
running 74 Candidate Key Evaluation 108
Redundancy Evaluation function description 105
description 109 loading the Raw datatype for verbose mode 113, 123
Regular Expression domain Redundancy Evaluation 109
creating 94 Row Count 105
description 92 Row Uniqueness 110
using perl syntax 94 sources
relational data samples eligible sources for profiling 32
delegating to the database 71 profiling multiple sources 33
improving data accuracy for historical data 72 SQL syntax
increasing performance 73 converting to perl syntax 96
sampling as the PowerCenter Server reads data 70 Sybase
relational database connections creating a Data Profiling warehouse 13
See also Workflow Administration Guide upgrading a Data Profiling warehouse 15
configuring for the Data Profiling warehouse 15
reports
auto profile reports 79 T
custom profile reports 81
PowerAnalyzer Data Profiling reports 85 temporary sessions
PowerCenter Data Profiling reports 79 definition 63

Index 179
running from the Profile Manager 63 workflows
running in normal mode 63 creating in the Workflow Manager 74
Teradata
creating a Data Profiling warehouse 13
upgrading a Data Profiling warehouse 15
troubleshooting
X
profile sessions 75 XML scripts
installing 16

U
upgrading
Data Profiling warehouse on IBM DB2 14
Data Profiling warehouse on Informix 14
Data Profiling warehouse on Microsoft SQL Server 14
Data Profiling warehouse on Oracle 14
Data Profiling warehouse on Sybase 15
Data Profiling warehouse on Teradata 15

V
verbose data
description 4
viewing in PowerAnalyzer Data Profiling reports 85
verbose mode
configuring a function 43
verbose mode sessions
Raw datatype 113, 123
versioned objects
See also Repository Guide
checking in profile mappings 6
viewing
auto profile reports 79
custom profile reports 81
PowerAnalzyer Data Profiling reports 85
PowerCenter Data Profiling reports 83
views
descriptions 128
list of 128

W
warehouse
See Data Profiling warehouse
webzine xvii
Workflow Manager
creating a session and workflow 74
Workflow Monitor
monitoring interactive sessions 67

180 Index

You might also like