Professional Documents
Culture Documents
In Schema level we can say as when we consider these as Count of Occurances/even that
does not involved or getting aggreaged at the fact table means,we call these fact(has no
measure but with events/occurance) called as Fact less Fact table.
Generally we using the factless fact when we want events that happen only at information
level but not included in the calculations level.just an information about an event that
happen over a period.
Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has
two types of columns: those that contain facts and those that are foreign keys to
dimension tables. The primary key of a fact table is usually a composite key that is made
up of all of its foreign keys.
In the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be added across several
dimensions. Fact tables store different types of measures like additive, non additive and
semi additive measures.
Measure Types
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with
others.
A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables).
In the real world, it is possible to have a fact table that contains no measures or facts.
These tables are called as Factless Fact tables.
iGATE Internal
Determine the lowest level of summary in a fact table(sales dollar).
Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6
In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. Measure "Sales Dollar" in sales fact table can be added across all
dimensions independently or in a combined manner which is explained below.
--
In Snowflake schema, the example diagram shown below has 4 dimension tables, 4
lookup tables and 1 fact table. The reason is that hierarchies(category, branch, state, and
month) are being broken out of the dimension tables(PRODUCT, ORGANIZATION,
LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake
schema approach increases the number of joins and poor performance in retrieval of data.
In few organizations, they try to normalize the dimension tables to save space. Since
dimension tables hold less space, Snowflake schema approach may be avoided.
iGATE Internal
Example of Snowflake Schema: Figure 1.7
Star schema
General Information
In general, an organization is started to earn money by selling a product or by providing
service to the product. An organization may be at one place or may have several
branches.
iGATE Internal
towards the dimension tables. The advantage of star schema are slicing down,
performance increase and easy understanding of data.
lossary:
Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can
be used to define data aggregation; for example, in a time dimension, a hierarchy might
be used to aggregate data from the Month level to the Quarter level, from the Quarter
level to the Year level. A hierarchy can also be used to define a navigational drill path,
regardless of whether the levels in the hierarchy represent aggregated totals or not.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that
represents data at the Month, Quarter, and Year levels.
Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table
typically has two types of columns: those that contain facts and those that are foreign
keys to dimension tables. The primary key of a fact table is usually a composite key that
is made up of all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables). A fact table
usually contains facts with the same level of aggregation.
iGATE Internal
Example of Star Schema: Figure 1.6
In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. It shows that data can be sliced across all dimensions and again it
is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in
sales fact table can be calculated across all dimensions independently or in a combined
manner which is explained below.
--
Data Warehouse & Data Mart
A data warehouse is a relational/multidimensional database that is designed for query and
analysis rather than transaction processing. A data warehouse usually contains historical
data that is derived from transaction data. It separates analysis workload from transaction
workload and enables a business to consolidate data from several sources.
iGATE Internal
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region,
business unit or business function.
Data warehouses and data marts are built on dimensional data modeling where fact tables
are connected with dimension tables. This is most useful for users to access data since a
database can be visualized as a cube of several dimensions. A data warehouse provides an
opportunity for slicing and dicing that cube along each of its dimensions.
Data Mart: A data mart is a subset of data warehouse that is designed for a particular
line of business, such as sales, marketing, or finance. In a dependent data mart, data can
be derived from an enterprise-wide data warehouse. In an independent data mart, data can
be collected directly from sources.
--
Dimensions that change over time are called Slowly Changing Dimensions. For instance,
a product price changes over time; People change their names for some reason; Country
and State names may change over time. These are a few examples of Slowly Changing
Dimensions since some changes are happening to them over a period of time.
Slowly Changing Dimensions are often categorized into three types namely Type1,
Type2 and Type3. The following section deals with how to capture and handling these
changes over time.
iGATE Internal
The "Product" table mentioned below contains a product named, Product1 with Product
ID being the primary key. In the year 2004, the price of Product1 was $150 and over the
time, Product1's price changes from $150 to $350. With this information, let us explain
the three types of Slowly Changing Dimensions.
Product
Product
The problem with the above mentioned data structure is "Product ID" cannot store
duplicate values of "Product1" since "Product ID" is the primary key. Also, the current
data structure doesn't clearly specify the effective date and expiry date of Product1 like
when the change to its price happened. So, it would be better to change the current data
structure to overcome the above primary key violation.
Product
iGATE Internal
Product Effective Product Product Expiry
Year
ID(PK) DateTime(PK) Name Price DateTime
01-01-2004 12-31-2004
1 2004 Product1 $150
12.00AM 11.59PM
01-01-2005
1 2005 Product1 $250
12.00AM
In the changed Product table's Data structure, "Product ID" and "Effective DateTime" are
composite primary keys. So there would be no violation of primary key constraint.
Addition of new columns, "Effective DateTime" and "Expiry DateTime" provides the
information about the product's effective date and expiry date which adds more clarity
and enhances the scope of this table. Type2 approach may need additional space in the
data base, since for every changed record, an additional row has to be stored. Since
dimensions are not that big in the real world, additional space is negligible.
Product
The problem with the Type 3 approach, is over years, if the product price continuously
changes, then the complete history may not be stored, only the latest change will be
stored. For example, in year 2006, if the product1's price changes to $350, then we would
not be able to see the complete history of 2004 prices, since the old values would have
been updated with 2005 product information.
Product
--
Time Dimension
In a relational data model, for normalization purposes, year lookup, quarter lookup,
iGATE Internal
month lookup, and week lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called TIME
DIMENSION for performance and slicing data.
This dimensions helps to find the sales done on date, weekly, monthly and yearly basis.
We can have a trend analysis by comparing this year sales with the previous year or this
week sales with the previous week.
Year Lookup
Quarter Lookup
Month Lookup
iGATE Internal
5 May 1/1/2005 11:23:31 AM
6 June 1/1/2005 11:23:31 AM
7 July 1/1/2005 11:23:31 AM
8 August 1/1/2005 11:23:31 AM
9 September 1/1/2005 11:23:31 AM
10 October 1/1/2005 11:23:31 AM
11 November 1/1/2005 11:23:31 AM
12 December 1/1/2005 11:23:31 AM
Week Lookup
Time Dimension
iGATE Internal
1/1/2005
4 2005 32 Q1 2 February 1 5 3 2/1/2005 11:23:31
AM
Organization Dimension
In a relational data model, for normalization purposes, corporate office lookup, region
lookup, branch lookup, and employee lookups are not merged as a single table. In a
dimensional data modeling(star schema), these tables would be merged as a single table
called ORGANIZATION DIMENSION for performance and slicing data.
This dimension helps us to find the products sold or serviced within the organization by
the employees. In any industry, we can calculate the sales on region basis, branch basis
and employee basis. Based on the performance, an organization can provide incentives to
employees and subsidies to the branches to increase further sales.
Product Dimension
In a relational data model, for normalization purposes, product category lookup, product
sub-category lookup, product lookup, and and product feature lookups are are not merged
as a single table. In a dimensional data modeling(star schema), these tables would be
merged as a single table called PRODUCT DIMENSION for performance and slicing
data requirements.
iGATE Internal
Example of Product Dimension: Figure 1.9
Dimension Table
Dimension table is one that describe the business entities of an enterprise, represented as
hierarchical, categorical information such as time, departments, locations, and products.
Dimension tables are sometimes called lookup or reference tables.
Location Dimension
In a relational data modeling, for normalization purposes, country lookup, state lookup,
county lookup, and city lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called LOCATION
DIMENSION for performance and slicing data requirements. This location dimension
helps to compare the sales in one region with another region. We may see good sales
profit in one region and loss in another region. If it is a loss, the reasons for that may be a
new competitor in that area, or failure of our marketing strategy etc.
iGATE Internal
Relational Data Modeling is used in OLTP systems which are transaction oriented and
Dimensional Data Modeling is used in OLAP systems which are analytical based. In a
data warehouse environment, staging area is designed on OLTP concepts, since data has
to be normalized, cleansed and profiled before loaded into a data warehouse or data mart.
In OLTP environment, lookups are stored as independent tables in detail whereas these
independent tables are merged as a single dimension in an OLAP environment like data
warehouse.
Relational vs Dimensional
Relational Data Modeling Dimensional Data Modeling
Data is stored in RDBMS or Multidimensional
Data is stored in RDBMS
databases
Tables are units of storage Cubes are units of storage
Data is normalized and used for OLTP. Data is denormalized and used in datawarehouse
Optimized for OLTP processing and data mart. Optimized for OLAP
Several tables and chains of Few tables and fact tables are connected to
relationships among them dimensional tables
Volatile(several updates) and time
Non volatile and time invariant
variant
SQL is used to manipulate data MDX is used to manipulate data
Summary of bulky transactional data(Aggregates
Detailed level of transactional data
and Measures) used in business decisions
User friendly, interactive, drag and drop
Normal Reports
multidimensional OLAP Reports
Dimensional Data Modeling comprises of one or more dimension tables and fact tables.
Good examples of dimensions are location, product, time, promotion, organization etc.
Dimension tables store records related to that particular dimension and no
facts(measures) are stored in these tables.
For example, Product dimension table will store information about products(Product
Category, Product Sub Category, Product and Product Features) and location dimension
table will store information about location( country, state, county, city, zip. A
fact(measure) table contains measures(sales gross value, total units sold) and dimension
columns. These dimension columns are actually foreign keys from the respective
dimension tables.
iGATE Internal
Example of Dimensional Data Model: Figure 1.6
In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. It shows that data can be sliced across all dimensions and again it
is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in
sales fact table can be calculated across all dimensions independently or in a combined
manner which is explained below.
In Dimensional data modeling, hierarchies for the dimensions are stored in the
dimensional table itself. For example, the location dimension will have all of its
hierarchies from country, state, county to city. There is no need for the individual
hierarchial lookup like country lookup, state lookup, county lookup and city lookup to be
shown in the model.
iGATE Internal
--
Logical vs Physical Data Modeling
Logical Data Model Physical Data Model
Represents business information and Represents the physical implementation of the
defines business rules model in a database.
Entity Table
Attribute Column
Primary Key Primary Key Constraint
Alternate Key Unique Constraint or Unique Index
Inversion Key Entry Non Unique Index
Rule Check Constraint, Default Value
Relationship Foreign Key
Definition Comment
Reports:
» Generate reports from data model.
Review:
» Review the data model with functional and technical team.
Creation of database:
» Create sql code from data model and co-ordinate with DBAs to create database.
» Check to see data models and databases are in synch.
iGATE Internal
Support & Maintenance:
» Assist developers, ETL, BI team and end users to understand the data model.
» Maintain change log for each data model.
These are the general guidelines to create a standard data model and in real time, a data
model may not be created in the same sequential manner as shown below. Based on the
enterprise’s requirements, some of the steps may be excluded or included in addition to
these.
Sometimes, data modeler may be asked to develop a data model based on the existing
database. In that situation, the data modeler has to reverse engineer the database and
create a data model.
--
http://www.learndatamodeling.com/
iGATE Internal
Informatica:
Informatica is a widely used ETL tool for extracting the source data and loading it into
the target after applying the required transformation. In the following section, we will try
to explain the usage of Informatica in the Data Warehouse environment with an example.
Here we are not going into the details of data warehouse design and this tutorial simply
provides the overview about how INFORMATICA can be used as an ETL tool.
Note: The exchanges/companies that are explained here is for illustrative purpose only.
Bombay Stock Exchange (BSE) and National Stock Exchange (NSE) are two major stock
exchanges in India in which the shares of ABC Corporation and XYZ Private Limited are
traded between Mondays through Friday except Holidays. Assume that a software
company “KLXY Limited” has taken the project to integrate the data between two
exchanges BSE and NSE.
In order to complete this task of integrating the Raw data received from NSE & BSE,
KLXY Limited allots responsibilities to Data Modelers, DBAs and ETL Developers.
During this entire ETL process, many IT professionals may involve, but we are
highlighting the roles of these three personals only for easy understanding and better
clarity.
Data Modelers analyze the data from these two sources(Record Layout 1 &
Record Layout 2), design Data Models, and then generate scripts to create
necessary tables and the corresponding records.
DBAs create the databases and tables based on the scripts generated by the data
modelers.
ETL developers map the extracted data from source systems and load it to target
systems after applying the required transformations.
Overall Process:
The complete process of data transformation from external sources to our target data
warehouse is explained using the following sections. Each section will be explained in
detail.
Data from the external sources (source1 - .CSV (comma seperated) file , source2 -
Oracle table)
iGATE Internal
Source(s) table layout details
Look up table details
Target table layout details
Defining Source table and target table in Informatica
Implementing extraction mapping in Informatica (Mapping Designer)
Implementing transformation and loading mapping in Informatica
Workflow creation in Informatica (Workflow Manager)
Verifying records through Informatica (Workflow Monitor)
http://learnbi.com/informatica2.htm
ETL Testing:
Testing is an important phase in the project lifecycle. A structured well defined testing
methodology involving comprehensive unit testing and system testing not only ensures
smooth transition to the production environment but also a system without defects.
iGATE Internal
The testing phase can be broadly classified into the following categories:
Integration Testing
System Testing
Regression Testing
Performance Testing
Operational Qualification
Test Strategy:
A test strategy is an outline that describes the test plan. It is created to inform the project
team the objective, high level scope of the testin process. This includes the testing
objective, methods of testing, resources, estimated timelines, environment etc.
The test strategy is created based on high level design document.For each testing
component test strategy needs to be created. based on this strategy testing process will be
detailed out in the test plan.
Test Planning:
Test Planning is a key for successfully implementing the testing of a system. The
deliverable is the actual �Test Plan�. A software project test plan is a document that
describes the Purpose, System Overview, Approach to Testing, Test Planning, Defect
Tracking, Test Environment, Test prerequisites and References.
A key prerequisite for preparing a successful Test Plan is having approved (functional and
non functional) requirements. Without the frozen requirements and specification the test
plan will result in the lack of validation for the projects testing efforts.
The process of preparing a test plan is a useful way to get to know how testing of a
particular system can be carried out within the given time line provided the test plan
should be thorough enough.
The test plan outlines and defines the strategy and approach taken to perform end-to-end
testing on a given project. The test plan describes the tasks, schedules, resources, and
tools for integrating and testing the software application. It is intended for use by project
personnel in understanding and carrying out prescribed test activities and in managing
these activities through successful completion.
iGATE Internal
To verify the functional and non functional requirements are met.
To coordinate resources, environments into an integrated schedule.
To provide a plan that outlines the contents of detailed test cases scenarios for
each of the four phases of testing.
To determine a process for communicating issues resulting from the test phase.
The test plan, thus, summarizes and consolidates information that is necessary for the
efficient and effective conduct of testing. Design Specification, Requirement Document
and Project plan supporting the finalization of testing are located in separate documents
and are referenced in the test plan.
Test Estimation:
Effective software project estimation is one of the most challenging and important
activities in the testing activities. However, it is the most essential part of a proper project
planning and control. Under-estimating a project leads to under-staffing it, running the
risk of low quality deliverables and resulting in loss of credibility as deadlines are
missed. So it is imperative to do a proper estimation during the planning stage.
After receiving the requirements the tester analyses the mappings that are
created/modified and study about the changes made. Based on the impact analysis, the
tester comes to know about how much time is needed for the whole testing process,
which consists of mapping analysis, test case preparation, test execution, defect reporting,
regression testing and final documentation. This calculated time is entered in the
Estimation time sheet.
iGATE Internal
Integration Testing:
Prerequisites:
System Testing:
This environment integrates the various components and runs as a single unit. This
should include sequences of events that enable those different components to run as a
single unit and validate the data flow.
Prerequisites:
Regression Testing:
Regression Testing is performed after the developer fixes a defect reported. This testing is
to verify whether the identified defects are fixed and fixing of these defects does not
introduce any other new defects in the system / application. This testing will also be
iGATE Internal
performed when a Change Request is implemented on an existing production system.
After the Change Request (CR) is approved, the testing team takes the impact analysis as
input for designing the test cases for the CR.
Prerequisites:
Performance Testing:
I'm going to work on an ETL project -where source is oracle ,target is teradata and ETL
tool to be used is informatica.
There are two levels -one is load into staging(staging is also teradata) and second is
loading into target tables.
I query the oracle source tables and load into staging area.
Which of the approach is good -
1.create a one to one mapping to do this or
2.Use any of the tools offered by Teradata -like Mload,Tpump,etc in informatica and do
it.
Please advice on the second level as well ( from staging to target) whether to use one to
one mapping or teradata tools.
I'm really afraid because there is an automatic preimary index getting created in Teradata
tables and this lead to rejection of records in some cases.
Ans : For phase 1 you can either use informatica mapping using teradata loader
connections (like fastload,tpump,multiload) or directly use teradata loader scripts to load
data into staging table. Its preferable to use informatica with loader connections which
will be faster for develpment, and will only be one to one mapping.
For phase 2 stage to target, you must go for informatica and use the joins in case if there
is any. You'll have to write sql override queries which would replace the SCD
iGATE Internal
transformations in informatica. The query must distinguish the new records and update
records using a flag and do the operation of insert and update according to the flag value.
In case of primary index, it is the lifeline of a teradata table. You need to provide the
name of the primary index while creating a table, else it automatically takes the first
column as the primary index. Using the PI only the data gets loaded into TD as well as
retrieval is done using the PI only.
Phase 1:
To make things more clearer-you has suggested me use loader connection instead of
Relational Connection.
As indicated by you,does the loader connection will be faster than the relational
connection?
I hope this would help me in the one-time load/history load as the historical data is
provided in the form of oracle dumps.
But in the incremental loads- the source data is being pulled from the oracle database's
views(obviously there will be performance issues- but clients requirement is this way-
can't help them)- do you suggest again here as well the loader connection will be faster
than the relational connection?
Phase 2:
I'm not very clear in the explanation given by you here.
I'll be using a lookup transformation to check whether the record is present or not.
Accordingly ,I'll be inserting or updating the target tables(by Tagging them accordingly)
using Update Statergy transformation.
Basically I'm using the lookup and update statergy transfoemation to achieve my SCD
concept here.
iGATE Internal
Is this the concept you have explained here?
New Question:
The concept is the same, but the implementation process is different. You can do it in the
way you have described using US n lookup's , also the way I have mentioned in the
previous post. The SCD can be implemented with over ride querie for faster development
instead of going for transformers like lookup n US. When using queries for SCD joins
will be replacing lookups and US will be replaced by a flag column which says whether
the record is gonna be insert/update provided the granularity of record change is 1 day.
About extracting from Oracle views to teradata you'll have to go for relational
connections only. Loaders can only load data from flatfiles to Teradata.
Informatica Push down optimization is a new concept embedded into Informatica 8
Version series.
Pushdown optimization is a concept where you can try to make most of the calculations
on the database side than doing it at the informatica level. For example if you need some
kind of aggregation, you can push those computations to be done at the database end. It is
basically to utilize the power of database, so you do it at source side as well as target side.
Take a look at pdf from wisdomforce page: Best Practices for high-speed data transfer
from Oracle to Teradata
http://www.wisdomforce.com/resources/docs/FastReaderBestPracticesforTeradata.pdf
http://dwhtechstudy.com/Docs/informatica/New_Features_Enhancements_PC8.pdf
We can push transformation logic to the source or target database when the Lookup
transformation contains a lookup override.
To perform source-side ,target-side , or full pushdown optimization for a session
containing lookup overrides , configure the session for pushdown optimization and select
a pushdown option that allows us to create views.We can also use full pushdown
optimization when we use the target loading option to treat source rows as delete.
http://www.docstoc.com/docs/6205071/Powercenter-Informatica-80
http://www.informatica.com/INFA_Resources/br_powercenter_6659.pdf
iGATE Internal
http://www.informatica.com/INFA_Resources/ds_high_availability_6674.pdf
http://www.docstoc.com
*Object Permissions*
Effective in version 8.1.1, you can assign object permissions to users when
you add a user account, set user permissions, or edit an object.
Effective in version 8.1, you configure the gateway node and location for
log event files on the Properties tab for the domain. Log events describe
operations and error messages for core and application services, workflows
and sessions.
We can configure the maximum size of logs for automatic purge in megabytes.
Powercenter 8.1 also provides enhancements to the Log Viewer and log event
formatting.
*Unicode compliance*
You may notice an increase in memory and CPU resource usage on machines
running PowerCenter Services.
iGATE Internal
*License Usage*
*High Availability* HA
**
*Repository Security*
*Partitioning*
*Recovery*
The recovery of workflow, session and task are more robust now. The state of
iGATE Internal
the workflow/session is now stored in the shared file system and not in
memory.
*FTP*
We have options to have partitioned FTP targets and Indirect FTP file
source(with file list).
*Performance*
Pushdown optimization
We can create more efficient data conversion using the new version.
One can specify a command for source or target file in a session. The
command can be to create a source file like 'cat a file'.
*Pmcmd/infacmd*
*Mappings *
We can now build custom transformation enhancements in API using c++ and
Java code.
iGATE Internal
User defined functions similar to macros in Excel files. Some new functions
are added such as COMPRESS, DECOMPRESS, AES_ENCRYPT, AES_DECRYPT, IN,
GREATEST, LEAST, PMT, RATE, RAND etc.
=
http://etlpowercenter.blogspot.com/
@@@
In this lesson, create very simple mapping and execute by session and workflow.
Mapping is based on last lesson source and target definitions. Between source and target
definition is source qualifier.
Source Qualifier
- When you add a relational or a flat file source definition to a mapping, you need to
connect it to a Source Qualifier transformation. The Source Qualifier transformation
represents the rows that the PowerCenter Server reads when it runs a sesson.
- Source Qualifier transformation can perform the following tasks:
Join data originating from the same source database
Filter rows when the PowerCenter Server reads source data
Specify sorted ports
Select only distinct values from the source
Create a custom query to issue a special SELECT statement for the PowerCenter
Server to read source data.
Session and Workflow for this lesson are very simple. Looks like most
logic are in Mapping. Session just like diagram to link all Mapping.
Workflow is just runtime instance of session.
iGATE Internal
Creating Target Tables:
In target Designer, you can create table in database and informatica SQL statement from
your target table definitions.
1. Click Targets > Generate/Execute SQL.
2. In the File Name field, enter the SQL DDL file name.
3. Select the Create Table, Drop Table, Foreign Key and Primary Key options.
4. Click the Generate and Execute button.
So from what I did and see , it's simple to create source and target
definition. Just like create new tables in database. One thing I don't
like, after important source definition from database. The layout view
is not well arranged like other database logic view. you have to
rearrange them by yourself. So many software can do this job
automatically, I don't know why PowerCenter Designer can't do this.
This lesson is pretty simple just show you how to connect to repository. My
environment was already setup by my company's administrator. So I do need
worry about install and setup repository server and service.
If you don't have environment, you have to install server and do configuration
them by yourself. I tale looked that part, just need very powerful machine. So I
just ignore that.
For this lesson, includes connect to a informatica repository (domain, server, port
, username and password) , create groups and create a folder. Setup permission
on that created folder. All these did in Informatica PowerCenter Repository
Manager. Nothing special, everybody can under stand it easily.
And in this lesson also includes creating source tables and data for this tutorial.
You need have a database and execute appropriate SQL file in PowerCenter
Designer. I used smpl_ms.sql for SQL Server. This SQL file includes table’s
schema and data. Informatica PowerCenter connects database target or source
by ODBC. I installed MS SQL Server Express in my local box. It's free and
seems ok, doesn’t eat too much resources compare to oracle. When you setup
ODBC, server is "locahost\sqlexpress".
After SQL execute you can see few new tables with some data in sql
server mater database.
iGATE Internal
Before the 6 lessons. I must understand of PowerCenter Architecture. This is not copy
and paste from help. I try to write down by my own under stand. It's not big deal for
developer view. But may very help in job interview.
- Node:
Logical representation of a machine in a domain.
One node i each domain servers as a gateway for the domain.
All processes in PowerCenter run as services on a node.
- Services:
Two type of services: Core services and Application services.
Core services: support the domain and application services. E.g. Domain service,
Log service.
Application services: represent PowerCenter server-based functionality. E.g.
Repository service, Intergration service...
- Informatica Repository:
Contains a set of metadata tables withing repository database that informatica
applications and tools access.
- Informatica Client:
Manages users, define sources and targets, builds mappings and mapplets with
transformation logic, and create workflows to run the mapping logic. The
Informatica Client has four client applications: Repository Manager, Designer,
Workflow Manager and Workflow monitor.
What's my plan
1. First follow the tutorial from Powercenter Help .
This includes totally 6 lessons.
2. If I can get Informatica PowerCenter 8 Developer training, I'll follow training agenda
I'm trying to get a on site training combine level one and two together.
Agenda:
o Data Integration
iGATE Internal
o Mapping and Transformations
o Metadata
o PowerCenter Architecture
Source Qualifier
o Velocity Methodology
o Source Pipelines
o Expression Editor
o Filter Transformation
o File Lists
o Workflow Scheduler
o Joiner Transformation
o Shortcuts
iGATE Internal
o Lab B - Features and Techniques I
o Lookup Transformation
o Reusable Transformations
Debugger
o Debugging Mappings
Sequence Generator
o Lookup Caching
o Sorter Transformation
o Aggregator Transformation
o Data Concatenation
o Self-Join
o Router Transformation
iGATE Internal
o Expression Default Values
o Target Override
o Dynamic Lookup
o Error Logging
o System Variables
Mapplets
o Mapplets
Mapping Design
o Designing Mappings
o Workshop
o Link Conditions
o Workflow Variables
o Assignment Task
o Decision Task
o Email Task
iGATE Internal
o Lab - Load Product Weekly Aggregate Table
o Command Task
o Reusable Tasks
o PMCMD Utility
o Worklets
o Timer Task
o Control Task
Workflow Design
o Designing Workflows
o Workshop (Optional)
Agenda:
o Architectural overview
o Administration Console
o Configuring services
o High Availability
iGATE Internal
o Mapping parameters and variables and parameter files
o File lists
o Incremental aggregation
o Denormalization
Workflow Techniques
o Using Tasks
o Workflow Alerts
o Dynamic Scheduling
o Pseudo-looping techniques
Workflow Recovery
o State of operation
Transaction Control
o Database Transactions
o Transformation scope
Error Handling
o Error categories
iGATE Internal
o Error logging
o Migration
o Comparing objects
o Repository Reporting
o Metadata Reports
o Repository reports
Memory Allocation
o Auto-cache sizing
o Session dynamics
o Measuring performance
o Optimization techniques
Pipeline Partitioning
o Pipeline types
o Multi-partition sessions
iGATE Internal
Informatica Resources
I really can't find many useful website except Informatica own website.
But you have to be a partner or customer to access these site or have to buy documents.
The lucky thing is I found some Informatica Powercenter 7 documents are free in
Chinese.
As an Informatica customer, you can access the Informatica Customer Portal site at
http://my.informatica.com. The site contains product information, user group information,
newsletters, access to the Informatica customer support case management system
(ATLAS), the Informatica Knowledge Base, and access to the Informatica user
community.
You can access the Informatica corporate web site at http://www.informatica.com. The
site contains information about Informatica, its background, upcoming events, and sales
offices. You will also find product and partner information. The services area of the site
includes important information about technical support, training and education, and
implementation services.
iGATE Internal
Why I create this blog
I'm J2EE developer now, I'm interesting to become a Informatica Powercenter ETL
developer now. When I google online I can't find many useful website or article. So I
thing maybe create a blog to record my study trace and later may create a Informatica
Powercenter website.
First I'll follow tutorial in Informatica Powerceneter help as start point, after that I don't
know yet. Hope I can find something to continue or put it based on real project in my job.
@@@@
http://blogs.hexaware.com/informatica_way/informatica-powercenter-8x-
key-concepts-5.html
5. Repository Service
iGATE Internal
> OperatingMode: Values are Normal and Exclusive. Use Exclusive mode to perform
administrative tasks like enabling version control or promoting local to global repository
> EnableVersionControl: Creates a versioned repository
Advanced Properties
> CommentsRequiredFor Checkin: Requires users to add comments when checking in
repository objects.
> Error Severity Level: Level of error messages written to the Repository Service log.
Specify one of the following message levels: Fatal, Error, Warning, Info, Trace & Debug
Environment Variables
iGATE Internal
value for the database client code page environment variable than the value set for the
node.
You might want to configure the code page environment variable for a Repository
Service process when the Repository Service process requires a different database client
code page than the Integration Service process running on the same node.
For example, the Integration Service reads from and writes to databases using the UTF-8
code page. The Integration Service requires that the code page environment variable be
set to UTF-8. However, you have a Shift-JIS repository that requires that the code page
environment variable be set to Shift-JIS. Set the environment variable on the node to
UTF-8. Then add the environment variable to the Repository Service process properties
and set the value to Shift-JIS.
We shall look at the fundamental components of the Informatica PowerCenter 8.x Suite,
the key components are
1. PowerCenter Domain
2. PowerCenter Repository
3. Administration Console
4. PowerCenter Client
5. Repository Service
6. Integration Service
PowerCenter Domain
Node
Node is the logical representation of a machine in a domain. The machine in which the
PowerCenter is installed acts as a Domain and also as a primary node. We can add other
machines as nodes in the domain and configure the nodes to run application services such
as the Integration Service or Repository Service. All service requests from other nodes in
the domain go through the primary node also called as ‘master gateway’.
iGATE Internal
The Service Manager runs on each node within a domain and is responsible for starting
and running the application services. The Service Manager performs the following
functions,
Application services
The services that essentially perform data movement, connect to different data sources
and manage data are called Application services, they are namely Repository Service,
Integration Service, Web Services Hub, SAPBW Service, Reporting Service and
Metadata Manager Service. The application services run on each node based on the way
we configure the node and the application service
Domain Configuration
Some of the configurations for a domain involves assigning host name, port numbers to
the nodes, setting up Resilience Timeout values, providing connection information of
metadata Database, SMTP details etc. All the Configuration information for a domain is
stored in a set of relational database tables within the repository. Some of the global
properties that are applicable for Application Services like ‘Maximum Restart Attempts’,
‘Dispatch Mode’ as ‘Round Robin’/’Metric Based’/’Adaptive’ etc are configured under
Domain Configuration
4
Informatica 7.x vs 8.x
Ans
iGATE Internal
1. sql transformation
2. java transformation
3. support unstructured data like emails, word doc, and pdfs.
4. In custom transformation we can build the transformation using java or vc++
5.Concept of flat file updation is also introduced in 8.x
Object Permissions
Effective in version 8.1.1, you can assign object permissions to users when
you add a user account, set user permissions, or edit an object.
Effective in version 8.1, you configure the gateway node and location for
log event files on the Properties tab for the domain. Log events describe
operations and error messages for core and application services, workflows
and sessions.
We can configure the maximum size of logs for automatic purge in megabytes.
Powercenter 8.1 also provides enhancements to the Log Viewer and log event
formatting.
Unicode compliance
You may notice an increase in memory and CPU resource usage on machines
running PowerCenter Services.
License Usage
High Availability
iGATE Internal
High availability is the PowerCenter option that eliminates a single point
of failure in the PowerCenter environment and provides minimal service
interruption in the event of failure. High availability provides the
following functionality:
Resilience: Resilience is the ability for services to tolerate transient
failures, such as loss of connectivity to the database or network failure
http://www.dwhlabs.com/vs_links/vs_link_informatica7.xvs8.x.aspx
A. In Informatica 8.x, multiple integration services can be enabled under one node. In case if
there is a need to determine the process associated with an Integration service or Repository
service, then it can be done as follows.
If there are multiple Integration Services enabled in a node, there are multiple pmserver
processes running on the same machine. In PowerCenter 8.x, it is not possible to differentiate
between the processes and correlate it to a particular Integration Service, unlike in 7.x where
every pmserver process is associated with a specific pmserver.cfg file. Likewise, if there are
multiple Repository Services enabled in a node, there are multiple pmrepagent processes
running on the same machine. In PowerCenter 8.x, it is not possible to differentiate between
the processes and correlate it to a particular Integration Service.
To do these in 8.x do the following:
5. Use the PID from this column to identify the process as follows:
UNIX:
Windows:
1.
a. Run task manager.
b. Select the Processes tab.
Scroll to the value in the PID column that is displayed in the PowerCenter Administration
Console.
iGATE Internal
B. Sometimes, the PowerCenter Administration Console URL is inaccessible from some
machines even when the Informatica services are running. The following error is displayed on
the browser:
The reason for this is due to an invalid or missing configuration in the hosts file on the client
machine.
1. Edit the hosts file located in the windows/system32/drivers/etc folder on the server
from where the Administration Console is being accessed.
2. Add the host IP address and the host name (for the host where the PowerCenter
services are installed).
Example
10.1.2.10 ha420f3
1. Launch the Administration Console and access the login page by typing the URL:
http://<host>:<port>/adminconsole in the browser address bar.
It should be noted that the host name in the URL matches the host entry in the hosts file.
Problem Description
Data is written correctly to target, but above error will appear multiple times in session
log and it will grow to very large size. Is there away to stop this message from being
written to the session log?
Solution
iGATE Internal
More Information
Setting the custom property XMLWarnDupRows to “NO” will not resolve this issue as it
has been replaced by the XMLWarnDupRows Integration Service property.
Reference
PowerCenter Administrator Guide > “Creating and Configuring the Integration Service”
> “Configuring the Integration Service Properties” > “Configuration Properties” > “Table
9-6. Configuration Properties for an Integration Service”
Solution
Using High Availability for Web Services Hub, a client application can access the Web
Service even in the event of failover of the Web Service to another node (when the URL
changes). To achieve this, use either of the following:
With a load balancer, the client application sends a request to the load balancer and the
load balancer routes the request to an available Web Services Hub. Any of the Web
Services Hub services can process requests from the client application. The load balancer
does not verify that the host names and port numbers given for the Web Services Hub
services are valid or that the services are running.
Before you send requests through the load balancer, ensure that the Web Service Hub
services are available.
iGATE Internal
Some of the third party load balancers that can be used (available on the Web) are Apache
tcpmon and Apache jmeter.
Informatica provides sample third party load balancers (used only for test environment).
This can be used to understand the usage of load balancer with web Services Hub. This
sample third party load balancer is present at
infa_home/server/samples/WebServices/samples/SoftwareLoadBalancer.
Before using load balancer, read the Readme.txt file which is present in the same path.
This guides you on:
Note
Informatica third party load balancers can be used only for test environment and not for
production environment.
http://blog.mydwbi.com/?p=126
http://www.dwhlabs.com/dwh_concepts/normalization.aspx
@@=>
Normalization
What is Normalization?
Normalization is the process of efficiently organizing data in a database. There are two
goals of the normalizaton process::
First Normal form (1 NF) sets the very basic rules for an organized database.
Create separate tables for each group of related data and identify each
row with a unique column or set of columns(the primary key)
iGATE Internal
Second Normal Form
Second Normal form (2 NF) further addresses the concept of removing duplicative data.
Third Normal form (3 NF) remove columns which are not dependent upon the primary
key.
Remove columns that are not dependent upon the primary key.
Normally, a fatal Oracle error may not be registered as a warning or row error and the
session may not fail, conversely a non-fatal error may cause a PowerCenter session to
fail.This can be changed with few tweaking in
C. Server Settings
Adding an entry for this error in the ora8err.act file and enabling the
OracleErrorActionFile option does not change this behavior (Both ora8err.act and
OracleErrorActionFile are discussed in later part of this blog).
iGATE Internal
When this exception (NO_DATA_FOUND) is raised in PL/SQL it is sent to the Oracle
client as an informational message not an error message and the Oracle client sends this
message to PowerCenter. Since the Oracle client does not return an error to PowerCenter
the session continues as normal and will not fail.
20991, F
E.g.,
C:\Informatica\PowerCenter8.1.1\server\bin
Examples:
To fail a session when the ORA-03114 error is encountered change the 03114 line in the
file to the following:
03114, F
To return a row error when the ORA-02292 error is encountered change the 02292 line to
the following:
iGATE Internal
02292, R
Note that the Oracle action file only applies to native Oracle connections in the session. If
the target is using the SQL*Loader external loader option, the message status will not be
modified by the settings in this file.
C. Once the file is modified, following changes need to be done in the server level.
Infa 8.x
Set the OracleErrorActionFile Integration Service Custom Property to the name of the
file (ora8err.act by default) as follows:
4. Under the Properties tab, click Edit in the Custom Properties section.
7. Click OK.
PowerCenter 7.1.x
UNIX
1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).
OracleErrorActionFile=ora8err.act
Windows
iGATE Internal
For the server running on Windows:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\
PowerMart\Parameters\Configuration
Select Edit; New; String Value. Enter the “OracleErrorActionFile” for the string value.
Select Edit; Modify.
Enter the directory and the file name of the Oracle error action file:
\ora8err.act
Example:
Click OK
HINTS used in a SQL statement helps in sending instructions to the Oracle optimizer which would
quicken the query processing time involved. Can we make use of these hints in SQL overrides within
our Informatica mappings so as to improve a query performance?
On a general note any Informatica help material would suggest: you can enter any valid SQL
statement supported by the source database in a SQL override of a Source qualifier or a Lookup
transformation or at the session properties level.
While using them as part of Source Qualifier has no complications, using them in a Lookup SQL
override gets a bit tricky. Use of forward slash followed by an asterix (“/*”) in lookup SQL Override
[generally used for commenting purpose in SQL and at times as Oracle hints.] would result in
session failure with an error like:
iGATE Internal
This is because Informatica’s parser fails to recognize this special character when used in a Lookup
override. There has been a parameter made available starting with PowerCenter 7.1.3 release,
which enables the use of forward slash or hints.
Infa 7.x
1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).
2. Add the following entry at the end of the file:
LookupOverrideParsingSetting=1
3. Re-start the PowerCenter server (pmserver).
Infa 8.x
1. Connect to the Administration Console.
2. Stop the Integration Service.
3. Select the Integration Service.
4. Under the Properties tab, click Edit in the Custom Properties section.
5. Under Name enter LookupOverrideParsingSetting
6. Under Value enter 1.
7. Click OK.
8. And start the Integration Service.
Starting with PowerCenter 8.5, this change could be done at the session task itself
as follows:
STAR SCHEMA
Star schema architecture is the simplest data warehouse design. The main feature of a star
schema is a table at the center, called the fact table and the dimension tables which allow
browsing of specific categories, summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are de-normalized (second normal form).
Fact table
The fact table is not a typical relational database table as it is de-normalized on purpose -
to enhance query response times. The fact table typically contains records that are ready
to explore, usually with ad hoc queries. Records in the fact table are often referred to as
events, due to the time-variant nature of a data warehouse environment.
The primary key for the fact table is a composite of all the columns except numeric
values / scores (like QUANTITY, TURNOVER, exact invoice date and time).
iGATE Internal
Typical fact tables in a global enterprise data warehouse are (apart for those, there may be
some company or business specific fact tables):
Dimension table
Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only one de-
normalized table for a given dimension.
iGATE Internal
An example of a star schema architecture is depicted below.
SNOWFLAKE SCHEMA
Snowflake schema architecture is a more complex variation of a star schema design. The
main difference is that dimensional tables in a snowflake schema are normalized, so they
have a typical relational database design.
Snowflake schemas are generally used when a dimensional table becomes very big and
when a star schema can’t represent the complexity of a data structure. For example if a
PRODUCT dimension table contains millions of rows, the use of snowflake schemas
should significantly improve performance by moving out some data to other table (with
BRANDS for instance).
The problem is that the more normalized the dimension table is, the more complicated
SQL joins must be issued to query them. This is because in order for a query to be
answered, many tables need to be joined and aggregates generated.
iGATE Internal
GALAXY SCHEMA
For each star schema or snowflake schema it is possible to construct a fact constellation
schema.
This schema is more complex than star or snowflake architecture, which is because it
contains multiple fact tables. This allows dimension tables to be shared amongst many
fact tables.
That solution is very flexible, however it may be hard to manage and support.
The main disadvantage of the fact constellation schema is a more complicated design
because many variants of aggregation must be considered.
In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when some
facts are associated with a given dimension level and other facts with a deeper dimension
level.
Use of that model should be reasonable when for example, there is a sales fact table (with
details down to the exact date and invoice header id) and a fact table with sales forecast
which is calculated based on month, client id and product id.
In that case using two different fact tables on a different level of grouping is realized
through a fact constellation model.
iGATE Internal
Data Integration Challenge – Understanding Lookup Process–
I
One of the basic ETL steps that we would use in most of the ETL jobs during
development is ‘Lookup’. We shall discuss further on what lookup is? when to use? how
it works ? and some points to be considered while using a lookup process.
During the process of reading records from a source system and loading into a target table
if we query another table or file (called ‘lookup table’ or ‘lookup file’) for retrieving
additional data then its called a ‘lookup process’. The ‘lookup table or file’ can reside on
the target or the source system. Usually we pass one or more column values that has been
read from the source system to the lookup process in order to filter and get the required
data.
Direct Query: Run the required query against the table or file whenever the
‘lookup process’ is called up
Join Query: Run a query joining the source and the lookup table/file before
starting to read the records from the source.
Cached Query: Run a query to cache the data from the lookup table/file local to
the ETL server as a cache file. When the data flow from source then run the
required query against the cache file whenever the ‘lookup process’ is called up
Most of the leading products like Informatica, DataStage support all the three ways in
their product architecture. We shall see the pros and cons of this process and how these
work in part II.
DESIGNER ::
Use the Normalizer transformation with COBOL sources, which are often stored in a
denormalized format
You can also use the Normalizer transformation with relational sources to create multiple
rows from a single row of data.
iGATE Internal
MAPPING ::
Objective : To create a mapping which converts a single row into multiple rows.
Mapping Flow : Source Definition (Flat File) > Source Qualifier > Expression (column
names) > Normalizer transformation (converts single row into multiple rows)> Target
Definition (flat file)
Description :
Source Definition
Target Definition
Download :
XML FILE DESIGNER
m_norm_col_rows Single row into multiple rows
iGATE Internal
Exceptions in Informatica – 2
Let us see few more strange exceptions in Informatica
There might be several reasons for this. One possible reason could be the way the
function SUBSTR is used in the mappings, like the length argument of the SUBSTR
function being specified incorrectly.
Example:
IIF(SUBSTR(MOBILE_NUMBER, 1, 1) = ‘9′,
SUBSTR(MOBILE_NUMBER, 2, 24),
MOBILE_NUMBER)
To solve this, correct the length option so that it does not go beyond the length of the
field or avoid using the length option to return the entire string starting with the start
value.
Example:
IIF(SUBSTR(MOBILE_NUMBER, 1, 1) = ‘9′,
SUBSTR(MOBILE_NUMBER, 2, 23),
MOBILE_NUMBER)
OR
IIF(SUBSTR(MOBILE_NUMBER, 1, 1) = ‘9′,
SUBSTR(MOBILE_NUMBER, 2),
MOBILE_NUMBER).
“TE_11015 Error in xxx: No matching input port found for output port OUTPUT_PORT
TM_6006 Error initializing DTM for session…”
iGATE Internal
This error will occur when there is corruption in the transformation.
To resolve this do one of the following: * Recreate the transformation in the mapping
having this error.
1. When opening designer, you get “Exception access violation”, “Unexpected condition
detected”.
2. Unable to see the navigator window, output window or the overview window in
designer even after toggling it on.
These are all indications that the pmdesign.ini file might be corrupted. To solve this,
following steps need to be followed.
When PowerMart opens the Designer, it will create a new pmdesign.ini if it doesn’t find
an existing one. Even reinstalling the PowerMart clients will not create this file if it finds
one.
http://www.dwhlabs.com/transformation_mappings/UpdateStrategy.aspx
Informatica Exceptions – 3
1. There are occasions where sessions fail with the following error in the Workflow
Monitor:
“First error code [36401], message [ERROR: Session task instance [session XXXX]:
Execution terminated unexpectedly.] “
iGATE Internal
To determine the error do the following:
a. If the session fails before initialization and no session log is created look for errors in
Workflow log and pmrepagent log files.
b. If the session log is created and if the log shows errors like
then a core dump has been created on the server machine. In this case Informatica
Technical Support should be contacted with specific details. This error may also occur
when the PowerCenter server log becomes too large and the server is no longer able to
write to it. In this case a workflow and session log may not be completed. Deleting or
renaming the PowerCenter Server log (pmserver.log) file will resolve the issue.
2. Given below is not an exception but a scenario which most of us would have come
across.
Rounding problem occurs with columns in the source defined as Numeric with Precision
and Scale or Lookups fail to match on the same columns. Floating point arithmetic is
always prone to rounding errors (e.g. the number 1562.99 may be represented internally
as 1562.988888889, very close but not exactly the same). This can also affect functions
that work with scale such as the Round() function. To resolve this do the following:
b. Define all numeric ports as Decimal datatype with the exact precision and scale
desired. When high precision processing is enabled the PowerCenter Server support
numeric values up to 28 digits. However, the tradeoff is a performance hit (actual
performance really depends on how many decimal ports there are).
Exceptions in Informatica
There exists no product/tool without strange exceptions/errors, we will see some of those
exceptions.
1. You get the below error when you do “Generate SQL” in Source Qualifier and try to
validate it.
“Query should return exactly n field(s) to match field(s) projected from the Source
Qualifier”
Where n is the number of fields projected from the Source Qualifier.
iGATE Internal
1. The order of ports may be wrong
2. The number of ports in the transformation may be more/less.
3. Sometimes you will have the correct number of ports and in correct order too but even
then you might face this error in that case make sure that Owner name and Schema name
are specified correctly for the tables used in the Source Qualifier Query.
E.g., TC_0002.EXP_AUTH@TIMEP
“[/export/home/build80/zeusbuild/vobs/powrmart
/common/odl/oracle8/oradriver.cpp] line [xxx]”
Where xxx is some line number mostly 241, 291 or 416.
When there is no enough memory in System this happens. To resolve this we can either
http://blogs.hexaware.com/business-intelligence/data-integration-challenge-
%E2%80%93-understanding-lookup-process-%E2%80%93-iii.html
iGATE Internal
measurements.
==
iGATE Internal
Data mining involves using techniques Web mining involves the analysis of Web
to find underlying structure and server logs of a Web site.
relationships in large amounts of data.
Data mining products tend to fall into The Web server logs contain the entire
five categories: neural networks, collection of requests made by a potential
knowledge discovery, data visualization, or current customer through their browser
fuzzy query analysis and case-based and responses by the Web server
reasoning.
OLTP VS OLAP
Ans
OLTP OLAP
On Line Transaction processing On Line Analytical processing
Continuously updates data Read Only Data
Tables are in normalized form Partially Normalized / Denormalized Tables
Single record access Multiple records for analysis purpose
Holds current data Holds current and historical data
Records are maintained using Primary key Records are baased on surogate keyfield
feild
Delete the table or record Cannot delete the records
Complex data model Simplified data model
In Part II we discussed ‘when to use’ and ‘when not to use’ the particular type of lookup
process, the Direct Query lookup, Join based lookup and the Cache file based lookup.
Now we shall see what are the points to be considered for better performance of these
‘lookup’ types.
In the case of Direct Query the following points are to be considered
Index on the lookup condition columns
Selecting only the required columns
In the case of Join based lookup, the following points are to be considered
Index on the columns that are used as part of Join conditions
Selecting only the required columns
iGATE Internal
In the case of Cache file based lookup, let us first try to understand the process of how
these files are built and queried.
The key aspects of a Lookup Process are the
SQL that pulls the data from lookup table
Cache memory/files that holds the data
Lookup Conditions that query the cache memory/file
Output Columns that are returned back from the cache files
In the case of Informatica, the cache file is of separate index and data file, the index file
has the fields that are part of the ‘lookup condition’ and the data file has the fields that are
to be returned. Datastage cache files are called Hash files which are optimized based on
the ‘key fields’.
Process:
1. Get the Inputs for Lookup Query, Lookup Condition and Columns to be returned
2. Load the cache file to memory
3. Search the record(s) matching the Lookup condition values , in case of
Informatica this search happens on the ‘index file’
4. Pull the required columns matching the condition and return, in case of
Informatica with the result from ‘index file’ search, the data from the ‘data file’ is
located and retrieved
In the search process, based on the memory availability there could be many disk hits and
page swapping.
The following table lists the points to be considered for the better performance of a cache
file based lookup
Category Points to consider
Optimize Cache file • While retrieving the records to build the cache file, sort the
building process records by the lookup condition, this sorting would speed up
iGATE Internal
the index (file) building process. This is because the search
tree of the Index file would be built faster with lesser node
realignment
• Select only the required fields there by reducing the cache
file size
• Reusing the same cache file for multiple requirements for
same or slightly varied lookup conditions
• Sort the records that come from source to query the cache
file by the lookup condition columns, this ensures less page
swapping and page hits. If the subsequent input source
records come in a continuous sorted order then the hits of
the required index data in the memory is high and the disk
Optimize Cache file
swapping is reduced
query process
• Having a dedicated separate disk ensures a reserved space
for the lookup cache files and also improves response of
writing to the disk and reading from the disk
• Avoid querying recurring lookup condition, by sorting the
incoming records by the lookup condition
When we receive the data from source systems, the data file will not carry a flag
indicating whether the record provided is new or has it changed. We would need to build
process to determine the changes and then push them to the target table.
Step 1: Pull the incremental data from the source file or table
If source system has audit columns like date then we can find the new records else we
will not be able to find the new records and have to consider the complete data
For source system’s file or table that has audit columns, we would follow the below steps
1. While reading the source records for a day (session), find the maximum value of
date(audit filed) and store in a persistent variable or a temporary table
2. Use this persistent variable value as a filter in the next day to pull the incremental
data from the source table
Step 2: Determine the impact of the record on target table as Insert/Update/ Delete
Following are the scenarios that we would face and the suggested approach
iGATE Internal
1. Data file has only incremental data from Step 1 or the source itself provide only
incremental data
o do a lookup on the target table and determine whether it’s a new record or
an existing record
o if an existing record then compare the required fields to determine whether
it’s an updated record
o have a process to find the aged records in the target table and do a clean up
for ‘deletes’
2. Data file has full complete data because no audit columns are present
o The data is of higher
have a back up of the previously received file
perform a comparison of the current file and prior file; create a
‘change file’ by determining the inserts, updates and deletes.
Ensure both the ‘current’ and ‘prior’ file are sorted by key fields
have a process that reads the ‘change file’ and loads the data into
the target table
based on the ‘change file’ volume, we could decide whether to do a
‘truncate & load’
o The data is of lower volume
do a lookup on the target table and determine whether it’s a new
record or an existing record
if an existing record then compare the required fields to determine
whether it’s an updated record
have a process to find the aged records in the target table and do a
clean up or delete
How to effectively manage the data storage and also leverage the benefit of a
timestamp field?
One way of managing the storage of timestamp field is by introducing a process id field
and a process table. Following are the steps involved in applying this method in table
structures and as well as part of the ETL process.
Data Structure
iGATE Internal
1. Consider a table name PAYMENT with two fields with timestamp data type like
INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the
changes for every present in the table
2. Create a table named PROCESS_TABLE with columns PROCESS_NAME
Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
3. Now drop the fields of the TIMESTAMP data type from table PAYMENT
4. Create two fields of integer data type in the table PAYMENT like
INSERT_PROCESS_ID and UPDATE_PROCESS_ID
5. These newly created id fields INSERT_PROCESS_ID and
UPDATE_PROCESS_ID would be logically linked with the table
PROCESS_TABLE through its field PROCESS_ID
ETL Process
1. Let us consider an ETL process called ‘Payment Process’ that loads data into the
table PAYMENT
2. Now create a pre-process which would run before the ‘payment process’, in the
pre-process build the logic by which a record is inserted with the values like
(‘payment process’, SEQUNCE Number, current timestamp) into the
PROCESS_TABLE table. The PROCESS_ID in the PROCESS_TABLE table
could be defined as a database sequence function.
3. Pass the currently generated PROCESS_ID of PROCESS_TABLE as
‘current_process_id’ from pre-process step to the ‘payment process’ ETL process
4. In the ‘payment process’ if a record is to inserted into the PAYMENT table then
the current_prcoess_id value is set to both the columns INSERT_PROCESS_ID
and UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT
table then the current_process_id value is set to only the column
UPDATE_PROCESS_ID
5. So now the timestamp values for the records inserted or updated in the table
PAYMENT can be picked from the PROCESS_TABLE by joining by the
PROCESS_ID with the INSERT_PROCESS_ID and UPDATE_PROCESS_ID
columns of the PAYMENT table
Benefits
The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less
space when compared to the timestamp fields
Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are
Index friendly
Its easier to handle these process id fields in terms picking the records for
determining the incremental changes or for any audit reporting.
TRANSFORMATION DEVELOPER VS MAPPLET DESIGNER
Ans TRANSFORMATION
MAPPLET DESIGNER
DEVELOPER
Used to create reusable
Used to create mapplets
transformation
Reusable transformation :: It contains a set of transformations
Reusable transformations can be and allows you to reuse that
iGATE Internal
transformation logic in multiple
used in multiple mappings.
mappings.
Ans
ETL TOOL REPORTING TOOL
ETL - Extract, Transform, Load data Reporting :: To generate reports
Cleansing data,Validating data,Business Convert ETL data into Cubes, Pie charts,
Logic on the data graphs etc.,
Developed by Reporting developer .End
Developed by ETL developer user access the data in terms of Business
reports
Tools :: Informatica, Data Stage, Oracle Tools :: Business Objects, Hyperion,
Warehouse etc., Cognos etc.,
NORMALIZATION VS DENORMALIZATION
Ans
FACT TABLE DIMENSION TABLE
A table in a data warehouse whose A dimensional table is a collection of
entries describe data in a fact table. hierarchies and categories along which
Dimension tables contain the data from the user can drill down and drill up. it
which dimensions are created. A fact contains only the textual attributes.
table in data ware house is it describes
the transaction data. It contains
iGATE Internal
characteristics and key figures.
In a Data Model schema less number of In a Data Model schema more number of
fact tables are observed. dimensional tables are observed.
Ans
RDBMS SCHEMA DWH SCHEMA
* Used for OLTP systems * Used for OLAP systems
* Traditional and old schema * New generation schema
* Normalized * Denormalized
* Difficult to understand and navigate * Easy to understand and navigate
* Cannot solve extract and complex * Extract and complex problems can be
problems easily solved
* Poorly modelled * Very good model
Oracle VS Teradata:
Both the database has there advantages & disadvantages.
There are a lot of factors to be taken into consideration before deciding which database is
better.
If you are talking about OLTP systems then Oracle is far better than Teradata.
Oracle is more flexible in terms of programming like u can write
Packages,procedures,functions .
Teradata is useful if you want to generate reports on a very huge database.
But the recent versions of Oracle like 10g is quite good & contains a lot of features to
support DataWareHouse.
Teradata is a MPP System which really can process the complex queries very
fastly..Another advantage is the uniform distribution of data through the Unique primary
indexes with out any overhead. Recently we had an evaluation with experts from both
Oracle and Teradata for OLAP system,and they were really impressed with the
performance of Teradata over Oracle.
iGATE Internal
iGATE Internal