Data Integration Fund All RVW Labs

Information Server Data Integration Fundamental Boot Camp – Labs Review
IBM® InfoSphere™
Data Integration
Fundamentals
Boot Camp
Lab Review
January 2014
© Copyright IBM Corporation 2014 Page 1 of 276

Table of Contents
Lab 01: Verify Information Server Status ................................................ 5

How to Bring Up Information Server Services (when InfoServer is down) ............... 7
Lab 02: Project Governance with Blueprint Director ............................. 9
Task: Discover the business-driven BI development method and template ................. 10
Task: Importing a Blueprint .......................................................................................... 17
Task: Set up connections to the InfoSphere Information server environment.............. 22
Task: Understanding the lineage path ........................................................................... 30
Lab 03: Automate Data Discovery with InfoSphere Discovery ............. 33
Task: Create a Data Discovery project ......................................................................... 34
Task: Define the data set related to the Customer Sentiment Analysis ........................ 36
Task: Adding the customer sentiment text file to the data set ...................................... 37
Task: Adding the customer master table to the data set................................................ 39
Task: Run column analysis ........................................................................................... 42
Task: Primary and foreign key analysis ........................................................................ 47
Task: Data Objects Discovery ...................................................................................... 49
Lab 04: DataStage Administration ........................................................... 51
Task: Explore the Administration Console ................................................................... 51
Task: Explore settings in DataStage Administrator ...................................................... 54
Lab 05: Sequential Data Access................................................................. 59
Task: Log onto DataStage Designer ............................................................................. 59
Task: Create a simple parallel job ................................................................................. 60
Task: Compile, run, and monitor the job ...................................................................... 71
Task: Create and use a job parameter ........................................................................... 75
Task: Reject link of a Sequential File stage .................................................................. 80
Task: Handling NULL values in a Sequential File stage .............................................. 83
Task: Read data from multiple sequential files using File Pattern ............................... 88
Lab 06: DataStage Parallel Architecture ................................................ 95
Task: Using a two-node configuration file ................................................................... 95
Task: Using data partitioning and collecting ................................................................ 96
Task: Experiment with different partitioning methods ................................................. 99
Task: Read data with multiple readers ........................................................................ 100
Lab 07: Relational (RDBMS) Data Access ............................................ 103
Task: Creating a data connection object ..................................................................... 103
Task: Load the data from the sequential file to a DB2 UDB table using a DB2
Connector stage ........................................................................................................... 106
Lab 08: Data Modelling and Metadata Import ..................................... 113
Task: Create a data design project .............................................................................. 113
Task: Create a Customer Master model from reverse engineering ............................. 115
Task: Create a warehouse dimension table ................................................................. 118

Task: Create a DDL script for the new table .............................................................. 124
Task: Export the physical data models ....................................................................... 128
Task: Importing physical data models using the ODBC Connector in DataStage ..... 130
Lab 09: Creating Mapping Specifications .............................................. 132
Task: Creating a FastTrack project ............................................................................. 133
Task: Import Metadata using FastTrack ..................................................................... 136
Task: Creating source to target specifications ............................................................ 138
Task: Use the source to target specification to generate a DataStage Job .................. 143
Lab 10: Combining and Sorting Data.................................................... 156
Task: Creating a source to target specification with a lookup table ........................... 157
Task: Completing the lookup stage job ...................................................................... 167
Task: Range lookup on reference link ........................................................................ 180
Task: Using the Sort stage .......................................................................................... 187
Task: Using the Remove Duplicates stage.................................................................. 191
Task: Using the Join stage .......................................................................................... 194
Task: Using the Merge stage....................................................................................... 201
Task: Using the Funnel stage ...................................................................................... 204
Task: Perform an impact analysis using the Repository window ............................... 208
Task: Find the differences between two jobs .............................................................. 210
Lab 11: Aggregating Data ....................................................................... 213
Task: Using the Aggregator stage ............................................................................... 213
Lab 12: Transforming Data .................................................................... 217
Task: Create a parameter set ....................................................................................... 217
Task: Add a Transformer stage to a job and define a constraint ................................ 219
Task: Define an Otherwise link .................................................................................. 225
Task: Define derivations ............................................................................................. 229
Lab 13: Operating and Deploying Data Integration Jobs ................... 234
Task: View the Metadata Lineage ............................................................................. 234
Task: Building a Job Sequence .................................................................................. 238
Task: Add a user variable .......................................................................................... 245
Task: Add a Wait For File stage ................................................................................ 248
Task: Add exception handling ................................................................................... 250
Lab 14: Real Time Data Integration ...................................................... 253
Task: Revisiting our Project Blueprint ....................................................................... 253
Task: Creating a Service Enabled Job ........................................................................ 256
Task: Create an Information Service project with Information Services Director ..... 262
Task: Create an Information Application and Service ................................................ 266
Summary ..................................................................................................................... 275

LAB Notes for review: This document is only being

provided to prepare for the InfoSphere Information
Server for Data Integration Fundamentals Technical
Mastery Test; there’s no associated lab image with this
document.
1. List of userids and passwords used in the labs:
ENVIRONMENT USER PASSWORD
SLES user root inf0sphere

Windows user administrator inf0sphere
IS admin1 dif inf0server
WAS admin2 wasdmin inf0server
DB2 admin db2admin inf0server
DataStage admin dsadm inf0server
DataStage user dif inf0server
Note: the passwords contain a zero, not the letter o.

For DataStage Designer, please use user ID “dif”.
For DataStage Administrator, please use user ID “dsadm”.
2. In the labs, we will use the term “VM Machine” to refer to the VMWare environment
of IBM InfoSphere Information Server, and the term “Host Machine” to refer to the
machine of VMWare Player or Workstation used to load and host the VMWare
image.
3. All the required data files are located at: /DS_Fundamentals/Labs. You will be using
the DataStage project called “dif”.
1
IS admin: InfoSphere Information Server administrator
2
WAS admin: WebSphere Application Server administrator

Lab 01: Verify Information Server Status
Task: Log onto the Information Server Web Console

1. On your Host Machine, open a new browser (Firefox) and go to this URL:
http://infosrvr:9080/ibm/iis/console/ or click the bookmark link and

the InfoSphere Information Server Web Console login page will be displayed. Enter
the IS Administrator user ID and password then click Login.

2. If you see the following window, Information Server is up and running.

3. If the browser does not display the web page, ask your instructor for help and follow
these instructions together.
How to Bring Up Information Server Services (when InfoServer is down)
4. The VMware image should be running.
5. Login as root with the password “inf0sphere”.
6. Click on Computer (bottom left corner), and open GNOME Terminal
7. To change the user to db2inst1, type the following command “su - db2inst1”
8. To start the DB2 Database, type “db2start”.
9. Type ‘exit’ to go back to the root user.
10. In order to Start the WebSphere App server, execute this command (This is a time
consuming process that takes more than a few minutes to complete):
/opt/IBM/WebSphere/AppServer/bin/startServer.sh server1
11. Finally, execute the following command to start the ASBNode:
/opt/IBM/InformationServer/ASBNode/bin/NodeAgents.sh Start


Lab 02: Project Governance with Blueprint Director

This hands-on lab will take you through a typical data integration scenario.
We will receive a data file that contains information on consumer sentiment
of our products. We will load this data into our existing warehouse. This
way we can later develop Business Intelligence reports.
The data file that we will discover, map, transform and load contains social
media sentiment information. The data was extracted from the web using a
web crawler and then processed on our InfoSphere BigInsights cluster,
IBM’s big data processing platform. The BigInsights processing reduced
the size of the data. This made easier to store the data in our traditional
data warehouse infrastructure. It is now our job to make the information
available for consumption by our business users through our data
warehouse.
In this first exercise we will use Blueprint Director to better understand the
class scenario and our goals.
Blueprint Director helps to govern our organizations models, policies, rules

and standards through integrated data architecture diagrams. In the past,
these were completely independent assets. Progress was hard to track and
only the architect knew that these documents existed. Since we started
using blueprints, our vision stays connected with reality and it allows us to
collaboratively execute based on a methodical approach.
In this lab, we will
 Discover the business-driven BI development method and template

 View and modify the blueprint that fits our class scenario
 Add a connection to the Information Server Metadata Repository

Task: Discover the business-driven BI development method and

template
InfoSphere Blueprint Director templates

You can use out-of-the-box templates to create a blueprint
that is based on a standard set of best practices. The
blueprint then follows a standard reference architecture and
is associated with best practices to help guide the team
members through the process of creating a blueprint.
Logoff from the IBM Information Server Web Console if you are connected to it.
1 Open Blueprint Director by double-clicking on the ‘InfoSphere Blueprint Director’
icon, or click on Start  All Programs IBM InfoSphere IBM Blueprint
Director2.2.0.
2 When you work on a project inside Blueprint Director, the information regarding the
project is stored into a workspace. A workspace is a folder on your system. In this
environment, it is located in directory C:\Users\Administrator\IBM\bpd\workspace.
3 Select File > New > Blueprint.
4 Common reference architectures are represented through a number of out-of-the-

box templates like
 Information Lifecycle Management
 Business Driven BI Development
 Managed Data Cleansing

 Delivering Trusted Master Data
You can also build your own template from scratch using Rational Method
Composer.
5 We will create a new blueprint from a template; in this case, we use the “Business
Driven BI Development” template. Save the blueprint in the destination folder
“Miscellaneous Project”, and name your blueprint ‘My_BI_Blueprint.bpt’.
6 Click ‘Finish’ to create the blueprint. The new blueprint, based on the ‘Business
Driven BI Development’ template, should now be visible.

3
1
2
7 The blueprint, as displayed above, contains a number of components:

1. The canvas, or diagram, where you perform the work. It is a visual design
environment where you bring and connect the elements from the palette to
represent the target landscape. The canvas supports multiple levels, allowing
you to drill down from high level representation to a low level detail.
2. The palette, which contains a number of categories of elements. These elements
can be added to the blueprint diagram through drag and drop.
3. The Blueprint Navigator, which lists of all the elements that are contained in the
blueprint specification. It lets you quickly assess lower-level content, and find and
open blueprints
4. The timeline area where you create and manage blueprint milestones that
represent a certain time or point in the blueprint project timeline. In your project,
you define blueprint milestones to mark the beginning and end of a blueprint
project and to indicate other important points in the project
5. The content browsers area: method browser, asset browser and glossary
browser.
6. The properties pane which provides for the element currently highlighted
context-specific detail such as name, description, owner, steward, milestones at
which an element is shown and hidden.
Note that the position of the various components on the Blueprint Director workspace
can be modified to suit your own preference.

8 Explore the content of the palette. There are various categories: Groups,
Operations, Analytics, Consumers and Delivery, Data Stores, Files, Conceptual
Models, Connections. Each category contains a number of elements. This list is
extensible, so you can add your own elements through Blueprint > Extend Palette.

9 The top level blueprint diagram of our BI project already uses a number of elements
from the palette.
 There are a number of domains: “Data Sources”, “Data Integration”, “Data

Repositories”, “Analytics”, “Consumers”, etc. This is the relatively abstract, top-
level view.
 Each domain in the diagram has one or more high level elements. For example,
the Data Sources domain contains a series of group elements called Asset Sets
such as Structured and Unstructured Sources, External Feeds and Enterprise
Applications. The Data Integration domain contains the Integrate Sources
Routine element, an element in the Operations category.
 General Flow connectors link elements together within a domain and across
domains. These general flow links help you visualize the flow of information in
your information project.
10 Many elements on the diagram contain a sub-diagram, with lower level, more
detailed information. You can tell if there is a linked, lower level subject area if there
is an orange plus sign at the top left corner of an element. And again, any element
in a sub-diagram can itself contain a sub-diagram. This hierarchical representation
of diagrams lets you maintain higher level diagrams uncluttered of unnecessary
detail.
11 Click on this ‘+’ sign or double click on the ‘Integrate Sources’ element to drill down
into more detail of the "Extract, Transform and Load" (ETL) process.

12 The ETL sub-diagram is now open. The highlighted tab at the top of the canvas
shows the diagram you are currently working on.
13 In this sub-diagram, notice that the elements on the left and the right side are in gray
italics. This indicates that these elements have been added from another diagram
by dragging them from the Blueprint Navigator. Changes to these elements are
kept synchronized across diagrams.
14 On the right hand side of Blueprint Director workspace, you may have noticed three
content browsers.
 The Method Browser displays the outline of the method that is associated with
the template diagram. A method provides guidance on recommended roles,
tasks, deliverables and dependencies for the overall project.
 The Asset Browser browses IBM InfoSphere Information Server metadata
repositories based on a connection profile. You can drag & drop entries (e.g. a
database, a job, etc.) from the asset browser onto elements on the canvas.
These elements will be automatically linked so that you can open IBM InfoSphere
Metadata Workbench to view the metadata details from the blueprint.
 The Glossary Browser, which is the standard IBM InfoSphere Business
Glossary eclipse plug-in, displays the glossary categories and terms in a tree
view and the detailed definition in the property view. You can drag & drop
glossary terms onto the blueprint diagram to define conceptual entities or tag
elements with terms.

15 In the Method Browser, expand the Business Driven BI development scenario. This
template scenario provides you with a high-level overview and guidance for the
required steps in a particular project. When you define and manage a new project,
you have access to the corresponding method in a hierarchical view for high-level
phases and activities, plus detailed descriptions activities of the method for a
selected project based on a template.
16 This template provides guidance on recommended roles, specific tasks, deliverables

and dependencies for the project. It includes a set of
 phases ,
 capability patterns (activities) within each phase,

 and tasks within each activity.
17 Click the ‘+’ sign in front of any phase, activity, or task to see the details.
18 In this boot camp, we will focus on the Discover Sources pattern and discovering,
defining and developing Information Integration activities.

Task: Importing a Blueprint

We will now apply those methods to our scenario. We will import a pre-existing blueprint
that defines the architecture of our information project. The blueprint will show the
architecture of how our social media data is going to be transformed and moved into our
data warehouse for customer sentiment analysis.
19 Import the existing blueprint DIF_Scenario.bpt, by selecting from the menu File >
Import Blueprint.

20 Import the blueprint from directory C:\bootcamp\dif\BlueprintDirector; select the

BlueprintDirector folder, then the blueprint named DIF_Scenario.bpt, and import it
into the Miscellaneous Project folder.
21 Click ‘Finish’ to complete the import.

22 Open the DIF_Scenario.bpt file by clicking on it in the Blueprint Navigator view.
23 Notice that this diagram is based on the BI development template we reviewed

earlier; it has been adjusted to our actual information flow.
Many projects use milestones to break deliverables up into multiple phases with
milestones completing each phase.
Blueprint Director allows you to account for these milestones and also assign elements
to milestones. This allows you to define the evolution of a blueprinted project over time
and view the assets that are active during each project milestone.

Our blueprint has milestones defined and the elements are already assigned to
milestones.
24 View the Timeline tab on the bottom left side of the screen.
25 Ensure that ‘Enable read-only blueprint view by timeline’ is checked.
26 You can view the evolution of the blueprint by using the slider in the timeline
window.
27 Select the milestones that you want to visualize at each phase of the project. This
capability helps your team to understand the end-to-end project vision.
Designing blueprints with milestones

By shifting between the regular and milestone views, looking at
what appears at each milestone, you can quickly gauge whether
your blueprint contains embedded assumptions about specific
processes or is missing processes that may in fact be required for
subsequent steps.
28 Move the timeline slider from ‘End of workshop’ to ‘Adding Sources’ and then to
‘Data Quality’. You will notice that a yellow circle appears around the Integrate
Sources routine element. This feature informs you about lower-level diagram
activity.

Be informed in higher-level diagrams of changes

occurring at certain milestones in lower-level diagrams
Elements in your higher-level diagram may be displayed with a circle
surrounding one or more elements in the diagram at certain milestones.
This indicates that elements in lower-level diagrams are appearing at
these milestones.
You do not need to drill into the lower level diagrams to realize that
changes have or have not taken place at a particular milestone.
29 Slide back to the End of workshop milestone.
30 Double click onto the Web Data asset set in the data sources domain. You can
alternatively click onto the plus sign. This will open the sub-level diagram.
31 In our scenario, data is extracted from the web and saved in a data file containing
additional user information using the InfoSphere BigInsights product. Once the data
was processed on our BigInsights cluster, the result is written to a file. We will need
to read this file later on using DataStage. Note that DataStage has a connector that
allows for direct connections into the BigInsights/Hadoop file system called the
HDFS connector. Since we don’t have access to the BigInsights server in our
workshop environment, we chose to export the file to the local file system.
32 Close the Web Data sub-diagram.

33 Open the Integrate Sources sub-diagram.
34 Notice that the elements on the left and on the right side are in gray italics. This
indicates that these elements are the connection points from the top level diagram.
Changes to these elements are kept synchronized across diagrams.
35 The Integrate diagram shows a classic extract, transform and load (ETL) pattern.
 The first part of an ETL process involves extracting the data from the source
systems. In our case this is reading the data from the web data feed file. We will
store the extracted data in a database.
 The transform stage applies to a series of rules or functions to the extracted

data from the source to derive the data for loading into the end target.
 The load phase loads the data into the end target which, in our case, is a data
warehouse.
36 Note the methodologies that are associated with each stage.
37 Close the Integrate Sources sub-diagram.

Task: Set up connections to the InfoSphere Information server

environment
Blueprint Director allows you to link elements of a blueprint to a number of artifacts (IT
assets and business assets) leveraging related InfoSphere tools such as Business
Glossary, Metadata Workbench, DataStage and Data Architect. You can also connect
to external assets residing outside InfoSphere like Cognos Framework Manager,
external files or URLs. This way you can directly link elements to these assets without
having to manually enter the information about these assets in the blueprint. This keeps
your blueprint in touch with the actual assets that require representation in your blueprint
and helps you to add more meaning and understanding to your design.
Before linking assets to blueprint elements, you need to have established connections to
these tools within the blueprint.
In this section, we will establish connection to InfoSphere Metadata Workbench, an
Information Server component used to display the metadata about any asset type
stored in the Information Server Metadata Repository (e.g. database, table, column,
DataStage job, models, mapping specifications…).
38 We will configure Blueprint Director to connect and use an Information Server

instance.
 From the menu bar in Blueprint Director, select Blueprint  Manage Server
Connections
 in the Manage Server Connections window, click Add
 in the Add Server Connection window, provide this information:

o connection name: InformationServer

o connection type: InfoSphere Information Server
 click Next, and provide connection information to Information Server

o Host: infosrvr
o Port: 9080
o User: isadmin
o Password: inf0server
o Metadata Workbench version: 9.1

 Validate the connection.

 A message should appear mentioning that the connection is valid.

 Click Finish.
 This server connection is listed now in the Manage Server Connections window
 Close the Manage Server Connections window

39 The connection to our Information Server environment is now established, and you
have access to all the metadata stored in the Metadata Repository. Using the Asset
Browser, you can view the metadata stored in the Metadata Repository.
40 Open the Asset Browser tab, and select the InformationServer connection from the
drop-down list
41 Check that you can display assets of a certain type:

 select Database as asset type.
 click on Display Assets to view the corresponding assets.

You can click on the sign in front of each database to discover the underlying
schemas, tables, and columns.
42 By connecting to the Information Server environment, extensive metadata

information becomes available to the Blueprint Director, and that metadata
information can be linked to your blueprint elements.
43 We are now going to create a connection for a BI Report. We already have existing
customer data reports that access our warehouse. We will now link our blueprint to
one of the existing reports. Once the customer sentiment analysis report is built, we
could go back to our blueprint and include this asset link. In your blueprint, right click
on the Reports element.

44 Move Timeline to Advanced Analytics and ensure to remove the Enable read-only
Blueprint view by timeline, right click Report and Select Add Asset Link.
45 Name the asset link Customer Report.
46 Choose InfoSphere Metadata Workbench as the Asset Link Type.

47 Click Next.
48 Select the Information Server connection that we created earlier.
49 Click Next.
50 Pick BI Report as the Asset Type and display all assets.
51 Highlight the JKLW_Customer_Report_3.
52 Click Finish.
53 A green arrow has now been added to the Reports element. It indicates that there is
one or more assets associated to this element.
54 Click the green arrow and then the Customer Report link.

55 You can now browse the BI report representation in the Information Server
Metadata Repository using Metadata Workbench. The window is embedded in
Blueprint Director.
56 Maximize the window.

Task: Understanding the lineage path

57 Click on Data Lineage to invoke the data lineage diagram.
InfoSphere Metadata Workbench provides capabilities to explore, analyze and manage

the operational and design-time metadata. This allows you to find the BI report and trace
back the lineage to its source.
The data lineage diagram can display
 The flow of data to / from a selected metadata asset,

through stages and stage columns, across one or
more jobs, into databases and BI.
 The flow of data to / from a selected metadata asset
across one or more jobs, through database tables,
views, or data file structures and into BI reports and
information services operations
 The order of activities within a job run, including the
tables that the jobs write to or read from and the
number of rows that are written and read. You can
inspect the results of each part of a job run by drilling
into the job run activities to see the links, stages, and
database tables, or data file structures that the job
reads from and writes to.

58 The above diagram shows part of the full lineage between the Customer Report and
the associated warehouse tables and the existing operational sources. You can use
the slider or the + / - sign on the tool bar to zoom in or zoom out.
59 Notice that by clicking on any of the link, information is displayed regarding the
source and the target of the link, and the link type (model, design, operational, user
defined data).
60 Our task is to enrich our customer warehouse with additional customer information
that we have gained from the sentiment analysis. Exit the maximized view.
Click onto the method icon on the web data asset set. The next actions associated with
this first set are:
 Discover Sources and
 Analyze Sources

61 Close Blueprint Director.
62 What did we learn in this module?

1. InfoSphere Blueprint Director helps you to govern your integration projects by
extending the vision of the projects to all members of your team, fostering
collaboration, and best practices.
2. Solution architects can create a business intelligence (BI) reference architecture
that depicts the data sources, data integration, data repositories, analytics, and
consumers.
3. You can connect elements to metadata and open InfoSphere Metadata
Workbench from that element to view the referenced metadata.
4. Data lineage reports show the movement of data within a job or through multiple
jobs and show the order of activities within a run of a job.

Lab 03: Automate Data Discovery with InfoSphere

Discovery
Understanding an unfamiliar data source can be challenging when there is no
documentation, when relationships are not explicitly declared, and when the data source
is large in both schema size – thousands of tables are common – and in data volume. In
addition, even if we do manage to understand a data source, fitting it into a unified
schema is another part of the challenge.
In this exercise we are going to analyze the big data customer sentiment results file that
was exported from our BigInsights/hadoop cluster. InfoSphere Discovery will help us get
a better understanding of the new data source and how we can process it in DataStage.
We will also check if we can discover any relationships between the text file and our
existing customer master database table.
This lab requires DB2 and InfoSphere Discovery related services up and running.
DB2 must be up and running: indeed, any request made in Discovery Studio is
processed by the Discovery Engine, which retrieves and / or stores objects in the
Discovery Repository database, a DB2 database.
Start DB2 service.
The two InfoSphere Discovery services that must be running are:

o IBM InfoSphere Discovery Server
o IBM InfoSphere Discovery Engine
Click on the Services icon on your desktop, and check if these services have been
started. If they have not been started, perform the two steps described below:
1. Start the InfoSphere Discovery Server:
b Start  Programs  IBM InfoSphere  Discovery  Discovery Server  Start

Discovery Server Service
OR click on the desktop icon Start Discovery Server Service
2. Start the InfoSphere Discovery Engine:

Start  Programs  IBM InfoSphere  Discovery  Discovery Engine  Start
Discovery Engine Service
OR click on the desktop icon Start Discovery Engine Service

Note: if the DB2 services or Discovery services were not started, you would get
a java error message when you try to bring up Discovery Studio.
Task: Create a Data Discovery project

1 Start Discovery Studio by selecting StartAll Programs IBM InfoSphere 
Discovery  Discovery Studio, or by double-clicking the Discovery Studio icon on
the desktop.

2 In the Source Data Discovery Tab, and select the New Project icon.
3 Name your project DIF_Customer_Sentiment, Select the Connection name and

uncheck ‘Use Password’; you may also add a note to describe this project.

4 Click OK.
InfoSphere Discovery supports two types of projects: Source Data Discovery

and Transformation Discovery.
Source Data Discovery projects are used for data inventory, overlap analysis
and master data management.
 Column analysis and data model discovery are performed to reveal
relationships between tables.
 Overlap analysis is performed to find redundant data.
 Trusted sources are determined, and values are graded so that, in case of
conflict, the highest trust precedence can be used.
 A unified schema can be prototyped to facilitate data integration or
consolidation.
Transformation Discovery projects are used to find out business rules
(transformations) between a source and a target data set.
Both types of projects can discover sensitive data.
Task: Define the data set related to the Customer Sentiment

Analysis
When a source data discovery project is created, the project opens on the Data Sets tab.
Data Sources
Every Source Data Discovery project contains at least one data set
A data set can contain physical tables from one or more databases,
text files, or a combination of tables and files.
5 We will define a data set for the data source we are interested in. This process will
consist of:
o Naming the data set
o Specifying a connection to the data set
o Selecting the tables or files to be included in the data set.
o Importing the physical tables to the Discovery staging database
o Defining logical views (Logical Tables), which let you eliminate unnecessary
columns, or perform pre-joins.

6 Rename the data set to SocialData.
7 Keep the Operational classification.
Data Set Classification

Each data set can be classified as Operational or Data Warehouse. Data set
classification types are only used during Primary and Foreign key discovery,
but have an influence on the structure and the quality of data objects
discovered for that data set.
The Operational type reflects an entity relationship model used in OLTP.
The Data Warehouse type depicts data sets structured as a star schema.
Task: Adding the customer sentiment text file to the data set
8 Highlight the Text File Formats & Files section.
9 Click on the green plus sign to add a file to the DataSet.
10 Navigate to C:\bootcamp\dif\SAMPLE_Brand_Retail_Feedback.csv
11 Click Open. The Add Text Table wizard will open.
12 Keep the Delimited File format and the other settings like Row Delimiter.

13 Check the Heading Line text box. The first line has columns names.
14 Click Next.
15 Keep the Column Delimiter as Comma and set the Text Delimiter to “ (double
quote).

16 Click Next.
17 Keep the Column Properties and select Finish.
Task: Adding the customer master table to the data set

18 Highlight the Connections & Tables section.
19 Click on the green plus sign to add a database connection.
20 The information needed to define the database connection is summarized in the

following table:
Connection name CustomerMaster
Database Type IBM DB2
Database server name INFOSRVR
Database name JKLW_DB
Port number 50000
User name db2admin
Password inf0server

21 Click Test Connection.
22 Once the connection has been verified, click OK to create Connection.
23 Highlight the CustomerMaster connection and click on the green plus sign to import
a database table.

24 In the search criteria section, activate ‘Search Tables’ and state

CUSTOMER_MASTER as search criteria.
25 Click Next. The CUSTOMER_MASTER table should be the only result.
26 Click Finish.
27 We have now defined all data sets for our project, and can start the analysis.
28 Save your project: Project  Save

Task: Run column analysis

We are ready to start with the first analysis, consisting of column analysis.
29 Click on the ‘Run Next Steps…’ button on the lower right side.
30 In the Processing Options window, notice the arrow pointing to Column Analysis;
you could drag that arrow down to any of the other tasks (PF Keys, Data Objects,
Overlaps). All the tasks up to the task being pointed to will be executed. Usually, it
is advised to perform each task separately, verify the results delivered by Discovery,
and make any appropriate modifications before proceeding to the next task.
31 Click the Run button.

Processing options and option sets

For each type of processing, such as column analysis, primary & foreign
key discovery, and overlap assessment, we can adjust the processing
options to influence how InfoSphere Discovery performs these tasks.
For example, by default, the InfoSphere Discovery will attempt to discover
perfect keys as well as “almost keys”. To advise InfoSphere Discovery
that all valid foreign keys are perfect (selectivity of 1 on primary side, no
orphan records on foreign side), we can modify two options in the PFkey
section of the option set. In general, options can be configured to reflect
what is known about the data being processed. This way we can provide
more detailed guidance to the product during data analysis.
32 While the task is executed, notice an information message informing you that the
project is locked and that you cannot perform changes. However, you can monitor
the progress of the task(s) being processed. A status indicator is displayed in the
top-right corner that indicates the number of tasks currently active.
33 You can click on ‘Currently 1 Active tasks’. This opens the Activity Viewer window,
which lists all the activities3, whether queued, running, completed, or completed with
errors.
34 From the ‘Activity Viewer’ window, you can monitor the trace and error logs for each
activity.
35 Close the Activity Viewer window.
3
If all the activities remain in a queued state, this could mean that the InfoSphere Discovery Engine was
not started.

36 Once the processing completes the message ‘Project is locked. You cannot make
any change’ will disappear, and InfoSphere Discovery will display column profiles.
37 The Column Analysis tab will appear with a green status icon indicating that the
analysis run was successful.
38 The CUSTOMER_MASTER table result is displayed. We analyzed the full table. It

turns out that it contains 207 rows.
39 Review the column analysis results for that table. Notice that the analysis results are
composed of column metadata information on the left side, and column statistics
information on the right side. (Metadata, Statistics columns)
40 In the metadata section, we discovered that the native data type corresponds to the
defined data type.
41 Scroll to the right to see the statistics columns (alternatively, you can click on the
‘Column Chooser’ icon to select the columns that you would like to display).
42 Pay attention to the statistical information for the IDENTIFIER and

SOCIAL_USERID columns. The Null Count is zero for both. The Selectivity is 1 for
both which means that records are 100% unique. Looks like our master data is in
good shape.

43 Switch to the SAMPLE_BRAND_RETAIL_FEEDBACK analysis results.
44 We analyzed the full table, resulting in 588 rows.
45 None of the columns are 100% unique.
46 Many columns carry a high null count.
47 Note that the BigInsights text analytics processing was not always able to identify
which product category, brand, product family and product name is associated with
the sentiment data.

48 Highlight the BRAND column, click on Value Frequency and review the actual data
values in this column by frequency. You will find missing values or the value <null>
in it. Note that <null> is an actual value since we’ve been analyzing a text file that
does not support NULL values. We can later convert the <null> to NULL values in
DataStage when loading the data into a RDBMS.
49 So, what did we learn from the column analysis?

From the result of column analysis, we are now able to identify:
1. the type and format of each column
2. which columns have null or blank values, and the percentage of nulls
3. most/least common format in a specific column
4. most/least common values in a specific column
5. duplicate values in a column, and display the full records with the duplicate
values in one column

Task: Primary and foreign key analysis
50 So far, we have analyzed each table separately. We will now let the system discover
the relationships between tables in each data set. Text files cannot include primary-
foreign key metadata. This step is critical to identify keys that we can use to join the
data.
51 From the Column Analysis tab, click Run Next Steps.
52 In the Processing Options window, make sure the arrow points to PF Keys, and
click Run.
Primary-Foreign Key discovery

During this analysis, InfoSphere Discovery automatically infers Primary
and Foreign Key relationships in each data set. For each of these data
sets, the PF keys are presented in a diagram.
Discovery relies heavily on PF Keys to identify table relationships
during later analysis.
53 Once processing has completed, look at the result of the PK Key discovery for the
SocialData data set.
54 You can position the different objects on the graph as you wish, so that the lines
between tables do not overlap.

Primary-Foreign Key graph

There are 2 types of lines that can be used to connect 2 tables:
 Dashed lines indicate column matches between 2 tables (but no
necessarily PF key relationships)
 Solid lines indicate PF key relationship
The arrow points to the table containing the primary key. The
arrowhead indicates the PF Key relationship classification:
 Parent-Child: one arrow pointing to table with primary key
 Reference: 2 arrows pointing to table with primary key; the
originating table has a primary key formed by foreign keys from
multiple tables
55 Discovery has now identified a Primary key relationship with CUSTOMER_MASTER

as the root and SAMPLE_BRAND_RETAIL_FEEDBACK as the child table.
56 Click onto the arrow that connects the two tables. Two key associations were
identified, FULLNAME and USERID. While FULLNAME has a relatively high hit rate
only the SOCIAL_USERID and USERID relationship has a 100% value row hit rate
on both sides.
57 Only one relationship actually makes common sense. You can now exclude the
FULLNAME key relationship by highlighting it and deleting it by clicking on the red
X.
58 Save the project
59 So, what did we learn from the PF key analysis?
1. Between any pair of tables, there may be zero, one or several column matches.
2. Some column matches may be coincidental, and it is your responsibility to
remove the matches that have no meaning from a business perspective.
3. Column matches and the resulting PF keys are influenced by parameter values
defined in the processing options.
4. Table classification and PF key identification is also influenced by the data set
classification type (operational or data warehouse).
5. Table classification is automatically determined by Discovery during PF key
analysis.
6. Discovery users (SME) can modify table classification, link type between tables.
7. Discovery users (SME) can add links between tables when desired.

8. Discovery will NEVER miss to identify a link between two columns: a link is
identified if the values that are common between the columns is above a
threshold specified as parameter in the processing options.
60 We can now proceed to the next phase: Data Objects Discovery.
Task: Data Objects Discovery
61 Run the next step: Data Objects.
Data Objects
Discovery organizes related tables into structures called data objects,
based on the primary-foreign keys.
In most cases, tables classified as root entity tables become root tables
(parents). Tables with foreign keys classified as child entity or
reference tables usually become child tables in data objects.
A table with no primary key or foreign keys is also considered a data
object.
Data objects never span across multiple data sets: if a table in one
data set is related to a table in another data set, this relationship will be
discovered in subsequent steps.
62 Review the objects generated for the SocialData data set. CustomerMaster was
classified as the root entity.
63 Data Objects can be easily exported to Optim for archival.

Data Archival
When archiving data, your goal is to minimize the traffic of data
between primary storage and the archive. Therefore, related tables
should be archived together.
The Data Object Discovery phase identified sets of related tables as
‘data objects’.
You can export these Discovery data objects to Optim for archival
purpose. A filter (WHERE clause) can be applied on a data object to
further restrict the size of the archive.
64 Save the project.
65 Export the data object: click Project  Export  Optim Data Models
66 Save it to the C:\bootcamp\dif\DataModels folder. The result of this export is a

number of dbm files.
67 So, what did we learn from Data Object Discovery phase?

1. Discovery discovers data objects based on PK keys.
2. Data object discovery is performed on each data set separately: i.e., a data
object will never contain tables pertaining to different data sets.
3. The structure of each data object depends on the classification type of each table
and column as well as the data set classification. So, it is very important to
review the result of PF key analysis.
4. Most data objects consist of two or more tables. However, tables with no primary
or foreign key are also considered data objects.
5. Data Objects can be exported to Optim for archival.
68 We can now proceed to the next phase: overlap discovery.
This concludes the Discovery lab. We now have a better picture about our data that we
need to process. Save the Discovery project and exit the studio.

Lab 04: DataStage Administration

Task: Explore the Administration Console
1. Open Mozilla Firefox and click on IS Web Console at the Bookmark toolbar. Log onto
the IBM Information Server Web Console using isadmin / inf0server.

2. Click the Administration tab.
3. Expand Users and Groups and then click Users.
4. The Information Server Suite Administrator user ID, isadmin, is displayed. Also the
WebSphere Application Server administrator user ID, wasadmin, the DataStage
Administrator dsadm and the DataStage user dsuser is listed. We will be using the
user ID that was created for the Data Integration Fundamentals class called dif.
5. Select the dif user and then click Open User.

6. Note the properties of this user. Expand the Suite Component. Note the Suite Roles
and the Component Roles that have been assigned to this user. Our user has
access to most of the Information Server components in various roles (User +
Administrator etc.). We have the DataStage Administrator and User role checked.
7. Return to the Users main window by clicking on the Cancel button (you might have to
scroll down in order to see it).
8. Click Log Out on the upper right corner of the screen and then close the browser.
9. What did we learn in the Web Console?

1. The user registry holds user account information, such as user names and
passwords.
2. You create user accounts and groups. You assign roles to users and groups to
specify which features users can use.

Task: Explore settings in DataStage Administrator

10. On your host system, open the DataStage Administrator from the desktop icon or
Start->All Programs-> IBM InfoSphere Information Server -> IBM InfoSphere
DataStage and QualityStage Administrator.
11. Specify the host name of the Information Server services tier computer (infosrvr),
followed by a colon, and followed by the port number (9080) to connect. Use dsadm
as the User name and inf0server as the password to attach to the DataStage
server. In our case the host name of the Information Server engine is the same as
the one for the services tier. Click Login.

12. Click the Projects tab. Here, you can add, delete, and move DataStage projects.
Select the “dif” project and click the Properties button.
13. Select the Permissions tab.
Note that all users have the DataStage and QualityStage Administrator role except
dsuser.
14. Switch back to the General tab.

15. Click the Environment button to open up the Environment variables window.
16. There are many environment variables that affect the design and running of parallel
jobs in DataStage. Commonly used ones are exposed in the DataStage
Administrator client, and can be set or unset using the Administrator. In the Parallel
folder, note that the APT_CONFIG_FILE parameter points to the default
configuration file location. We will learn about configuration files in a later module.

In the Reporting folder we have enabled additional reporting information that will
help us debug our jobs.
17. Click OK.
18. Go to the Parallel tab and browse the parameters and available settings. The parallel
page allows you to specify certain defaults for parallel jobs in the project, for example
format defaults for time and date. Click OK when done.
19. Close DataStage Administrator by clicking Close.

20. What did we learn in the DataStage and QualityStage Administrator?

1. For DataStage and QualityStage, administrators can further define user authority
by assigning suite component and project roles to DataStage and QualityStage
users.
2. You can add and delete projects.
3. Change project properties, for example set project wide environment variables.

Lab 05: Sequential Data Access

In this segment we will use the customer sentiment file and process it in DataStage. We
will learn the basic techniques to read and write files in DataStage.
Task: Log onto DataStage Designer

1. Open the DataStage Designer client program.
2. Use the dif / inf0server combination to log into the INFOSRVR/dif DataStage
project.

3. Once you log on to the Designer client, you will see this screen:
Task: Create a simple parallel job

4. If you still have the first window open as displayed in the last screenshot, select
Parallel Job and click OK. Otherwise, open a new Parallel job by either clicking on
the “New” icon (first one from left) or from the menu File  New.
5. Save the job now with the name SampleSentiment into the Jobs folder in the
repository by clicking File  Save As…

6. Type the name and save it into the Jobs folder. Click Save.
7. Add a Sequential File stage (‘File’ category in the Palette), a Copy stage
(‘Processing’ category in the Palette), and a second Sequential File stage. Draw links
between them. You can draw links by selecting the link element from the General tab
in the palette. The quickest way to draw links, however, is to right click onto the
originating stage (Sequential File) and drag the link onto the target stage (Copy
Stage).
Copy Stage
The Copy stage has one input link and can have multiple output links. It
copies all incoming records to all output links.
The single output link in our case means that the records are simply
passed along without any operation. It serves as a placeholder for
future logic.
8. Name the stages and links as shown. To rename a stage and link, select the object
and start typing over it. You can also right click on the object and select Rename.
9. We are going to read from the SampleSentiment data file. When reading or writing
sequential files using the sequential files stage, DataStage needs to know three
important facts:
▪ The directory path and file name

▪ The format of the file (e.g. comma separated, double quotes)
▪ The column layout of the file (e.g. column names, data types)
Table Definitions
Table definitions describe the column layout of a file. They contain
information about the structure of your data. Within a table definition
are column definitions, which contain information about column names,
length, data type, and other column properties, such as keys and null
values.
Table definitions are stored in the metadata repository and can be used
in multiple DataStage jobs.

10. First, we need to check if we already have a Table Definition for the file we want to
process. Browse the Repository window in the ‘Table Definitions’ folder 
Sequential.
11. We are looking for a Table Definition with the name of

SAMPLE_Brand_Retail_Feedback.csv since Table Definitions for sequential files
normally have the same name as the file name. It appears that there is no Table
Definition for our SampleSentiment source file. We must therefore import the table
definition. Click on Import  Table Definitions  Sequential File Definitions.

12. Choose /bootcamp/dif Directory by clicking the button to the right of the Directory
field.
 Note: The files will not be displayed because you are just selecting the
directory.
 Note: We are browsing the engine tier file system on the Linux server VM, not
the Windows client VM.
13. After you click OK to the directory browser, all text files will be displayed in the Files
area.
 Note: Switch the File Type filter to All Files.
14. Select the file SAMPLE_Brand_Retail_Feedback.csv.
15. Make sure you are saving the Table Definition to the \Table Definitions\Sequential\
folder and click Import.

16. Check the box First line is column names and then go to the Define tab.
17. By default, all fields are non-nullable. Since we have analyzed the file beforehand in
Discovery, we know that most fields contain empty values. We plan to use USERID
as an identifier and should keep this field as non-nullable. Define all other fields as
Nullable.
18. Change the SQL type for CREATEDTIME to Timestamp and remove the 255 value
from the length field.

19. Change USERID to a length of 25 and FOLLOWERSCOUNT to a length of 10. All

changes are highlighted in bold below.
Column name SQL type Length (Precision) Nullable
SEARCHOBJECT VarChar 255 Yes
CATEGORY VarChar 255 Yes
BRAND VarChar 255 Yes
FORMAT VarChar 255 Yes
FAMILY VarChar 255 Yes
SUBFAMILY VarChar 255 Yes
PRODUCT VarChar 255 Yes
POLARITY VarChar 255 Yes
CREATEDTIME Timestamp Delete the 255 Yes
FULLNAME VarChar 255 Yes
GENDER VarChar 255 Yes
SCREENNAME VarChar 255 Yes
USERID VarChar 25 No
USERVERIFIED VarChar 255 Yes
FOLLOWERSCOUNT VarChar 10 Yes
TEXT VarChar 255 Yes

20. Your sequential file meta data should look like this:
21. Click OK.

22. Close the import window. The new Table Definition will be displayed in the
Repository window under the Table Definitions > Sequential folder.
23. Double click on the source Sequential File stage. We need to specify the file to read
in the Properties tab. Select the File property and then use the right arrow to browse
for a file to find the SAMPLE_Brand_Retail_Feedback.csv file. Click OK. Hit the
Enter key to see the file path updated in the File property.

24. Set the ‘First Line is Column Names’ property to True. If you don’t, your job will have
trouble reading the first row and issue a warning message in the log.
25. Next, go to the Format tab and click the Load button to load the format from the
SAMPLE_Brand_Retail_Feedback.csv table definition under folder /Table
Definitions/Sequential.

26. Note that DataStage was able to identify the file format during the import of the
Table Definition. This is how the file looks like in the raw format:
27. Next go to the Columns tab and load the columns from the same table definition in
the repository. Click OK to accept the columns.

28. Click View Data and then OK to verify that the metadata has been specified properly.
This is true when you can see the data window. Otherwise you will get an error
message. Close the View Data window and click OK to close the Sequential File
stage editor.
29. Open the Copy stage. In the Copy stage Output tab > Mapping tab, select all source
columns and drag them across from the source to the target.
30. Click OK to save the settings.

31. Open the target Sequential File stage.
32. Ensure that in the Format tab, the Delimiter setting in the Field defaults folder is set
to comma delimited.
33. In the Properties tab in the File property, type the directory name /bootcamp/dif/ and
name the file SAMPLE_Brand_Retail_Feedback_Copy.csv. Instead of typing, you
can use the right arrow button to ‘browse for file’. Then pick the
SAMPLE_Brand_Retail_Feedback.csv file and come back to correct it to append
“_Copy.csv”.
34. Set option ‘First Line is Column Names’ to true. The File Update Mode should
continue to be set to ‘Overwrite’ every time the job is run. Click OK to save your
settings.
Task: Compile, run, and monitor the job

35. Save your job
36. Click the Compile button.

37. After the compilation is finished you can close the Compile Job window.
38. Right-click over an empty part of the canvas. Select or verify that “Show performance
statistics” is enabled (a checkmark should be present in front of “Show performance
statistics”). This will show, for each link, how many rows were processed and the
throughput per second.
39. Ensure that you have the Job Log view open. To open the window, click on the menu
View > Job Log. Enlarge the job log window.
40. Run the job by pressing the green arrow icon.
41. Select Run.

42. Scroll through the messages in the log. There should be no warnings (yellow) or
errors (red). If there are, double-click on the messages to examine their contents.
Fix any problem and then recompile and run.
43. Rearrange the job log window to make more canvas space available.
44. You can view the result data by right clicking on the target sequential file and
choosing ‘View SampleSentimentCopy data…’.
45. Select OK to view the first 100 rows.

46. The target file content will be shown in DataStage.
47. Close the Data Browser window.
48. What did we learn in this task?

1. Jobs include stages that can connect to data sources, extract and transform that
data, and then load that data into a target.
2. Jobs are created within a visual paradigm that enables instant understanding of
the goal of the job.
3. In order to process a sequential file, you need to specify, at the very least, its
location, the format and the column layout.

Task: Create and use a job parameter

49. Save the job SampleSentiment as SampleSentimentParam. Rename the last link
and the target Sequential File stage to “TargetFile”.
50. Open up the job properties window by clicking the 5th icon on the tool bar.
51. Go to the Parameters tab. We will define two job parameters. Define the first job
parameter named TargetFile of type string. You double click on the Parameter name
field and simply type into it and then fill out the other fields. Create an appropriate
default filename, e.g., TargetFile.txt. The second job parameter will contain the
target directory called /bootcamp/dif/.
52. Make sure to also add the final forward slash.
Hit the Enter key to retain the changes. Click OK to close the window.

53. Open up your target Sequential File stage to the Properties tab.
54. Select the File property. Delete the content of the File property.
55. Click on the black arrow button on the right side of the text box.
56. Select ‘Insert job parameter …’.
57. Select the TargetDirectory parameter first. Place the cursor at the end of the inserted
#TargetDirectory# string. Repeat the step with the TargetFile parameter.
58. Your final File box string should look like: #TargetDirectory##TargetFile#
59. Note that the parameters are enclosed in # signs. If you did not add a final / for your
target directory earlier, you could place it manually in between the parameter.

60. Hit return and click OK to save the changes.
61. Save and compile your job.
62. Run your job. Note that DataStage prompts for the parameter values. Leave the
default values intact. Click Run.
63. Enlarge the job log view
64. Scroll through the messages in the log. There should be no warnings (yellow) or
errors (red). If there are, double-click on the messages to examine their contents.
Fix any problem and then recompile and run.

65. Right click on the final sequential stage ‘Target File’. Select ‘View Target_File
data…”
66. Select OK to confirm the default parameter set value ‘TargetFile.txt’.
67. Select OK to display the number of rows to display.

68. Close the Data Browser window.


1. You can use job parameters to design flexible, reusable jobs.
2. You can use the Parameters page of the Job properties window to specify one or
more parameters for your job.
3. You can supply default values for parameters, which are used unless another
value is specified when the job is run.
Task: Reject link of a Sequential File stage

In this task, we will add a reject link to the source stage to capture the records which are
rejected due to formatting errors.
70. We will use the existing job SampleSentimentParam and save it as

SampleSentimentReject. Add a reject link to the source stage leading into a
Sequential File stage as shown.
71. Rename the stage and link names as shown for good standard practice.

72. Edit the SampleSentimentData Sequential File stage. Change the file name in the file
property to SAMPLE_Brand_Retail_Feedback_Reject.csv and set the property
Reject Mode to Output. This way, the rejected records will flow to the sequential file.
This file contains a few records that do not fit the Table Definition that will be
rejected.
73. Modify the Sequential File stage Sentiment_Rejects to write the output to a file called
Sentiment_Rejects.txt, located in /bootcamp/dif/.

74. On the Format tab, change the Quote property to none.
75. Switch to the Columns tab. Note that there is only one output column and that it is
greyed out. Often times, column layout issues cause records to be rejected.
DataStage outputs these records as a binary object.
76. Click OK.
77. Save and compile the job.
78. Run the job and view the job log. The result will be as shown below. In order to see
the number of records on the links, don’t forget to turn on the Show performance
statistics for the job from the canvas if they are not there.

79. The input file we used had one additional field defined for three records. Since these
records did not fit into the table definition where TEXT was defined as the last field,
they were sent down the reject link. Note that these data quality issues should be
caught upstream during the discover phase.
80. Replace the Sentiment_Rejects Sequential File stage with a Peek stage from the
Development/Debug category in the palette.
81. Save, compile and run the job again.
82. Observe the job run log. Instead of storing the records in a text file, the peek stage
has caused the records to be output to the log. You will notice two entries (one for
each processing node) that contain the actual rejected data records. You can
double-click on the log entry to view the full text.

1. You can add reject links to many stages in DataStage. They can be data access
stages like Sequential File stages or Database Connector stages or they can be
processing stages like Lookup or Transformer stages.
2. Reject links can be added to source or target stages.
3. The Peek stage lets you print record column values to the job log.
Task: Handling NULL values in a Sequential File stage

Next, see how NULL values can be interpreted to be read as NULL from the source and
written as assigned value to the target.
84. Save the job as SampleSentimentNull.

85. Open the source sequential file stage and change the file attribute file name to
SAMPLE_Brand_Retail_Feedback_Null.csv.
86. We will now process an input file that has empty string values in it. The values occur
in the CATEGORY field on three records. We will define these as NULL.

87. Click the Columns tab of the source Sequential File stage.
88. Double-click the column number 2 (to the left of the column name) to open up the
Edit Column Meta Data window.
89. In the Properties section, click on the Nullable folder and then add the Null field value
property. Here, we will treat the empty string as meaning NULL. To do this specify “”
(back-to-back double quotes). Click on Apply and then Close to close the window.

90. Click OK to close the source sequential file stage.
91. Click the Columns tab of the target Sequential File stage. Double-click the
CATEGORY column number 2 (to the left of the column name) to open up the Edit
Column Meta Data window.
92. In the Properties section, click on the Nullable folder and then add the Null field value
property. Here, we will write the string NO CATEGORY when a NULL is
encountered. Click on Apply and then Close to close the window.
93. Click OK to close the target sequential file.
94. Save, compile and Run the job.

95. View the data at the target Sequential File stage by right-clicking on the stage and
selecting View TargetFile data…. Notice that DataStage prints the word “NULL” in all
records with empty strings. The NO CATEGORY value is not displayed. This is
because DataStage knows that these represent a NULL value. Let’s take a look at
the file on the DataStage server.
96. Go to the Windows 7 Quick Launch bar
97. click on the putty.exe icon to run putty.
98. Connect to the DataStage server ‘infosrvr’.

99. Click Open.
100. Login as dsadm / inf0server.
101. Run the following command:

cat /bootcamp/dif/TargetFile.txt | grep CATEGORY
102. You will see that the records contain the string that we assigned, “NO
CATEGORY”, to represent a NULL value.
103. You can keep the putty window open for now.

If your input file contains NULL values, you will need to handle them by specifying
null field values in the column metadata.
Task: Read data from multiple sequential files using File Pattern
In this task, we will create a job that will read data from multiple sequential files and write
to a sequential file. We will use the File Pattern option to read multiple files in a
Sequential File stage.
105. Save the job as SampleSentimentPattern.

106. Edit the source Sequential File stage read method to File Pattern. Accept the
warning message.
107. Specify the file path as shown:

/bootcamp/dif/SAMPLE_Brand_Retail_Feedback_Pattern*.csv
108. This will read all the files matching the file pattern in the specified directory.
109. Click OK.
110. Compile and run the job. As can be seen on the auto partitioning icon on the link,
the source stage reads data from all the source files matching the pattern and writes
it to the output file.

111. You can right click on the target sequential file and view the data of this stage. You
may want to increase the amount of rows to be displayed to 600. We have
processed two input files with this file pattern. Check the results in the output file and
verify it has all the records that satisfy the file pattern.

Specifying a file pattern will read all files that match the file name criteria. You can
set the wildcard anywhere in the path, e.g. /bootcamp/dif/*.csv
Task: Write data to a Data Set

In this task, we will create a job that will read data from our customer sentiment
sequential file and write to a data set.
113. Save the job as SampleSentimentDataSet.
114. Delete the target sequential file and replace it with a Data Set file from the File
category in the Palette and name the link and stage TargetDataSet.

115. Edit the target Data Set stage properties. Write to a file named TargetDataSet.ds
in the /bootcamp/dif/ directory.
116. Verify the columns tab and that all columns are there. Click OK to close the stage
editor.
117. Compile and run the job.
118. View the output in the job log.

119. In Designer click on Tools > Data Set Management. Select the Data Set that was
just created.
120. The Data Set Management window opens up as shown.

121. Click the Show Data icon to view the data of the Data Set (3rd icon).
122. Close the data viewer. Click the Show Schema icon (2nd icon) to view the Data Set
schema.

123. The Data Set Management utility can be used to view the internal schema format.
Close the Dataset Management Utility.

1. Data sets contain meta data. They contain the column definitions and the data is
stored in a partitioned format that matches the number of nodes that were defined in
the configuration file. We will learn more about this topic in the following module.
2. Data sets can be landed to disk as persistent data sets. This is the most efficient way
of moving data between linked jobs.

Lab 06: DataStage Parallel Architecture

Task: Using a two-node configuration file
In our lab environment we have been executing our jobs with a two-node configuration
file.
1. Click Tools > Configurations.
2. In the Configurations box, select the default configuration. You might want to expand
the window so that the lines do not wrap to make them easier to understand.
3. Your file should look like the picture below with two nodes already defined. If only
one node is listed, make a copy of the node definition through the curly braces, i.e.
copy from the 1st “node” to the first “}”, paste it right after the end of the definition
section for node1, and change the name of the new node to “node2”. Be careful you
only have a total of 3 pairs of the curly brackets; one encloses all the nodes, one
encloses the node1 definitions, and one encloses the node2 definitions.
4. Save only if you have made changes. Click Close.

Task: Using data partitioning and collecting

1. Open the SampleSentiment and save it as SampleSentimentPartition. Rename the
target sequential file stage and input link to TargetFile.
2. Note how the partitioning indicator is showing the ‘fan in’ symbol before the target
stage. This means the two partitions are currently collected into a single file.
3. In the target Sequential File stage, define two files, TargetFile1.txt and
TargetFile2.txt, in order to see how DataStage data partitioning works. To define
more than one target file, click on File property.

4. Note how the partitioning indicator on the link changed to ‘Auto’.
5. Compile and run your job.
6. View the job log. Notice how the data is exported to the two different partitions (0
and 1).
7. Go back to the putty window where you should still be logged on as dsadm /
inf0server on the server.
Let’s view the first output file of the job by typing:
head /bootcamp/dif/TargetFile1.txt -n 5
dsadm@infosrvr:~> head /bootcamp/dif/TargetFile1.txt -n 5

8. Note the associated FULLNAME records for the first entries. In this case these were
Dale Hemmingway, Garth Karlson, Ralph Monk and Bill Lanford.
9. Next, view the first ten rows of the source file by typing:
head /bootcamp/dif/SAMPLE_Brand_Retail_Feedback.csv -n 10
dsadm@infosrvr:~> head /bootcamp/dif/SAMPLE_Brand_Retail_Feedback.csv -n 10
Notice how the data is partitioned. Here, we see that the 1st, 3rd, 5th, etc. records go
into one file and the 2nd, 4th, 6th, etc. records go in the other file. This is because
the default partitioning algorithm is Round Robin.

Task: Experiment with different partitioning methods

1. Open the target Sequential File stage. Go to the ‘Partitioning’ tab. Change the
partitioning algorithm setting to various settings, e.g. ENTIRE, RANDOM, and HASH.
2. Compile and run the job again. Open the target files and examine. Notice how the
data gets distributed. Experiment with different partitioning algorithms!

3. The following table shows the results for several partitioning algorithms. You will also
find the row count in the log. Observe the message for the export of the TargetFile
Sequential File operator for partition 0 and 1. TargetFile,0: Export complete and
TargetFile,0: Export complete.
Records Records
Partitioning Algorithm Comments
in File1 in File2
Round-Robin (Auto) 294 294
Entire 588 588 Each file contains all the records
random distribution (record #

Random 279 309
results may vary)
File 1 without entries for the field

Hash on column “GENDER” 515 73
File 2 with entries
Task: Read data with multiple readers

In this task, we will create a job that will read data from a sequential file and write to
another sequential file. We will see how to read a single sequential file in parallel.
1. Save the previous job as SampleSentimentReaders.

2. Open the Properties tab of the source Sequential File stage. Click the Options folder
and add the “Number of Readers Per Node” property.
3. Set number of readers to 2. Close the stage editor.

5. View the results in the job log.
6. In the job log, you will find log messages from Import SampleSentimentData,0 and
SampleSentimentData,1. These messages are from reader 1 and reader 2. In
addition, you can see that DataStage is now using the same partitioning before the
copy stage since the incoming data stream already has two partitions.
7. You may also notice that one record was dropped because the data string did not
match the timestamp format of the CREATEDTIME column. We sent the first column
name record into the data stream as well. The ‘First line is column names’ property is
invalid when reading with multiple readers per file.
8. What did we learn in this module?

1. DataStage brings the power of parallel processing to the data extraction and
transformation process.
2. DataStage combines pipeline and partition parallel processing.
3. The configuration file describes available processing power in terms of processing
nodes.
4. When you run a parallel job, DataStage first reads the configuration file to determine
the available system resources.
5. DataStage handles the partitioning and collecting of data in a job flow automatically.
There are a few cases where you need to manually specify the partitioning and
collecting, for example when you are using an aggregator stage to summarize your
data.

Lab 07: Relational (RDBMS) Data Access

Task: Creating a data connection object
In the next task, we will start using our DB2 database for permanent storage. We will
save the details of our database connections as data connection objects. Data
connection objects store the information needed to connect to a particular database in a
reusable form.
1. In the DataStage Designer menu bar, click File > New
2. Select the Other folder and then Data Connection. Click OK.
3. Name the Data Connection JKLW_DB and type ‘Sample Outdoor Operations
Database’ as a short description.

4. Switch to the Parameters tab. Browse for a stage type in the ‘Connect using Stage
Type’ section. Note that there are many different stage types for which you can
create Data Connections for. Select the DB2 Connector stage type from Parallel >
Database and click ‘Open’.
5. Fill in the following connection parameters:
Connection parameter Value
ConnectionString JKLW_DB
Username Db2admin
Password (will be encrypted) inf0server

6. Click OK.
7. Save the Connection Object as JKLW_DB in the folder Jobs > Shared.
8. Click Save and then Cancel in the Data Connection window.

Task: Load the data from the sequential file to a DB2 UDB table
using a DB2 Connector stage
In this task, we will create a job that reads data from the sentiment data sequential file
and loads the records into a DB2 UDB table. We will use a DB2 Connector stage to
write data into a new DB2 database table.
9. Create a new parallel job named SequentialSentimentToDB2. Note: You can use the
SampleSentiment job as a template to save time and remove the target stage and
link.
10. Drag and drop the JKLW_DB Connection Object that you just created onto the
canvas. Change the link to an input link. Connect the link to the Copy stage.
11. Rename the stage and link names as shown for good standard practice.
12. We will load the data into our database for further processing.

13. In the SampleSentimentData source stage, open the Output properties. Ensure that
you remove the Multiple Readers per Node property if it’s active. We need to read
the file with the column names. Ensure that the First Line is Column Names property
is True.
14. Open the Copy stage. Go to the Output tab and map all columns to the output link.

15. Open the DB2 connector stage.
16. DataStage associated the stage with the Connection Object. Test the connection.

17. In the Properties tab, expand the Usage section. Change the Generate SQL option
to Yes. Specify the following Table name: DIF.CSTSENTIMENT.
18. Change the Table action property to Create.
19. Click OK.

21. Observe the job log. Note the SQL statements that were generated for each partition.
22. You will find a create table DIF.CSTSENTIMENT statement and also an INSERT
INTO DIF.CSTSENTIMENT table statement.
23. In the following exercises we will use this table for further processing. This means we
will have to pay closer attention to the SQL data types in our table definitions. During
the discover phase, we found out that the UserID and Followerscount fields consist
entirely of numbers. We can use the default DataStage type conversion in the copy
stage to convert these two varchar fields to a numeric data type. This will make it
easier to run operations like joins, aggregations and sorts on these fields.
What are type conversions?

Type conversions are necessary when you need to convert data from
one type to another. Although it’s only the metadata that is changing,
some of your actual data values may become invalid due to restrictions
of the new data type. For example, DataStage will reject rows that have
a character field that you try to write into an integer field.
What are default type conversions?
Default conversions are automatically carried out between certain
types, for example strings composed solely of numeric characters and
int8, int16, int32 etc integer types.
24. Open the properties of the Copy stage. Change to the Output tab. We are mapping
all input columns to the output of the stage into the target sequential file stage. Open
the Columns tab.
25. Change USERID to the Numeric and FOLLOWERSCOUNT to the Integer SQL type.
Column name SQL type Length (Precision)
USERID Numeric 25
FOLLOWERSCOUNT Integer Empty

26. This is how the table definition should look like:
27. Click OK. If we want to run the job again, we need to change the table action from
create to replace. You can make this change in the DB2 Connector stage properties.

28. Click OK.
29. Save, compile and run the job.
30. View the job log. You will notice that two warnings were issued. One for each default
type conversion that was carried out in the copy stage. You will also notice that a
drop table statement is issued before the create table statement.
31. You can view the data in the table by opening the DB2 Connector stage and then
clicking on the View Data button. DB2 has now created a table using native DB2
data types.
32. Click Close and OK.
33. Close the job and the DataStage Designer client.

1. How to load data into relational data targets.
2. Use the Designer clients reusable components.
3. When writing data, the DB2 Connector stage converts DataStage data types to DB2
data types.

Lab 08: Data Modelling and Metadata Import
In this lab, we will build a physical data model for a new warehouse table. This table will
store the customer sentiment data that we have just loaded into the operational
database. For this, you will use IBM InfoSphere Data Architect to view the overall
database structure and manage system design changes.
In this lab, you will

 Reverse engineer the database schema into a physical data model.
 Create a warehouse dimension table.
 Create a DDL script.
 Export the new physical model and import it to the common Metadata
Repository.
Task: Create a data design project

1. Open IBM InfoSphere Data Architect using the icon provided on the desktop;
or, select “Start  Programs  IBM InfoSphere  IBM InfoSphere Data
Architect 9.1.0.0”.
2. A window opens with the name of a workspace. A workspace in Data Architect is a

folder where information about your project is stored. Accept the folder shown.
3. From the main menu, click File  New  Data Design Project. A new data design
project wizard will open.
4. In the Project Name field, type DIF_CUSTOMER_SENTIMENT. Click Finish.

5. The project will be displayed in the Data Project Explorer.

Task: Create a Customer Master model from reverse engineering
6. Right click on the DIF_CUSTOMER_SENTIMENT project and select New 

Physical Data Model.
7. Select ‘Create from reverse engineering’. Click Next.

8. Select Database. Click Next.
9. Choose the JKLW_DB connection. Click Next.
10. Select the DIF schema as a filter. Click Next.

11. Uncheck all default Database elements except ‘Tables’. Click Finish.
12. You can now browse the discovered table in the DIF schema. The
CUSTOMER_MASTER table has a defined Primary Key (PK). It also has the
SOCIAL_USERID, which we learned during the Discovery phase, has the same key
values as our Customer Sentiment data.

Task: Create a warehouse dimension table

Recently our business users have requested an additional requirement for our customer
warehouse in the area of consumer sentiment data. They would like to be able to
explore and report upon a given customer’s product experience (aka. customer polarity)
that was voiced in the social media sphere.
In this step, we will create a new table to store our customer polarity information the
existing data warehouse.
13. We will now import the existing DIF schema from our data warehouse.
14. Right click onto the Data Models folder, select New  Physical Data Model.

15. Rename the data model to ‘Warehouse Physical Data Model’. Select ‘Create from
reverse engineering’.
16. Select Database and click Next.
17. Choose the JKLW_DWH database connection and click Next.

18. Select the DIF schema and click Next.
19. Choose to import Tables only. Click Finish.

20. Expand the JKLW_DWH database and the DIF schema. Right click onto the DIF
schema entry. Select ‘Add Data Object’ and then ‘Table’.
21. Name the new table CUSTOMER_POLARITY. This table will store data about
identified positive or negative product experiences. This table will contain the
following fields:
Column SQL Type

Reason for adding
Name
DECIMAL
UserID Identifying key. Helps us to join records.
31
NumFollowers INTEGER Aids in identifying public opinion makers.
VARCHAR
Polarity Positive or negative sentiment.
255
VARCHAR Product category. InfoSphere Discovery has shown

Category 255 that this column is the most populated product
identifier.

22. Add these columns by right clicking on the CUSTOMER_POLARITY table and
selecting ‘Add Data Object’  ‘Column’.
23. Name the column. Start with UserID.
24. Go to the Properties view on the bottom right side of the screen. Switch to the Type
section. For the UserID column, change the type to Decimal, Precision 31.

25. Repeat these steps for the other three columns. Make sure you define the correct
SQL Types for each column. This is how your CUSTOMER_POLARITY table should
look like in the end:
26. Save your Data Project.

Task: Create a DDL script for the new table

Our Database Administrator wants to create the physical table before we bring the
metadata into the Information Server repository. We will now create and run a script from
IDA.
27. Right click on the CUSTOMER_POLARITY table and select Generate DDL…
28. Uncheck ‘Quoted identifiers’.

29. Click Next.
30. In the Objects selection, Deselect All and then select Tables. Click Next.

31. The script is now created. We can now run the script on the server. Check the ‘Run
DDL on server’ option and click Next.
32. Select the JKLW_DWH connection and click Next.

33. Keep the default options and click Finish.
34. The SQL Results window will appear and the status should be ‘Succeeded’.

1. InfoSphere Data Architect can create logical, physical and domain models for DB2,
Informix Dynamic Server, Oracle, Sybase, Microsoft SQL Server, MySQL, and
Teradata.
2. With InfoSphere Data Architect, you can discover, model, visualize, and standardize
diverse and distributed data assets across your enterprise.
3. Elements from logical & physical data models can be visually represented in
diagrams. Alternatively, physical data model diagrams can use the UML notation.
4. IDA enables you to deploy your data models directly to the DBMS system.

Task: Export the physical data models
In this task, we will export the InfoSphere Data Architect Model to disk. This will enable
us to bring the physical model metadata into the Information Server Metadata
Repository.
36. To export the warehouse physical data model that we just created, go to File 
Export.
37. Open the General folder and select File System. Click Next.

38. Browse for a directory to save the export files in. Choose
C:\bootcamp\dif\DataModels. Select the Physical Data Model and the Warehouse
Physical Data Model for export. Click Finish.
39. Close InfoSphere Data Architect.

Task: Importing physical data models using the ODBC

Connector in DataStage
You can use DataStage import physical data models. We will use an ODBC Database
connector in the next module to read from our DB2 tables.
40. Before you can use the ODBC connector in a job, you need to configure database
drivers, driver managers, and data source names. Our server has two Data Source
Names (DSN’s) for JKLW_DB and JKLW_DWH already defined. We are set to
import Table Definitions using the ODBC Connector.
41. Open DataStage Designer and log on to the dif project using dif / inf0server. Then,
go to Import > Table Definitions > ODBC Table Definitions.
42. Select the JKLW_DB system DSN from the list.
43. Authenticate with db2admin / inf0server.
44. Click OK.

45. DataStage is now searching for tables in the data source. Select the
DIF.CSTSENTIMENT and DIF.CUSTOMER_MASTER tables from the list. Save
them in the default \Table Defintions\ODBC\JKLW_DB folder. Click Import.
46. The Table Definitions are now available in the Repository window.

There are multiple ways to import physical data model data in Information Server. The
ODBC Connector import through DataStage is a versatile and effective way to create
Table Definitions.

Lab 09: Creating Mapping Specifications
We will use FastTrack to specify a mapping of our customer sentiment source data into
the new customer polarity data warehouse structure. InfoSphere FastTrack helps to
automate a huge chunk of this process and provides a centralized location for tracking
and auditing specifications for a data integration project.
Let’s quickly revisit the steps we have made so far by looking at the BI development
method guidance in our blueprint. We have now analyzed our source data, developed
the necessary physical warehouse table and now we are going to define our mapping
specification. The next step will be developing our information integration logic.

Task: Creating a FastTrack project

FastTrack eliminates tedious copy and paste actions from various documents that are
used to design, review, and dispatch specifications. FastTrack is integrated into
InfoSphere Information Server such that specifications, metadata, and the resulting jobs
are accessible by team members who are using InfoSphere Information Server.
This also ensures that the mapping specifications are linking to the same metadata
artifacts that the data integration developers are using.
1. Start the Information Server FastTrack client from the desktop, or select “Start  All
Programs  IBM InfoSphere Information Server  IBM InfoSphere FastTrack
Client”.
2. Log on as dif / inf0server.

3. Switch to the Home tab, and create a new project.
4. Name the new project ‘DIF Customer Warehouse’.
5. In the description field, fill in ‘Mappings for customer sentiment tables’. Click Finish.
6. Double click on the new project. This will open the mappings tab.

7. Expand the DIF Customer Warehouse folder. You will notice the three folders in the
DIF Customer Warehouse project.
8. The Mapping Specifications folder holds the created and imported source to target
mapping specifications. The Mapping Components folder holds mapping
components which are the direct equivalent of DataStage shared containers. They
are integrated into mapping specifications as sources or targets. The Mapping
Compositions folder stores mapping compositions which consist of a set of
mapping specifications that share a relationship, for example, the same target
mapping.

Task: Import Metadata using FastTrack
You can use FastTrack to import metadata from existing physical tables.
9. Open the Metadata pillar and click on Metadata Repository.
10. Highlight the INFOSRVR host and, in the Tasks bar on the right, click on Import
Metadata.
11. Expand the JKLW_DB database connection with the JK Life & Wealth Operational
Database description.
12. Authenticate with db2inst1 / inf0server.

13. Expand JKLW_DB > DIF. Select the CSTSENTIMENT table and the
CUSTOMER_MASTER table. Click Import.
14. Import it to the existing INFOSRVR host. Click Next.
15. Choose JKLW_DB. Click Finish.

16. Expand the JKLW_DWH connection. Authenticate with db2inst1 / inf0server.

Expand JKLW_DWH > DIF and click on CUSTOMER_POLARITY. Click Import.
17. Choose to import to the existing INFOSRVR host and click Next. Import to the
JKLW_DWH. Click Finish.
Task: Creating source to target specifications

18. We will first create a straight forward table to table mapping for our new polarity
table. Click on the Mapping pillar, then on the Mapping Specifications folder and
select ‘New’.
19. Name the mapping specification ‘CUSTOMER_SENTIMENT TO

CUSTOMER_POLARITY’.

20. Define our dif user as the owner.
21. Switch over to the mappings view. Also note the Database Metadata folder in the
Browser window. Tip: If you do not see the Browser tab, click View > Browser.
Column Mapping Editor

The FastTrack column mapping editor records source to target transformations. We
can also include transformation rules via free-form text or as DataStage functions.

The Browser Window

The browser view in the mapping editor allows you to view all of the metadata that
you can add to your mapping specification. It gives you a list of available data
sources, databases, schemas, and tables that you can add to your specification.
22. Expand the Database Metadata folder.
23. Expand the INFOSRVR host.
24. Expand the JKLW_DWH database.
25. Expand the DIF database schema.
26. Right click onto the CUSTOMER_POLARITY table. Select Map to.
27. The target fields are now populated. Highlight the source fields and right click onto
the highlighted area. Select ‘Discover More…’.
28. In the Browser window, open the JKLW_DB database.

29. Open the DIF schema.
30. Check the CSTSENTIMENT table. Click OK.
31. FastTrack will find three results for the four columns. Add the three results as source
fields.
32. Drag and drop the FOLLOWERSCOUNT column from the CST_SENTIMENT source
table into the NumFollowers target field column. Your mapping table:
Source Field Target Field
CSTSENTIMENT.POLARITY CUSTOMER_POLARITY.Polarity
CSTSENTIMENT.USERID CUSTOMER_POLARITY.UserId
CSTSENTIMENT.FOLLOWERSCOUNT CUSTOMER_POLARITY.NumFollowers

CSTSENTIMENT.CATEGORY CUSTOMER_POLARITY.Category
33. In the end, your result should look like this:
34. Save your mapping specification.
Validation
In InfoSphere FastTrack, you can enable validation of a mapping specification.

This automatically validates your mapping specification for correctness &
integrity. Any potential problems are listed in the Validation view and flagged in
the Mapping Editor.
35. Ensure that you have no validation errors. Tip: To view the validation tab, click View
> Validation. It will appear in the top right area.

Task: Use the source to target specification to generate a

DataStage Job
36. We can now build a DataStage job to populate the CUSTOMER_POLARITY table.
Click on the Generate Job icon.

37. Select the DIF > Jobs > Warehouse Jobs folder to store the job. Store the Table
Definitions in the Table Definitions Folder. Click Next.
38. Now we can define the data source connection information. In the Connection
Configuration section, click ‘Manage’.
39. Create two new Connection Configurations.
Database Write
Name Connector
Name Mode
JKLW_DB_DB2 JKLW_DB
DB2 Insert
JKLW_DWH_DB2 JKLW_DWH

40. Name the connection JKLW_DB_DB2 and select the DB2 Connector. Specify
JKLW_DB for the Database Name property. For the authentication information,
select ‘Manage Parameters…’.
41. For the Userid field, create a new parameter called DB2 User. The default value is
db2admin. Click OK. Assign the DB2 User Parameter to the Userid field.
42. Switch to the password field. Create another parameter called DB2 Password. You
cannot assign a default value here. Click OK. Assign the DB2 password User
Parameter to the password field.

43. To create the JKLW_DWH_DB2 Configuration, select New and repeat the steps
using the JKLW_DWH database name while reusing the default parameters.
44. Click OK.
45. Assign the configurations to the data sources.
46. Click Finish. FastTrack is now generating a DataStage job.
47. You can now close the source to target mapping specification.
48. Let’s observe the generated job in DataStage. Open up DataStage Designer. Log on
to the dif project. Tip: If you still had the Designer window open, click Repository >
Refresh.
49. Open the new job from the Jobs > WarehouseJobs folder.

50. The job consists or a source DB2 connector stage, a transformer stage and a target
DB2 connector stage. Also note that FastTrack has automatically created an
annotation on the canvas that documents the specification from which it was created
and the time and date.
51. We will now edit the job parameter. Open the job properties.
52. Switch to the Parameters tab. Specify ‘inf0server’ as the DB2 Password default value
and ‘db2admin’ as the DB2 user. Click OK.

53. Open the properties of the source DB2 connector stage. You will notice that the user
name and password fields are already filled with those job parameters.
54. Click OK.
55. On the target connector stage, we can keep the Append table action with the Insert
write mode since we already created the empty table from InfoSphere Data Architect.
56. Switch to the Columns tab.

57. Double click on the transformer stage. Here you can see the simple 1:1 mapping we
created in our specification.
58. Click Cancel.

59. Open the source connector stage. Note that the Generate SQL property is set to
Yes.
60. Switch to the Columns tab. Only the four relevant columns from the
DIF.CSTSENTIMENT table are part of the Table Definition. This will result in a SQL
statement that will only select these columns, thus keeping the data extraction
process as efficient as possible. Cancel out of the source connector stage.
62. You should see 588 rows being loaded into the CUSTOMER_POLARITY table.
63. Go into the properties of the target DB2 connector and view the data of the last job
run.

64. You will notice that we have transferred all records into the data warehouse target,
including those that do not carry information about product polarity.
65. We need to make sure that we are only reading records from our source that have
this field populated. This can be done by adding a where clause to our select SQL
statement in the source DB2 connector stage.
66. Close the View Data Window and the target stage properties.
67. Open the properties of the source stage.

68. Switch the Generate SQL option to No. In the Select Statement field, click on the
Tools button.
69. Choose to build DB2 UDB 8.2 SQL syntax.

70. From the Table Definitions folder on the left side of the screen, navigate into the
Table Definitions folder. From there, open the JKLW_DB database, the DIF schema
and then drag the CSTSENTIMENT table into the area that says ‘Drag tables to
here’. Note that this is the Table Definition that we imported through FastTrack
earlier.
71. Double click on CATEGORY, POLARITY, USERID and FOLLOWERSCOUNT to add

them to the selection of SELECT columns.
72. In the 3. Construct Filter Expression (WHERE clause) section, choose

CSTSENTIMENT_ALIAS.POLARITY as the column to build an expression for, and
type ‘positive’ following the = sign, as below. Note that it’s lower case. Click Add.
73. Select the expression and copy it to your clipboard. Add the following statement at
the end of the current statement:
OR CSTSENTIMENT_ALIAS.POLARITY = 'negative'
The entire statement should look like this:

74. Switch to the SQL tab and view the entire SQL statement.
75. Select Yes to continue with the warnings
76. Click OK.
77. View Data and make sure the select statement with where clause is working fine

78. Open the target DB2 connector stage and change the table action property to
Replace.
80. You can view the data after this second run. If you have specified the WHERE
condition correctly, the performance statistics will already tell you that this time only
133 records were processed.

1. InfoSphere FastTrack accelerates the design time to create source-to-target
mappings and to automatically generate DataStage jobs.
2. You can use the SQL Builder function in the DataStage Common Database
connector stages to customize your SQL statements for reading or writing data.

Lab 10: Combining and Sorting Data

The business has requested a warehouse table that shows the product information in
combination with the geographic location. This way we can find out where our customers
are actually living. This data can help us decide where to open new retail stores.
We will now create a new table that combines the product information from the
CSTSENTIMENT table with the address information from our CUSTOMER_MASTER
table.
During the Discovery phase we found out that these two tables share a key:
CSTSENTIMENT.USERID = CUSTOMER_MASTER.SOCIAL_USERID. Remember that
CSTSENTIMENT was created when we extracted the records from the
SAMPLE_BRAND_RETAIL_FEEDBACK source file. We will use this key to join these
tables together.

Task: Creating a source to target specification with a lookup

table
1. We will now create a more complex mapping specification that will join two tables on
the source side in order to satisfy our business requirements. Go back to your
FastTrack window or start the FastTrack client if you have closed the window.
2. Close the open Mapping Specification. Create a new mapping specification.
3. Name the specification ‘Product_Geography_Mapping’ and assign dif as it’s owner.
4. Open the mappings view.
5. From the metadata browser window, select the JKLW_DB > DIF > CSTSENTIMENT
> BRAND, CATEGORY, POLARITY, PRODUCT and USERID columns.

6. Drag and drop them into the Source Field side of the mapping.
7. Switch to the Lookup Definitions view. Select Add Lookup Definition.
8. Name the lookup ProductGeo.
9. Click OK.

10. Drag and drop the JKLW_DB.DIF.CUSTOMER_MASTER table into the Lookup
Column.
11. Drag and drop the JKLW_DB.DIF.CSTSENTIMENT table into the Sources column.

12. Open the Tables tab.
13. Select Add Key…

14. Expand the tables and choose CUSTOMER_MASTER.SOCIAL_USERID and

CSTSENTIMENT.USERID as join keys.
15. Click OK. The key association appears in the Keys and Fields box.
16. We just added a lookup table to our mapping specification. Save the mapping
specification.
17. Click on the Properties tab in your mapping specification.

18. Note that FastTrack supports a wide range of simple transformation options.
19. We could have also joined these two tables. Switch to the Mappings view.
20. Now we can add the columns from our lookup table to our mapping.
21. Right click onto the next empty Source Field. Click Add Lookup Field…

22. Expand ProductGeo and CUSTOMER_MASTER. Select ADDRESS and click OK.
23. Repeat the previous steps and add CITY and STATE.
24. Save the mapping specification.

25. We can now define the target fields. Click into the first Target Field. Switch the field
from a Physical column to a Candidate column.
26. We will create a candidate table called ProductGeo. Fill in the Table Name and
Column Name information. You can keep the same column names and keep the
STRING, length 250 data type except for USERID which is DECIMAL, length 31.
Also pay attention that you are mapping the correct fields with each other since your
fields may appear in a different sequential order.
27. Do this for all columns until you have the following mapping:
Target Column Data Type
ProductGeo.Category STRING 250
ProductGeo.Brand STRING 250
ProductGeo.Product STRING 250
ProductGeo.Polarity STRING 250
ProductGeo.Userid DECIMAL 31
ProductGeo.Address STRING 250
ProductGeo.City STRING 250
ProductGeo.State STRING 250
28. Save the mapping specification.
29. Next, we will generate a job from this mapping specification.

30. Click on the Generate Job button.
31. Save the job in the dif > Jobs folder.
32. Save the Table Definitions in the Table Definitions folder.
33. This time, we will use the ODBC Connector to read and write the data. In the
Connection Configuration, click Manage.
The Open Database Connectivity (ODBC) standard

accomplishes DBMS independence by using an ODBC
driver.
An ODBC driver enables DataStage to use a data source,
normally a DBMS or a CSV file. ODBC connectivity is
relevant for data sources that do not have their own
common connector, for example PostgreSQL or MySQL.
34. Create a new configuration.
35. Name the configuration JKLW_DB_ODBC.
36. Use the ODBC Connector.
37. The ODBC Data Source name is JKLW_DB.

38. You can use the DB2 User and DB2 Password authentication parameter. Double-
click to select the parameter.
39. Click OK.
40. Choose the JKLW_DB_ODBC connection for the source. Do not define a connection
for the target. Click Finish.
41. Once the job is generated, you can close the Mapping Specification and the
FastTrack client.

InfoSphere FastTrack allows you to create advanced source-to-target mappings that
contain multiple tables. When you create DataStage jobs from these specifications,
these tables will be joined.
FastTrack also allows you to create candidate tables for target that do not yet have a
physical representation.

Task: Completing the lookup stage job

43. Open the DataStage Designer Client and authenticate with dif / inf0server or if it’s
still open, click Repository > Refresh.
44. Open the generated job Product_Geography_Mapping.
45. Your job should look similar to this screenshot:
46. FastTrack created a local container for the lookup design.
DataStage Containers
Containers are reusable objects that hold groupings of stages and links. Containers
create a level of reuse that allows you to use the same set of logic several times
while reducing the maintenance.
There are two kinds of containers:
Local container
A local container simplifies your job design. A local container can be used in only
one job. However, you can have one or more local containers within a job.
Shared container
A shared container facilitates reuse. They can be used in many jobs. As with local
containers, you can have one or more shared containers within a job.

47. Double click onto the container. The container content will open in a new tab.
48. Switch back to the Product_Geography_Mapping main job. For job design simplicity
we are going to deconstruct this local container. Right click onto the local container
and select ‘Deconstruct’.
49. Confirm to deconstruct.
50. Zoom out to view the entire job.

51. Rearrange the stages. FastTrack created a job that reads from the DB2 data source
using an ODBC Connector and then looks up records from the
CUSTOMER_MASTER reference table and then writes out to an ODBC Connector
Stage. This Lookup stage has only one reference link but the stage allows for
multiple reference data sets.
52. Open the job properties.

53. Switch to the Parameters tab. Specify ‘inf0server’ as the DB2 Password default
value.
54. We will now create a parameter set from these two parameters. This will allow us to
reuse the database connection information in other jobs and centralize the database
authentication management.
Parameter Set
Use parameter set objects to define job parameters that you are likely to use over
and over again in different jobs. Then, whenever you need this set of parameters in
a job design, you can insert them into the job properties from the parameter set
object.

55. Highlight both parameters and select Create Parameter Set…
56. Name the Parameter Set ‘DB2Authentication’.
57. Verify that both parameters are present in the Parameters tab and that the DB2
Password has a default value. Click OK.
58. Save the Parameter Set under Jobs > Shared.

59. Replace the current parameters with the parameter set.
60. Notice that both parameters have collapsed into a single Parameter Set Object. Click
OK.
61. We now need to update the source ODBC Connector stage with the parameter set
object. Open the CSTSENTIMENT source stage.
62. In the Connection area, click on the #DB2_User# parameter in the User name row.
Specify the parameter from our new Parameter Set Object called DB2Authentication.
Repeat this step for the Password parameter.
63. Click OK.
64. Open the CUSTOMER_MASTER lookup ODBC Connector stage. In the Connection
properties, update the User name and Password properties with the new Parameter
Set object. Your connection properties should look like this:
65. Click OK.

66. Double click onto the Lookup stage. You can see the columns of each table. The
source and lookup tables are on the left and the target table on the right. FastTrack
already defined the lookup key for us. Note that the keys are defined in the table
definitions in the lower part of the screen. The keys are also defined in the
CUSTOMER_MASTER table. The key type is = which stands for equality match.
Lookup Stage
For each record of the source data set from the primary link, the Lookup stage
performs a table lookup on each of the lookup tables attached by reference links.
The table lookup is based on the values of a set of lookup key columns, one set for
each table.
67. Click the following icon: (Constraints).

68. Notice the options for Condition Not Met and Lookup Failure. The Condition field is
empty.
Lookup Stage Conditions

You can specify a condition on each of the reference links, such that the stage will
only perform a lookup on that reference link if the condition is satisfied. There are
four options available for the Condition Not Met field if you have defined a condition:
Continue (processing), Drop (the record), Fail (the job), Reject (send record to reject
link).
Lookup Failure
When DataStage cannot find a corresponding record in the reference set based on
the defined key, you can specify four options. They are: Continue (processing), Drop
(the record), Fail (the job), Reject (send record to reject link).
69. Click OK twice.
70. Double click the reference ODBC Connector.
71. Update the Connection Username and Password with Parameter set information.

72. The job is not yet ready to run. We need to define the target stage information first.
Open the properties of the ProductGeo ODBC Connector stage.

73. In the Connection area, highlight the Data source row and define JKLW_DWH as the
data source.
74. Click OK.
75. Use the Parameter Set for user name and password.
76. In the Usage properties, change the table action to ‘Create’.

77. Define a schema for our ProductGeo table in the Table name section:
DIF.ProductGeo. You may want to update the description, too.
Description: Writes into JKLW_DWH DB2 database.
Change the Table action to Create. We will now create a table from the candidate
schema.
78. Switch to the Columns tab. We need to finish up the Table Definition that was
created by FastTrack. Define all columns as Nullable except UserID.

79. Click onto the Save… button. Specify ODBC as the Data source type, JKLW_DWH
as the Data source name and ProductGeo as the Table/file name.
80. Click OK.
81. Save the Table Definition in the Table Definition\ODBC folder. Click OK.

82. Click OK to close the ODBC Connector stage.
84. The job should complete successfully. If not, go back and fix the errors. Look through
the log messages of the job. Everything should look OK with a few warnings for
default type conversions.
85. View the result set in the target stage by clicking on View Data.

What did we learn in this task?

1. A local container contains job logic.
2. Parameter set objects can be leveraged to centralize the frequently used
parameters.
3. The Lookup stage should be used when you want to join master records with a
relatively small lookup data set.
4. The ODBC Connector stage is easy to use and shares the look & feel of the other
Common Connector stages like the DB2 Connector stage.
Task: Range lookup on reference link

In this task, we will enhance our customer polarity table with a column for community
impact. This column will carry an indicator for each sentiment record depending on the
number of followers. In order to do so, we will perform a Range lookup on the number of
followers to determine the impact.
1. Open the CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY job and save it

in the WarehouseJobs folder as
CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LOOKUP. Add the
lookup stage, sequential file stage and descriptions as shown.
Hint: The Lookup stage is in the Processing category. The first output link that you
draw is the stream link. The reference link is an input link.
2. Open the FollowerCountLevel Sequential File stage. On the Properties tab, specify
the file /bootcamp/dif/FollowerCountLevel.txt to be read.
3. Set the First Line is Column Names property to True.
4. In the Format tab, keep the Field Delimiter as comma but change Quote to ‘none’.
5. In the Columns tab, load the following table definition: dif > Table Definitions >
Sequential > FollowerCountLevel.txt. Click OK.

6. Remove the Length value of 10 in the FollowerFrom and FollowerTo columns.
7. Click View Data to verify that the metadata has been specified properly. Select Close
and OK to close the Sequential File stage.
8. Open the Lookup stage and map the input columns from CSTSENTIMENT to the
output as shown below.

9. Set the FOLLOWERSCOUNT column as a Range Key by checking the box in the
‘Range’ column.
10. Then right-click on the FOLLOWERSCOUNT row and select “Edit Key Expression”.
The expression editor will be displayed.
11. The FOLLOWERSCOUNT field should have a value between

FollowerLevel.FollowerFrom and FollowerLevel.FollowerTo. Select the Operators for
each field from the drop down. For FollowerLevel.FollowerFrom, select Operator >=
(greater and equal than), and for FollowerLevel.FollowerTo select Operator <= (less
and equal than).
12. Keep AND as the logical operator. Click OK.
13. In the FollowerLevel lookup set, switch the Key Type to Range (a..z).

14. Drag the Key Expression from the Link_CSTSENTIMENT table into the Key
Expression fields from FollowerLevel.FollowerFrom and FollowerLevel.FollowerTo
rows. From the FollowerLevel lookup table, drag the FollowerLevel colum over to the
output table.
15. Click on the Constraints icon . Make sure the Link “FollowerLevel” is selected.
For the Lookup Failure option, select “Reject” and click OK twice.

16. Add a sequential file stage above the lookup stage. Name the stage
FollowerCountRejects. Link the lookup stage to the sequential file stage. Name the
link Rejects.
17. Open the FollowerCountRejects sequential file stage and define the following file
property: /bootcamp/dif/FollowerCountRejects.txt. Click OK.
18. Open the Transformer Stage. Map all input columns to the output, replacing the
existing columns.

19. In the Transformer stage, click on the properties icon . In the Stage > General tab,
ensure that Legacy null processing is disabled. This means that we can process
NULL values inside the transformer stage. Click OK.
20. Click OK to close the transformer stage.
21. Open the CUSTOMER_POLARITY target DB2 Connector stage properties. Switch
the table action property to Replace. Click OK.
23. Run the job and after it’s finished, validate the results by opening the
CUSTOMER_POLARITY target stage and viewing the data.
24. No records were rejected in the Lookup stage.


The range lookup compares the value of a source column to a range of values between
two lookup table columns. If the source column value falls within the required range, a
row is passed to the output link.

Task: Using the Sort stage

In this task, we will build a job using the Sort stage which will sort the Sentiment data
with USERID as the key.
26. In the Repository window, right click on the SampleSentiment job. Select Create
copy.
27. Rename the job CopyOfSampleSentiment to SampleSentimentSort. Open the job.

Replace the copy stage with a sort stage. Change the link and name of the target
stage to SampleSentimentSorted.

28. Edit the SampleSentimentData source sequential file stage table definition. Read the
USERID field as Numeric 25.
29. Edit the Sort stage to specify the key as USERID and Sort Order is ascending as
shown in the snapshot below:

30. Note that you could add additional sorting keys when the Sorting Keys folder is
highlighted. This allows you to sort records within the first sort key group. Keep
USERID as the only sorting key.
31. Don’t forget to map all the input columns to the output in the Output tab of the Sort
stage. If you did not delete the links earlier, the mappings will still be there. In this
case, update the target Table Definition by switching the UserID field to Numeric.

32. In the target Sequential File stage, update the file parameter to
/bootcamp/dif/SAMPLE_Brand_Retail_Feedback_Sorted.csv
33. Save and compile the Job. Run the job and check the results. The output should
contain data sorted by USERID in ascending order.
34. Our source and target stages are sequential while the Sort stage is a parallel stage.
In our case, DataStage is automatically collecting the records by the USERID column
to produce a sorted sequential output. You may also specify the collection method
explicitly. Go to the Partitioning tab in the target Sequential File stage. Select the
Sort Merge collector and specify USERID as the key.
35. The sorted output shows that we have multiple records per UserID. But what if we
are only interested in unique records? We will learn how to achieve this in the next
task.

1. Use the Sort stage when you need your data to be sorted in a specific order.
2. You specify sorting keys as the criteria on which to perform the sort.
3. The first column you specify as a key to the stage is the primary key, but you can
specify additional secondary keys.
Task: Using the Remove Duplicates stage

In this task, we will use the Remove Duplicate stage to remove the duplicate UserID
records from the sorted data. It is required for this stage that all records are presorted by
the identical key values.

37. To rename the created job, save it as SampleSentimentNoDups. Add a Remove

Duplicate stage from the Processing category following the Sort stage and rename
stages and links.
38. Edit the Remove Duplicate stage and specify the Key column as USERID. Note the
options. You have the choice to either retain the first or last duplicate key. If you want
to retain records by a specific logic, you would have to apply a multi-hierarchy-sort
(multiple sorting keys) for the input records as they pass into the Remove Duplicates
stage.

39. From the Output tab, click the Mapping tab, and specify the mapping between input
and output columns as shown below. Click Ok to close the stage.
40. Open the target Sequential File stage and specify the output file as
/bootcamp/dif/SAMPLE_Brand_Retail_Feedback_NoDups.csv
42. Run the job and verify the results. You can already see from the performance
statistics that 588 rows entered the Remove Duplicates stage and only 207 rows
were carried over to the target stage.

43. Observe the job log. You will notice that the target sequential stage issued a
warning.
UniqueSampleSentiment: When checking operator: A sequential operator cannot

preserve the partitioning of the parallel data set on input
Preserve Partitioning Flag
A stage can request that the next stage in the job preserves whatever
partitioning it has implemented. This is defined by the Preserve
Partitioning flag. If the next stage ignores this request, a warning is
displayed on the log to notify the developer.
In our case the Remove Duplicates stage had the default Preserve
Partitioning flag set, which in this case was Propagate. Since we are
writing to a sequential target, the parallel partitions have to be collected
and cannot be propagated.

The Remove Duplicates stage takes a single sorted data set as input, removes all
duplicate rows (based on the values for the key column(s)), and writes the results to
an output data set.
Task: Using the Join stage
In the meantime, our QualityStage expert has improved our customer master data that is
stored in the CUSTOMER_MASTER table. The FULLNAME field is now split into first
and last names and the GENDER field values are now complete thanks to
QualityStage’s Country Rule Set processing.

We now have a file that contains the following fields: Identifier, Gender, Firstname and
Lastname. We will now join the new file with our existing master data and replace the
names and gender values. Processing a set of master records with update records is a
good use case for the Join and Merge stages that we will be looking at next.
45. Build a new parallel job that reads from the new file using a sequential file stage and
from the CUSTOMER_MASTER table using a DB2 Connector Stage. Both source
stages are joined and then the data is written to a dataset.
46. Click New > Parallel Job. Save this job as SampleSentimentJoin in the Jobs folder.
47. Properly name the stages and links as good standard practice.
48. Open the source Sequential File. For the File property, specify
/bootcamp/dif/CST_FIRST_LAST_NAME_GENDER.txt. This is how the file looks
like:
49. The file does not contain column names. The format is comma separated with
double quotes. Load the table definition for CST_FIRST_LAST_NAME_GENDER.txt
from the Table Definitions > Sequential folder for the Format and the Columns. Click
the View Data button.
50. View the data again to make sure the file can be read. Click Close and OK to close
the Sequential File stage.

51. Open the properties of the CUSTOMER_MASTER DB2 Connector stage. Load the
JKLW_DB data connection and specify to generate the SQL. Specify
DIF.CUSTOMER_MASTER as the table name.

52. In the Columns tab, load the table definition from Table Definitions >
DIF.CUSTOMER_MASTER.
53. Check IDENTIFIER as your Key. Your Table Definition should look like this:

54. The stage will use this table definition when it’s generating the select statement. We
will not be using the GENDER and FULLNAME fields anymore since these are now
coming from the new master data file. It’s good job performance practice to not
include these columns in the select statement. Delete these two columns from the
Columns list by highlighting them in the Columns tab and pressing the delete button.
Your new column list should look like this:
55. Go back to the Properties tab and view the data to make sure your settings are fine.
Click Close and OK to close the DB2 Connector stage. You may save the job.
56. Open the Join Stage. In the Properties tab, specify the join key as IDENTIFIER and
Join Type as Inner as below.

57. Check the Link Ordering tab. It is important to identify the correct left link and right
link when doing either a left outer join or right outer join. Since we are doing an Inner
join, it only serves to identify which link the key column is coming from. You can keep
the default.
58. Click on the Output > Mapping tab and map the columns to the target. Click OK.

59. Open the target DataSet Stage NEW_CUSTOMER_MASTER. In the Properties tab
specify the path and file to write the output records
/bootcamp/dif/NEW_CUSTOMER_MASTER.ds
60. Click OK.
61. Save and compile the job. Run the job. It should finish successfully.
62. View the generated file from the Dataset Stage and verify that First Name and Last
Name fields are now separate and that the Gender field is now fully populated.

1. The Join stage is used to join data from two input tables and produce one output
table.
2. You can use the Join stage to perform inner joins, outer joins, or full joins.

Task: Using the Merge stage

64. In this task, we will merge the customer master records from the database with the
update records from the sequential file.
65. Save the SampleSentimentJoin job as SampleSentimentMerge.
66. Delete the Join stage.
67. Add a Merge stage as below.
68. Open the Merge stage and specify the Key which will be used for matching records
from the two files. Select IDENTIFIER. We will keep unmatched master records.

69. Check the Link Ordering tab to make sure that you have the two input sources set
correctly as Master and Update links. For this exercise, OldMasterData should be
the Master link and NewMasterData should be the Update link.
70. Click on the Output > Mapping tab. Verify that all columns are mapped and that they
are mapped correctly.
71. We will overwrite the joined dataset with this job run. You can keep the existing file
properties of the target Dataset Stage.

73. Observe the produced output and the job log. There is one warning message:
Merge,1: Master record (87) has no updates.
74. All 207 rows are passed down to the dataset since we decided to keep unmatched
master records in the Merge Stage properties.
75. NULL values were passed for the unmatched master record.

1. The Merge stage is one of three stages that join tables based on the values of
key columns.
2. The Merge stage combines a master data set with one or more update data sets.
3. A master record and an update record are merged only if both of them have the
same values for the merge key column(s) that you specify.
4. You can specify to keep or drop unmatched master records.

Task: Using the Funnel stage

In this task, we will combine data from two different sequential files using the Funnel
stage. The Funnel stage requires that both input files have the same metadata (table
definition).
77. Create a new parallel job called SampleSentimentFunnel with two Sequential File
stages. We will combine the two files that we split earlier in our job
SampleSentimentPartition.
Note: If you did not run the data partitioning and collection job that produced the two
files, you can load them from /bootcamp/dif/solutions.
SampleSentiment1 should read data from /bootcamp/dif/TargetFile1.txt and

SampleSentiment2 should read data from /bootcamp/dif/TargetFile2.txt. Add a
Funnel stage to combine the data and a Target Sequential file.
78. Open Sequential File stage SampleSentiment1. On the Properties tab, specify the
file to read as /bootcamp/dif/TargetFile1.txt. Set the First Line is Column Names
property to True.

79. Click on Columns tab, then on the load button to load the
SAMPLE_Brand_Retail_Feedback.csv table definition under folder /Table
Definitions/Sequential.
80. Click View Data to verify that the metadata has been specified properly. Click Close
and OK to close the source Sequential File.
81. Open the Sequential File stage SampleSentiment2. On the Properties tab specify the
file to read as /bootcamp/dif/TargetFile2.txt. Don’t forget to set the First Line is
Column Names to True.
82. Click on Columns tab, then on the load button to add the column definitions from the
SAMPLE_Brand_Retail_Feedback.csv table definition.
83. Click View Data to verify that the metadata has been specified properly. Click Close
and OK.

84. Open the Funnel stage and view the properties. Keep the Continuous Funnel mode.
85. Select the Output tab and map the input columns to the output columns.
86. Click “OK”.

87. Open target Sequential File stage SentimentCombined. On the Properties tab
specify the path and file to write the output records
/bootcamp/dif/CustomerSentimentCombined.txt. Set First Line is Column Names to
True.
88. Save the job and compile it.
89. Run the job and view the output.

1. The Funnel stage is useful for combining separate data sets with identical
metadata into a single large data set.
2. The Continuous Funnel mode combines the records of the input data in no
guaranteed order. It takes one record from each input link in turn.

Task: Perform an impact analysis using the Repository window

We built a number of jobs now. You may wonder how DataStage can help you
identifying relationships between DataStage assets like Jobs and Table Definitions.
Impact analysis can help you identify related assets. It is useful to identify affected
assets when you are about to change an asset like a Table Definition.
91. Close all open jobs.
92. In the Repository window, select the SAMPLE_Brand_Retail_Feedback.csv Table

Definition from the Sequential folder. Click the right mouse button and then select
Find Where Used > All Types.

93. The result is shown in the Repository Advanced Find window. These are the jobs
that use this Table Definition.
94. Click the right mouse button over the SampleSentiment job and then click “Show
dependency path to…”

95. Maximize the window or use the Zoom button to adjust the size of the dependency
path. Notice that you have a detailed view of the stages and links that use this Table
Definition. The graph shows you in detail which stages require attention when you
are about to change the Table Definition.
96. Close the Path Viewer window at the bottom of the screen.
Task: Find the differences between two jobs

97. Mark the SampleSentiment and the SampleSentimentSort jobs in the Repository
Advanced Find window. Right click onto one of the highlighted jobs and choose
‘Compare selected’.
98. Once the result is available, close the Repository Advanced Find window.

99. DataStage displays the two jobs as well as the Comparison Results window. It
contains a detailed account of the changes made, e.g. that the Copy stage was
replaced with a Sort stage and that the USERID field was changed from VarChar to
Numeric to enable the calculation in the Sort stage. Note that you can also compare
Table Definitions with each other in the same way.

1. You can use the advanced find window to perform advanced find operations and
to perform impact analysis operations on items in the repository.
2. The impact analysis feature helps you to assess the impact of changes you might
make to an object on other objects, or on job designs.

Lab 11: Aggregating Data

Task: Using the Aggregator stage
In this task, we will calculate the revenue generated by our customers this month.
1. Create a parallel job called SampleSalesAggregate with a Sequential File as the

input file. Use an Aggregator stage to sum up the sales per customer id and produce
the output in a target sequential file with the result. Save the job in the Jobs folder.
2. Open the source sequential file properties and set the file path to
/bootcamp/dif/Sales.txt. Don’t forget to load the table definition in the Format and
Columns tab. It is located at \Table Definitions\Sequential\Sales.txt.
3. Click on View data and make yourself familiar with the input columns before they are
processed. We will use the Aggregator stage to calculate the total revenue for the
sales data in this file.
4. Edit the Aggregator stage to add the grouping key, CustomerID. Also set the
property Aggregation Type = Calculation, as shown below.

5. Select the Column for Calculation = TotalPrice and at the right bottom portion of the
screen, select Sum Output Column.
6. A new column will be generated with the aggregation results. Name the new column
name MonthlyRevenue.

7. Click on the Output tab. In the Mapping sub-tab, map both input fields into the target
file.
8. Click OK
9. Open the MonthlySales target Sequential File stage. In the Properties tab specify
/bootcamp/dif/MonthlySales.txt as the file to write. Set First Line is Column Names to
True.
10. In the Partitioning tab, select Sort Merge for collector type; check the Perform sort
box, and select MonthlyRevenue as the key with an option of Descending order.

11. Save and compile.
Run the job and verify the results. The final file should contain the Grouping Key as
CustomerID and the MonthlyRevenue column in descending order.
Close the Data Viewer and the job.

The Aggregate stage classifies data rows from a single input link into groups and
computes totals or other aggregate functions for each group. The summed totals for
each group are output from the stage via an output link.

Lab 12: Transforming Data

In this exercise we will be creating transformation logic inside the transformer stage.
Task: Create a parameter set

1. First, let’s create a parameter set that we will add to our job. Click the New button on
the Designer toolbar and then open the “Other” folder.
2. Double-click on the Parameter Set icon.
3. On the General tab, name your parameter set SourceTargetData.

4. On the Parameters tab, define the parameters as shown. Don’t forget the last slash
for the directory value.
5. In the Values tab, specify a name for the Value File that holds all the job parameters
within this Parameter Set. Click OK.
6. Save your new parameter set in Jobs > shared.

Task: Add a Transformer stage to a job and define a constraint

7. Open the existing job
CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LOOKUP from the
WarehouseJobs folder and save it as SamplePolarityTransform in the Jobs folder.
8. In the menu bar, go to Edit and open up your Job Properties. Select the Parameters
tab. Click Add Parameter Set. Select your SourceTargetData parameter set.

9. Click OK.
10. Click Ok
11. Remove the target DB2 Connector stage.
12. Replace it with a sequential file stage called PositiveSentiment. Add another
sequential file stage called NegativeSentiment as a second output of the transformer.

13. In the Transformer stage, map all the columns from the source link to both target
links. Select all the source columns and drag-& drop them to the output link. The
transformer editor should appear as shown below:

14. Open the transformer stage constraints by clicking on the chain icon . We will
now create a constraint that identifies records with positive or negative values for the
POLARITY column. Open the Constraint editor by double clicking into the Constraint
field of the PositiveSentiment row. To insert a column name without typing it, click on
the … icon and select “Input Column”
15. For the PositiveSentiment link, define the following constraint: Transform.POLARITY
= “positive”. Make sure to use all lowercase letters. Hit Enter.
16. For the NegativeSentiment link, define the following constraint:

Transform.POLARITY = “negative”. Hit Enter. Click OK.
17. The constraints are now displayed in the stage editor.
18. Select OK and close the stage editor.

19. Configure the properties for the target Sequential File stages. Open the
PositiveSentiment output stage. Use the Dir and PositiveTargetFile parameter
included in the SourceTargetData parameter set to define the File property as
shown. Also, set the option First Line is Column Names as

True.
20. Open the NegativeSentiment output stage. Use the Dir and NegativeTargetFile
parameter included in the SourceTargetData parameter set to define the File
property as shown. Also, set the option First Line is Column Names as True.
21. Compile and run your job.
22. View the data in the targets and verify that the records were split up correctly.

23. In the log you may notice warnings saying Exporting nullable field without null
handling properties for the three target Sequential File stages. We see this warning
since we are reading from a database table with a Table Definiton that allows for
NULL values. We are then writing to Sequential File stages where NULL values must
have some character representation.
24. We can define this character representation in the target Sequential File stages by
adding the Null field value parameter in the Field defaults folder of the Format tab.
You may choose a number, string or escape character.
25. Note: When you read these sequential files again as a source file in another job, you
will have to specify in the stage properties that the NULL string represents the NULL
value.

Task: Define an Otherwise link

26. Save the job SamplePolarityTransform as SamplePolarityTransformOtherwise.
27. Add a new Sequential File stage linked as an output to the Transformer stage and
name it as shown below.

28. In the Transformer, map all the input columns across to the new target link.
29. Open the Constraints window for the Otherwise output link. Note: You can also
double-click the Constraint box.

30. Check the Otherwise box for the Otherwise link.
31. Click OK and OK to close the stage editor.
32. Edit the OtherSentiment Sequential File stage as shown.
33. Add a null field value in the Format tab. Click OK.
34. Save, compile, and run your job. No rows should be going into the Otherwise link.
Our custom SQL select statement in the DB2 Connector stage had the where clause
to only read positive or negative POLARITY values. Let’s change that.

35. Open the source connector stage. Remove the WHERE clause from the select
statement.
36. Click OK. Compile and run the job again. You should now see records getting
passed into the otherwise link that do not satisfy the transformer constraint condition.

Task: Define derivations

37. In this task, you will define a derivation. The first derivation constructs addresses
from several input columns. The second defines the current date at runtime.
38. Save the job SamplePolarityTransformOtherwise as

SamplePolarityTransformDerivation.
39. Open the Transformer stage. Right mouse click on the Stage Variables window, and
click Stage Variable Properties…
40. Under the Stage Variables tab, create a stage variable named DateProcessed with
Date as the SQL type.
41. Click OK to close the Stage Variable Properties window.

42. Double-click in the derivation editor for the DateProcessed stage variable. Define a
derivation that contains the current date using the function CurrentDate() for
DateProcessed stage variable. You can either type it or look the function up.

43. Create a new column named ProcessedDate with Date as the SQL type for each of
the three output links by typing the new column name and its corresponding
properties in the next empty row of the output column definition grid located at the
right bottom as shown here.

44. Define the derivations for these columns using the Stage Variable DateProcessed by
dragging the DateProcessed variable and dropping it into the Derivation space of the
ProcessedDate fields. The Transformer editor should look like this:
45. Exit the Stage Editor by clicking OK, save, compile and run the job.
46. When you view the result data files, you will find the ProcessedDate column filled.

5. You must define a null value representation when writing nulls to sequential files.
6. The Transformer stage uses a set of functions to transform your data.
7. You can define constraints that allow you to pass data that meets the constraint
condition to a specific output link.

Lab 13: Operating and Deploying Data Integration Jobs

In this lab, you will discover the metadata lineage, create a single job sequence that runs
two jobs to populate the CUSTOMER_POLARITY table and monitor the Information
Server instance with the DataStage Operations Console.
To automate the population of the CUSTOMER_POLARITY table if the source

data refreshes, we will now create a job sequence that takes the export file from
BigInsights, transforms it and loads it into the data warehouse table.
Task: View the Metadata Lineage

1. Open Firefox and click on the Metadata Workbench bookmark link.
2. Log on with dif / inf0server.
3. Switch to the Advanced Tab.

4. Click on Manage Lineage
5. Select the dif project, which has new jobs, and then click the Detect Associations
icon . Note: Keep the dstage1 project checked. Confirm to run the service.
6. Once this step has completed, we are ready to create a data lineage from our
CUSTOMER_POLARITY data warehouse table back to the data source.
Managing the lineage

You perform administrative tasks that are needed to prepare the data in the
metadata repository for lineage reports. The Manage Lineage utility in the metadata
workbench automatically discovers relationships between stages and tables or data
file structures, between stages, and between database tables and views.

7. Click on the Browse tab to view the Implemented Data Resources.
8. Expand INFOSRVR > JKLW_DWH > DIF. Click on the CUSTOMER_POLARITY

table.
9. Click on Data Lineage.

10. Our data lineage path shows two database tables and five jobs lined up in the data
lineage flow. Due to the solution jobs in place as well.
11. You can zoom in and out, export the lineage as pdf or jpg and also view specific
relationship types.

Relationship types
You can select to view Design, Operational, and User-Defined relationships. Our job
here only contains design information since at this point we haven’t imported our
operational data from the engine tier.
Job Design Relationships
Displays data items that the job reads from or writes to. Displays the previous and
next jobs based on job design information that is interpreted by the automated
services. Displays job design parameters and whether runtime column propagation
is enabled.
Job Operational Relationships
Displays the previous and next jobs based on the values of parameters at run time,
based on operational metadata that is interpreted by the automated services.
Job User-Defined Relationships
Displays the data items that a job reads from or writes to, based on the results of
manual linking actions that are performed by the Metadata Workbench
Administrator.
12. The CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY and

CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LOOKUP jobs are show
to both read from the CSTSENTIMENT table and write to the
CUSTOMER_POLARITY table. You can click onto the Expand link to view more
details of these jobs in a new window.
13. We can now build a job sequence that will run the SequentialSentimentToDB2 job
and then the CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LOOKUP
job that also contains the follower category field.
14. Log out from the Metadata Workbench and close the browser.
Task: Building a Job Sequence

15. In DataStage Designer, select “File” on the menu, and “New” on the popup window,
then open the Jobs folder and select “Sequence Job” to create a new Job Sequence.

16. Save it as SeqJob in the Jobs folder.
17. Drag and drop two Job Activity stages to the canvas, link them, and name the stages
and links as shown.
18. Open the Job (Sequence) Properties. In the General tab, verify that all the
compilation options are selected.

19. Click the Parameters tab and add the parameter sets SourceTargetData and
DB2Authentication as shown. Load these parameters through the
button. These parameters will be available to all the stages within the job sequence
during execution.
20. Click OK.

21. Open up each of the Job Activity stages and associate the parallel job you want to
execute with each stage.
SeqJob Activity
Parallel Job
Stage
CustomerSentiment SequentialSentimentToDB2
CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LO
CustomerPolarity
OKUP

22. For the Job Activity stage CustomerSentiment, change the Execution action to
“Reset if required, then run”.
23. For the Job Activity stage CustomerPolarity, we want it to be executed only when the
upstream job ran without any error, although possibly with warnings.
24. In the first Job Activity stage CustomerSentiment, open the Triggers tab and set the
Expression Type to Custom (Conditional)
Note: This means that the DSJ.JOBSTATUS can be either DSJS.RUNOK or

DSJS.RUNWARN. Browse the Activity Variables and the DS Constant in the
expression editor to compose the triggers.

25. Note: The expression format is JobActivityStageName.$JobStatus. You can

add this parameter by right-clicking into the box and selecting Activity Variable.
then $JobStatus.
26. Right click again and select
27. Right click again and select then .
28. The result for the CustomerSentiment stage should look like:
CustomerSentiment.$JobStatus = DSJS.RUNOK or
CustomerSentiment.$JobStatus = DSJS.RUNWARN
29. Compile and run your job sequence.
30. Open the job log for the job sequence. Verify that each job ran successfully. Locate
and examine the job sequence summary.

31. Examine what happens if the first job aborts. To cause that, open up the job
SequentialSentimentToDB2 and replace in the source Sequential File name
SAMPLE_Brand_Retail_Feedback.csv with the non-existent dummy.csv as shown
below. Save and compile SequentialSentimentToDB2.
32. Execute the job sequence SeqJob and check the log showing the job is aborted. The
first error message in the job log should contain the relevant error.
Note: you don’t need to recompile the job sequence to execute it since nothing was
changed in the job sequence.

33. Open the SequentialSentimentToDB2 job, replacing the dummy.csv source file with
the original SAMPLE_Brand_Retail_Feedback.csv in the source Sequential File
stage File property. Then save and compile the job.
Task: Add a user variable

In this exercise, we will extend our job trigger condition with a user variable. We only
want to run the CustomerPolarity job if we enabled it’s execution with the use of a user
variable.
34. Save the job sequence SeqJob as SeqJobVar. Add a User Variable Activity stage as
shown.
35. Open the User Variables Activity stage and select the User Variables tab. Right click
in the gray space and select Add Row to create a variable named
EnableCustomerPolarity with value 0. Click OK.
36. We want to enable the execution of CustomerPolarity only if the value of the
EnableCustomerPolarity variable is 1. To specify this condition, open the Trigger tab
in the CustomerSentiment Job Activity stage and modify the expression as shown.
Note: you can refer to the User Variable Activity stage variables within any stage in
the job sequence using the syntax:
UserVariableActivityName.UservariableName
(CustomerSentiment.$JobStatus = DSJS.RUNOK or
CustomerSentiment.$JobStatus = DSJS.RUNWARN) and
UserVars.EnableCustomerPolarity = 1
37. Compile job sequence SeqJobVar.

38. Start the job using the DataStage and QualityStage Director client. The Director is
the client component that validates, runs, schedules, and monitors jobs. You can
invoke the Director client through Tools > Run Director.

39. Switch into the Jobs folder, highlight SeqJobVar and click on ‘Run now..’ in the
shortcut icon bar to execute the job sequence again. Click Run.
40. Switch to the job log view by clicking on the ‘Notebook’ icon.

41. You may notice that there is no mention of the

CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LOOKUP job. It has not
been executed because the User Variable EnableCustomerPolarity value is 0.
42. Edit the UserVars stage and change the EnableCustomerPolarity value to 1. This
will cause CustomerPolarity to execute.
43. Compile and run the job sequence again and verify in the logs that CustomerPolarity
was executed.
Task: Add a Wait For File stage

In this task, you will modify your design so that your job is not executed until a file called
StartRun appears in directory /bootcamp/dif.
44. Save your job from the last lab as SeqJobWait.
45. Add Wait for File stage as shown.

46. Open the Wait For File stage and set the filename of the file as shown below.
Note: the “Do not timeout” option makes the stage wait forever for the file StartRun
until it appears in the specified location.
47. Define an unconditional trigger so the following Activity, CustomerSentiment, will be

started as soon as the file StartRun appears in directory /bootcamp/dif.
48. Compile and run your job. Notice that after the job starts it waits for the file StartRun
to appear in the expected folder.
49. Open a putty session to the server ‘infosrvr’.
50. Authenticate with dsadm/inf0server.
51. Create a file named StartRun in the directory /bootcamp/dif. You can use the
command “touch /bootcamp/dif/StartRun” for this purpose.
52. Switch back to the log view. Notice the log messages and the job sequence
execution should now continue by running the stage following the Wait For File
Activity.

Task: Add exception handling

53. Save your job from the previous task as SeqJobException.
54. Add the Exception Handler and Terminator stages as shown.
55. Edit the Terminator stage so that any running job is stopped when an exception
occurs.
56. To see how the exception handling takes control over the job sequence, you will
have to make one of the jobs that are part of the Job Sequence fail. Modify the job
SequentialSentimentToDB2 replacing the SAMPLE_Brand_Retail_Feedback.csv file
name in the source sequential file stage with dummy.csv and compile the job.

57. Compile and run the job sequence again and check the log with the Director client.
Note that as SequentialSentimentToDB2 did not finish successfully, the sequence is
aborted.


1. As a Metadata Workbench Administrator you must run the Manage Lineage utility
to prepare for Data Lineage reports. This step links the target stage in one job to
the source stage in the next job, and links views to database tables.
2. Solution architects can create a business intelligence (BI) reference architecture
that depicts the data sources, data integration, data repositories, analytics, and
consumers.
3. A Sequence job is a special type of job that you can use to create a workflow by
running other jobs in a specified order.
4. In a Sequence job, you can build programmatic controls, specify different
courses of action to take depending on whether a job in the sequence job
succeeds or fails, run system commands, and perform exception handling.

Lab 14: Real Time Data Integration
Task: Revisiting our Project Blueprint

1. Open Blueprint Director by double-clicking on the ‘InfoSphere Blueprint Director’
icon, or click on Start  Programs IBM InfoSphere IBM Blueprint Director
2.2.0.
2. Open the DIF_Scenario.bpt blueprint from the Miscellaneous Project folder.
3. You should find your blueprint with the timeline feature enabled. Make sure you are
viewing the End of workshop milestone as shown below.
4. We have extracted our Web Data from the file, transformed the data and loaded it
into the warehouse. Now our BI Analysts can start building reports based on the
customer sentiment found in our source data.
5. In this final exercise, we will learn how to expose the polarity and follower count data
that we loaded into the warehouse as a web service and therefore make it
consumable by other applications.

6. Let’s begin by modeling this solution in our blueprint.
7. Disable the read-only blueprint view by timeline.
8. Create more space on the blueprint by moving the advanced analytics elements up
and make the Data Repositories and Analytics domains smaller.
9. From the Groups section in your Palette, add another domain to your blueprint.
Name the domain Web Services.
10. Add an Information Service element from the Consumers and Delivery category to
your Web Services domain.

11. Add an Application object from the Consumers and Delivery category to the
Consumers domain. Rename the Application to SOA Application.
12. Create connections from and to
12..1 Warehouse – Information Service
12..2 Information Service – SOA Application
13. Note the ingoing and outgoing links when you hover your mouse over the elements.
Click and drag these links to create the connections.

14. Mark the two new elements and go to the properties section on the lower right side of
the screen. Switch to the Milestones section and define these two objects to show up
at the End of workshop milestone.
15. You can now enable the timeline view again to reduce the blueprint scope again to
what we are achieving in this class.
16. Save the blueprint and close Blueprint Director.
Task: Creating a Service Enabled Job

1. Open the DataStage Designer client.
2. Create a new parallel job and save it as PolarityFollowerCountService in the Jobs >
WarehouseJobs folder. We will build a simple job in this exercise. Note that the real
value unfolds when you take advantage of DataStage’s and QualityStage’s full
transformation and data cleansing potential in combination with the service
endpoints.
3. Add the following stages to the canvas (Stage, Palette Category):
3..1 ISD Input (Real Time)
3..2 Lookup (Processing)
3..3 DB2 UDB (Database)

3..4 ISD Output (Real Time)
4. Connect the stages and name the stages and links as shown:
5. Open the CUSTOMER_POLARITY reference stage.
6. In the Database row, load the JKLW_DWH database name.

7. Specify db2admin as the user name and inf0server as the password.
8. Switch the Generate SQL property to Yes and define DIF.CUSTOMER_POLARITY

as the table name. Click View Data to verify the connection.
9. Switch to the columns tab and load the CUSTOMER_POLARITY table definition.
10. Click OK. Set UserID as the key.
11. Click OK.

12. Open the lookup stage.
13. In the table definition field for the CustomerID link, fill in the following metadata:

14. Click and drag the UserID key from the input stream to the reference stream.
15. Map the input and reference data to the output.

16. Open the Constraints window.
17. We need to specify what action should be taken if a lookup on a link fails. Make sure
the Lookup Failure field is set to Continue. This will set our reference data to NULL. It
continues processing any further lookups before sending the row to the output link
18. Click OK twice to close the lookup stage.
19. You will notice that every link now has table definition metadata defined as indicated
by the small table icons. You also notice that the ISD Input and Output stages are
sequential, as indicated by the fan in and fan out icons and the lookup stage as well
as the DB2 connector stage are parallel stages.

20. Before we can compile the job, we will need to make this job available for information
services. Open the job properties.
21. In the General tab, set the checkmarks for “Allow Multiple Instance” and “Enabled for
Information Services”.
22. Click OK. Save and compile the job.
23. Close the DataStage Designer client.
Task: Create an Information Service project with Information

Services Director
Our next task consists of creating a service which will look up the Polarity and Follower
level information for a specific UserID. We will use the IBM InfoSphere Information
Services Director client to create an Information Services project. Within that project we
create an application and within that we create a service.
24. To start the Information Services Director client, double-click on IBM InfoSphere
Information Server Console icon on the desktop; or, select “Start  Programs 
IBM InfoSphere Information Server  IBM InfoSphere Information Server Console.
25. Log on with dif / inf0server

26. Create a new Information Services project: open the menu under ‘No project
selected’, and select ‘New Project’.
27. Select Type: “Information Services”, Name the project

“DIF_SampleOutdoorInformationServices”.
From the InfoSphere Information Server Console, two types of

projects can be created: an Information Services project, or an
Information Analyzer project. However, to create a project of a
specific type, the user logged on to Information Server Console
must have the appropriate role.
In this case, the roles assigned to user dif allow you to create an
Information Services project.
28. Once the project is created, switch to the Users tab. You can fill in the users that may
connect to the project, and their roles for the project: Information Services Director
Designer, and/or Information Services Director Project Administrator.

29. Assign the two project roles to user dif.
30. Save the project.
31. Each project needs to connect to one or more information providers. This project
needs to access the DataStage engine.
32. From the home page of the Information Server Console, click on Home 
Configuration  Information Services Connections.
33. Open the DSServer connection.
34. Click OK.

35. Select the Edit button on the lower right corner.
36. Fill in the user name and password information to connect to the DataStage engine:
dif / inf0server.
37. Test the connection.
38. Click on the Save, Enable and Close option under the Save button.

Task: Create an Information Application and Service

39. Now that we set the properties of the project and defined an Information Services
Connection we can develop our Information Services Application. Each service
defined within Information Services Director is part of an Information Services
application. Multiple services can be defined within each application. The application
is the component that will be eventually deployed to an application server.
40. Under the Develop pillar, select Information Services Application.
41. Click on New.
42. Name the application “CustomerPolarityFollowerCount” and then Save Application.
43. Create a new Service by clicking on New > New Service.
44. Name the service “LookupPolarityFollowerCount”.

45. Double-click on Bindings to define the bindings to be used by the service. From the
Attach Bindings menu, select SOAP over HTTP.
46. We are ready to define one or more operations to be associated with the service.
Under the Operations folder, double click newOperation1. The newOperation1 tab
will open.
47. The first operation was created automatically. Name the operation DataStageLookup
and select an information provider.
48. Select DataStage and QualityStage.

49. Browse our infosrvr host for the dif project.
50. Select the PolarityFollowerCountService job from the Jobs > WarehouseJobs folder.
51. Select the job PolarityFollowerCountService then Save the Application and then
close it.
52. Our service is now ready to be deployed. Highlight the service and click Deploy.
53. Include all the Services and Operations.
54. Click Deploy. You can monitor the deployment status at the bottom of the screen.

55. Wait until the deployment done completelyt and then Start InfoSphere DataStage
and QualityStage Director
56. Log on to the project INFOSRVR/DIF with dif/infoserver
Expand dif project and then the Jobs folder and go to WarehouseJobs, you will see
the job “PolarityFollowerCountService” with the invocation ID is currently up and
running and ready for the call.
59. The Information Services Director is now waiting for SOAP over HTTP requests that
are then processed and the DataStage engine receives the UserID string as an input
for our service enabled job.
60. Go back to ISD (Information Server Console), in the project menu click OPERATE
icon and select Deployed Information Services Application

61. Click OK in the Read-Only Mode
62. Click the ‘View Service in Catalog” button net to the service , this will take you
directly to the service view within the Information Services Web Catalog.

63. Click on the ‘Bindings’ on the catalog view. Expand the ‘SOAP over HTTP’ binding to
open the binding properties. Click on the Open WSDL Document link.

This will open the WSDL document in separate browser window. Have a look at the
WSDL. You might recognize some of the information that we had looked at earlier.
Keep this WSDL browser open, we need to copy the link for testing purposes in the
next few steps.
64. Open InfoSphere Data Architect by double clicking on the IBM InfoSphere Data
Architect icon on the desktop.
65. Accept the default work space an click ‘OK’.
66. Switch to the ‘Web perspective by clicking the ‘Open Perspective button on the top
right corner. Select to ‘Other’ and choose web near the bottom of the list and click ok

67. Click ‘Run’ and select ‘Launch the Web Services Explorer’ from the pull down menu.

68. On the top right corner of ‘Web services explorer’ window, click the ‘WSDL page’
(second from the right) icon.
69. In the Navigator of the Web Services Explorer, click ‘WSDL Main’ then copy the URL
of the WSDL document from the browser window into the WSDL URL text field and
click ‘Go’.
As you can see, the Web Service Explorer could interpret the WSDL and discovered
an operation ‘DataStagelookup’ and an endpoint (Service Provider) to which the
request would be sent.
70. Click on the ‘DataStageLookup’ operation name link.

71. On the ‘invoke WSDL Operation’ window, enter an userid (e.g.: 7653556196,
26876535196, 76534565196) and click ‘Go’.
The response message at the bottom part of the window includes polarity and the
number of follower for that userid.
What has just happened here? The input in the Web Services Explorer was sent as
a SOAP/HTTP service to the service provider (InfoSphere Information Server, in this
case) which then invoked the InfoSphere DataStage job. The job then did a lookup
against the customer polarity repository to retrieve any existing customer. It returned
the result to the userid, which then packaged it as SOAP message and sent it back
to the Web Services Explorer.
We now have a Web service that checks userid against the customer in our
customer repository. This service could be used by any JKLW application. All it
takes now is to publish this service on our service registry.
Summary
IBM InfoSphere Information Services Director is a powerful tool to create Web services
on top of InfoSphere DataStage and Quality Stage jobs, SQL statements against DB2,
Oracle or classic federation data sources. InfoSphere Information Services Director
services package information integration logic that insulates developers from the
underlying complexities of data sources. InfoSphere Information Services Director
provides support for load balancing and fault tolerance for requests across multiple
servers. It also provides foundation infrastructure for information services.

72. What did we learn in this module
1. InfoSphere Information Services Director provides a unified and consistent way

to publish and manage shared information services. It packages information
integration logic as services.
2. The Information Services Director development model is based on a hierarchy of
objects: Projects, Applications, Services, and Operations.
3. You can create services from the following information service providers:
DataStage; QualityStage; DB2 for Linux, UNIX, and Microsoft Windows;
InfoSphere Federation Server; InfoSphere Classic Federation Server for z/OS;
and Oracle Database Server.
4. DataStage jobs must be enabled for web services.

Data Integration Fund All RVW Labs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Integration Fund All RVW Labs

Uploaded by

Copyright:

Available Formats

Information Server Data Integration Fundamental Boot Camp – Labs Review

© Copyright IBM Corporation 2014 Page 1 of 276

Lab 01: Verify Information Server Status ................................................ 5

© Copyright IBM Corporation 2014 Page 2 of 276

© Copyright IBM Corporation 2014 Page 3 of 276

LAB Notes for review: This document is only being

ENVIRONMENT USER PASSWORD

SLES user root inf0sphere

Note: the passwords contain a zero, not the letter o.

© Copyright IBM Corporation 2014 Page 4 of 276

Lab 01: Verify Information Server Status

Task: Log onto the Information Server Web Console

http://infosrvr:9080/ibm/iis/console/ or click the bookmark link and

© Copyright IBM Corporation 2014 Page 5 of 276

2. If you see the following window, Information Server is up and running.

© Copyright IBM Corporation 2014 Page 6 of 276

How to Bring Up Information Server Services (when InfoServer is down)

4. The VMware image should be running.

5. Login as root with the password “inf0sphere”.

6. Click on Computer (bottom left corner), and open GNOME Terminal

8. To start the DB2 Database, type “db2start”.

9. Type ‘exit’ to go back to the root user.

11. Finally, execute the following command to start the ASBNode:

© Copyright IBM Corporation 2014 Page 7 of 276

© Copyright IBM Corporation 2014 Page 8 of 276

Lab 02: Project Governance with Blueprint Director

Blueprint Director helps to govern our organizations models, policies, rules

In this lab, we will

 Discover the business-driven BI development method and template

© Copyright IBM Corporation 2014 Page 9 of 276

Task: Discover the business-driven BI development method and

InfoSphere Blueprint Director templates

3 Select File > New > Blueprint.

4 Common reference architectures are represented through a number of out-of-the-

© Copyright IBM Corporation 2014 Page 10 of 276

 Delivering Trusted Master Data

© Copyright IBM Corporation 2014 Page 11 of 276

7 The blueprint, as displayed above, contains a number of components:

© Copyright IBM Corporation 2014 Page 12 of 276

© Copyright IBM Corporation 2014 Page 13 of 276

 There are a number of domains: “Data Sources”, “Data Integration”, “Data

© Copyright IBM Corporation 2014 Page 14 of 276

© Copyright IBM Corporation 2014 Page 15 of 276

16 This template provides guidance on recommended roles, specific tasks, deliverables

 capability patterns (activities) within each phase,

© Copyright IBM Corporation 2014 Page 16 of 276

Task: Importing a Blueprint

© Copyright IBM Corporation 2014 Page 17 of 276

20 Import the blueprint from directory C:\bootcamp\dif\BlueprintDirector; select the

21 Click ‘Finish’ to complete the import.

23 Notice that this diagram is based on the BI development template we reviewed

© Copyright IBM Corporation 2014 Page 18 of 276

25 Ensure that ‘Enable read-only blueprint view by timeline’ is checked.

Designing blueprints with milestones

© Copyright IBM Corporation 2014 Page 19 of 276

Be informed in higher-level diagrams of changes

29 Slide back to the End of workshop milestone.

32 Close the Web Data sub-diagram.

© Copyright IBM Corporation 2014 Page 20 of 276

33 Open the Integrate Sources sub-diagram.

 The transform stage applies to a series of rules or functions to the extracted

36 Note the methodologies that are associated with each stage.

37 Close the Integrate Sources sub-diagram.