Professional Documents
Culture Documents
IBM® InfoSphere™
Data Integration
Fundamentals
Boot Camp
Lab Review
January 2014
Table of Contents
Task: Create a DDL script for the new table .............................................................. 124
Task: Export the physical data models ....................................................................... 128
Task: Importing physical data models using the ODBC Connector in DataStage ..... 130
Lab 09: Creating Mapping Specifications .............................................. 132
Task: Creating a FastTrack project ............................................................................. 133
Task: Import Metadata using FastTrack ..................................................................... 136
Task: Creating source to target specifications ............................................................ 138
Task: Use the source to target specification to generate a DataStage Job .................. 143
Lab 10: Combining and Sorting Data.................................................... 156
Task: Creating a source to target specification with a lookup table ........................... 157
Task: Completing the lookup stage job ...................................................................... 167
Task: Range lookup on reference link ........................................................................ 180
Task: Using the Sort stage .......................................................................................... 187
Task: Using the Remove Duplicates stage.................................................................. 191
Task: Using the Join stage .......................................................................................... 194
Task: Using the Merge stage....................................................................................... 201
Task: Using the Funnel stage ...................................................................................... 204
Task: Perform an impact analysis using the Repository window ............................... 208
Task: Find the differences between two jobs .............................................................. 210
Lab 11: Aggregating Data ....................................................................... 213
Task: Using the Aggregator stage ............................................................................... 213
Lab 12: Transforming Data .................................................................... 217
Task: Create a parameter set ....................................................................................... 217
Task: Add a Transformer stage to a job and define a constraint ................................ 219
Task: Define an Otherwise link .................................................................................. 225
Task: Define derivations ............................................................................................. 229
Lab 13: Operating and Deploying Data Integration Jobs ................... 234
Task: View the Metadata Lineage ............................................................................. 234
Task: Building a Job Sequence .................................................................................. 238
Task: Add a user variable .......................................................................................... 245
Task: Add a Wait For File stage ................................................................................ 248
Task: Add exception handling ................................................................................... 250
Lab 14: Real Time Data Integration ...................................................... 253
Task: Revisiting our Project Blueprint ....................................................................... 253
Task: Creating a Service Enabled Job ........................................................................ 256
Task: Create an Information Service project with Information Services Director ..... 262
Task: Create an Information Application and Service ................................................ 266
Summary ..................................................................................................................... 275
2. In the labs, we will use the term “VM Machine” to refer to the VMWare environment
of IBM InfoSphere Information Server, and the term “Host Machine” to refer to the
machine of VMWare Player or Workstation used to load and host the VMWare
image.
3. All the required data files are located at: /DS_Fundamentals/Labs. You will be using
the DataStage project called “dif”.
1
IS admin: InfoSphere Information Server administrator
2
WAS admin: WebSphere Application Server administrator
3. If the browser does not display the web page, ask your instructor for help and follow
these instructions together.
7. To change the user to db2inst1, type the following command “su - db2inst1”
10. In order to Start the WebSphere App server, execute this command (This is a time
consuming process that takes more than a few minutes to complete):
/opt/IBM/WebSphere/AppServer/bin/startServer.sh server1
/opt/IBM/InformationServer/ASBNode/bin/NodeAgents.sh Start
The data file that we will discover, map, transform and load contains social
media sentiment information. The data was extracted from the web using a
web crawler and then processed on our InfoSphere BigInsights cluster,
IBM’s big data processing platform. The BigInsights processing reduced
the size of the data. This made easier to store the data in our traditional
data warehouse infrastructure. It is now our job to make the information
available for consumption by our business users through our data
warehouse.
In this first exercise we will use Blueprint Director to better understand the
class scenario and our goals.
Logoff from the IBM Information Server Web Console if you are connected to it.
1 Open Blueprint Director by double-clicking on the ‘InfoSphere Blueprint Director’
icon, or click on Start All Programs IBM InfoSphere IBM Blueprint
Director2.2.0.
2 When you work on a project inside Blueprint Director, the information regarding the
project is stored into a workspace. A workspace is a folder on your system. In this
environment, it is located in directory C:\Users\Administrator\IBM\bpd\workspace.
You can also build your own template from scratch using Rational Method
Composer.
5 We will create a new blueprint from a template; in this case, we use the “Business
Driven BI Development” template. Save the blueprint in the destination folder
“Miscellaneous Project”, and name your blueprint ‘My_BI_Blueprint.bpt’.
6 Click ‘Finish’ to create the blueprint. The new blueprint, based on the ‘Business
Driven BI Development’ template, should now be visible.
3
1
2
8 Explore the content of the palette. There are various categories: Groups,
Operations, Analytics, Consumers and Delivery, Data Stores, Files, Conceptual
Models, Connections. Each category contains a number of elements. This list is
extensible, so you can add your own elements through Blueprint > Extend Palette.
9 The top level blueprint diagram of our BI project already uses a number of elements
from the palette.
Each domain in the diagram has one or more high level elements. For example,
the Data Sources domain contains a series of group elements called Asset Sets
such as Structured and Unstructured Sources, External Feeds and Enterprise
Applications. The Data Integration domain contains the Integrate Sources
Routine element, an element in the Operations category.
General Flow connectors link elements together within a domain and across
domains. These general flow links help you visualize the flow of information in
your information project.
10 Many elements on the diagram contain a sub-diagram, with lower level, more
detailed information. You can tell if there is a linked, lower level subject area if there
is an orange plus sign at the top left corner of an element. And again, any element
in a sub-diagram can itself contain a sub-diagram. This hierarchical representation
of diagrams lets you maintain higher level diagrams uncluttered of unnecessary
detail.
11 Click on this ‘+’ sign or double click on the ‘Integrate Sources’ element to drill down
into more detail of the "Extract, Transform and Load" (ETL) process.
12 The ETL sub-diagram is now open. The highlighted tab at the top of the canvas
shows the diagram you are currently working on.
13 In this sub-diagram, notice that the elements on the left and the right side are in gray
italics. This indicates that these elements have been added from another diagram
by dragging them from the Blueprint Navigator. Changes to these elements are
kept synchronized across diagrams.
14 On the right hand side of Blueprint Director workspace, you may have noticed three
content browsers.
The Method Browser displays the outline of the method that is associated with
the template diagram. A method provides guidance on recommended roles,
tasks, deliverables and dependencies for the overall project.
The Asset Browser browses IBM InfoSphere Information Server metadata
repositories based on a connection profile. You can drag & drop entries (e.g. a
database, a job, etc.) from the asset browser onto elements on the canvas.
These elements will be automatically linked so that you can open IBM InfoSphere
Metadata Workbench to view the metadata details from the blueprint.
The Glossary Browser, which is the standard IBM InfoSphere Business
Glossary eclipse plug-in, displays the glossary categories and terms in a tree
view and the detailed definition in the property view. You can drag & drop
glossary terms onto the blueprint diagram to define conceptual entities or tag
elements with terms.
15 In the Method Browser, expand the Business Driven BI development scenario. This
template scenario provides you with a high-level overview and guidance for the
required steps in a particular project. When you define and manage a new project,
you have access to the corresponding method in a hierarchical view for high-level
phases and activities, plus detailed descriptions activities of the method for a
selected project based on a template.
phases ,
17 Click the ‘+’ sign in front of any phase, activity, or task to see the details.
18 In this boot camp, we will focus on the Discover Sources pattern and discovering,
defining and developing Information Integration activities.
19 Import the existing blueprint DIF_Scenario.bpt, by selecting from the menu File >
Import Blueprint.
Our blueprint has milestones defined and the elements are already assigned to
milestones.
24 View the Timeline tab on the bottom left side of the screen.
26 You can view the evolution of the blueprint by using the slider in the timeline
window.
27 Select the milestones that you want to visualize at each phase of the project. This
capability helps your team to understand the end-to-end project vision.
28 Move the timeline slider from ‘End of workshop’ to ‘Adding Sources’ and then to
‘Data Quality’. You will notice that a yellow circle appears around the Integrate
Sources routine element. This feature informs you about lower-level diagram
activity.
30 Double click onto the Web Data asset set in the data sources domain. You can
alternatively click onto the plus sign. This will open the sub-level diagram.
31 In our scenario, data is extracted from the web and saved in a data file containing
additional user information using the InfoSphere BigInsights product. Once the data
was processed on our BigInsights cluster, the result is written to a file. We will need
to read this file later on using DataStage. Note that DataStage has a connector that
allows for direct connections into the BigInsights/Hadoop file system called the
HDFS connector. Since we don’t have access to the BigInsights server in our
workshop environment, we chose to export the file to the local file system.
34 Notice that the elements on the left and on the right side are in gray italics. This
indicates that these elements are the connection points from the top level diagram.
Changes to these elements are kept synchronized across diagrams.
35 The Integrate diagram shows a classic extract, transform and load (ETL) pattern.
The first part of an ETL process involves extracting the data from the source
systems. In our case this is reading the data from the web data feed file. We will
store the extracted data in a database.
The load phase loads the data into the end target which, in our case, is a data
warehouse.
Click Finish.
This server connection is listed now in the Manage Server Connections window
You can click on the sign in front of each database to discover the underlying
schemas, tables, and columns.
43 We are now going to create a connection for a BI Report. We already have existing
customer data reports that access our warehouse. We will now link our blueprint to
one of the existing reports. Once the customer sentiment analysis report is built, we
could go back to our blueprint and include this asset link. In your blueprint, right click
on the Reports element.
44 Move Timeline to Advanced Analytics and ensure to remove the Enable read-only
Blueprint view by timeline, right click Report and Select Add Asset Link.
47 Click Next.
49 Click Next.
52 Click Finish.
53 A green arrow has now been added to the Reports element. It indicates that there is
one or more assets associated to this element.
54 Click the green arrow and then the Customer Report link.
55 You can now browse the BI report representation in the Information Server
Metadata Repository using Metadata Workbench. The window is embedded in
Blueprint Director.
58 The above diagram shows part of the full lineage between the Customer Report and
the associated warehouse tables and the existing operational sources. You can use
the slider or the + / - sign on the tool bar to zoom in or zoom out.
59 Notice that by clicking on any of the link, information is displayed regarding the
source and the target of the link, and the link type (model, design, operational, user
defined data).
60 Our task is to enrich our customer warehouse with additional customer information
that we have gained from the sentiment analysis. Exit the maximized view.
Click onto the method icon on the web data asset set. The next actions associated with
this first set are:
Analyze Sources
In this exercise we are going to analyze the big data customer sentiment results file that
was exported from our BigInsights/hadoop cluster. InfoSphere Discovery will help us get
a better understanding of the new data source and how we can process it in DataStage.
We will also check if we can discover any relationships between the text file and our
existing customer master database table.
This lab requires DB2 and InfoSphere Discovery related services up and running.
DB2 must be up and running: indeed, any request made in Discovery Studio is
processed by the Discovery Engine, which retrieves and / or stores objects in the
Discovery Repository database, a DB2 database.
Start DB2 service.
Note: if the DB2 services or Discovery services were not started, you would get
a java error message when you try to bring up Discovery Studio.
2 In the Source Data Discovery Tab, and select the New Project icon.
4 Click OK.
Data Sources
Every Source Data Discovery project contains at least one data set
A data set can contain physical tables from one or more databases,
text files, or a combination of tables and files.
5 We will define a data set for the data source we are interested in. This process will
consist of:
o Naming the data set
o Specifying a connection to the data set
o Selecting the tables or files to be included in the data set.
o Importing the physical tables to the Discovery staging database
o Defining logical views (Logical Tables), which let you eliminate unnecessary
columns, or perform pre-joins.
Task: Adding the customer sentiment text file to the data set
8 Highlight the Text File Formats & Files section.
10 Navigate to C:\bootcamp\dif\SAMPLE_Brand_Retail_Feedback.csv
12 Keep the Delimited File format and the other settings like Row Delimiter.
13 Check the Heading Line text box. The first line has columns names.
14 Click Next.
15 Keep the Column Delimiter as Comma and set the Text Delimiter to “ (double
quote).
16 Click Next.
Password inf0server
23 Highlight the CustomerMaster connection and click on the green plus sign to import
a database table.
26 Click Finish.
27 We have now defined all data sets for our project, and can start the analysis.
29 Click on the ‘Run Next Steps…’ button on the lower right side.
30 In the Processing Options window, notice the arrow pointing to Column Analysis;
you could drag that arrow down to any of the other tasks (PF Keys, Data Objects,
Overlaps). All the tasks up to the task being pointed to will be executed. Usually, it
is advised to perform each task separately, verify the results delivered by Discovery,
and make any appropriate modifications before proceeding to the next task.
31 Click the Run button.
32 While the task is executed, notice an information message informing you that the
project is locked and that you cannot perform changes. However, you can monitor
the progress of the task(s) being processed. A status indicator is displayed in the
top-right corner that indicates the number of tasks currently active.
33 You can click on ‘Currently 1 Active tasks’. This opens the Activity Viewer window,
which lists all the activities3, whether queued, running, completed, or completed with
errors.
34 From the ‘Activity Viewer’ window, you can monitor the trace and error logs for each
activity.
35 Close the Activity Viewer window.
3
If all the activities remain in a queued state, this could mean that the InfoSphere Discovery Engine was
not started.
36 Once the processing completes the message ‘Project is locked. You cannot make
any change’ will disappear, and InfoSphere Discovery will display column profiles.
37 The Column Analysis tab will appear with a green status icon indicating that the
analysis run was successful.
39 Review the column analysis results for that table. Notice that the analysis results are
composed of column metadata information on the left side, and column statistics
information on the right side. (Metadata, Statistics columns)
40 In the metadata section, we discovered that the native data type corresponds to the
defined data type.
41 Scroll to the right to see the statistics columns (alternatively, you can click on the
‘Column Chooser’ icon to select the columns that you would like to display).
47 Note that the BigInsights text analytics processing was not always able to identify
which product category, brand, product family and product name is associated with
the sentiment data.
48 Highlight the BRAND column, click on Value Frequency and review the actual data
values in this column by frequency. You will find missing values or the value <null>
in it. Note that <null> is an actual value since we’ve been analyzing a text file that
does not support NULL values. We can later convert the <null> to NULL values in
DataStage when loading the data into a RDBMS.
50 So far, we have analyzed each table separately. We will now let the system discover
the relationships between tables in each data set. Text files cannot include primary-
foreign key metadata. This step is critical to identify keys that we can use to join the
data.
51 From the Column Analysis tab, click Run Next Steps.
52 In the Processing Options window, make sure the arrow points to PF Keys, and
click Run.
53 Once processing has completed, look at the result of the PK Key discovery for the
SocialData data set.
54 You can position the different objects on the graph as you wish, so that the lines
between tables do not overlap.
56 Click onto the arrow that connects the two tables. Two key associations were
identified, FULLNAME and USERID. While FULLNAME has a relatively high hit rate
only the SOCIAL_USERID and USERID relationship has a 100% value row hit rate
on both sides.
57 Only one relationship actually makes common sense. You can now exclude the
FULLNAME key relationship by highlighting it and deleting it by clicking on the red
X.
1. Between any pair of tables, there may be zero, one or several column matches.
2. Some column matches may be coincidental, and it is your responsibility to
remove the matches that have no meaning from a business perspective.
3. Column matches and the resulting PF keys are influenced by parameter values
defined in the processing options.
4. Table classification and PF key identification is also influenced by the data set
classification type (operational or data warehouse).
5. Table classification is automatically determined by Discovery during PF key
analysis.
6. Discovery users (SME) can modify table classification, link type between tables.
7. Discovery users (SME) can add links between tables when desired.
8. Discovery will NEVER miss to identify a link between two columns: a link is
identified if the values that are common between the columns is above a
threshold specified as parameter in the processing options.
Data Objects
Discovery organizes related tables into structures called data objects,
based on the primary-foreign keys.
In most cases, tables classified as root entity tables become root tables
(parents). Tables with foreign keys classified as child entity or
reference tables usually become child tables in data objects.
A table with no primary key or foreign keys is also considered a data
object.
Data objects never span across multiple data sets: if a table in one
data set is related to a table in another data set, this relationship will be
discovered in subsequent steps.
62 Review the objects generated for the SocialData data set. CustomerMaster was
classified as the root entity.
Data Archival
When archiving data, your goal is to minimize the traffic of data
between primary storage and the archive. Therefore, related tables
should be archived together.
The Data Object Discovery phase identified sets of related tables as
‘data objects’.
You can export these Discovery data objects to Optim for archival
purpose. A filter (WHERE clause) can be applied on a data object to
further restrict the size of the archive.
65 Export the data object: click Project Export Optim Data Models
This concludes the Discovery lab. We now have a better picture about our data that we
need to process. Save the Discovery project and exit the studio.
4. The Information Server Suite Administrator user ID, isadmin, is displayed. Also the
WebSphere Application Server administrator user ID, wasadmin, the DataStage
Administrator dsadm and the DataStage user dsuser is listed. We will be using the
user ID that was created for the Data Integration Fundamentals class called dif.
6. Note the properties of this user. Expand the Suite Component. Note the Suite Roles
and the Component Roles that have been assigned to this user. Our user has
access to most of the Information Server components in various roles (User +
Administrator etc.). We have the DataStage Administrator and User role checked.
7. Return to the Users main window by clicking on the Cancel button (you might have to
scroll down in order to see it).
8. Click Log Out on the upper right corner of the screen and then close the browser.
11. Specify the host name of the Information Server services tier computer (infosrvr),
followed by a colon, and followed by the port number (9080) to connect. Use dsadm
as the User name and inf0server as the password to attach to the DataStage
server. In our case the host name of the Information Server engine is the same as
the one for the services tier. Click Login.
12. Click the Projects tab. Here, you can add, delete, and move DataStage projects.
Select the “dif” project and click the Properties button.
Note that all users have the DataStage and QualityStage Administrator role except
dsuser.
15. Click the Environment button to open up the Environment variables window.
16. There are many environment variables that affect the design and running of parallel
jobs in DataStage. Commonly used ones are exposed in the DataStage
Administrator client, and can be set or unset using the Administrator. In the Parallel
folder, note that the APT_CONFIG_FILE parameter points to the default
configuration file location. We will learn about configuration files in a later module.
In the Reporting folder we have enabled additional reporting information that will
help us debug our jobs.
18. Go to the Parallel tab and browse the parameters and available settings. The parallel
page allows you to specify certain defaults for parallel jobs in the project, for example
format defaults for time and date. Click OK when done.
2. Use the dif / inf0server combination to log into the INFOSRVR/dif DataStage
project.
3. Once you log on to the Designer client, you will see this screen:
5. Save the job now with the name SampleSentiment into the Jobs folder in the
repository by clicking File Save As…
6. Type the name and save it into the Jobs folder. Click Save.
7. Add a Sequential File stage (‘File’ category in the Palette), a Copy stage
(‘Processing’ category in the Palette), and a second Sequential File stage. Draw links
between them. You can draw links by selecting the link element from the General tab
in the palette. The quickest way to draw links, however, is to right click onto the
originating stage (Sequential File) and drag the link onto the target stage (Copy
Stage).
Copy Stage
The Copy stage has one input link and can have multiple output links. It
copies all incoming records to all output links.
The single output link in our case means that the records are simply
passed along without any operation. It serves as a placeholder for
future logic.
8. Name the stages and links as shown. To rename a stage and link, select the object
and start typing over it. You can also right click on the object and select Rename.
9. We are going to read from the SampleSentiment data file. When reading or writing
sequential files using the sequential files stage, DataStage needs to know three
important facts:
Table Definitions
Table definitions describe the column layout of a file. They contain
information about the structure of your data. Within a table definition
are column definitions, which contain information about column names,
length, data type, and other column properties, such as keys and null
values.
Table definitions are stored in the metadata repository and can be used
in multiple DataStage jobs.
10. First, we need to check if we already have a Table Definition for the file we want to
process. Browse the Repository window in the ‘Table Definitions’ folder
Sequential.
12. Choose /bootcamp/dif Directory by clicking the button to the right of the Directory
field.
Note: The files will not be displayed because you are just selecting the
directory.
Note: We are browsing the engine tier file system on the Linux server VM, not
the Windows client VM.
13. After you click OK to the directory browser, all text files will be displayed in the Files
area.
15. Make sure you are saving the Table Definition to the \Table Definitions\Sequential\
folder and click Import.
16. Check the box First line is column names and then go to the Define tab.
17. By default, all fields are non-nullable. Since we have analyzed the file beforehand in
Discovery, we know that most fields contain empty values. We plan to use USERID
as an identifier and should keep this field as non-nullable. Define all other fields as
Nullable.
18. Change the SQL type for CREATEDTIME to Timestamp and remove the 255 value
from the length field.
USERID VarChar 25 No
20. Your sequential file meta data should look like this:
22. Close the import window. The new Table Definition will be displayed in the
Repository window under the Table Definitions > Sequential folder.
23. Double click on the source Sequential File stage. We need to specify the file to read
in the Properties tab. Select the File property and then use the right arrow to browse
for a file to find the SAMPLE_Brand_Retail_Feedback.csv file. Click OK. Hit the
Enter key to see the file path updated in the File property.
24. Set the ‘First Line is Column Names’ property to True. If you don’t, your job will have
trouble reading the first row and issue a warning message in the log.
25. Next, go to the Format tab and click the Load button to load the format from the
SAMPLE_Brand_Retail_Feedback.csv table definition under folder /Table
Definitions/Sequential.
26. Note that DataStage was able to identify the file format during the import of the
Table Definition. This is how the file looks like in the raw format:
27. Next go to the Columns tab and load the columns from the same table definition in
the repository. Click OK to accept the columns.
28. Click View Data and then OK to verify that the metadata has been specified properly.
This is true when you can see the data window. Otherwise you will get an error
message. Close the View Data window and click OK to close the Sequential File
stage editor.
29. Open the Copy stage. In the Copy stage Output tab > Mapping tab, select all source
columns and drag them across from the source to the target.
32. Ensure that in the Format tab, the Delimiter setting in the Field defaults folder is set
to comma delimited.
33. In the Properties tab in the File property, type the directory name /bootcamp/dif/ and
name the file SAMPLE_Brand_Retail_Feedback_Copy.csv. Instead of typing, you
can use the right arrow button to ‘browse for file’. Then pick the
SAMPLE_Brand_Retail_Feedback.csv file and come back to correct it to append
“_Copy.csv”.
34. Set option ‘First Line is Column Names’ to true. The File Update Mode should
continue to be set to ‘Overwrite’ every time the job is run. Click OK to save your
settings.
37. After the compilation is finished you can close the Compile Job window.
38. Right-click over an empty part of the canvas. Select or verify that “Show performance
statistics” is enabled (a checkmark should be present in front of “Show performance
statistics”). This will show, for each link, how many rows were processed and the
throughput per second.
39. Ensure that you have the Job Log view open. To open the window, click on the menu
View > Job Log. Enlarge the job log window.
42. Scroll through the messages in the log. There should be no warnings (yellow) or
errors (red). If there are, double-click on the messages to examine their contents.
Fix any problem and then recompile and run.
43. Rearrange the job log window to make more canvas space available.
44. You can view the result data by right clicking on the target sequential file and
choosing ‘View SampleSentimentCopy data…’.
50. Open up the job properties window by clicking the 5th icon on the tool bar.
51. Go to the Parameters tab. We will define two job parameters. Define the first job
parameter named TargetFile of type string. You double click on the Parameter name
field and simply type into it and then fill out the other fields. Create an appropriate
default filename, e.g., TargetFile.txt. The second job parameter will contain the
target directory called /bootcamp/dif/.
Hit the Enter key to retain the changes. Click OK to close the window.
53. Open up your target Sequential File stage to the Properties tab.
54. Select the File property. Delete the content of the File property.
55. Click on the black arrow button on the right side of the text box.
57. Select the TargetDirectory parameter first. Place the cursor at the end of the inserted
#TargetDirectory# string. Repeat the step with the TargetFile parameter.
58. Your final File box string should look like: #TargetDirectory##TargetFile#
59. Note that the parameters are enclosed in # signs. If you did not add a final / for your
target directory earlier, you could place it manually in between the parameter.
62. Run your job. Note that DataStage prompts for the parameter values. Leave the
default values intact. Click Run.
64. Scroll through the messages in the log. There should be no warnings (yellow) or
errors (red). If there are, double-click on the messages to examine their contents.
Fix any problem and then recompile and run.
65. Right click on the final sequential stage ‘Target File’. Select ‘View Target_File
data…”
71. Rename the stage and link names as shown for good standard practice.
72. Edit the SampleSentimentData Sequential File stage. Change the file name in the file
property to SAMPLE_Brand_Retail_Feedback_Reject.csv and set the property
Reject Mode to Output. This way, the rejected records will flow to the sequential file.
This file contains a few records that do not fit the Table Definition that will be
rejected.
73. Modify the Sequential File stage Sentiment_Rejects to write the output to a file called
Sentiment_Rejects.txt, located in /bootcamp/dif/.
75. Switch to the Columns tab. Note that there is only one output column and that it is
greyed out. Often times, column layout issues cause records to be rejected.
DataStage outputs these records as a binary object.
78. Run the job and view the job log. The result will be as shown below. In order to see
the number of records on the links, don’t forget to turn on the Show performance
statistics for the job from the canvas if they are not there.
79. The input file we used had one additional field defined for three records. Since these
records did not fit into the table definition where TEXT was defined as the last field,
they were sent down the reject link. Note that these data quality issues should be
caught upstream during the discover phase.
80. Replace the Sentiment_Rejects Sequential File stage with a Peek stage from the
Development/Debug category in the palette.
82. Observe the job run log. Instead of storing the records in a text file, the peek stage
has caused the records to be output to the log. You will notice two entries (one for
each processing node) that contain the actual rejected data records. You can
double-click on the log entry to view the full text.
85. Open the source sequential file stage and change the file attribute file name to
SAMPLE_Brand_Retail_Feedback_Null.csv.
86. We will now process an input file that has empty string values in it. The values occur
in the CATEGORY field on three records. We will define these as NULL.
87. Click the Columns tab of the source Sequential File stage.
88. Double-click the column number 2 (to the left of the column name) to open up the
Edit Column Meta Data window.
89. In the Properties section, click on the Nullable folder and then add the Null field value
property. Here, we will treat the empty string as meaning NULL. To do this specify “”
(back-to-back double quotes). Click on Apply and then Close to close the window.
91. Click the Columns tab of the target Sequential File stage. Double-click the
CATEGORY column number 2 (to the left of the column name) to open up the Edit
Column Meta Data window.
92. In the Properties section, click on the Nullable folder and then add the Null field value
property. Here, we will write the string NO CATEGORY when a NULL is
encountered. Click on Apply and then Close to close the window.
95. View the data at the target Sequential File stage by right-clicking on the stage and
selecting View TargetFile data…. Notice that DataStage prints the word “NULL” in all
records with empty strings. The NO CATEGORY value is not displayed. This is
because DataStage knows that these represent a NULL value. Let’s take a look at
the file on the DataStage server.
102. You will see that the records contain the string that we assigned, “NO
CATEGORY”, to represent a NULL value.
103. You can keep the putty window open for now.
Task: Read data from multiple sequential files using File Pattern
In this task, we will create a job that will read data from multiple sequential files and write
to a sequential file. We will use the File Pattern option to read multiple files in a
Sequential File stage.
106. Edit the source Sequential File stage read method to File Pattern. Accept the
warning message.
108. This will read all the files matching the file pattern in the specified directory.
110. Compile and run the job. As can be seen on the auto partitioning icon on the link,
the source stage reads data from all the source files matching the pattern and writes
it to the output file.
111. You can right click on the target sequential file and view the data of this stage. You
may want to increase the amount of rows to be displayed to 600. We have
processed two input files with this file pattern. Check the results in the output file and
verify it has all the records that satisfy the file pattern.
114. Delete the target sequential file and replace it with a Data Set file from the File
category in the Palette and name the link and stage TargetDataSet.
115. Edit the target Data Set stage properties. Write to a file named TargetDataSet.ds
in the /bootcamp/dif/ directory.
116. Verify the columns tab and that all columns are there. Click OK to close the stage
editor.
119. In Designer click on Tools > Data Set Management. Select the Data Set that was
just created.
121. Click the Show Data icon to view the data of the Data Set (3rd icon).
122. Close the data viewer. Click the Show Schema icon (2nd icon) to view the Data Set
schema.
123. The Data Set Management utility can be used to view the internal schema format.
Close the Dataset Management Utility.
2. In the Configurations box, select the default configuration. You might want to expand
the window so that the lines do not wrap to make them easier to understand.
3. Your file should look like the picture below with two nodes already defined. If only
one node is listed, make a copy of the node definition through the curly braces, i.e.
copy from the 1st “node” to the first “}”, paste it right after the end of the definition
section for node1, and change the name of the new node to “node2”. Be careful you
only have a total of 3 pairs of the curly brackets; one encloses all the nodes, one
encloses the node1 definitions, and one encloses the node2 definitions.
2. Note how the partitioning indicator is showing the ‘fan in’ symbol before the target
stage. This means the two partitions are currently collected into a single file.
3. In the target Sequential File stage, define two files, TargetFile1.txt and
TargetFile2.txt, in order to see how DataStage data partitioning works. To define
more than one target file, click on File property.
6. View the job log. Notice how the data is exported to the two different partitions (0
and 1).
7. Go back to the putty window where you should still be logged on as dsadm /
inf0server on the server.
Let’s view the first output file of the job by typing:
head /bootcamp/dif/TargetFile1.txt -n 5
8. Note the associated FULLNAME records for the first entries. In this case these were
Dale Hemmingway, Garth Karlson, Ralph Monk and Bill Lanford.
9. Next, view the first ten rows of the source file by typing:
head /bootcamp/dif/SAMPLE_Brand_Retail_Feedback.csv -n 10
Notice how the data is partitioned. Here, we see that the 1st, 3rd, 5th, etc. records go
into one file and the 2nd, 4th, 6th, etc. records go in the other file. This is because
the default partitioning algorithm is Round Robin.
2. Compile and run the job again. Open the target files and examine. Notice how the
data gets distributed. Experiment with different partitioning algorithms!
3. The following table shows the results for several partitioning algorithms. You will also
find the row count in the log. Observe the message for the export of the TargetFile
Sequential File operator for partition 0 and 1. TargetFile,0: Export complete and
TargetFile,0: Export complete.
Records Records
Partitioning Algorithm Comments
in File1 in File2
2. Open the Properties tab of the source Sequential File stage. Click the Options folder
and add the “Number of Readers Per Node” property.
6. In the job log, you will find log messages from Import SampleSentimentData,0 and
SampleSentimentData,1. These messages are from reader 1 and reader 2. In
addition, you can see that DataStage is now using the same partitioning before the
copy stage since the incoming data stream already has two partitions.
7. You may also notice that one record was dropped because the data string did not
match the timestamp format of the CREATEDTIME column. We sent the first column
name record into the data stream as well. The ‘First line is column names’ property is
invalid when reading with multiple readers per file.
2. Select the Other folder and then Data Connection. Click OK.
3. Name the Data Connection JKLW_DB and type ‘Sample Outdoor Operations
Database’ as a short description.
4. Switch to the Parameters tab. Browse for a stage type in the ‘Connect using Stage
Type’ section. Note that there are many different stage types for which you can
create Data Connections for. Select the DB2 Connector stage type from Parallel >
Database and click ‘Open’.
ConnectionString JKLW_DB
Username Db2admin
6. Click OK.
7. Save the Connection Object as JKLW_DB in the folder Jobs > Shared.
Task: Load the data from the sequential file to a DB2 UDB table
using a DB2 Connector stage
In this task, we will create a job that reads data from the sentiment data sequential file
and loads the records into a DB2 UDB table. We will use a DB2 Connector stage to
write data into a new DB2 database table.
9. Create a new parallel job named SequentialSentimentToDB2. Note: You can use the
SampleSentiment job as a template to save time and remove the target stage and
link.
10. Drag and drop the JKLW_DB Connection Object that you just created onto the
canvas. Change the link to an input link. Connect the link to the Copy stage.
11. Rename the stage and link names as shown for good standard practice.
12. We will load the data into our database for further processing.
13. In the SampleSentimentData source stage, open the Output properties. Ensure that
you remove the Multiple Readers per Node property if it’s active. We need to read
the file with the column names. Ensure that the First Line is Column Names property
is True.
14. Open the Copy stage. Go to the Output tab and map all columns to the output link.
16. DataStage associated the stage with the Connection Object. Test the connection.
17. In the Properties tab, expand the Usage section. Change the Generate SQL option
to Yes. Specify the following Table name: DIF.CSTSENTIMENT.
21. Observe the job log. Note the SQL statements that were generated for each partition.
22. You will find a create table DIF.CSTSENTIMENT statement and also an INSERT
INTO DIF.CSTSENTIMENT table statement.
23. In the following exercises we will use this table for further processing. This means we
will have to pay closer attention to the SQL data types in our table definitions. During
the discover phase, we found out that the UserID and Followerscount fields consist
entirely of numbers. We can use the default DataStage type conversion in the copy
stage to convert these two varchar fields to a numeric data type. This will make it
easier to run operations like joins, aggregations and sorts on these fields.
24. Open the properties of the Copy stage. Change to the Output tab. We are mapping
all input columns to the output of the stage into the target sequential file stage. Open
the Columns tab.
25. Change USERID to the Numeric and FOLLOWERSCOUNT to the Integer SQL type.
USERID Numeric 25
27. Click OK. If we want to run the job again, we need to change the table action from
create to replace. You can make this change in the DB2 Connector stage properties.
30. View the job log. You will notice that two warnings were issued. One for each default
type conversion that was carried out in the copy stage. You will also notice that a
drop table statement is issued before the create table statement.
31. You can view the data in the table by opening the DB2 Connector stage and then
clicking on the View Data button. DB2 has now created a table using native DB2
data types.
In this lab, we will build a physical data model for a new warehouse table. This table will
store the customer sentiment data that we have just loaded into the operational
database. For this, you will use IBM InfoSphere Data Architect to view the overall
database structure and manage system design changes.
3. From the main menu, click File New Data Design Project. A new data design
project wizard will open.
11. Uncheck all default Database elements except ‘Tables’. Click Finish.
12. You can now browse the discovered table in the DIF schema. The
CUSTOMER_MASTER table has a defined Primary Key (PK). It also has the
SOCIAL_USERID, which we learned during the Discovery phase, has the same key
values as our Customer Sentiment data.
In this step, we will create a new table to store our customer polarity information the
existing data warehouse.
13. We will now import the existing DIF schema from our data warehouse.
14. Right click onto the Data Models folder, select New Physical Data Model.
15. Rename the data model to ‘Warehouse Physical Data Model’. Select ‘Create from
reverse engineering’.
20. Expand the JKLW_DWH database and the DIF schema. Right click onto the DIF
schema entry. Select ‘Add Data Object’ and then ‘Table’.
21. Name the new table CUSTOMER_POLARITY. This table will store data about
identified positive or negative product experiences. This table will contain the
following fields:
DECIMAL
UserID Identifying key. Helps us to join records.
31
VARCHAR
Polarity Positive or negative sentiment.
255
22. Add these columns by right clicking on the CUSTOMER_POLARITY table and
selecting ‘Add Data Object’ ‘Column’.
24. Go to the Properties view on the bottom right side of the screen. Switch to the Type
section. For the UserID column, change the type to Decimal, Precision 31.
25. Repeat these steps for the other three columns. Make sure you define the correct
SQL Types for each column. This is how your CUSTOMER_POLARITY table should
look like in the end:
27. Right click on the CUSTOMER_POLARITY table and select Generate DDL…
30. In the Objects selection, Deselect All and then select Tables. Click Next.
31. The script is now created. We can now run the script on the server. Check the ‘Run
DDL on server’ option and click Next.
34. The SQL Results window will appear and the status should be ‘Succeeded’.
In this task, we will export the InfoSphere Data Architect Model to disk. This will enable
us to bring the physical model metadata into the Information Server Metadata
Repository.
36. To export the warehouse physical data model that we just created, go to File
Export.
37. Open the General folder and select File System. Click Next.
38. Browse for a directory to save the export files in. Choose
C:\bootcamp\dif\DataModels. Select the Physical Data Model and the Warehouse
Physical Data Model for export. Click Finish.
40. Before you can use the ODBC connector in a job, you need to configure database
drivers, driver managers, and data source names. Our server has two Data Source
Names (DSN’s) for JKLW_DB and JKLW_DWH already defined. We are set to
import Table Definitions using the ODBC Connector.
41. Open DataStage Designer and log on to the dif project using dif / inf0server. Then,
go to Import > Table Definitions > ODBC Table Definitions.
45. DataStage is now searching for tables in the data source. Select the
DIF.CSTSENTIMENT and DIF.CUSTOMER_MASTER tables from the list. Save
them in the default \Table Defintions\ODBC\JKLW_DB folder. Click Import.
46. The Table Definitions are now available in the Repository window.
We will use FastTrack to specify a mapping of our customer sentiment source data into
the new customer polarity data warehouse structure. InfoSphere FastTrack helps to
automate a huge chunk of this process and provides a centralized location for tracking
and auditing specifications for a data integration project.
Let’s quickly revisit the steps we have made so far by looking at the BI development
method guidance in our blueprint. We have now analyzed our source data, developed
the necessary physical warehouse table and now we are going to define our mapping
specification. The next step will be developing our information integration logic.
This also ensures that the mapping specifications are linking to the same metadata
artifacts that the data integration developers are using.
1. Start the Information Server FastTrack client from the desktop, or select “Start All
Programs IBM InfoSphere Information Server IBM InfoSphere FastTrack
Client”.
5. In the description field, fill in ‘Mappings for customer sentiment tables’. Click Finish.
6. Double click on the new project. This will open the mappings tab.
7. Expand the DIF Customer Warehouse folder. You will notice the three folders in the
DIF Customer Warehouse project.
8. The Mapping Specifications folder holds the created and imported source to target
mapping specifications. The Mapping Components folder holds mapping
components which are the direct equivalent of DataStage shared containers. They
are integrated into mapping specifications as sources or targets. The Mapping
Compositions folder stores mapping compositions which consist of a set of
mapping specifications that share a relationship, for example, the same target
mapping.
You can use FastTrack to import metadata from existing physical tables.
10. Highlight the INFOSRVR host and, in the Tasks bar on the right, click on Import
Metadata.
11. Expand the JKLW_DB database connection with the JK Life & Wealth Operational
Database description.
13. Expand JKLW_DB > DIF. Select the CSTSENTIMENT table and the
CUSTOMER_MASTER table. Click Import.
17. Choose to import to the existing INFOSRVR host and click Next. Import to the
JKLW_DWH. Click Finish.
21. Switch over to the mappings view. Also note the Database Metadata folder in the
Browser window. Tip: If you do not see the Browser tab, click View > Browser.
26. Right click onto the CUSTOMER_POLARITY table. Select Map to.
27. The target fields are now populated. Highlight the source fields and right click onto
the highlighted area. Select ‘Discover More…’.
31. FastTrack will find three results for the four columns. Add the three results as source
fields.
32. Drag and drop the FOLLOWERSCOUNT column from the CST_SENTIMENT source
table into the NumFollowers target field column. Your mapping table:
CSTSENTIMENT.POLARITY CUSTOMER_POLARITY.Polarity
CSTSENTIMENT.USERID CUSTOMER_POLARITY.UserId
CSTSENTIMENT.FOLLOWERSCOUNT CUSTOMER_POLARITY.NumFollowers
CSTSENTIMENT.CATEGORY CUSTOMER_POLARITY.Category
Validation
35. Ensure that you have no validation errors. Tip: To view the validation tab, click View
> Validation. It will appear in the top right area.
37. Select the DIF > Jobs > Warehouse Jobs folder to store the job. Store the Table
Definitions in the Table Definitions Folder. Click Next.
38. Now we can define the data source connection information. In the Connection
Configuration section, click ‘Manage’.
Database Write
Name Connector
Name Mode
JKLW_DB_DB2 JKLW_DB
DB2 Insert
JKLW_DWH_DB2 JKLW_DWH
40. Name the connection JKLW_DB_DB2 and select the DB2 Connector. Specify
JKLW_DB for the Database Name property. For the authentication information,
select ‘Manage Parameters…’.
41. For the Userid field, create a new parameter called DB2 User. The default value is
db2admin. Click OK. Assign the DB2 User Parameter to the Userid field.
42. Switch to the password field. Create another parameter called DB2 Password. You
cannot assign a default value here. Click OK. Assign the DB2 password User
Parameter to the password field.
43. To create the JKLW_DWH_DB2 Configuration, select New and repeat the steps
using the JKLW_DWH database name while reusing the default parameters.
47. You can now close the source to target mapping specification.
48. Let’s observe the generated job in DataStage. Open up DataStage Designer. Log on
to the dif project. Tip: If you still had the Designer window open, click Repository >
Refresh.
49. Open the new job from the Jobs > WarehouseJobs folder.
50. The job consists or a source DB2 connector stage, a transformer stage and a target
DB2 connector stage. Also note that FastTrack has automatically created an
annotation on the canvas that documents the specification from which it was created
and the time and date.
51. We will now edit the job parameter. Open the job properties.
52. Switch to the Parameters tab. Specify ‘inf0server’ as the DB2 Password default value
and ‘db2admin’ as the DB2 user. Click OK.
53. Open the properties of the source DB2 connector stage. You will notice that the user
name and password fields are already filled with those job parameters.
55. On the target connector stage, we can keep the Append table action with the Insert
write mode since we already created the empty table from InfoSphere Data Architect.
57. Double click on the transformer stage. Here you can see the simple 1:1 mapping we
created in our specification.
59. Open the source connector stage. Note that the Generate SQL property is set to
Yes.
60. Switch to the Columns tab. Only the four relevant columns from the
DIF.CSTSENTIMENT table are part of the Table Definition. This will result in a SQL
statement that will only select these columns, thus keeping the data extraction
process as efficient as possible. Cancel out of the source connector stage.
62. You should see 588 rows being loaded into the CUSTOMER_POLARITY table.
63. Go into the properties of the target DB2 connector and view the data of the last job
run.
64. You will notice that we have transferred all records into the data warehouse target,
including those that do not carry information about product polarity.
65. We need to make sure that we are only reading records from our source that have
this field populated. This can be done by adding a where clause to our select SQL
statement in the source DB2 connector stage.
66. Close the View Data Window and the target stage properties.
68. Switch the Generate SQL option to No. In the Select Statement field, click on the
Tools button.
70. From the Table Definitions folder on the left side of the screen, navigate into the
Table Definitions folder. From there, open the JKLW_DB database, the DIF schema
and then drag the CSTSENTIMENT table into the area that says ‘Drag tables to
here’. Note that this is the Table Definition that we imported through FastTrack
earlier.
73. Select the expression and copy it to your clipboard. Add the following statement at
the end of the current statement:
OR CSTSENTIMENT_ALIAS.POLARITY = 'negative'
The entire statement should look like this:
74. Switch to the SQL tab and view the entire SQL statement.
77. View Data and make sure the select statement with where clause is working fine
78. Open the target DB2 connector stage and change the table action property to
Replace.
80. You can view the data after this second run. If you have specified the WHERE
condition correctly, the performance statistics will already tell you that this time only
133 records were processed.
We will now create a new table that combines the product information from the
CSTSENTIMENT table with the address information from our CUSTOMER_MASTER
table.
During the Discovery phase we found out that these two tables share a key:
CSTSENTIMENT.USERID = CUSTOMER_MASTER.SOCIAL_USERID. Remember that
CSTSENTIMENT was created when we extracted the records from the
SAMPLE_BRAND_RETAIL_FEEDBACK source file. We will use this key to join these
tables together.
5. From the metadata browser window, select the JKLW_DB > DIF > CSTSENTIMENT
> BRAND, CATEGORY, POLARITY, PRODUCT and USERID columns.
6. Drag and drop them into the Source Field side of the mapping.
9. Click OK.
10. Drag and drop the JKLW_DB.DIF.CUSTOMER_MASTER table into the Lookup
Column.
11. Drag and drop the JKLW_DB.DIF.CSTSENTIMENT table into the Sources column.
15. Click OK. The key association appears in the Keys and Fields box.
16. We just added a lookup table to our mapping specification. Save the mapping
specification.
18. Note that FastTrack supports a wide range of simple transformation options.
19. We could have also joined these two tables. Switch to the Mappings view.
20. Now we can add the columns from our lookup table to our mapping.
21. Right click onto the next empty Source Field. Click Add Lookup Field…
22. Expand ProductGeo and CUSTOMER_MASTER. Select ADDRESS and click OK.
23. Repeat the previous steps and add CITY and STATE.
25. We can now define the target fields. Click into the first Target Field. Switch the field
from a Physical column to a Candidate column.
26. We will create a candidate table called ProductGeo. Fill in the Table Name and
Column Name information. You can keep the same column names and keep the
STRING, length 250 data type except for USERID which is DECIMAL, length 31.
Also pay attention that you are mapping the correct fields with each other since your
fields may appear in a different sequential order.
27. Do this for all columns until you have the following mapping:
ProductGeo.Userid DECIMAL 31
33. This time, we will use the ODBC Connector to read and write the data. In the
Connection Configuration, click Manage.
38. You can use the DB2 User and DB2 Password authentication parameter. Double-
click to select the parameter.
40. Choose the JKLW_DB_ODBC connection for the source. Do not define a connection
for the target. Click Finish.
41. Once the job is generated, you can close the Mapping Specification and the
FastTrack client.
DataStage Containers
Containers are reusable objects that hold groupings of stages and links. Containers
create a level of reuse that allows you to use the same set of logic several times
while reducing the maintenance.
There are two kinds of containers:
Local container
A local container simplifies your job design. A local container can be used in only
one job. However, you can have one or more local containers within a job.
Shared container
A shared container facilitates reuse. They can be used in many jobs. As with local
containers, you can have one or more shared containers within a job.
47. Double click onto the container. The container content will open in a new tab.
48. Switch back to the Product_Geography_Mapping main job. For job design simplicity
we are going to deconstruct this local container. Right click onto the local container
and select ‘Deconstruct’.
51. Rearrange the stages. FastTrack created a job that reads from the DB2 data source
using an ODBC Connector and then looks up records from the
CUSTOMER_MASTER reference table and then writes out to an ODBC Connector
Stage. This Lookup stage has only one reference link but the stage allows for
multiple reference data sets.
53. Switch to the Parameters tab. Specify ‘inf0server’ as the DB2 Password default
value.
54. We will now create a parameter set from these two parameters. This will allow us to
reuse the database connection information in other jobs and centralize the database
authentication management.
Parameter Set
Use parameter set objects to define job parameters that you are likely to use over
and over again in different jobs. Then, whenever you need this set of parameters in
a job design, you can insert them into the job properties from the parameter set
object.
57. Verify that both parameters are present in the Parameters tab and that the DB2
Password has a default value. Click OK.
60. Notice that both parameters have collapsed into a single Parameter Set Object. Click
OK.
61. We now need to update the source ODBC Connector stage with the parameter set
object. Open the CSTSENTIMENT source stage.
62. In the Connection area, click on the #DB2_User# parameter in the User name row.
Specify the parameter from our new Parameter Set Object called DB2Authentication.
Repeat this step for the Password parameter.
64. Open the CUSTOMER_MASTER lookup ODBC Connector stage. In the Connection
properties, update the User name and Password properties with the new Parameter
Set object. Your connection properties should look like this:
66. Double click onto the Lookup stage. You can see the columns of each table. The
source and lookup tables are on the left and the target table on the right. FastTrack
already defined the lookup key for us. Note that the keys are defined in the table
definitions in the lower part of the screen. The keys are also defined in the
CUSTOMER_MASTER table. The key type is = which stands for equality match.
Lookup Stage
For each record of the source data set from the primary link, the Lookup stage
performs a table lookup on each of the lookup tables attached by reference links.
The table lookup is based on the values of a set of lookup key columns, one set for
each table.
68. Notice the options for Condition Not Met and Lookup Failure. The Condition field is
empty.
Lookup Failure
When DataStage cannot find a corresponding record in the reference set based on
the defined key, you can specify four options. They are: Continue (processing), Drop
(the record), Fail (the job), Reject (send record to reject link).
71. Update the Connection Username and Password with Parameter set information.
72. The job is not yet ready to run. We need to define the target stage information first.
Open the properties of the ProductGeo ODBC Connector stage.
73. In the Connection area, highlight the Data source row and define JKLW_DWH as the
data source.
75. Use the Parameter Set for user name and password.
77. Define a schema for our ProductGeo table in the Table name section:
DIF.ProductGeo. You may want to update the description, too.
Change the Table action to Create. We will now create a table from the candidate
schema.
78. Switch to the Columns tab. We need to finish up the Table Definition that was
created by FastTrack. Define all columns as Nullable except UserID.
79. Click onto the Save… button. Specify ODBC as the Data source type, JKLW_DWH
as the Data source name and ProductGeo as the Table/file name.
81. Save the Table Definition in the Table Definition\ODBC folder. Click OK.
84. The job should complete successfully. If not, go back and fix the errors. Look through
the log messages of the job. Everything should look OK with a few warnings for
default type conversions.
85. View the result set in the target stage by clicking on View Data.
2. Open the FollowerCountLevel Sequential File stage. On the Properties tab, specify
the file /bootcamp/dif/FollowerCountLevel.txt to be read.
4. In the Format tab, keep the Field Delimiter as comma but change Quote to ‘none’.
5. In the Columns tab, load the following table definition: dif > Table Definitions >
Sequential > FollowerCountLevel.txt. Click OK.
7. Click View Data to verify that the metadata has been specified properly. Select Close
and OK to close the Sequential File stage.
8. Open the Lookup stage and map the input columns from CSTSENTIMENT to the
output as shown below.
9. Set the FOLLOWERSCOUNT column as a Range Key by checking the box in the
‘Range’ column.
10. Then right-click on the FOLLOWERSCOUNT row and select “Edit Key Expression”.
The expression editor will be displayed.
13. In the FollowerLevel lookup set, switch the Key Type to Range (a..z).
14. Drag the Key Expression from the Link_CSTSENTIMENT table into the Key
Expression fields from FollowerLevel.FollowerFrom and FollowerLevel.FollowerTo
rows. From the FollowerLevel lookup table, drag the FollowerLevel colum over to the
output table.
15. Click on the Constraints icon . Make sure the Link “FollowerLevel” is selected.
For the Lookup Failure option, select “Reject” and click OK twice.
16. Add a sequential file stage above the lookup stage. Name the stage
FollowerCountRejects. Link the lookup stage to the sequential file stage. Name the
link Rejects.
17. Open the FollowerCountRejects sequential file stage and define the following file
property: /bootcamp/dif/FollowerCountRejects.txt. Click OK.
18. Open the Transformer Stage. Map all input columns to the output, replacing the
existing columns.
19. In the Transformer stage, click on the properties icon . In the Stage > General tab,
ensure that Legacy null processing is disabled. This means that we can process
NULL values inside the transformer stage. Click OK.
21. Open the CUSTOMER_POLARITY target DB2 Connector stage properties. Switch
the table action property to Replace. Click OK.
23. Run the job and after it’s finished, validate the results by opening the
CUSTOMER_POLARITY target stage and viewing the data.
26. In the Repository window, right click on the SampleSentiment job. Select Create
copy.
28. Edit the SampleSentimentData source sequential file stage table definition. Read the
USERID field as Numeric 25.
29. Edit the Sort stage to specify the key as USERID and Sort Order is ascending as
shown in the snapshot below:
30. Note that you could add additional sorting keys when the Sorting Keys folder is
highlighted. This allows you to sort records within the first sort key group. Keep
USERID as the only sorting key.
31. Don’t forget to map all the input columns to the output in the Output tab of the Sort
stage. If you did not delete the links earlier, the mappings will still be there. In this
case, update the target Table Definition by switching the UserID field to Numeric.
32. In the target Sequential File stage, update the file parameter to
/bootcamp/dif/SAMPLE_Brand_Retail_Feedback_Sorted.csv
33. Save and compile the Job. Run the job and check the results. The output should
contain data sorted by USERID in ascending order.
34. Our source and target stages are sequential while the Sort stage is a parallel stage.
In our case, DataStage is automatically collecting the records by the USERID column
to produce a sorted sequential output. You may also specify the collection method
explicitly. Go to the Partitioning tab in the target Sequential File stage. Select the
Sort Merge collector and specify USERID as the key.
35. The sorted output shows that we have multiple records per UserID. But what if we
are only interested in unique records? We will learn how to achieve this in the next
task.
1. Use the Sort stage when you need your data to be sorted in a specific order.
2. You specify sorting keys as the criteria on which to perform the sort.
3. The first column you specify as a key to the stage is the primary key, but you can
specify additional secondary keys.
38. Edit the Remove Duplicate stage and specify the Key column as USERID. Note the
options. You have the choice to either retain the first or last duplicate key. If you want
to retain records by a specific logic, you would have to apply a multi-hierarchy-sort
(multiple sorting keys) for the input records as they pass into the Remove Duplicates
stage.
39. From the Output tab, click the Mapping tab, and specify the mapping between input
and output columns as shown below. Click Ok to close the stage.
40. Open the target Sequential File stage and specify the output file as
/bootcamp/dif/SAMPLE_Brand_Retail_Feedback_NoDups.csv
42. Run the job and verify the results. You can already see from the performance
statistics that 588 rows entered the Remove Duplicates stage and only 207 rows
were carried over to the target stage.
43. Observe the job log. You will notice that the target sequential stage issued a
warning.
A stage can request that the next stage in the job preserves whatever
partitioning it has implemented. This is defined by the Preserve
Partitioning flag. If the next stage ignores this request, a warning is
displayed on the log to notify the developer.
In our case the Remove Duplicates stage had the default Preserve
Partitioning flag set, which in this case was Propagate. Since we are
writing to a sequential target, the parallel partitions have to be collected
and cannot be propagated.
In the meantime, our QualityStage expert has improved our customer master data that is
stored in the CUSTOMER_MASTER table. The FULLNAME field is now split into first
and last names and the GENDER field values are now complete thanks to
QualityStage’s Country Rule Set processing.
We now have a file that contains the following fields: Identifier, Gender, Firstname and
Lastname. We will now join the new file with our existing master data and replace the
names and gender values. Processing a set of master records with update records is a
good use case for the Join and Merge stages that we will be looking at next.
45. Build a new parallel job that reads from the new file using a sequential file stage and
from the CUSTOMER_MASTER table using a DB2 Connector Stage. Both source
stages are joined and then the data is written to a dataset.
46. Click New > Parallel Job. Save this job as SampleSentimentJoin in the Jobs folder.
47. Properly name the stages and links as good standard practice.
48. Open the source Sequential File. For the File property, specify
/bootcamp/dif/CST_FIRST_LAST_NAME_GENDER.txt. This is how the file looks
like:
49. The file does not contain column names. The format is comma separated with
double quotes. Load the table definition for CST_FIRST_LAST_NAME_GENDER.txt
from the Table Definitions > Sequential folder for the Format and the Columns. Click
the View Data button.
50. View the data again to make sure the file can be read. Click Close and OK to close
the Sequential File stage.
51. Open the properties of the CUSTOMER_MASTER DB2 Connector stage. Load the
JKLW_DB data connection and specify to generate the SQL. Specify
DIF.CUSTOMER_MASTER as the table name.
52. In the Columns tab, load the table definition from Table Definitions >
DIF.CUSTOMER_MASTER.
53. Check IDENTIFIER as your Key. Your Table Definition should look like this:
54. The stage will use this table definition when it’s generating the select statement. We
will not be using the GENDER and FULLNAME fields anymore since these are now
coming from the new master data file. It’s good job performance practice to not
include these columns in the select statement. Delete these two columns from the
Columns list by highlighting them in the Columns tab and pressing the delete button.
Your new column list should look like this:
55. Go back to the Properties tab and view the data to make sure your settings are fine.
Click Close and OK to close the DB2 Connector stage. You may save the job.
56. Open the Join Stage. In the Properties tab, specify the join key as IDENTIFIER and
Join Type as Inner as below.
57. Check the Link Ordering tab. It is important to identify the correct left link and right
link when doing either a left outer join or right outer join. Since we are doing an Inner
join, it only serves to identify which link the key column is coming from. You can keep
the default.
58. Click on the Output > Mapping tab and map the columns to the target. Click OK.
59. Open the target DataSet Stage NEW_CUSTOMER_MASTER. In the Properties tab
specify the path and file to write the output records
/bootcamp/dif/NEW_CUSTOMER_MASTER.ds
61. Save and compile the job. Run the job. It should finish successfully.
62. View the generated file from the Dataset Stage and verify that First Name and Last
Name fields are now separate and that the Gender field is now fully populated.
68. Open the Merge stage and specify the Key which will be used for matching records
from the two files. Select IDENTIFIER. We will keep unmatched master records.
69. Check the Link Ordering tab to make sure that you have the two input sources set
correctly as Master and Update links. For this exercise, OldMasterData should be
the Master link and NewMasterData should be the Update link.
70. Click on the Output > Mapping tab. Verify that all columns are mapped and that they
are mapped correctly.
71. We will overwrite the joined dataset with this job run. You can keep the existing file
properties of the target Dataset Stage.
73. Observe the produced output and the job log. There is one warning message:
Merge,1: Master record (87) has no updates.
74. All 207 rows are passed down to the dataset since we decided to keep unmatched
master records in the Merge Stage properties.
75. NULL values were passed for the unmatched master record.
77. Create a new parallel job called SampleSentimentFunnel with two Sequential File
stages. We will combine the two files that we split earlier in our job
SampleSentimentPartition.
Note: If you did not run the data partitioning and collection job that produced the two
files, you can load them from /bootcamp/dif/solutions.
78. Open Sequential File stage SampleSentiment1. On the Properties tab, specify the
file to read as /bootcamp/dif/TargetFile1.txt. Set the First Line is Column Names
property to True.
79. Click on Columns tab, then on the load button to load the
SAMPLE_Brand_Retail_Feedback.csv table definition under folder /Table
Definitions/Sequential.
80. Click View Data to verify that the metadata has been specified properly. Click Close
and OK to close the source Sequential File.
81. Open the Sequential File stage SampleSentiment2. On the Properties tab specify the
file to read as /bootcamp/dif/TargetFile2.txt. Don’t forget to set the First Line is
Column Names to True.
82. Click on Columns tab, then on the load button to add the column definitions from the
SAMPLE_Brand_Retail_Feedback.csv table definition.
83. Click View Data to verify that the metadata has been specified properly. Click Close
and OK.
84. Open the Funnel stage and view the properties. Keep the Continuous Funnel mode.
85. Select the Output tab and map the input columns to the output columns.
87. Open target Sequential File stage SentimentCombined. On the Properties tab
specify the path and file to write the output records
/bootcamp/dif/CustomerSentimentCombined.txt. Set First Line is Column Names to
True.
Impact analysis can help you identify related assets. It is useful to identify affected
assets when you are about to change an asset like a Table Definition.
93. The result is shown in the Repository Advanced Find window. These are the jobs
that use this Table Definition.
94. Click the right mouse button over the SampleSentiment job and then click “Show
dependency path to…”
95. Maximize the window or use the Zoom button to adjust the size of the dependency
path. Notice that you have a detailed view of the stages and links that use this Table
Definition. The graph shows you in detail which stages require attention when you
are about to change the Table Definition.
96. Close the Path Viewer window at the bottom of the screen.
97. Mark the SampleSentiment and the SampleSentimentSort jobs in the Repository
Advanced Find window. Right click onto one of the highlighted jobs and choose
‘Compare selected’.
98. Once the result is available, close the Repository Advanced Find window.
99. DataStage displays the two jobs as well as the Comparison Results window. It
contains a detailed account of the changes made, e.g. that the Copy stage was
replaced with a Sort stage and that the USERID field was changed from VarChar to
Numeric to enable the calculation in the Sort stage. Note that you can also compare
Table Definitions with each other in the same way.
In this task, we will calculate the revenue generated by our customers this month.
2. Open the source sequential file properties and set the file path to
/bootcamp/dif/Sales.txt. Don’t forget to load the table definition in the Format and
Columns tab. It is located at \Table Definitions\Sequential\Sales.txt.
3. Click on View data and make yourself familiar with the input columns before they are
processed. We will use the Aggregator stage to calculate the total revenue for the
sales data in this file.
4. Edit the Aggregator stage to add the grouping key, CustomerID. Also set the
property Aggregation Type = Calculation, as shown below.
5. Select the Column for Calculation = TotalPrice and at the right bottom portion of the
screen, select Sum Output Column.
6. A new column will be generated with the aggregation results. Name the new column
name MonthlyRevenue.
7. Click on the Output tab. In the Mapping sub-tab, map both input fields into the target
file.
8. Click OK
9. Open the MonthlySales target Sequential File stage. In the Properties tab specify
/bootcamp/dif/MonthlySales.txt as the file to write. Set First Line is Column Names to
True.
10. In the Partitioning tab, select Sort Merge for collector type; check the Perform sort
box, and select MonthlyRevenue as the key with an option of Descending order.
Run the job and verify the results. The final file should contain the Grouping Key as
CustomerID and the MonthlyRevenue column in descending order.
4. On the Parameters tab, define the parameters as shown. Don’t forget the last slash
for the directory value.
5. In the Values tab, specify a name for the Value File that holds all the job parameters
within this Parameter Set. Click OK.
8. In the menu bar, go to Edit and open up your Job Properties. Select the Parameters
tab. Click Add Parameter Set. Select your SourceTargetData parameter set.
9. Click OK.
10. Click Ok
12. Replace it with a sequential file stage called PositiveSentiment. Add another
sequential file stage called NegativeSentiment as a second output of the transformer.
13. In the Transformer stage, map all the columns from the source link to both target
links. Select all the source columns and drag-& drop them to the output link. The
transformer editor should appear as shown below:
14. Open the transformer stage constraints by clicking on the chain icon . We will
now create a constraint that identifies records with positive or negative values for the
POLARITY column. Open the Constraint editor by double clicking into the Constraint
field of the PositiveSentiment row. To insert a column name without typing it, click on
the … icon and select “Input Column”
15. For the PositiveSentiment link, define the following constraint: Transform.POLARITY
= “positive”. Make sure to use all lowercase letters. Hit Enter.
19. Configure the properties for the target Sequential File stages. Open the
PositiveSentiment output stage. Use the Dir and PositiveTargetFile parameter
included in the SourceTargetData parameter set to define the File property as
20. Open the NegativeSentiment output stage. Use the Dir and NegativeTargetFile
parameter included in the SourceTargetData parameter set to define the File
property as shown. Also, set the option First Line is Column Names as True.
22. View the data in the targets and verify that the records were split up correctly.
23. In the log you may notice warnings saying Exporting nullable field without null
handling properties for the three target Sequential File stages. We see this warning
since we are reading from a database table with a Table Definiton that allows for
NULL values. We are then writing to Sequential File stages where NULL values must
have some character representation.
24. We can define this character representation in the target Sequential File stages by
adding the Null field value parameter in the Field defaults folder of the Format tab.
You may choose a number, string or escape character.
25. Note: When you read these sequential files again as a source file in another job, you
will have to specify in the stage properties that the NULL string represents the NULL
value.
27. Add a new Sequential File stage linked as an output to the Transformer stage and
name it as shown below.
28. In the Transformer, map all the input columns across to the new target link.
29. Open the Constraints window for the Otherwise output link. Note: You can also
double-click the Constraint box.
33. Add a null field value in the Format tab. Click OK.
34. Save, compile, and run your job. No rows should be going into the Otherwise link.
Our custom SQL select statement in the DB2 Connector stage had the where clause
to only read positive or negative POLARITY values. Let’s change that.
35. Open the source connector stage. Remove the WHERE clause from the select
statement.
36. Click OK. Compile and run the job again. You should now see records getting
passed into the otherwise link that do not satisfy the transformer constraint condition.
39. Open the Transformer stage. Right mouse click on the Stage Variables window, and
click Stage Variable Properties…
40. Under the Stage Variables tab, create a stage variable named DateProcessed with
Date as the SQL type.
42. Double-click in the derivation editor for the DateProcessed stage variable. Define a
derivation that contains the current date using the function CurrentDate() for
DateProcessed stage variable. You can either type it or look the function up.
43. Create a new column named ProcessedDate with Date as the SQL type for each of
the three output links by typing the new column name and its corresponding
properties in the next empty row of the output column definition grid located at the
right bottom as shown here.
44. Define the derivations for these columns using the Stage Variable DateProcessed by
dragging the DateProcessed variable and dropping it into the Derivation space of the
ProcessedDate fields. The Transformer editor should look like this:
45. Exit the Stage Editor by clicking OK, save, compile and run the job.
46. When you view the result data files, you will find the ProcessedDate column filled.
5. You must define a null value representation when writing nulls to sequential files.
6. The Transformer stage uses a set of functions to transform your data.
7. You can define constraints that allow you to pass data that meets the constraint
condition to a specific output link.
5. Select the dif project, which has new jobs, and then click the Detect Associations
icon . Note: Keep the dstage1 project checked. Confirm to run the service.
6. Once this step has completed, we are ready to create a data lineage from our
CUSTOMER_POLARITY data warehouse table back to the data source.
10. Our data lineage path shows two database tables and five jobs lined up in the data
lineage flow. Due to the solution jobs in place as well.
11. You can zoom in and out, export the lineage as pdf or jpg and also view specific
relationship types.
Relationship types
You can select to view Design, Operational, and User-Defined relationships. Our job
here only contains design information since at this point we haven’t imported our
operational data from the engine tier.
Job Design Relationships
Displays data items that the job reads from or writes to. Displays the previous and
next jobs based on job design information that is interpreted by the automated
services. Displays job design parameters and whether runtime column propagation
is enabled.
Job Operational Relationships
Displays the previous and next jobs based on the values of parameters at run time,
based on operational metadata that is interpreted by the automated services.
Job User-Defined Relationships
Displays the data items that a job reads from or writes to, based on the results of
manual linking actions that are performed by the Metadata Workbench
Administrator.
13. We can now build a job sequence that will run the SequentialSentimentToDB2 job
and then the CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LOOKUP
job that also contains the follower category field.
14. Log out from the Metadata Workbench and close the browser.
17. Drag and drop two Job Activity stages to the canvas, link them, and name the stages
and links as shown.
18. Open the Job (Sequence) Properties. In the General tab, verify that all the
compilation options are selected.
19. Click the Parameters tab and add the parameter sets SourceTargetData and
DB2Authentication as shown. Load these parameters through the
button. These parameters will be available to all the stages within the job sequence
during execution.
21. Open up each of the Job Activity stages and associate the parallel job you want to
execute with each stage.
SeqJob Activity
Parallel Job
Stage
CustomerSentiment SequentialSentimentToDB2
CUSTOMER_SENTIMENT_TO_CUSTOMER_POLARITY_LO
CustomerPolarity
OKUP
22. For the Job Activity stage CustomerSentiment, change the Execution action to
“Reset if required, then run”.
23. For the Job Activity stage CustomerPolarity, we want it to be executed only when the
upstream job ran without any error, although possibly with warnings.
24. In the first Job Activity stage CustomerSentiment, open the Triggers tab and set the
Expression Type to Custom (Conditional)
then $JobStatus.
28. The result for the CustomerSentiment stage should look like:
CustomerSentiment.$JobStatus = DSJS.RUNOK or
CustomerSentiment.$JobStatus = DSJS.RUNWARN
30. Open the job log for the job sequence. Verify that each job ran successfully. Locate
and examine the job sequence summary.
31. Examine what happens if the first job aborts. To cause that, open up the job
SequentialSentimentToDB2 and replace in the source Sequential File name
SAMPLE_Brand_Retail_Feedback.csv with the non-existent dummy.csv as shown
below. Save and compile SequentialSentimentToDB2.
32. Execute the job sequence SeqJob and check the log showing the job is aborted. The
first error message in the job log should contain the relevant error.
Note: you don’t need to recompile the job sequence to execute it since nothing was
changed in the job sequence.
33. Open the SequentialSentimentToDB2 job, replacing the dummy.csv source file with
the original SAMPLE_Brand_Retail_Feedback.csv in the source Sequential File
stage File property. Then save and compile the job.
34. Save the job sequence SeqJob as SeqJobVar. Add a User Variable Activity stage as
shown.
35. Open the User Variables Activity stage and select the User Variables tab. Right click
in the gray space and select Add Row to create a variable named
EnableCustomerPolarity with value 0. Click OK.
36. We want to enable the execution of CustomerPolarity only if the value of the
EnableCustomerPolarity variable is 1. To specify this condition, open the Trigger tab
in the CustomerSentiment Job Activity stage and modify the expression as shown.
Note: you can refer to the User Variable Activity stage variables within any stage in
the job sequence using the syntax:
UserVariableActivityName.UservariableName
(CustomerSentiment.$JobStatus = DSJS.RUNOK or
CustomerSentiment.$JobStatus = DSJS.RUNWARN) and
UserVars.EnableCustomerPolarity = 1
38. Start the job using the DataStage and QualityStage Director client. The Director is
the client component that validates, runs, schedules, and monitors jobs. You can
invoke the Director client through Tools > Run Director.
39. Switch into the Jobs folder, highlight SeqJobVar and click on ‘Run now..’ in the
shortcut icon bar to execute the job sequence again. Click Run.
40. Switch to the job log view by clicking on the ‘Notebook’ icon.
42. Edit the UserVars stage and change the EnableCustomerPolarity value to 1. This
will cause CustomerPolarity to execute.
43. Compile and run the job sequence again and verify in the logs that CustomerPolarity
was executed.
46. Open the Wait For File stage and set the filename of the file as shown below.
Note: the “Do not timeout” option makes the stage wait forever for the file StartRun
until it appears in the specified location.
48. Compile and run your job. Notice that after the job starts it waits for the file StartRun
to appear in the expected folder.
51. Create a file named StartRun in the directory /bootcamp/dif. You can use the
command “touch /bootcamp/dif/StartRun” for this purpose.
52. Switch back to the log view. Notice the log messages and the job sequence
execution should now continue by running the stage following the Wait For File
Activity.
55. Edit the Terminator stage so that any running job is stopped when an exception
occurs.
56. To see how the exception handling takes control over the job sequence, you will
have to make one of the jobs that are part of the Job Sequence fail. Modify the job
SequentialSentimentToDB2 replacing the SAMPLE_Brand_Retail_Feedback.csv file
name in the source sequential file stage with dummy.csv and compile the job.
57. Compile and run the job sequence again and check the log with the Director client.
Note that as SequentialSentimentToDB2 did not finish successfully, the sequence is
aborted.
3. You should find your blueprint with the timeline feature enabled. Make sure you are
viewing the End of workshop milestone as shown below.
4. We have extracted our Web Data from the file, transformed the data and loaded it
into the warehouse. Now our BI Analysts can start building reports based on the
customer sentiment found in our source data.
5. In this final exercise, we will learn how to expose the polarity and follower count data
that we loaded into the warehouse as a web service and therefore make it
consumable by other applications.
8. Create more space on the blueprint by moving the advanced analytics elements up
and make the Data Repositories and Analytics domains smaller.
9. From the Groups section in your Palette, add another domain to your blueprint.
Name the domain Web Services.
10. Add an Information Service element from the Consumers and Delivery category to
your Web Services domain.
11. Add an Application object from the Consumers and Delivery category to the
Consumers domain. Rename the Application to SOA Application.
13. Note the ingoing and outgoing links when you hover your mouse over the elements.
Click and drag these links to create the connections.
14. Mark the two new elements and go to the properties section on the lower right side of
the screen. Switch to the Milestones section and define these two objects to show up
at the End of workshop milestone.
15. You can now enable the timeline view again to reduce the blueprint scope again to
what we are achieving in this class.
2. Create a new parallel job and save it as PolarityFollowerCountService in the Jobs >
WarehouseJobs folder. We will build a simple job in this exercise. Note that the real
value unfolds when you take advantage of DataStage’s and QualityStage’s full
transformation and data cleansing potential in combination with the service
endpoints.
4. Connect the stages and name the stages and links as shown:
9. Switch to the columns tab and load the CUSTOMER_POLARITY table definition.
13. In the table definition field for the CustomerID link, fill in the following metadata:
14. Click and drag the UserID key from the input stream to the reference stream.
17. We need to specify what action should be taken if a lookup on a link fails. Make sure
the Lookup Failure field is set to Continue. This will set our reference data to NULL. It
continues processing any further lookups before sending the row to the output link
19. You will notice that every link now has table definition metadata defined as indicated
by the small table icons. You also notice that the ISD Input and Output stages are
sequential, as indicated by the fan in and fan out icons and the lookup stage as well
as the DB2 connector stage are parallel stages.
20. Before we can compile the job, we will need to make this job available for information
services. Open the job properties.
21. In the General tab, set the checkmarks for “Allow Multiple Instance” and “Enabled for
Information Services”.
24. To start the Information Services Director client, double-click on IBM InfoSphere
Information Server Console icon on the desktop; or, select “Start Programs
IBM InfoSphere Information Server IBM InfoSphere Information Server Console.
26. Create a new Information Services project: open the menu under ‘No project
selected’, and select ‘New Project’.
28. Once the project is created, switch to the Users tab. You can fill in the users that may
connect to the project, and their roles for the project: Information Services Director
Designer, and/or Information Services Director Project Administrator.
31. Each project needs to connect to one or more information providers. This project
needs to access the DataStage engine.
32. From the home page of the Information Server Console, click on Home
Configuration Information Services Connections.
36. Fill in the user name and password information to connect to the DataStage engine:
dif / inf0server.
38. Click on the Save, Enable and Close option under the Save button.
45. Double-click on Bindings to define the bindings to be used by the service. From the
Attach Bindings menu, select SOAP over HTTP.
46. We are ready to define one or more operations to be associated with the service.
Under the Operations folder, double click newOperation1. The newOperation1 tab
will open.
47. The first operation was created automatically. Name the operation DataStageLookup
and select an information provider.
50. Select the PolarityFollowerCountService job from the Jobs > WarehouseJobs folder.
51. Select the job PolarityFollowerCountService then Save the Application and then
close it.
52. Our service is now ready to be deployed. Highlight the service and click Deploy.
54. Click Deploy. You can monitor the deployment status at the bottom of the screen.
55. Wait until the deployment done completelyt and then Start InfoSphere DataStage
and QualityStage Director
Expand dif project and then the Jobs folder and go to WarehouseJobs, you will see
the job “PolarityFollowerCountService” with the invocation ID is currently up and
running and ready for the call.
59. The Information Services Director is now waiting for SOAP over HTTP requests that
are then processed and the DataStage engine receives the UserID string as an input
for our service enabled job.
60. Go back to ISD (Information Server Console), in the project menu click OPERATE
icon and select Deployed Information Services Application
62. Click the ‘View Service in Catalog” button net to the service , this will take you
directly to the service view within the Information Services Web Catalog.
63. Click on the ‘Bindings’ on the catalog view. Expand the ‘SOAP over HTTP’ binding to
open the binding properties. Click on the Open WSDL Document link.
This will open the WSDL document in separate browser window. Have a look at the
WSDL. You might recognize some of the information that we had looked at earlier.
Keep this WSDL browser open, we need to copy the link for testing purposes in the
next few steps.
64. Open InfoSphere Data Architect by double clicking on the IBM InfoSphere Data
Architect icon on the desktop.
66. Switch to the ‘Web perspective by clicking the ‘Open Perspective button on the top
right corner. Select to ‘Other’ and choose web near the bottom of the list and click ok
67. Click ‘Run’ and select ‘Launch the Web Services Explorer’ from the pull down menu.
68. On the top right corner of ‘Web services explorer’ window, click the ‘WSDL page’
(second from the right) icon.
69. In the Navigator of the Web Services Explorer, click ‘WSDL Main’ then copy the URL
of the WSDL document from the browser window into the WSDL URL text field and
click ‘Go’.
As you can see, the Web Service Explorer could interpret the WSDL and discovered
an operation ‘DataStagelookup’ and an endpoint (Service Provider) to which the
request would be sent.
70. Click on the ‘DataStageLookup’ operation name link.
71. On the ‘invoke WSDL Operation’ window, enter an userid (e.g.: 7653556196,
26876535196, 76534565196) and click ‘Go’.
The response message at the bottom part of the window includes polarity and the
number of follower for that userid.
What has just happened here? The input in the Web Services Explorer was sent as
a SOAP/HTTP service to the service provider (InfoSphere Information Server, in this
case) which then invoked the InfoSphere DataStage job. The job then did a lookup
against the customer polarity repository to retrieve any existing customer. It returned
the result to the userid, which then packaged it as SOAP message and sent it back
to the Web Services Explorer.
We now have a Web service that checks userid against the customer in our
customer repository. This service could be used by any JKLW application. All it
takes now is to publish this service on our service registry.
Summary
IBM InfoSphere Information Services Director is a powerful tool to create Web services
on top of InfoSphere DataStage and Quality Stage jobs, SQL statements against DB2,
Oracle or classic federation data sources. InfoSphere Information Services Director
services package information integration logic that insulates developers from the
underlying complexities of data sources. InfoSphere Information Services Director
provides support for load balancing and fault tolerance for requests across multiple
servers. It also provides foundation infrastructure for information services.