Data Treatment

BODS15 SAP BO Data Integration 4.
0
Defining Data Services
Business Objects Data Services provides a graphical interface that allows you to easily create jobs that extract data
from heterogeneous sources, transform that data to meet the business requirements of your organization, and load
the data into a single location.
Data Services combines both batch and real-time data movement and management with intelligent caching. It
provides you a single data integration platform for information management from any information source and for
any information use.

(Building a Reporting Data Mart = 6-12 Month Rapid Mart=6-12 Days)
Data Services performs 3 key functions that can be combined to create a scalable, high-performance data platform.
Loads ERP data into DSO in batch or real-time.
Creates routing requests.
Applies transactions against ERP systems.

Data Services Designer is used to:
Create, test, and manually execute jobs that
transform and populate a data warehouse.
Create data management applications that consist of
data mappings, transformations, and control logic.

Data Services Repository is a set of tables that holds:
user-created predefined system objects;
source and target metadata;
transformation rules.
There are 3 types of repositories:
1. local repository to store definitions of source and target metadata, and Data Services objects
2. central repository (optional) to support multiuser development
3. profiler repository to store information used to determine the quality of data
Data Services job server retrieves the job from its associated repository and starts the data movement engine.
Data Services Object types:
1. projects
2. jobs - the smallest unit of work that you can
schedule independently for execution, which include
workflows and data flows
3. workflows (optional object) - orders data flows and
the operations that support them and defines the
interdependencies between data flows
4. dataflows - transforms source data into target data
5. scripts
6. datastores
7. file formats (flat files, XML schemas, Excel)

Defining Source and Target Metadata
Datastore is a connection to a database. Data Services uses Datastores to import the metadata that describes the
data from the data source.
Datastore types:
Application datastores to import metadata from ERP system
Database datastores to import from RDBMS
Adapter datastores to access applications data and metadata
Metadata types:
External metadata connects to Database
Repository metadata imported to repository and used by Data Services
Reconcile vs Reimport reconcile compares external and repository metadata; reimport overwrites
repository with external metadata
A file format is a generic description. You can use this to describe one file or multiple data files if they share the same
format . File formats are used to connect to source or target data when the data is stored in a flat file.
File format editor describe files in various formats:
Delimited format (commas or tabs)
Fixed width format (fixed column width is specified by the user)
SAP ERP format
3 work areas of a file format editor:
Property value (helps you edit file format property values)
Column attributes (helps you edit and define columns or fields in the file)
Data preview (allows you to view how the settings affect sample data)
Flat file editor supports file readers and error handling.
There is also the Excel File format editor.
Creating Batch Jobs
Project
single-use object that groups jobs
contain group of jobs
organization highest level
Jobs
only executable objects in Data Services
created in project area/local Object Library
If a job has syntax errors, it does not execute
Workflows (Optional object)
sequence the decision-making process for executing data flows
Created in Tool Palette/local Object Library
Include:
Data flows
Conditionals
while loops
try/catch blocks
scripts
Other work flows
Workflows can:
1. Call data flows to perform data movement operations.
2. Define the conditions appropriate to run data flows.
3. Pass parameters to and from data flows.
Dataflows
Define how the information moves from a source to a target.
Closed operations
Data flows include:
Source objects
Target objects
Transforms
Key activities of data flows are Information extraction, transformation and loading to target objects. Created data
sets are no available to other steps in the work flow.

Query transforms perform operations such as:
1. filter data
2. join data
3. map columns
4. perform transformations and functions
5. perform data nesting and unnesting
6. add columns, schemas and function results
7. assign primary keys

Join ranks indicate the weight of the output data set if the data set is used in a join.
Cache holds the output from the transform in memory for use in subsequent transforms.
A template table allows to modify the schema in the data flow without going to RDBMS.

Troubleshooting Batch Jobs
Descriptions and annotations (a convenient way to add comments to objects and workspace diagrams)
Annotation: an object in the workspace, which describes a flow, a part of a flow, or a diagram.
The annotation can be a sticky-note with folded-down corner, which you can add from Tool Palette.
Description is connected to a particular object.
You can make a description visible in the Designer by performing these tasks:
Enter a description into the properties of the object.
Enable the description on the properties of that object by right clicking it.
Enable View Enabled Object Descriptions option in the toolbar or menu.
A data flow designer determines when to show object descriptions based on a system-level setting and an
object level setting. Both settings must be activated to view the description for a particular object.
Note: The system-level setting is unique to your setup.

Tracing jobs
Use trace properties to select the information that Data Services monitors and writes to the trace log file during a
job.
Determines what information is written to the log
Can be changed temporarily (Execution properties) or persistently (Job Properties)
Be aware that some Trace options can write for every row
Using log files
As a job executes, Data Services produces three log files (can view these from the project area).
The log files are, by default, also set to display automatically in the workspace when you execute a job.
You can select the Trace, Monitor, and Error icons to view the log files, which are created during job execution.

Examining monitor logs
Use monitor logs to quantify the activities of the components of the
job.
It lists the time spent in a given component of a job and the number of
data rows that streamed through the component.

View Data allows
View Data allows you to see source data before you execute a job. Using data details you can:
Create higher quality job designs.
Scan and analyze imported table and file data from the Local Object Library.
See the data for those same objects within existing jobs.
Refer back to the source data after you execute the job.
It displays your data in the rows and columns of a data grid. The default sample size in the memory of a system is
1000 rows for imported source, targets, and transforms

Interactive Debugger
An Interactive Debugger is included in a Designer, which allows you to troubleshoot your jobs by placing filters and
breakpoints on lines in a data flow diagram. This, in turn, enables you to examine and modify data row by row during
a debug mode job execution.
When you execute a job in debug mode, Designer displays several additional windows:
Call stack,
Trace,
Variables,
View Data panes
Setting Filters and Breakpoints in the DataFlow
You can place a filter or a breakpoint on the line between a source and a transform or two transforms, when a job
runs in debug mode.
When you set a filter and a breakpoint on the same line, the Data Services applies the filter first, and then it applies
breakpoint (the breakpoint applies to the filtered rows only).
The filter and breakpoint have many features:
Step Over allows you to go to next connecting line in Data Flow.
Get Next Row allows you to move down one record.
Continue allows you to proceed to the next breakpoint.
You can use a filter if you want to reduce a data set in a debug job execution. It does not support complex
expressions.
You can use a breakpoint to pause a job execution and return to the same location. It can be based on a condition or
set to break after a specific number of rows.

Using Auditing Points, Label and Functions
An audit point represents the object in a data flow where you collect statistics. You can use the auditing point to
audit a source, a transform, or a target in a data flow.
An audit label represents the unique name in the data flow that Data Services generates for the audit statistics
collected for each audit function that you define. You can use these labels to define audit rules for the data flow.
When the audit point is on a table or output schema, 2 labels are generated for the Count audit function:
$Count_objectname
$CountError_objectname
When the audit point is on a column, the audit label is generated with $auditfunction_objectname.
You can define audit points on objects in a data flow to specify an audit function. The audit function represents the
audit statistic that Data Services collects for a table, output schema, or column. You can choose from these audit
functions: Count, Sum, Average, and Checksum.
Defining Audit Rules and Actions
(no sei o q escrever lol)

Using functions, scripts and variables
Note: Data Services does not support functions that include tables as input or output parameters, except functions
imported from SAP ERP.

Smart Editor or Function wizard is used to add existing functions in an expression and is recommended for
defining complex functions.
The Smart Editor offers a number of options like variables, data types, keyboard shortcuts, and so on.
The Smart Editor has a user friendly interface which allows you to drag and drop items.
You can use this option in:
Query Editor;
Script Editor;
Conditional Editor;
Case Editor;
Function Wizard
Lookup functions allow you to use values from the source table to look up values in other tables to generate the
data that populates the target table.
Lookups enable you to store reusable values in memory to speed up the process.
Lookups are useful for values that rarely change.

All functions provide a specialized type of join, similar to an SQL outer join ()
Lookup function
Does not provide additional options for the lookup expression.
lookup_seq function
Searches in matching records to return a field from the record where the sequence column (ex:
effective_date) is closest to but not greater than a specified sequence value (ex: a transaction date).
Lookup_ext function [has multiple join and complex join conditions|multiple return values possible|more
flexible return policy| good for one-to-many situations]
Allows you to specify an Order by column and Return policy (Min, Max) to return the record with the
highest/lowest value in a given field (for example, a surrogate key).

The PRELOAD CACHE is the default value set (cache parameter) by
the lookup_ext functions.
This default value affects how Data Services uses the records of the
lookup table in the cache.
It is also directly related to the performance of the lookup job.

Decode function returns an expression based on the 1
st
condition in the specified list of conditions and
expressions that evaluates as TRUE. It provides an alternate way to write nested if/then/else functions.

Variables is a common component in scripts that acts as a placeholder to represent values that have the potential
to change each time a job is executed. Variables names start with $.

There are 2 types of variables:

Global variables: These types of variables are restricted to the job in which they are created.
Can reference the global variable directly in expressions in any object in that job.
They set a job level and are available for assignment or reference in child objects.
Global variables help you simplify your work.

Local variables: These are restricted to the job or work flow in which they are created.
The local variables are not available in the referenced objects.

Parameters: a parameter is another type of placeholder that calls a variable.
Can be input/output which is one-way or two-way.
Parameters are most commonly used in WHERE clauses.
The parameters are assigned in the Calls tab of the Variables or Parameters.

Note: You must use parameters to pass Local Variables to the work flows and data flows in the object
Global variables not require parameters to be passed to work flows and data flows in that job

Substitution parameters provide a way to define parameters that have a constant value for one environment, but
might need to get changed in certain situations.
The name for a substitution parameter with double dollar signs ($$) and an S_ prefix to differentiate from out-of-
the-box substitution parameters

Script is a single-use object that is used to call functions and assign values in a work flow.
Use a script when you want to calculate values that are passed on to other parts of the work flow.
Use scripts to assign values to variables and execute functions.

When can be used?
Job initialization (to determine execution paths)

Basic Syntax

Using platform transforms
Transforms (optional objects in a data flow) allow you to transform your data as it moves from source to target,
which include Case, Map Operation, Merge, Query, Row Generation, SQL, and Validation.
Many transforms can call functions.

Functions operate on single values, such as values in
specific columns in a data set.

Transforms operate on data sets by creating, updating, and
deleting rows of data.

Map Operation transform enables you to change the operation code on data sets to produce the desired output.
Operation codes describe the status of each row in each data set described by the inputs to and outputs from
objects in data flows. The operation codes indicate how each row in the data set would be applied to a target
table if the data set were loaded into a target.

The operation codes are:
Normal
Insert
Delete
Update
Validation Transform
The Validation transform qualifies a data set, based on rules for the input schema, and move data into target objects
based on whether they pass or fail validation.
It filters out data that fails your criteria (have 1 validation rule per column):
If the data fails to meet the criteria, send it to the Fail schema.
If the data is correct, you can send the data into Pass schema, or send it to both.
There is also an optional substitution to Pass schema. Then, you can validate the reports after collecting the
statistics.

Rule Violation Statistics Data Services adds two columns to the Failed schema, (DI_ERRORACTION column and
DI_ERRORCOLUMNS).
The DI_ERRORACTION column indicates the path of failed data:
The letter B is used for send to both Pass and Fail outputs.
The letter F is used for send only to the Fail output.
The DI_ERRORCOLUMNS column shows error messages for the columns with failed rules.
Merge Transform allows you to combine multiple sources with the same schema into a single target.
Helps you combine multiple incoming data sets with the same schema structure, to generate a single output data
set, with the same schema as the input data sets.
The Merge transform performs the union of the sources, given that, all the sources must have the same schema
that includes:
number of columns;
column names;
column data types.

Case Transform supports separating data from a source into multiple targets based on branch logic.
Only 1 data flow source is allowed as a data input for the Case transform.
Single transform

Setting up error handling
When a job fails to complete successfully during execution, some data flows may not have completed. When this
happens, some tables may have been loaded, partially loaded, or altered.
You may have to recover your data without introducing duplicate or missing data.
There are different types of data recovery mechanisms:
Recover entire database helps you restore the crashed data cache to an entire database.
Automatic recovery allows you to recover a partially loaded job.
Table Comparison helps you restore data from partially loaded tables.
Validation transform allows you to retrieve missing values or rows.
Alternative Workflows ensure that all the exceptions are managed in a work flow using conditionals, try and
catch blocks, and scripts.

Alternative workflow with try and catch Blocks
You can automate the recovery of your results by setting up your jobs to use alternative work flows that cover all
the possible exceptions and have recovery mechanisms built in.
This slide depicts the various components of alternative work flows with try and catch blocks:
A script determines if recovery is required. It reads the value in a status table and populates a global
variable with the same value.
A conditional calls the appropriate work flow based on whether recovery is required. It contains an If,
Then, or Else statement to specify how the work flows should process.
A work flow with a try and catch block executes a data flow without recovery.
A script in the catch object updates the status table, which specifies that recovery is required if any
exceptions are generated.
A work flow executes a data flow with recovery and a script to update the status table.
Conditionals are single-use objects used to implement conditional logic in a work flow.

You fix the error and run the job in recovery mode.
During the recovery execution, the first work flow no longer generates the exception. Thus the value of variable $I is
different, and the job selects a different subsequent work flow, producing different results.
Capturing changes in data
Slowly Changing Dimensions (SCD) are dimensions that have data relationships within them that changes over
time.
The three types of SCDs that have changed over time are:
SCD Type 1: You cannot preserve history in
this type of dimension, which is the natural
consequence of normalization.
SCD Type 2: You can preserve unlimited
history in this kind of dimension.
SCD Type 3: You can preserve limited
history by generating new fields in this kind
of dimension. You can add an additional
record to a dimensional record.

Changing Data Capture (CDC) selective processing of data which is identifying and loading only the changed data.
You can process the selective data after an initial full load is complete.
It only processes the changes made during the updation of data.
You can use selective processing of data during extraction, which is known as source-based, and loading,
which is known as target-based.
The design considerations include identifying the changes; capturing all the changes; and history
preservation.
You can do this by using these 2 methods:
1) Time stamps that you can use in your source data to track the changes in rows since the last extraction of
data. Your database table must have an update times stamp to support this type of source-based CDC. ~
You can use time stamp-based CDC when there is
o a small percentage of changes between extracts;
o no need to capture physical row deletes; and
o no requirement to capture intermediate results of each transaction between extracts.

Time stamp-based CDC is not recommended when you
o have a large percentage of changes between extracts;
o need to capture physical row deletes; and
o want to capture multiple events occurring on the same row between extracts.

2) Change logs where you can use log files from the Relational Database Management System (or RDBMS) to
check for changes in data.
A fact table uses the surrogate key, while the original key remains with the dimension table. The original key is
used to look up historic data.
Overlap period is a period of time when changes can be lost between 2 extraction runs without rigorously
isolating source data during the extraction process. As the source-based CDC depends on a static time stamp to
determine changed data, the overlap period can affect the CDC.
You can handle an overlap situation by 3 techniques:
1) Overlap reconciliation that reapplies the changes that
could have occurred during the overlap period after
the maximum time stamp. It is recommended to use
an overlap period which is more than the maximum
possible extract time.
2) Pre-sampling is similar to time stamp-based CDC.
However, instead of a last update stamp, the status
table has a start stamp which is the same as the end
time stamp of the previous job and an end stamp
which is the most recent time stamp from the source
table.
3) You can design an overlap strategy to meet your specific needs.
Target-based CDC compares the source to the target to determine which records have changed

Using text data processing
Entity Extraction Transform is used to extract predefined entities and for sentiment analysis
Text Data Processing analyzes the text and derives meaning from it, then converts it into structured data and
incorporates it into a db. Supports HTML, TXT, XML.
Dictionary is a file that contains a user-defined set of entities, each specifying the standard form, variant forms,
entity type, and so on.

Using data integrator platforms

Data Integrator Transforms perform key operations on data set to manipulate their structure when they are passed
from the source to the target.
Data transfer
Data conversion / Date Generation
Effective date
Hierarchy flattening
Map CDC Operation
Pivot creates a new row for each value in a column
Reverse Pivot
XML Pipeline
Performance Optimization
Pushing down operations reduces the number of rows and operations that the engine must retrieve and process.
This can be done as long as sources and targets allow it, and they are from the same datastore or are from
datastores with a db link between them.
Process slicing splits data into subdata flows.

Data Treatment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Treatment

Uploaded by

Copyright:

Available Formats

BODS15 SAP BO Data Integration 4.

You might also like