You are on page 1of 25

TABLE OF CONTENTS:

Intro: Why Offload Is the Best Way to Get Started


with Hadoop
A Framework for EDW Offload
From SQL to Hadoop ETL in 5 Steps
Case Study: Leading Healthcare Organization
Offloads EDW to Hadoop
Conclusion

INTRO: WHY OFFLOAD IS THE BEST


WAY TO GET STARTED WITH HADOOP
Wouldnt it be great to have a single,
consistent version of the truth for all
your corporate data? Nearly two decades
ago, that was the vision of the Enterprise
Data Warehouse (EDW) enabled by ETL
tools that Extract data from multiple
sources, Transform it with operations
such as sorting, aggregating, and
joining and then Load it into a central
repository. But early success resulted
in greater demands for information;

users became increasingly dependent


on data for better business decisions.
Unable to keep up with an insatiable
hunger for information, data integration
tools compelled organizations to push
transformations down to the data
warehouse, in many cases resorting
back to hand coding, and ELT emerged.
Today, 70% of all data warehouses
are
performance
and
capacity

constrained, according to Gartner. Many


organizations are spending millions of
dollars a year in database capacity just
to process ELT workloads. Faced with
growing costs and unmet requirements,
many are looking for an alternative and
are increasingly considering offloading
data and transformations to Hadoop for
the cost savings alone. Multiple sources
report that managing data in Hadoop
can range from $500 to $2,000 per
terabyte of data, compared to $20,000 to
$100,000 per terabyte for high-end data
warehouses.
Hadoop can become this massively
scalable and cost-effective staging area,
or Enterprise Data Hub, for all corporate

data. By offloading ELT workloads into


Hadoop you can:

Keep data for as long as you want at


a significantly lower cost

Free up premium database capacity


Defer additional data warehouse
expenditures

Significantly reduce batch windows


so your users get access to fresher
data

Provide business users direct access


to data stored in Hadoop for data
exploration and discovery

However, Hadoop is not a complete ETL


solution. While Hadoop offers powerful
utilities and virtually unlimited
horizontal scalability, it does not provide
the complete set of functionality users
need for enterprise ETL. In most cases,
these gaps must be filled through
complex manual coding and advanced
programming skills in Java, Hive, Pig
and other Hadoop technologies that are
expensive and difficult to find, slowing
Hadoop adoption and frustrating
organizations eager to deliver results.
Thats why Syncsort has developed
DMX-h, Hadoop ETL software that
combines the benefits of enterprisecaliber, high-performance ETL with

Hadoop, enabling you to optimize your


data warehouse while gaining all the
benefits of a complete ETL solution.
Hand-in-hand with DMX-h, Syncsort
has identified a set of best practices to
accelerate your data offload efforts into
Hadoop. This three-phased approach
begins with identifying the data and
transformations to target first, then
offloading the data and workloads
into Hadoop with a graphical tool and
no coding, and finally ensuring your
Hadoop ETL environment can meet
business requirements with enterpriseclass performance optimization and
security. Lets explore this approach
further.

A THREE-PHASE APPROACH:
THE SYNCSORT OFFLOAD FRAMEWORK

Analyze, understand & document


SQL jobs
Identify SQL ELT workloads
suitable for offload

Use a single point-&-click interface


to extract & load virtually any data
into HDFS; replicate existing ELT
workloads in Hadoop; develop in
Windows & deploy in Hadoop

Deploy as part of your Hadoop


cluster

No manual coding required

Easily monitor via a Web console

No code generation

Close integration with Cloudera


Manager & Hadoop Job Tracker

Deliver faster throughput per node


Fully support Kerberos & LDAP

PHASE I: Identify the data and


transformations that will bring you the
highest savings with minimum effort
and risk. In most cases 20% of data
transformations consume up to 80% of
resources. Analyzing, understanding
and documenting SQL workloads is an
essential first step. Syncsorts unique
utility, SILQ, helps you do this. SILQ
takes a SQL script as an input and then
provides a detailed flow chart of the
entire data flow. Using an intuitive webbased interface, users can easily drill
down to get detailed information about
each step within the data flow including
tables and data transformations. SILQ
even offers hints and best practices to
develop equivalent transformations
using Syncsort DMX-h, a unique
solution for Hadoop ETL that eliminates
the need for custom code, delivers
smarter connectivity to all your data,
and improves Hadoops processing
efficiency. One of the biggest barriers to
offloading from the data warehouse into
Hadoop has been a legacy of thousands
of scripts built and extended over
time. Understanding and documenting
massive amounts of SQL code and then
mastering the advanced programming
skills to offload these transformation
has left many organizations reluctant
to move. SILQ removes this roadblock,
eliminating the complexity and risk.
PHASE II: Offload expensive ETL
workloads and the associated data to
Hadoop quickly and securely with a
single tool using current skills within
your organization. You need to be able
to easily replicate existing workloads
without intensive manual coding

projects, even bringing mainframe


data into Hadoop, which offers no
native support for mainframes. The
next section of this guide will focus
exclusively on this phase.
PHASE III: Optimize & Secure the new
environment. Once the transformations
are complete, you then need to make sure
you have the tools and processes in place
to manage, secure and operationalize
your Enterprise Data Hub for ongoing
success. The organization expects the
same level of functionality and services
provided before, only faster and less
costly now that the transformations are in
Hadoop. You need to leverage businessclass tools to optimize performance of
your Enterprise Data Hub. Syncsort
DMX-h is fully integrated with Hadoop,
running on every node of your cluster.
This means faster throughput per node
without code generation. Syncsort
integrates with tools in the Hadoop
ecosystem such as Cloudera Manager,
Ambari, and Hadoop JobTracker
allowing you to easily deploy and
manage enterprise deployments from
just a few nodes to several hundred
nodes. A zero footprint, web-based
monitoring console allows users to
monitor and manage data flows through
a web browser and even mobile devices
such as smart phones and tablets. You
can also secure your Hadoop cluster
using common security standards such
as Kerberos and LDAP. And to simplify
management and reusability in order to
meet service level agreements, built-in
metadata capabilities are available as
part of Hadoop ETL. This guide focuses
on Phase II.
6

OVERCOMING SQL CHALLENGES


WITH SILQ
SQL still remains one of the primary
approaches for data integration. Thus,
data warehouse offload projects often
start by analyzing and understanding
ELT SQL scripts. In most cases though,
SQL can easily grow to hundreds or even
thousands of lines developed by several
people over the years, making it almost
impossible to maintain and understand.

SILQ is the only SQL Offload utility


specifically designed to overcome
these challenges, helping your data
warehouse offload initiative go smooth.
The following figure shows a snapshot
of the SQL used for the purposes of this
example along with a fragment of the
fully documented flow chart generated
by SILQ.

A CLOSER LOOK AT OFFLOADING ETL


WORKLOADS INTO HADOOP
Phase II, the task of re-writing heavy
ELT workloads to run in Hadoop, is
typically associated with the need for
highly skilled programmers in Java,
Hive and other Hadoop technologies.
This requirement is even higher when
the data and processing involves
complex data structures like mainframe
files and complex data warehouse SQL
processes that need to be converted and
run on Hadoop. But this doesnt have to
be the case.
Syncsort DMX-h provides a simpler,
all graphical approach to shift ELT
workloads and the associated data into
Hadoop before loading the processed
data back into the EDW for agile data
discovery and visual analytics. This
is done using native connectors to

virtually any data source including


most relational databases, appliances,
social data and even mainframes and
a graphical user interface to develop
complex
processing,
workflows,
scheduling and monitoring for Hadoop.
Graphical Offload Using DMX-h
The following flow demonstrates the
simplicity of the approach. This endto-end flow is implemented in a single
job comprised of multiple steps that
can include sub-jobs and tasks. Data
is offloaded from the EDW (in this
case Teradata) and the mainframe, and
loaded to the Hadoop Distributed File
System (HDFS) using native connectors.
The loaded data is then transformed on
Hadoop using DMX-h, followed by a
load back to Teradata/EDW from HDFS.

The DMX-h Graphical User Interface


(GUI) has 3 basic components:
1. The Job Editor is used to build
a DMX-h job or workflow of subjobs or tasks. The job defines the
execution dependencies (the order
in which tasks will run) and the data
flow for a set of tasks. The tasks may
be DMX-h tasks or custom tasks
which allow you to integrate external
scripts or programs into your DMX-h
application.
2. The Task Editor is used to build the
tasks that comprise a DMX-h job.
Tasks are the simplest unit of work.
Each task reads data from one or
more sources, processes that data
and outputs it to one or more targets.

3. The Server dialog is used to schedule


and monitor Hadoop jobs.
Below is the equivalent job flow in the
DMX-h Job Editor that represents the
flow depicted in Figure 1. As you can
see, the job contains 3 sub-jobs:
1. Load_DataWarehouse_Mainframe_
To_HDFS
2. MapReduce_Join_Agg re gat e _
EDW_and_Mainframe
3. L o a d _ A g g r e g a t e _ D a t a _ t o _
DataWarehouse
Lets take a detailed look at the steps
involved in each one of these sub-jobs.

STEP 1 EXTRACTING SOURCE DATA FROM


MAINFRAME & THE EDW
The first sub-job, Load_DataWarehouse_
Mainframe_To_HDFS, consists of 2
tasks: Extract_Active_EDW_data and
ConvertLocalMainframeFileHDFS. The
EDW to Hadoop extraction task and its
functionality is described in Steps 1.1,
1.2 and 1.3. Although excluded from this
guide, the second mainframe to Hadoop
task follows a similar graphical approach

using mainframe COBOL copybooks to


read and translate complex mainframe
data structures from EBCDIC to ASCII
delimited on Hadoop. Syncsort DMX-h
also includes a library of Use Case
Accelerators for common data flows.
This makes it easier to learn how to
create your own jobs.

10

1.1: EXTRACTING DATA FROM THE


ENTERPRISE DATA WAREHOUSE
Using native connectors, you can easily
create a DMX-h task to extract data from
Teradata or any other major database in

parallel and load it into HDFS without


writing any code.

11

1.2: SOURCE DATABASE TABLE DIALOG


Similarly, you can specify the columns
you need from the source data warehouse
table.

12

1.3: REFORMAT TARGET LAYOUT DIALOG


Once the columns are
specified, another dialog
window allows you to
specify the mapping
between the extracted
source EDW columns on
the left and the delimited
HDFS target file on the
right.

13

STEP 2 JOINING AND SORTING THE SOURCE


DATASETS USING A MAPREDUCE
ETL JOB
The second sub-job in our example is the
MapReduce sub-job, MapReduce_Join_
Aggregate_EDW_and_Mainframe.
DMX-h provides an extremely simple
way of defining the mapper and reducer
ETL jobs graphically without having to
write a single line of code. Everything
to the left of the Map Reduce
link
is the Map logic and everything to the
right is Reduce logic.
DMX-h provides the ability to have
multiple Map and Reduce tasks strung
together without having to write any

intermediate data between them.


Furthermore, none of the Map and
Reduce logic requires any coding or
code generation. All of the tasks are
executed natively by the pre-compiled
DMX-h ETL engine on every Hadoop
node as part of the mappers and
reducers. In this case, the Map task is
filtering and sorting the two files loaded
from the EDW and Mainframe, followed
by the Reduce task which is joining and
aggregating the data and writing the
result back to HDFS.

14

The No Coding Advantage


A Fortune 100 organization tried to create a Change Data Capture (CDC) job using
HiveQL. It took 9 weeks to complete and resulted in over 47 scripts and numerous userdefined functions written in Java to overcome the limitations of HiveQL. The code
wasnt scalable and would be costly to maintain with teams of skilled developers. Other
code-heavy tools introduced more complexity, cost and performance issues. Syncsort
DMX-h:

Cut development time by 2/3


Required only 4 DMX-h graphical jobs
Eliminated the need for Java user-defined functions
Delivered a 24x performance improvement
2.1: LEVERAGING METADATA FOR LINEAGE &
IMPACT ANALYSIS FOR MAPREDUCE JOBS
The DMX-h GUI makes it easy for you
to graphically link metadata across
MapReduce jobs to perform metadata
lineage and impact analysis across

mappers and reducers. The blue arrows


below track a certain columns lineage
across multiple steps in the MapReduce
job.

15

STEP 3 LOADING THE FINAL DATASET INTO


THE DATA WAREHOUSE
The
third
and
final
sub-job in our example,
Load_Aggregate_Data_to_
DataWarehouse, takes the
output of the Reduce tasks
from HDFS and loads it
back into the EDW using
native Teradata connectors
(TTU or TPT).

From Data Blending to Data Discovery


With Syncsort DMX-h it is easy to setup high-performance source and target access
for all major databases, including highly optimized, native connectivity for Teradata,
Vertica, EMC Greenplum, Oracle and more. Alternatively, you can also land the data
back into HDFS or even create a Tableau data extract file for visual data discovery and
analysis.

16

3.1: TARGET DATABASE TABLE DIALOG


You can graphically map the delimited
HDFS data columns on the left to
the EDW table columns on the right.
Syncsort DMX-h supports graphical
controls for inserting, truncating/

inserting, updating tables as well as


setting commit intervals for the EDW.
You can also create new tables using the
Create new button.

EXECUTING, SCHEDULING AND


MONITORING GRAPHICALLY
Using the GUI, you can execute and
schedule jobs graphically on a Hadoop
cluster, on standalone UNIX, LINUX
and Windows servers as well as on
your workstation/laptop. You can also

monitor the jobs and view logs through


the Server dialog. E-mail capabilities are
also available based on the status of the
job including e-mailing copies of job
logs. Steps 4 and 5 demonstrate how to
do this.

17

STEP 4 EXECUTING AND SCHEDULING YOUR


HADOOP ETL JOBS
Syncsort DMX-h allows you to easily
develop and test your Hadoop ETL jobs
graphically in Windows and then deploy
in Hadoop. The Run Job dialog allows

users to specify the run-time parameters


and execute jobs immediately or on a
given schedule.

18

STEP 5 MONITORING YOUR HADOOP ETL


JOBS
Comprehensive logging capabilities
as well as integration with Cloudera
Manager and the Hadoop JobTracker
make it easy to monitor and track your
DMX-h jobs. This is the last, but very
important step, as visibility into ETL
workloads is critical. These tools provide

the same level of enterprise-grade


functionality the organization has come
to expect before Hadoop, and make it
easy to identify and quickly correct for
errors and enhance productivity and
optimize performance of your Hadoop
environment.

5.1: DMX-H MONITORING CAPABILITIES


You can monitor the Server dialog for a
comprehensive, real-time list of all the
DMX-h jobs including those running,

completed
(success,
exceptions,
terminated) and scheduled.

19

5.2: DMX-H JOB LOGS


Using the Job Log, you can track the
successful completion of DMX-h
MapReduce jobs. The same log and
dialog can also include non-Hadoop job
logs.

20

STEP 5.3: HADOOP JOB TRACKER


Since the DMX-h engine is executed
natively as part of every mapper and
reducer, you can monitor DMX-h
statistics through the stderr logs of the
map and reduce tasks in the Hadoop

JobTracker. This is an example from


one of the reduce task logs that invoked
DMX-h for performing a Join between
5,023 and 4,858 records.

See For Yourself with a Free Test Drive &


Pre-Built Templates!

Download a free trial at syncsort.com/try and follow these steps with your own SQL
scripts.

DMX-h Use Case Accelerators are a set of pre-built graphical template jobs that
help you get started with loading data to Hadoop and processing data on Hadoop.
You can find them at: http://www.syncsort.com/TestDrive/Resources

21

Case Study:

LEADING HEALTHCARE
ORGANIZATION OFFLOADS EDW
TO HADOOP
A leading healthcare organization
continuously experiences exponential
growth in data volumes and has invested
millions of dollars in creating its data
environment to support a team of skilled
professionals who use this real-world
data to drive safety, health outcomes,
as well as late-phase and comparative
effectiveness research.
Faced with a cost-cutting initiative,
the organization needed to reduce
its hardware and software spend and
decided to explore moving its ETL/ELT
workloads from its EDW to Hadoop.
The healthcare organization found
that Hadoop offers a cost effective and

scalable data processing environment;


on average the cost to store and process
data in Hadoop would be 1% of the cost
to process and store the same data in its
EDW. But while Hadoop is a key enabling
technology for large-scale, cost-effective
data processing for ETL workloads,
the native Hadoop tools for building
and migrating applications (Hive, Pig,
Java) require custom coding and lack
enterprise features and enterprise
support. To fully address its cost-cutting
imperative, the organization needed
tools that would allow it to leverage
existing staff skilled in ETL without
requiring significant additional staff
with new skills (MapReduce) which are
scarce and expensive.

$1.4M PROJECTED TCO SAVINGS


OVER 3 YEARS
$1.8M
$390K

TCO: ELT on Enterprise Data Warehouse


TCO: ELT on Hadoop
22

The healthcare organization turned


to Syncsort and found that its existing
ETL developers could be productive
in Hadoop by leveraging Syncsort
DMX-h. The easy-to-use GUI allows
existing staff to create data flows with a
point-and-click approach and avoid the
complexities of MapReduce and manual
coding.
By offloading its EDW to Hadoop with
Syncsort, the healthcare organization
realized the following benefits:

Projected TCO savings over 3 years


are $1.4M

Eliminated immediate cost of $300k


EDW expense

Activated

Hadoop initiative with


a modern, secure and scalable
enterprise grade solution

Enabled Big Data for next-generation


analytics

Fast-tracked their EDW offload to


Hadoop with no need for specialized
skills, manual coding or tuning

Achieved

comparable high-end
performance at a tremendously
lower cost

CONCLUSION

For years, many organizations have


struggled with cost and processing
limitations of using their EDW for
data integration. Once considered
best practices, staging areas have
become the dirty secret of every
data warehouse environment one
that consumes the lions share of time,

money and effort. Thats why many


Hadoop implementations start with ETL
initiatives. But Hadoop also presents its
own challenges, and without the right
tools offloading ELT workloads can be
a lengthy and expensive process even
with relatively inexpensive hardware.

23

Syncsort DMX-h addresses these


challenges with an approach that
doesnt require you to write, tune or
maintain any code, but instead allows
you to leverage your existing ETL skills.
Even if you are not familiar with ETL,
this graphical approach allows you to
offload ELT workloads fast in 5 easy
steps. While we used a specific example
in this guide for illustrative purposes,
the steps can be applied to any ELT
offload project as follows:
STEP 1. Extract Data from original
sources, commonly including a mix
of relational, multi-structured, social
media, and mainframe sources
STEP 2: Re-design data transformations
previously developed in SQL to run
on Hadoop using a point-and-click,
graphical user interface
STEP 3: Load the Final Dataset into the
desired target repository, usually the
Data Warehouse or Hadoop itself
STEP 4: Execute and Schedule Your
Hadoop ETL Jobs on a Hadoop cluster.
Simply specify the run-time parameters
and execute jobs immediately or on a
given schedule.
STEP 5: Manage and Monitor Your
Hadoop ETL Jobs in real-time with
comprehensive logging capabilities and
integration with Cloudera Manager and
the Hadoop JobTracker.

The 5-step process shows how you


can use Syncsort DMX-h for a simpler,
simpler, all graphical approach to shift
ELT workloads and the associated
data into Hadoop before loading the
processed data back into the EDW for
agile data discovery and visual analytics.
The DMX-h GUI and engine support
important
Hadoop
authentication
protocols including Kerberos, and
integrate natively with these due to a
simple architecture and integration with
Hadoop. DMX-h Use Case Accelerators
a library of reusable templates for
some of the most common data flows
dramatically improve productivity
when deploying Hadoop jobs. Finally,
Syncsorts Hadoop ETL engine runs
natively on each Hadoop node, providing
much more scalability and efficiency
per node versus custom Java, Hive and
Pig code whether written manually by
developers or generated automatically
by other tools.
With Syncsort DMX-h you gain a
graphical approach to offload ELT
workloads into Hadoop with no coding,
and a practical way to optimize and freeup one of the most valuable investments
in your IT infrastructure, the data
warehouse. Moreover, offloading ELT
workloads into Hadoop will put you
on the fast-track to a modern data
architecture that delivers a single,
consistent version of the truth for all of
your corporate data.

24

ABOUT US

Syncsort provides fast, secure, enterprise-grade software spanning Big Data solutions in Hadoop to Big
Iron on mainframes. We help customers around the world to collect, process and distribute more data in
less time, with fewer resources and lower costs. 87 of the Fortune 100 companies are Syncsort customers,
and Syncsorts products are used in more than 85 countries to offload expensive and inefficient legacy
data workloads, speed data warehouse and mainframe processing, and optimize cloud data integration.
Experience Syncsort at www.syncsort.com

Learn More!
GUIDE:
5 Steps to Offloading Your Data
Warehouse with Hadoop >
SOLUTIONS:
Explore More Hadoop Solutions >
RESOURCES:
View More Hadoop Guides, eBooks,
Webcasts, Videos, & More >

2014 Syncsort Incorporated. All rights reserved. Company and product names used herein
may be the trademarks of their respective companies. DMXh-EB-001-0614US

You might also like