You are on page 1of 19

Deutsche Telekom Perspective on

HADOOP and Big Data Technologies


Gregory Smith
VP Solution Design and Emerging Technologies and Architectures
T-Systems North America
Gregory.Smith@t-systems.com
10 110 1011
0
0 1 0 1 0101111
11100 1
1 011 00
01111 010110
110 01 010101 1101
10110 0101
0 1 011 0
111 0 0101111
0
Deutsche Telekom and T-Systems Key Stats

Deutsche Telekom is Europes largest telecom service provider


Revenue: $75 billion
Employees: 232,342

T-Systems is the enterprise division of Deutsche Telekom


Revenue: $13 billion
Employees: 52,742
Services: data center, end user computing, networking, systems
integration, cloud and big data

2
Overwhelmed by new data types?
Sensor- / machine-based data

Sentiment Clickstream
data Big Data
Transactions, Interactions, Observations
data

Call detail records (CDRs)

3
80% of new data in 2015 will land on Hadoop!

Hadoop is like a data warehouse,


but it can store more data, more kinds of data,
and perform more flexible analyses
Hadoop is open source
and runs on industry standard hardware,
so it's 1-2 orders of magnitude more economical
than conventional data warehouse solutions
Hadoop provides more cost effective storage, processing,
and analysis. Some existing workloads run faster, cheaper, better
Hadoop can deliver a foundation for profitable growth:
Gain value from all your data by asking bigger questions

4
Hadoop Core

Reference architecture view of Hadoop Hadoop Projects


Adjacent Categories

Data Presentation
Integration

Workflow and Scheduling


Data Isolation
Data Visualization and Reporting Clients
Real Time
Ingestion
Application

Analytics
Analytics Apps Transactional Apps
Batch Middleware

Access Management
Ingestion

Operations
Security
Data Processing

Data Real Time/Stream


Batch Processing Search and Indexing

Management and Monitoring


Connectors Processing

Data Management
Metadata Distributed Distributed

Data Encryption
Services Non-relational Structured In
Storage Processing
DB Memory
(HDFS) (MapReduce)

Infrastructure
Virtualization Compute / Storage / Network

5
Example application landscape
Machine Learning
Real Time (Mahout, etc)
Streams
(Social,
sensors)

Real-Time
Processing
(s4, storm,
spark) Data Visualization
(Excel,
(Excel, Tableau)
Tableau)

ETL Real Time Interactive HIVE


Database Analytics
(Impala,
(Impala,
(Informatica, Talend,
(Shark,
(Shark, Greenplum,
Batch
Greenplum,
Spring Integration)
Gemfire,
Gemfire, hBase,
hBase, AsterData,
AsterData,
Processing
Cassandra)
Cassandra) (Map-Reduce)
(Map-Reduce)
Netezza)
Netezza)

Structured and Unstructured Data


(HDFS,
(HDFS, MAPR)
MAPR)

Cloud Infrastructure
Compute Storage Networking

Source: Vmware
Disruptive innovations in Big Data

Traditional Data MPP NoSQL


HADOOP
Database Warehouse Analytics Database

Structured Data types Any, including unstructured

Pre-defined, fixed Required on read


Schema
Required on write Store first, ask questions later

Processing coupled with data


No or limited
Processing Parallel processing / scale
data processing
out

Enterprise grade Physical Commodity is an option


Mission critical infrastructure Much cheaper storage

..
7
Innovations: Hadoop is 100x cheaper per TB
than in-memory appliances like HANA and
handles unstructured data as well
Hadoop
High Performance Ecosystem
BI Forward-looking
Legacy BI predictive analysis
Quasi-real-time
analysis Questions defined in
Backward-looking the moment, using
analysis Using data out of
Business business applications data from many
Business Using data out of sources
problem
problem business applications

Selected Vendors
SAP Business Objects Oracle Exadata Hadoop distributions
IBM Cognos SAP HANA No ACID transactions
Technology
Technology MicroStrategy Limited SQL Set (joins)
Solution
Solution Data Type/Scalability
Structured Structured Structured or
Limited (2 3 TB in Limited (2 8 TB in unstructured
RAM) RAM) Unlimited (20 30 PB)
True big data
Legacy vendor definition of big data
Innovations:
Store first, ask questions later
Much cheaper storage
but not just storage
Illustrative acquisition cost ? !

SAN Storage NAS Filers Enterprise Class White Box DAS 1)


Data Cloud 1)

3-5/GB 1-3/GB Hadoop Storage 0.50-1.00/GB 0.10-0.30 /GB


???/GB

Based on HDS Based on Netapp Based on Netapp Hardware can be Based on large
SAN Storage FAS-Series E-Series (NOSH) self-assembled scale object
storage interfaces

1) 9
Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions
Target use cases
Higher IT Infrastructure Business Line of Business & CXO
& Operations Intelligence & Business Analysts
Data Warehousing
Capacity Planning &
Utilization New
Customer Profiling & Business
Enterprise Data Revenue Analytics Models
Potential Warehouse Targeted Advertising
value Offload Analytics
Enterprise Data Service Renewal
Warehouse Implementation
Archive CDR based Data
Lower Cost
ETL Offload Analytics
Storage
Fraud Management
Enterprise
Data Lake

Lower
Shorter Longer

Time to value

Cost effective storage, Foundation for


processing, and analysis profitable growth

10
Enterprise data warehouse offload use case
The Challenge The Solution
Many EDWs are at capacity Hadoop for data storage and
Running out of budget before processing: parse, cleanse,
running out of relevant data apply structure and transform
Older data archived in the dark, Free EDW for valuable queries
not available for exploration Retain all data for analysis!

DATA WAREHOUSE DATA WAREHOUSE

Operational (50%)
Operational (44%)
Analytics (50%)
Analytics (11%)
HADOOP
ETL Processing (42%) Cost is
1/10th Storage & Processing

11
From data puddles and ponds to lakes and oceans

AVOID: GOAL:
Systems separated by workload Platform that natively supports
type due to contention mixed workloads as shared service

Batch Interactive Online


BU1 BU2 BU3
Refine Explore Enrich

Big Big Big Big Data


Data Data Data Transactions, Interactions, Observations

Page 12
Questions to ask in designing a solution
for a particular business use case
Presentation Which distribution is right for your needs today vs. tomorrow?
Data Application
Which distribution will ensure you stay on the main path of

Operations
Inte-

Security
gra-
tion Data Processing
open source innovation, vs. trap you in proprietary forks?
Data Management

Infrastructure

Widely adopted, mature distribution


GTM partners include Oracle, HP, Dell, IBM

Fully open source distribution (incl. management tools)


Reputation for cost-effective licensing
Strong developer ecosystem momentum
GTM partners include Microsoft, Teradata, Informatica, Talend

More proprietary distribution with features that appeal to some


business critical use cases
GTM partner AWS (M3 and M5 versions only)

Just announced by EMC, very early stage


Differentiator is HAWQ claims manifold query speed
improvement, full SQL instruction set
Note: Distributions include more than just the Data Management layer but are discussed at this point in the presentation. 13
Not shown: Intel, Fujitsu and other distributions
Common objections to Hadoop

We cant justify
We dont have big We dont have
the budget for a
data problems petabytes of data
new project

Were not sure


We already have a
We dont have Hadoop is
scale-out strategy
the skills mature/secure/
for our EDW/ETL
enterprise-ready

14
Every organization has data problems!
Hadoop can help
MYTH: MYTH:
Big Data means Petabytes Big Data means Data Science
Not just Volume Hadoop solves existing problems
Remember Variety, Velocity faster, better, cheaper than
Plenty of issues at smaller scales conventional technology, e.g.
Data processing Landing zone capturing and
Unstructured data refining multi-structured data
types with unknown future value
Often warehouse volumes are small
Cost effective platform for
because the technology is
retaining lots of data for long
expensive, not because there is no
periods of time
relevant data
Walk before you run
Scalability is about growing with the
business, affordably and predictably Big Data Is a State of Mind

15
Waves of adoption crossing the chasm
Wave 3
Wave 2 Real-Time Orientation
Interactive Orientation
Wave 1
Batch Orientation

Adoption Mainstream, Early adopters, Bleeding edge,


today* 70% of organizations 20% of organizations 10% of organizations
Example use Refine: Explore: Enrich:
cases archival and query and real-time decisions
transformation visualization
Response time Hour(s) Minutes Seconds
Data Volume Velocity
characteristic
Architectural EDW / RDBMS talk Analytic apps talk Derived data also
characteristic to Hadoop directly to Hadoop stored in Hadoop

Example MapReduce, Pig, ODBC/JDBC, Hive HBase, NoSQL,


technologies Hive SQL

* Among organizations using Hadoop 16


Hadoop in a nutshell

The Hadoop open source ecosystem delivers powerful innovation


in storage, databases and business intelligence, promising
unprecedented price / performance compared to existing
technologies.

Hadoop is becoming an enterprise-wide landing zone for large


amounts of data. Increasingly it is also used to transform data.

Large enterprises have realized substantial cost reductions by


offloading some enterprise data warehouse, ETL and archiving
workloads to a Hadoop cluster.

17
Challenges in the Enterprise

Use-case identification and cost justification


Cooperation and coordination from independent business units
As Hadoop increases its footprint in business-critical areas, the
business will demand mature enterprise capabilities, e.g. DR,
snap-shots, etc.
Hadoops disruptive approve is challenging strong legacy EDW
People, processes and technologies.
Data harmonization is often a significant challenge.
Fear of forking (think UNIX)
Proprietary absorption (Borged)
Audience: Hadoop address business problems, not IT problems
Fear of data complexity (I hated statistics class!)

18
Questions?
gregory.smith@t-systems.com

10 110 1011
0
0 1 0 1 0101111
11100 1
1 011 00
01111 010110
110 01 010101 1101
10110 0101
0 1 011 0
111 0 0101111
0

You might also like