Cloudera Tutorial

IntroducAon
to Data Science
with Hadoop
Glynn Durham, Senior Instructor, Cloudera
glynn@cloudera.com
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
1 of 36
Terms
I will cover:
with a few extras:
Hadoop, Hadoop ecosystem

HDFS
MapReduce
Sqoop
Flume
Hive
Pig
Mahout
Machine learning
Data science using Hadoop
YARN
HBase
Impala
Oozie
data products
2 of 36
Hadoop
Hadoop is:
a plaLorm for big data
several Apache SoNware
FoundaOon (ASF) projects
free open source soNware
Major parts:
Hadoop Core
Hadoop ecosystem
3 of 36
Hadoop Core Main Features: File System and Batch Programming
4 of 36
Hadoop Core
Hadoop Core consists of:

HDFS
(Hadoop Distributed File System), for storage
MapReduce
for batch programming
5 of 36
HDFS Writes
6 of 36
HDFS Reads
7 of 36
HDFS Strengths and Weaknesses
HDFS is good at:

storing enormous les
storing a lot of data reliably

throughput on sequenAal writes
throughput on sequenAal reads of a le or part of a le
HDFS is not good at:

high speed random reads of parts of a le
HDFS cannot:
update any part of a le once wri>en*
* but you can always write a new le, and/or delete, move,
and rename les and directories
8 of 36
MapReduce: Programming with Simple FuncAons
9 of 36
MapReduce Chains
10 of 36
MapReduce at Scale
11 of 36
MapReduce in Hadoop
12 of 36
MapReduce Strengths and Weaknesses
MapReduce is good at:

processing enormous amounts of data
scaling out as you add more machines
conAnuing to compleAon, even when some machines die
MapReduce is not good at:

running any algorithm you can think up
algorithms that require shared state overall*
* but maybe you can get clever with your algorithm design
MapReduce cannot:
run in real Ame: MapReduce jobs are batch jobs
13 of 36
Detour: YARN, Yet Another Resource NegoAatornear future
14 of 36
Hadoop Ecosystem
The Hadoop Ecosystem consists of other projects that round
out Hadoop Core to make it a useful pla\orm:

Sqoop, for RDBMS integraAon
Flume, for event ingesAon
Hive, for "SQL"-like high-level programming
Pig, another high-level programming paradigm
Mahout, a Java library for machine learning in Hadoop
Plus:
HBase, a "NoSQL" database system
Oozie, a workow manager for Hadoop acAons
....
15 of 36
Sqoop: RDBMS to Hadoop and Back
16 of 36
Flume: IngesAng ConAnuing Event Data
17 of 36
Detour: General File Input/Output
18 of 36
MapReduce revisited: How to write MapReduce programs?

Java MapReduce API
The most expressive technique possible
The most work, by far
(Can be easier with Hadoop Streaming: a way to use streaming programming

such as shell scripOng or Python)
19 of 36
Hive: MapReduce as "SQL"
Familiar language and programming paradigm
Provides interface to many SQL-compliant tools

20 of 36
Detour: Impala, High Speed AnalyAcs in Hadoop
5 to 30 Omes faster then Hive queries (someOmes 100's of Omes faster!)
Cloudera exclusive oering, but Apache licensed, so it's free and open source
21 of 36
Impala Does Not Use MapReduce
22 of 36
Detour: HBase, A NoSQL Database System
23 of 36
Detour: A bit more about HBase
HBase is a NoSQL database system:

programmers create and use database tables
high volume, high performance access to individual cells
much weaker query language than SQL
lacks ACID-compliant transacAons
HBase is not strictly needed to do "data science"

a resource hog; competes with analyAcal programs
ogen deployed on its own separate cluster
may be part of your organizaAon's data storage and delivery,
so you may need to get or put data into an HBase system*
* (or other NoSQL system)
24 of 36
Pig: Another Language for MapReduce
25 of 36
Mahout: Machine Learning in MapReduce

Mahout is:
a collecOon of algorithms, mainly focused on "the three C's" of
machine learning
wriden in Java
largely implemented over Hadoop MapReduce
invocable from the command line
extensible, with the Java API
Mahout is not:
a turnkey soluOon for doing machine learning
always user-friendly
26 of 36
Machine Learning
"The three C's" of machine learning:

ClassicaOon
Clustering
CollaboraOve ltering (recommenders)
27 of 36
Supervised Machine Learning: ClassicaAon
28 of 36
Machine Learning: Clustering
29 of 36
Machine Learning: CollaboraAve Filtering for Recommenders
30 of 36
Simple Enterprise Deployment: Hadoop as ETL Appliance
31 of 36
Detour: Oozie, Workow within Hadoop

Simple workow within Hadoop:
1. Clear out staging directory in HDFS
2. Sqoop import from OLTP tables
3. Hive (or Pig) script to transform data
4. Sqoop export to data warehouse
32 of 36
Hadoop: The Bigger Picture
33 of 36
Data Science with Hadoop

A data scienOst will:
1.
IdenOfy internal and external data for potenOal use (general data wrangling tools).
2.
Help build ingesOon pipelines to obtain data for use (Flume, Sqoop, other).
3.
Examine, clean, and anonymize ingested data (Hive, Impala, Pig, Hadoop Streaming).
4.
Shape data into useful formats (Hive, Pig).
5.
Explore data sets to gain understanding of problems, trends, reality (Impala, Hive, Pig,
staOsOcal programming).
6.
Build predicOve models using staOsOcal programming, machine learning (Mahout).
7.
Contribute to data products: products in the organizaOon that are built in large part
from the data itself (Mahout, Sqoop export, general le export).
8.
Conduct experiments with data products, quanOfying benets and/or tradeos of

system changes (Flume, Sqoop, staOsOcal tests).
9.
Communicate results and insights to stakeholders (visualizaOon*).
34 of 36
VisualizaAon: Needs VisualizaAon Sogware
35 of 36
Thank you!
QuesAons? ContribuAons?
Glynn Durham, Senior Instructor, Cloudera
glynn@cloudera.com
36 of 36

Cloudera Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloudera Tutorial

Uploaded by

Copyright:

Available Formats

IntroducAon

with a few extras:

Hadoop, Hadoop ecosystem

Hadoop Core Main Features: File System and Batch Programming

Hadoop Core consists of:

HDFS Strengths and Weaknesses

HDFS is good at:

storing a lot of data reliably

HDFS is not good at:

MapReduce: Programming with Simple FuncAons

MapReduce Strengths and Weaknesses

MapReduce is good at:

MapReduce is not good at:

Detour: YARN, Yet Another Resource NegoAatornear future

out Hadoop Core to make it a useful pla\orm:

Sqoop: RDBMS to Hadoop and Back

Flume: IngesAng ConAnuing Event Data

Detour: General File Input/Output

MapReduce revisited: How to write MapReduce programs?

The most expressive technique possible

The most work, by far

(Can be easier with Hadoop Streaming: a way to use streaming programming

Hive: MapReduce as "SQL"

Familiar language and programming paradigm

Provides interface to many SQL-compliant tools

Detour: Impala, High Speed AnalyAcs in Hadoop

5 to 30 Omes faster then Hive queries (someOmes 100's of Omes faster!)

Impala Does Not Use MapReduce

Detour: HBase, A NoSQL Database System

Detour: A bit more about HBase

HBase is a NoSQL database system:

HBase is not strictly needed to do "data science"

Pig: Another Language for MapReduce

Mahout: Machine Learning in MapReduce

"The three C's" of machine learning:

Supervised Machine Learning: ClassicaAon

Machine Learning: Clustering

Machine Learning: CollaboraAve Filtering for Recommenders

Simple Enterprise Deployment: Hadoop as ETL Appliance

Detour: Oozie, Workow within Hadoop

Hadoop: The Bigger Picture

Data Science with Hadoop

Shape data into useful formats (Hive, Pig).

Build predicOve models using staOsOcal programming, machine learning (Mahout).

Conduct experiments with data products, quanOfying benets and/or tradeos of

Communicate results and insights to stakeholders (visualizaOon*).

VisualizaAon: Needs VisualizaAon Sogware

You might also like