You are on page 1of 36

IntroducAon

to Data Science
with Hadoop
Glynn Durham, Senior Instructor, Cloudera
glynn@cloudera.com

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

1 of 36

Terms
I will cover:

with a few extras:

Hadoop, Hadoop ecosystem


HDFS
MapReduce
Sqoop
Flume
Hive
Pig
Mahout
Machine learning
Data science using Hadoop

YARN
HBase
Impala
Oozie
data products

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

2 of 36

Hadoop
Hadoop is:
a plaLorm for big data
several Apache SoNware
FoundaOon (ASF) projects
free open source soNware
Major parts:
Hadoop Core

Hadoop ecosystem
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

3 of 36

Hadoop Core Main Features: File System and Batch Programming

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

4 of 36

Hadoop Core

Hadoop Core consists of:


HDFS
(Hadoop Distributed File System), for storage
MapReduce
for batch programming

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

5 of 36

HDFS Writes

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

6 of 36

HDFS Reads

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

7 of 36

HDFS Strengths and Weaknesses

HDFS is good at:


storing enormous les

storing a lot of data reliably


throughput on sequenAal writes
throughput on sequenAal reads of a le or part of a le

HDFS is not good at:


high speed random reads of parts of a le
HDFS cannot:
update any part of a le once wri>en*

* but you can always write a new le, and/or delete, move,
and rename les and directories
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8 of 36

MapReduce: Programming with Simple FuncAons

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

9 of 36

MapReduce Chains

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10 of 36

MapReduce at Scale

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

11 of 36

MapReduce in Hadoop

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

12 of 36

MapReduce Strengths and Weaknesses

MapReduce is good at:


processing enormous amounts of data
scaling out as you add more machines
conAnuing to compleAon, even when some machines die

MapReduce is not good at:


running any algorithm you can think up
algorithms that require shared state overall*
* but maybe you can get clever with your algorithm design

MapReduce cannot:
run in real Ame: MapReduce jobs are batch jobs
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

13 of 36

Detour: YARN, Yet Another Resource NegoAatornear future

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

14 of 36

Hadoop Ecosystem
The Hadoop Ecosystem consists of other projects that round

out Hadoop Core to make it a useful pla\orm:


Sqoop, for RDBMS integraAon
Flume, for event ingesAon
Hive, for "SQL"-like high-level programming
Pig, another high-level programming paradigm
Mahout, a Java library for machine learning in Hadoop
Plus:
HBase, a "NoSQL" database system
Oozie, a workow manager for Hadoop acAons
....
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

15 of 36

Sqoop: RDBMS to Hadoop and Back

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

16 of 36

Flume: IngesAng ConAnuing Event Data

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

17 of 36

Detour: General File Input/Output

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

18 of 36

MapReduce revisited: How to write MapReduce programs?


Java MapReduce API

The most expressive technique possible

The most work, by far

(Can be easier with Hadoop Streaming: a way to use streaming programming


such as shell scripOng or Python)
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

19 of 36

Hive: MapReduce as "SQL"

Familiar language and programming paradigm

Provides interface to many SQL-compliant tools


Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

20 of 36

Detour: Impala, High Speed AnalyAcs in Hadoop

5 to 30 Omes faster then Hive queries (someOmes 100's of Omes faster!)

Cloudera exclusive oering, but Apache licensed, so it's free and open source
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

21 of 36

Impala Does Not Use MapReduce

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

22 of 36

Detour: HBase, A NoSQL Database System

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

23 of 36

Detour: A bit more about HBase

HBase is a NoSQL database system:


programmers create and use database tables
high volume, high performance access to individual cells
much weaker query language than SQL
lacks ACID-compliant transacAons

HBase is not strictly needed to do "data science"


a resource hog; competes with analyAcal programs
ogen deployed on its own separate cluster
may be part of your organizaAon's data storage and delivery,
so you may need to get or put data into an HBase system*
* (or other NoSQL system)
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

24 of 36

Pig: Another Language for MapReduce

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

25 of 36

Mahout: Machine Learning in MapReduce


Mahout is:
a collecOon of algorithms, mainly focused on "the three C's" of
machine learning
wriden in Java
largely implemented over Hadoop MapReduce
invocable from the command line
extensible, with the Java API
Mahout is not:
a turnkey soluOon for doing machine learning
always user-friendly
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

26 of 36

Machine Learning

"The three C's" of machine learning:


ClassicaOon
Clustering
CollaboraOve ltering (recommenders)
Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

27 of 36

Supervised Machine Learning: ClassicaAon

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

28 of 36

Machine Learning: Clustering

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

29 of 36

Machine Learning: CollaboraAve Filtering for Recommenders

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

30 of 36

Simple Enterprise Deployment: Hadoop as ETL Appliance

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

31 of 36

Detour: Oozie, Workow within Hadoop


Simple workow within Hadoop:
1. Clear out staging directory in HDFS
2. Sqoop import from OLTP tables
3. Hive (or Pig) script to transform data
4. Sqoop export to data warehouse

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

32 of 36

Hadoop: The Bigger Picture

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

33 of 36

Data Science with Hadoop


A data scienOst will:
1.

IdenOfy internal and external data for potenOal use (general data wrangling tools).

2.

Help build ingesOon pipelines to obtain data for use (Flume, Sqoop, other).

3.

Examine, clean, and anonymize ingested data (Hive, Impala, Pig, Hadoop Streaming).

4.

Shape data into useful formats (Hive, Pig).

5.

Explore data sets to gain understanding of problems, trends, reality (Impala, Hive, Pig,
staOsOcal programming).

6.

Build predicOve models using staOsOcal programming, machine learning (Mahout).

7.

Contribute to data products: products in the organizaOon that are built in large part
from the data itself (Mahout, Sqoop export, general le export).

8.

Conduct experiments with data products, quanOfying benets and/or tradeos of


system changes (Flume, Sqoop, staOsOcal tests).

9.

Communicate results and insights to stakeholders (visualizaOon*).

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

34 of 36

VisualizaAon: Needs VisualizaAon Sogware

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

35 of 36

Thank you!
QuesAons? ContribuAons?
Glynn Durham, Senior Instructor, Cloudera
glynn@cloudera.com

Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

36 of 36

You might also like