Hadoop Overview-Tutorial-20081128 PDF

Hadoop Overview
Hadoop Tutorial Workshop
2008. 11. 28
한 재 선 (NexR 대표이사)
jshan0000@gmail.com
http://www.nexr.co.kr
Hadoop
Brief History
2005년 Doug Cutting(Lucene & Nutch 개발자)에 의해 시작
Nutch 오픈소스 검색엔진의 분산확장 이슈에서 출발
2006년 Yahoo의 전폭적인 지원 (Doug Cutting과 전담팀 고용)
2008년 Apache Top-level Project로 승격
현재(2008년4월) 0.16.3 release
Hadoop
Java 언어 기반
Apache 라이선스
많은 컴포넌트들
HDFS, HBase, MapReduce, Hadoop On Demand(HOD),
Streaming, HQL, Hama, Mahout, etc
Hadoop Architecture
Google
Nutch: Open Source Search Engine Search
MapReduce: Distributed Data Processing MapReduce
HBase: Distributed Data Store Bigtable
HDFS: Distributed File System

GFS
Commodity PC Server Cluster

Commodity Server Cluster
WIN
(2002년 비교)
x86-based Server CPU: Dual 2 GHz Intel Xeon

RAM: 2 GB
CPU: 8 2 GHz Intel Xeon
Disk: 80 GB
RAM: 64 GB
X 88 대
Disk: 8 TB
= 176 2GHz CPU, 176GB RAM, 7TB Disk
$758,000
$278,000
Goals of HDFS
Very Large Distributed File System

10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recovers from them
Optimized for Batch Processing
Data locations exposed so that computations can move to
where data resides
Provides very high aggregate bandwidth
Features of HDFS
Single namespace for entire cluster

Managed by a single namenode.
Files are write-once.
Optimized for streaming reads of large files.
Files are broken in to large blocks.
Typically 128 MB
Replicated to several datanodes, for reliability
Client talks to both namenode and datanodes
Data is not sent through the namenode.
Throughput of file system scales nearly linearly with the
number of nodes.
HDFS Architecture
MapReduce
Distributed Processing Framework

Using Google MapReduce metaphor
map (k1, v1)  list (k2, v2)
reduce (k2, list (v2))  list (v2)
Proposed for parallel processing of large data sets
parallelization, fault-tolerance, data distribution in framework
Applications
Log analysis, search indexing, collaborative filtering, clustering,
machine learning, data mining, etc
Features of Hadoop MapReduce
Mapper locality
Overlap of maps, shuffle, sort
Speculative execution
MapReduce 동작순서
map(k, v)  list (k’, v’)

reduce(k’, list (v’))  list (v”)
MapReduce 논리적 처리 흐름
Task들의 병렬처리
출처: Google Slide (http://labs.google.com/papers/mapreduce-osdi04-slides/index.html)

WordCount MapReduce
Map
Reduce
WordCount MapReduce
file01.txt Hello World Bye World

file02.txt Hello Hadoop Goodbye Hadoop User
input files (from Local)
HDFS
input
(Bye, 1) files
(Goodbye, 1) Job
(Hadoop, 2) (wordcount)
(Hello, 2)
(Hello, 1)
(World, 2) (Bye, 1) (World, 1)
(Goodbye, 1) (Bye, 1)
(Hadoop, 1) (World, 1) M
(Hadoop, 1)
R (Hello, 1) Sorter
(Hello, 1)
(Hello, 1) (Hadoop, 1) M
(World, 1) (Goodbye, 1) JobTracker
(World, 1) (Hadoop, 1)
Nutch: Search over Hadoop
Crawling MapReduce Indexing MapReduce
seed URL list Injector

LinkDb link
fetch
(invert) list
URL list Generator
Indexer lucene
pages indexes
(segments) Fetcher
DeleteDuplicates
page
ParseSegment
metadata
IndexMerge merged
new CrawlDb index
crawl db (update)
HDFS
HDFS No
depth =
0?
Yes
Hadoop MapReduce Architecture
JobTracker
JobClient
job #1
job queue input list
Job Submission
task
heartbeat allocation
TaskTracker TaskTracker TaskTracker TaskTracker
HDFS t1 HDFS HDFS t2 HDFS t3
Node #1 Node #2 Node #3 Node #n
Writing Input by HDFS

MapReduce: Usage in Google
MapReduce 개발 및 사용 통계
MapReduce 프로그램 증가
MapReduce는 일종의 소프트웨어 컴포넌트

 재사용성 증가에 따른 개발 생산성 향상
출처: MapReduce: simplified data processing on large clusters (CACM 2008)
Real Indexing MapReduce in Google
Stolen from
Michael Kleber’s Presentation
Hadoop Ecosystem
?
Cascading
Workflow management for
NexR VCC Hadoop MapReduce
Hadoop on Virtualization
Parhely
ORM for HBase
Yahoo Pig
Query Language Interface
on Hadoop
Katta
Distributed indexing with
HDFS, MapReduce Hadoop
Yahoo Zookeeper HBase, HOD, Streaming,
Distributed Management Fuse-DFS, EC2 Support
Mahout & Hama
Machine Learning using
Hadoop MapReduce
IBM MapReduce Tools
Eclipse plug-in for Facebook Hive
MapReduce programs Data warehousing on
Hadoop
활용사례
Powered by Hadoop
Complete List: http://wiki.apache.org/hadoop/PoweredBy

Yahoo: Hadoop Cluster
~20,000 machines running Hadoop

The largest cluster is currently 2000 nodes
Several petabytes of user data (compressed, unreplicated)
Run hundreds of thousands of jobs every month
Yahoo: Running WebMap
Search needs a graph of the “known” web

Invert edges, compute link text, whole graph heuristics
Periodic batch job using Map/Reduce
Uses a chain of ~100 map/reduce jobs
Scale
100 billion nodes and 1 trillion edges
Largest shuffle is 450 TB
Final output is 300 TB compressed
Runs on 10,000 cores
Written mostly using Hadoop’s C++ interface
Yahoo: Research Cluster
The grid team runs research clusters as a service to

Yahoo researchers
Analytics as a Service
Mostly data mining/machine learning jobs
Most research jobs are *not* Java:
42% Streaming
Uses Unix text processing to define map and reduce
28% Pig
Higher level dataflow scripting language
28% Java
2% C++
Facebook: Hadoop Cluster
Production cluster
2400 cores, 300 machines, 16GB per machine -- Oct 2008
4800 cores, 600 machines, 16GB/8GB per machine – Nov 2008
4 SATA disks of 1 TB each per machine
2 level network hierarchy, 40 machines per rack
Test cluster
800 cores, 16GB each
Facebook: HDFS Storage Cluster
Single HDFS cluster across all cores

Running 0.17.2 + patches
Over 1.2 PB raw capacity
Ingest rate is 2 TB compressed user-data per day
10 TB uncompressed
10 Million files
NameNode on 64 bit JVM with 20GB heap size
Facebook: MapReduce Compute Clusters
3 static Map-Reduce clusters

Running 0.17.2 + patches
Main cluster has 2240 cores
Serves most users
Ads cluster has 80 cores
Dedicated for Advertisement related computations
Test cluster of 80 cores
Test miscellaneous fixes
Job Tracker(s) on 32 bit JVM with 3GB heap size
Facebook: Data Warehousing
Web Servers Scribe Servers
Filers
Hive on
Oracle RAC Federated MySQL
Hadoop Cluster
Facebook: Hive/Hadoop Usage
Types of Applications:
Summarization
Eg: Daily/Weekly aggregations of impression/click counts
Complex measures of user engagement
Ad hoc Analysis
Eg: how many group admins broken down by state/country
Data Mining (Assembling training data)
Eg: User Engagement as a function of user attributes
Spam Detection
Anomalous patterns in UGC
Application api usage patterns
Ad Optimization
Too many to count ..
Facebook: Hadoop Usage
Data statistics:
Total Data: 180TB (mostly compressed)
Net Data added/day: 2+TB (compressed)
6TB of uncompressed source logs
4TB of uncompressed dimension data reloaded daily
Usage statistics:
3200 jobs/day with 800K tasks(map-reduce tasks)/day
55TB of compressed data scanned daily
15TB of compressed output data written to hdfs
80 MM compute minutes/day
New York Times
Hadoop on Amazon Web Services

Convert 11 million articles (1851-1980) of TIFF format into PDF
Using Amazon S3 and EC2 for HW, Hadoop for SW
Amazon S3
TIFF Image (4TB) PDF (1.5TB)
AMI Hadoop MapReduce
Amazon EC2
(100 instances)
Hadoop Cluster for Academy
Google과 IBM의 대학 분산 플랫폼 수업 지원

수백대 규모 클러스터 제공
UW, MIT, Stanford, UM, CMU, UC Berkerly
Hadoop으로 MapReduce 프로그래밍 교육
Yahoo!의 CMU에 대용량 데이터 처리 연구 지원

M45 Cluster: 4000 processors, 3 TB memory, 1.5 PB disk
검색, Information Retrieval, 자연어처리, 기계번역 등
Hadoop을 이용한 분산 데이터 처리 연구
Classes using Hadoop
UW: Problem Solving on Large Scale Clusters

- University of Washington, Spring 2007
UCSD: Networked Services

- UC San Diego, Fall 2007
MIT: MapReduce course

- a part of 2008 Independent Activities Period at MIT
UM: Web-Scale Information Processing Applications

- University of Maryland, Spring 2008
KAIST: Distributed Algorithms and Systems

- KAIST, Fall 2008
Thank You!

Hadoop Overview-Tutorial-20081128 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Overview-Tutorial-20081128 PDF

Uploaded by

Copyright:

Available Formats

Hadoop Overview

Hadoop Tutorial Workshop

MapReduce: Distributed Data Processing MapReduce

HBase: Distributed Data Store Bigtable

HDFS: Distributed File System

Commodity PC Server Cluster

x86-based Server CPU: Dual 2 GHz Intel Xeon

Very Large Distributed File System

Single namespace for entire cluster

Distributed Processing Framework

map(k, v)  list (k’, v’)

출처: Google Slide (http://labs.google.com/papers/mapreduce-osdi04-slides/index.html)

file01.txt Hello World Bye World

Crawling MapReduce Indexing MapReduce

seed URL list Injector

TaskTracker TaskTracker TaskTracker TaskTracker

HDFS t1 HDFS HDFS t2 HDFS t3

Node #1 Node #2 Node #3 Node #n

Writing Input by HDFS

MapReduce는 일종의 소프트웨어 컴포넌트

Complete List: http://wiki.apache.org/hadoop/PoweredBy

~20,000 machines running Hadoop

Search needs a graph of the “known” web

The grid team runs research clusters as a service to

Single HDFS cluster across all cores

3 static Map-Reduce clusters

Web Servers Scribe Servers

Hadoop on Amazon Web Services

TIFF Image (4TB) PDF (1.5TB)

AMI Hadoop MapReduce

Google과 IBM의 대학 분산 플랫폼 수업 지원

Yahoo!의 CMU에 대용량 데이터 처리 연구 지원

UW: Problem Solving on Large Scale Clusters

UCSD: Networked Services

MIT: MapReduce course

UM: Web-Scale Information Processing Applications

KAIST: Distributed Algorithms and Systems

You might also like