You are on page 1of 31

Hadoop Overview

Hadoop Tutorial Workshop

2008. 11. 28
한 재 선 (NexR 대표이사)
jshan0000@gmail.com
http://www.nexr.co.kr
Hadoop

Brief History
2005년 Doug Cutting(Lucene & Nutch 개발자)에 의해 시작
Nutch 오픈소스 검색엔진의 분산확장 이슈에서 출발
2006년 Yahoo의 전폭적인 지원 (Doug Cutting과 전담팀 고용)
2008년 Apache Top-level Project로 승격
현재(2008년4월) 0.16.3 release

Hadoop
Java 언어 기반
Apache 라이선스
많은 컴포넌트들
HDFS, HBase, MapReduce, Hadoop On Demand(HOD),
Streaming, HQL, Hama, Mahout, etc
Hadoop Architecture

Google
Nutch: Open Source Search Engine Search

MapReduce: Distributed Data Processing MapReduce

HBase: Distributed Data Store Bigtable

HDFS: Distributed File System


GFS

Commodity PC Server Cluster


Commodity Server Cluster

WIN

(2002년 비교)

x86-based Server CPU: Dual 2 GHz Intel Xeon


RAM: 2 GB
CPU: 8 2 GHz Intel Xeon
Disk: 80 GB
RAM: 64 GB
X 88 대
Disk: 8 TB
= 176 2GHz CPU, 176GB RAM, 7TB Disk
$758,000
$278,000
Goals of HDFS

Very Large Distributed File System


10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recovers from them
Optimized for Batch Processing
Data locations exposed so that computations can move to
where data resides
Provides very high aggregate bandwidth
Features of HDFS

Single namespace for entire cluster


Managed by a single namenode.
Files are write-once.
Optimized for streaming reads of large files.
Files are broken in to large blocks.
Typically 128 MB
Replicated to several datanodes, for reliability
Client talks to both namenode and datanodes
Data is not sent through the namenode.
Throughput of file system scales nearly linearly with the
number of nodes.
HDFS Architecture
MapReduce

Distributed Processing Framework


Using Google MapReduce metaphor
map (k1, v1)  list (k2, v2)
reduce (k2, list (v2))  list (v2)
Proposed for parallel processing of large data sets
parallelization, fault-tolerance, data distribution in framework
Applications
Log analysis, search indexing, collaborative filtering, clustering,
machine learning, data mining, etc
Features of Hadoop MapReduce
Mapper locality
Overlap of maps, shuffle, sort
Speculative execution
MapReduce 동작순서

map(k, v)  list (k’, v’)


reduce(k’, list (v’))  list (v”)

MapReduce 논리적 처리 흐름

Task들의 병렬처리

출처: Google Slide (http://labs.google.com/papers/mapreduce-osdi04-slides/index.html)


WordCount MapReduce

Map

Reduce
WordCount MapReduce

file01.txt Hello World Bye World


file02.txt Hello Hadoop Goodbye Hadoop User
input files (from Local)
HDFS

input
(Bye, 1) files
(Goodbye, 1) Job
(Hadoop, 2) (wordcount)
(Hello, 2)
(Hello, 1)
(World, 2) (Bye, 1) (World, 1)
(Goodbye, 1) (Bye, 1)
(Hadoop, 1) (World, 1) M
(Hadoop, 1)
R (Hello, 1) Sorter
(Hello, 1)
(Hello, 1) (Hadoop, 1) M
(World, 1) (Goodbye, 1) JobTracker
(World, 1) (Hadoop, 1)
Nutch: Search over Hadoop

Crawling MapReduce Indexing MapReduce

seed URL list Injector


LinkDb link
fetch
(invert) list
URL list Generator

Indexer lucene
pages indexes
(segments) Fetcher

DeleteDuplicates
page
ParseSegment
metadata

IndexMerge merged
new CrawlDb index
crawl db (update)
HDFS
HDFS No
depth =
0?
Yes
Hadoop MapReduce Architecture

JobTracker
JobClient
job #1
job queue input list
Job Submission

task
heartbeat allocation

TaskTracker TaskTracker TaskTracker TaskTracker

HDFS t1 HDFS HDFS t2 HDFS t3

Node #1 Node #2 Node #3 Node #n

Writing Input by HDFS


MapReduce: Usage in Google

MapReduce 개발 및 사용 통계

MapReduce 프로그램 증가

MapReduce는 일종의 소프트웨어 컴포넌트


 재사용성 증가에 따른 개발 생산성 향상
출처: MapReduce: simplified data processing on large clusters (CACM 2008)
Real Indexing MapReduce in Google

Stolen from
Michael Kleber’s Presentation
Hadoop Ecosystem

?
Cascading
Workflow management for
NexR VCC Hadoop MapReduce
Hadoop on Virtualization

Parhely
ORM for HBase
Yahoo Pig
Query Language Interface
on Hadoop
Katta
Distributed indexing with
HDFS, MapReduce Hadoop
Yahoo Zookeeper HBase, HOD, Streaming,
Distributed Management Fuse-DFS, EC2 Support
Mahout & Hama
Machine Learning using
Hadoop MapReduce
IBM MapReduce Tools
Eclipse plug-in for Facebook Hive
MapReduce programs Data warehousing on
Hadoop
활용사례
Powered by Hadoop

Complete List: http://wiki.apache.org/hadoop/PoweredBy


Yahoo: Hadoop Cluster

~20,000 machines running Hadoop


The largest cluster is currently 2000 nodes
Several petabytes of user data (compressed, unreplicated)
Run hundreds of thousands of jobs every month
Yahoo: Running WebMap

Search needs a graph of the “known” web


Invert edges, compute link text, whole graph heuristics
Periodic batch job using Map/Reduce
Uses a chain of ~100 map/reduce jobs
Scale
100 billion nodes and 1 trillion edges
Largest shuffle is 450 TB
Final output is 300 TB compressed
Runs on 10,000 cores
Written mostly using Hadoop’s C++ interface
Yahoo: Research Cluster

The grid team runs research clusters as a service to


Yahoo researchers
Analytics as a Service
Mostly data mining/machine learning jobs
Most research jobs are *not* Java:
42% Streaming
Uses Unix text processing to define map and reduce
28% Pig
Higher level dataflow scripting language
28% Java
2% C++
Facebook: Hadoop Cluster

Production cluster
2400 cores, 300 machines, 16GB per machine -- Oct 2008
4800 cores, 600 machines, 16GB/8GB per machine – Nov 2008
4 SATA disks of 1 TB each per machine
2 level network hierarchy, 40 machines per rack

Test cluster
800 cores, 16GB each
Facebook: HDFS Storage Cluster

Single HDFS cluster across all cores


Running 0.17.2 + patches
Over 1.2 PB raw capacity
Ingest rate is 2 TB compressed user-data per day
10 TB uncompressed
10 Million files
NameNode on 64 bit JVM with 20GB heap size
Facebook: MapReduce Compute Clusters

3 static Map-Reduce clusters


Running 0.17.2 + patches
Main cluster has 2240 cores
Serves most users
Ads cluster has 80 cores
Dedicated for Advertisement related computations
Test cluster of 80 cores
Test miscellaneous fixes
Job Tracker(s) on 32 bit JVM with 3GB heap size
Facebook: Data Warehousing

Web Servers Scribe Servers

Filers

Hive on
Oracle RAC Federated MySQL
Hadoop Cluster
Facebook: Hive/Hadoop Usage

Types of Applications:
Summarization
Eg: Daily/Weekly aggregations of impression/click counts
Complex measures of user engagement
Ad hoc Analysis
Eg: how many group admins broken down by state/country
Data Mining (Assembling training data)
Eg: User Engagement as a function of user attributes
Spam Detection
Anomalous patterns in UGC
Application api usage patterns
Ad Optimization
Too many to count ..
Facebook: Hadoop Usage

Data statistics:
Total Data: 180TB (mostly compressed)
Net Data added/day: 2+TB (compressed)
6TB of uncompressed source logs
4TB of uncompressed dimension data reloaded daily

Usage statistics:
3200 jobs/day with 800K tasks(map-reduce tasks)/day
55TB of compressed data scanned daily
15TB of compressed output data written to hdfs
80 MM compute minutes/day
New York Times

Hadoop on Amazon Web Services


Convert 11 million articles (1851-1980) of TIFF format into PDF
Using Amazon S3 and EC2 for HW, Hadoop for SW
Amazon S3

TIFF Image (4TB) PDF (1.5TB)

AMI Hadoop MapReduce

Amazon EC2
(100 instances)
Hadoop Cluster for Academy

Google과 IBM의 대학 분산 플랫폼 수업 지원


수백대 규모 클러스터 제공
UW, MIT, Stanford, UM, CMU, UC Berkerly
Hadoop으로 MapReduce 프로그래밍 교육

Yahoo!의 CMU에 대용량 데이터 처리 연구 지원


M45 Cluster: 4000 processors, 3 TB memory, 1.5 PB disk
검색, Information Retrieval, 자연어처리, 기계번역 등
Hadoop을 이용한 분산 데이터 처리 연구
Classes using Hadoop

UW: Problem Solving on Large Scale Clusters


- University of Washington, Spring 2007

UCSD: Networked Services


- UC San Diego, Fall 2007

MIT: MapReduce course


- a part of 2008 Independent Activities Period at MIT

UM: Web-Scale Information Processing Applications


- University of Maryland, Spring 2008

KAIST: Distributed Algorithms and Systems


- KAIST, Fall 2008
Thank You!

You might also like