Professional Documents
Culture Documents
2008. 11. 28
한 재 선 (NexR 대표이사)
jshan0000@gmail.com
http://www.nexr.co.kr
Hadoop
Brief History
2005년 Doug Cutting(Lucene & Nutch 개발자)에 의해 시작
Nutch 오픈소스 검색엔진의 분산확장 이슈에서 출발
2006년 Yahoo의 전폭적인 지원 (Doug Cutting과 전담팀 고용)
2008년 Apache Top-level Project로 승격
현재(2008년4월) 0.16.3 release
Hadoop
Java 언어 기반
Apache 라이선스
많은 컴포넌트들
HDFS, HBase, MapReduce, Hadoop On Demand(HOD),
Streaming, HQL, Hama, Mahout, etc
Hadoop Architecture
Google
Nutch: Open Source Search Engine Search
WIN
(2002년 비교)
MapReduce 논리적 처리 흐름
Task들의 병렬처리
Map
Reduce
WordCount MapReduce
input
(Bye, 1) files
(Goodbye, 1) Job
(Hadoop, 2) (wordcount)
(Hello, 2)
(Hello, 1)
(World, 2) (Bye, 1) (World, 1)
(Goodbye, 1) (Bye, 1)
(Hadoop, 1) (World, 1) M
(Hadoop, 1)
R (Hello, 1) Sorter
(Hello, 1)
(Hello, 1) (Hadoop, 1) M
(World, 1) (Goodbye, 1) JobTracker
(World, 1) (Hadoop, 1)
Nutch: Search over Hadoop
Indexer lucene
pages indexes
(segments) Fetcher
DeleteDuplicates
page
ParseSegment
metadata
IndexMerge merged
new CrawlDb index
crawl db (update)
HDFS
HDFS No
depth =
0?
Yes
Hadoop MapReduce Architecture
JobTracker
JobClient
job #1
job queue input list
Job Submission
task
heartbeat allocation
MapReduce 개발 및 사용 통계
MapReduce 프로그램 증가
Stolen from
Michael Kleber’s Presentation
Hadoop Ecosystem
?
Cascading
Workflow management for
NexR VCC Hadoop MapReduce
Hadoop on Virtualization
Parhely
ORM for HBase
Yahoo Pig
Query Language Interface
on Hadoop
Katta
Distributed indexing with
HDFS, MapReduce Hadoop
Yahoo Zookeeper HBase, HOD, Streaming,
Distributed Management Fuse-DFS, EC2 Support
Mahout & Hama
Machine Learning using
Hadoop MapReduce
IBM MapReduce Tools
Eclipse plug-in for Facebook Hive
MapReduce programs Data warehousing on
Hadoop
활용사례
Powered by Hadoop
Production cluster
2400 cores, 300 machines, 16GB per machine -- Oct 2008
4800 cores, 600 machines, 16GB/8GB per machine – Nov 2008
4 SATA disks of 1 TB each per machine
2 level network hierarchy, 40 machines per rack
Test cluster
800 cores, 16GB each
Facebook: HDFS Storage Cluster
Filers
Hive on
Oracle RAC Federated MySQL
Hadoop Cluster
Facebook: Hive/Hadoop Usage
Types of Applications:
Summarization
Eg: Daily/Weekly aggregations of impression/click counts
Complex measures of user engagement
Ad hoc Analysis
Eg: how many group admins broken down by state/country
Data Mining (Assembling training data)
Eg: User Engagement as a function of user attributes
Spam Detection
Anomalous patterns in UGC
Application api usage patterns
Ad Optimization
Too many to count ..
Facebook: Hadoop Usage
Data statistics:
Total Data: 180TB (mostly compressed)
Net Data added/day: 2+TB (compressed)
6TB of uncompressed source logs
4TB of uncompressed dimension data reloaded daily
Usage statistics:
3200 jobs/day with 800K tasks(map-reduce tasks)/day
55TB of compressed data scanned daily
15TB of compressed output data written to hdfs
80 MM compute minutes/day
New York Times
Amazon EC2
(100 instances)
Hadoop Cluster for Academy