Professional Documents
Culture Documents
Job Tracker
Name Node
masters
Data Node & Task Tracker Data Node & Task Tracker
Data Node & Task Tracker Data Node & Task Tracker
Data Node & Task Tracker Data Node & Task Tracker
slaves
Hadoop Cluster
World
switch
switch
switch
Name Node DN + TT DN + TT DN + TT DN + TT
switch
Job Tracker DN + TT DN + TT DN + TT DN + TT
switch
Secondary NN
switch
Client DN + TT DN + TT DN + TT DN + TT
switch
DN + TT
DN + TT
DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT
Rack 1
Rack 2
Rack 3
Rack 4
Rack N
Typical Workflow
Load data into the cluster (HDFS writes) Analyze the data (Map Reduce) Store results in the cluster (HDFS writes) Read the results from the cluster (HDFS reads)
Sample Scenario:
How many times did our customers type the word Fraud into emails sent to customer service?
Huge file containing all emails sent to customer service
BRAD HEDLUND .com
File.txt
Name Node
Data Node 1
Blk A
Data Node 5
Blk B
Data Node 6
Blk C
Data Node N
Client consults Name Node Client writes block directly to one Data Node Data Nodes replicates block Cycle repeats for next block
BRAD HEDLUND .com
Name Node
switch
Data Node 9 B C Data Node 10 C Data Node 11 Data Node 12
switch
Data Node 5 C A Data Node 6 A Data Node 7 Data Node 8
Rack aware
Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7
metadata
File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9
Rack 1
Rack 5
Rack 9
Never loose all data if entire rack fails Keep bulky flows in-rack when possible Assumption that in-rack is higher bandwidth, lower latency
BRAD HEDLUND .com
File.txt
Blk A
Blk B
Blk C
switch
Rack aware
Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6
switch
Data Node 1
Ready Data Node 6 Ready?
switch
Data Node 5 Ready!
Data Node 6
Rack 1
Rack 5
BRAD HEDLUND .com
Name Node picks two nodes in the same rack, one node in a different rack Data protection Locality for M/R
Pipelined Write
File.txt Client
Blk A Blk B Blk C
Name Node
switch
Rack aware
Rack 1: Data Node 1
switch
switch
Data Node 5
A
Data Nodes 1 & 2 pass data along as its received TCP 50010
Data Node 1
A
Data Node 6
A
Rack 1
Rack 5
BRAD HEDLUND .com
Pipelined Write
File.txt Blk A: DN1, DN2, DN3
File.txt Client
Blk A Blk B Blk C
Block received
switch
Data Node 2
A
switch
Data Node 1
A
Data Node 3
A
Rack 1
Rack 5
BRAD HEDLUND .com
Client
Blk C
switch
switch
Blk A Blk A
switch
switch
Data Node 1
Data Node X
Blk C
Data Node 2
Blk B
Blk A Blk B
Data Node Y
Blk C
Data Node 3
Blk C
Data Node W
Blk B
Data Node Z
Rack 1
Rack 4
BRAD HEDLUND .com
Rack 5
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
Rack 1
Rack 2
Rack 3
Rack 4
Rack N
Factors:
File.txt
More blocks = Wider spread
BRAD HEDLUND .com
switch
Data Node C Data Node Data Node Data Node Data Node C Data Node
switch
Data Node Data Node B Data Node Data Node B Data Node Data Node
switch
Data Node Data Node A Data Node Data Node Data Node Data Node A
switch
Data Node Data Node Data Node Data Node Data Node Data Node
Rack 1
Rack 2
Rack 3
Rack 4
Rack N
Results.txt
Blk A Blk B Blk C
Name Node
Awesome! Thanks.
File system
File.txt = A,C
I have blocks: A, C
Data Node 1
A C
Im alive!
Data Node 2
A C
Data Node 3
A C
Data Node N
Data Node sends Heartbeats Every 10th heartbeat is a Block report Name Node builds metadata from Block reports TCP every 3 seconds If Name Node is down, HDFS is down
BRAD HEDLUND .com
Rack Awareness
Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8
Data Node 2
A C
Data Node 3
A C
Data Node 8
A C
Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells a Data Node to re-replicate
BRAD HEDLUND .com
Not a hot standby for the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can rebuild a failed Name Node
BRAD HEDLUND .com
switch
Data Node 5 C A
Data Node 6 A
switch
Data Node 8 B C Data Node 9 C
metadata
Results.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9
Data Node
Data Node
Data Node
Data Node
Rack 1
Rack 5
Rack 9
Client receives Data Node list for each block Client picks first Data Node for each block Client reads blocks sequentially
BRAD HEDLUND .com
Block A = 1,5,6
switch
Name Node
switch
Data Node 8 B C Data Node 9 C
switch
Data Node 5 C A
Data Node 6 A
Rack aware
Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5
metadata
File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9
Data Node
Data Node
Data Node
Data Node
Rack 1
Rack 5
Rack 9
Name Node provides rack local Nodes first Leverage in-rack bandwidth, single hop
BRAD HEDLUND .com
Client
Job Tracker
Map Task
Data Node 1
Map Task
Data Node 5
Map Task
Data Node 9
Fraud = 3
Fraud = 0
Fraud = 11
File.txt
Map: Run this computation on your local data Job Tracker delivers Java code to Nodes with local data
BRAD HEDLUND .com
Client
Job Tracker
switch
switch
switch
I need block A
Data Node 1
Map Task
Data Node 5
Map Task
Data Node 9
Fraud = 0
Fraud = 11
Rack 1
Rack 5
Rack 9
Job Tracker tries to select Node in same rack as data Name Node rack awareness
BRAD HEDLUND .com
Results.txt Fraud = 14
X Y Z
HDFS
Data Node 3
Reduce Task
Map Task
A
Data Node 5
Map Task
B
Data Node 9
Reduce: Run this computation across Map results Map Tasks deliver output data over the network Reduce Task data output written to and read from HDFS
BRAD HEDLUND .com
Unbalanced Cluster
switch switch NEW
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch NEW
Data Node
**I was assigned a Map Task but dont have the block. Guess I need to get it.
Data Node
Data Node Data Node Data Node Data Node
Rack 1
Rack 2
New Rack
New Rack
*Im bored!
File.txt
Hadoop prefers local processing if possible New servers underutilized for Map Reduce, HDFS* Might see more network bandwidth, slower job times**
BRAD HEDLUND .com
Cluster Balancing
switch switch NEW
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch
Data Node Data Node Data Node Data Node Data Node Data Node
switch NEW
Data Node
Data Node
Data Node Data Node Data Node Data Node
Rack 1
Rack 2
New Rack
New Rack
File.txt
brad@cloudera-1:~$hadoop balancer
Balancer utility (if used) runs in the background Does not interfere with Map Reduce or HDFS Default speed limit 1 MB/s
BRAD HEDLUND .com
Thanks!
Narrated at: http://bradhedlund.com/?p=3108