You are on page 1of 23

Understanding Hadoop Clusters and the Network

Part 1. Introduction and Overview


Brad Hedlund
http://bradhedlund.com http://www.linkedin.com/in/bradhedlund @bradhedlund

BRAD HEDLUND .com

Hadoop Server Roles


Clients

Distributed Data Analytics


Map Reduce

Distributed Data Storage


HDFS

Job Tracker

Name Node

Secondary Name Node

masters

Data Node & Task Tracker Data Node & Task Tracker

Data Node & Task Tracker Data Node & Task Tracker

Data Node & Task Tracker Data Node & Task Tracker

slaves

BRAD HEDLUND .com

Hadoop Cluster
World

switch

switch

switch
Name Node DN + TT DN + TT DN + TT DN + TT

switch
Job Tracker DN + TT DN + TT DN + TT DN + TT

switch
Secondary NN

switch
Client DN + TT DN + TT DN + TT DN + TT

switch
DN + TT

DN + TT
DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT

Rack 1

Rack 2

Rack 3

Rack 4

Rack N

BRAD HEDLUND .com

Typical Workflow
Load data into the cluster (HDFS writes) Analyze the data (Map Reduce) Store results in the cluster (HDFS writes) Read the results from the cluster (HDFS reads)
Sample Scenario:

How many times did our customers type the word Fraud into emails sent to customer service?
Huge file containing all emails sent to customer service
BRAD HEDLUND .com

File.txt

Writing files to HDFS I want to write


File.txt
Blk A

Blocks A,B,C of File.txt Client


Blk B Blk C

OK. Write to Data Nodes 1,5,6

Name Node

Data Node 1
Blk A

Data Node 5
Blk B

Data Node 6
Blk C

Data Node N

Client consults Name Node Client writes block directly to one Data Node Data Nodes replicates block Cycle repeats for next block
BRAD HEDLUND .com

Hadoop Rack Awareness Why?


switch switch
Data Node 1 B A Data Node 2 B Data Node 3 Data Node 5

Name Node
switch
Data Node 9 B C Data Node 10 C Data Node 11 Data Node 12

switch
Data Node 5 C A Data Node 6 A Data Node 7 Data Node 8

Rack aware
Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7

metadata
File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9

Rack 1

Rack 5

Rack 9

Never loose all data if entire rack fails Keep bulky flows in-rack when possible Assumption that in-rack is higher bandwidth, lower latency
BRAD HEDLUND .com

File.txt
Blk A

I want to write File.txt Block A

Preparing HDFS writes OK. Write to


Data Nodes 1,5,6 Client Name Node
Ready!

Blk B

Blk C

Ready Data Nodes 5,6

switch

Rack aware
Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6

switch
Data Node 1
Ready Data Node 6 Ready?

switch
Data Node 5 Ready!

Data Node 6

Rack 1

Rack 5
BRAD HEDLUND .com

Name Node picks two nodes in the same rack, one node in a different rack Data protection Locality for M/R

Pipelined Write
File.txt Client
Blk A Blk B Blk C

Name Node
switch

Rack aware
Rack 1: Data Node 1

switch

switch
Data Node 5
A

Rack 5: Data Node 5 Data Node 6

Data Nodes 1 & 2 pass data along as its received TCP 50010

Data Node 1
A

Data Node 6
A

Rack 1

Rack 5
BRAD HEDLUND .com

Pipelined Write
File.txt Blk A: DN1, DN2, DN3

File.txt Client
Blk A Blk B Blk C

Name Node Success


switch

Block received
switch
Data Node 2
A

Rack 1: Data Node 1 Rack 5: Data Node 2 Data Node 3

switch
Data Node 1
A

Data Node 3
A

Rack 1

Rack 5
BRAD HEDLUND .com

Multi-block Replication Pipeline


File.txt
Blk A Blk B

Client
Blk C

switch

1TB File = 3TB storage 3TB network traffic

switch
Blk A Blk A

switch

switch

Data Node 1

Data Node X

Blk C

Data Node 2

Blk B

Blk A Blk B

Data Node Y
Blk C

Data Node 3

Blk C

Data Node W

Blk B

Data Node Z

Rack 1

Rack 4
BRAD HEDLUND .com

Rack 5

Client writes Span the HDFS Cluster


Client

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

Rack 1

Rack 2

Rack 3

Rack 4

Rack N

Factors:

Block size File Size

File.txt
More blocks = Wider spread
BRAD HEDLUND .com

Data Node writes span itself, and other racks


switch
Data Node A B C Data Node Data Node Data Node Data Node

switch
Data Node C Data Node Data Node Data Node Data Node C Data Node

switch
Data Node Data Node B Data Node Data Node B Data Node Data Node

switch
Data Node Data Node A Data Node Data Node Data Node Data Node A

switch
Data Node Data Node Data Node Data Node Data Node Data Node

Rack 1

Rack 2

Rack 3

Rack 4

Rack N

Results.txt
Blk A Blk B Blk C

BRAD HEDLUND .com

Name Node
Awesome! Thanks.

metadata Name Node


DN1: A,C DN2: A,C DN3: A,C

File system
File.txt = A,C

I have blocks: A, C
Data Node 1
A C

Im alive!
Data Node 2
A C

Data Node 3
A C

Data Node N

Data Node sends Heartbeats Every 10th heartbeat is a Block report Name Node builds metadata from Block reports TCP every 3 seconds If Name Node is down, HDFS is down
BRAD HEDLUND .com

Re-replicating missing replicas


Uh Oh! Missing replicas

metadata Name Node


DN1: A,C DN2: A,C DN3: A, C

Rack Awareness
Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8

Copy blocks A,C to Node 8 Data Node 1


A C

Data Node 2
A C

Data Node 3
A C

Data Node 8
A C

Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells a Data Node to re-replicate
BRAD HEDLUND .com

Secondary Name Node


File system metadata Name Node
File.txt = A,C

Secondary Name Node

Its been an hour, give me your metadata

Not a hot standby for the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can rebuild a failed Name Node
BRAD HEDLUND .com

Client reading files from HDFS


Tell me the block locations of Results.txt Client Name Node
switch
Data Node 1 B A
Data Node 2 B Data Node Data Node

Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9

switch
Data Node 5 C A
Data Node 6 A

switch
Data Node 8 B C Data Node 9 C

metadata
Results.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9

Data Node
Data Node

Data Node
Data Node

Rack 1

Rack 5

Rack 9

Client receives Data Node list for each block Client picks first Data Node for each block Client reads blocks sequentially
BRAD HEDLUND .com

Data Node reading files from HDFS


Tell me the locations of Block A of File.txt
switch
Data Node 1 B A
Data Node 2 B Data Node 3 Data Node

Block A = 1,5,6

switch

Name Node
switch
Data Node 8 B C Data Node 9 C

switch
Data Node 5 C A
Data Node 6 A

Rack aware
Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5

metadata
File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9

Data Node
Data Node

Data Node
Data Node

Rack 1

Rack 5

Rack 9

Name Node provides rack local Nodes first Leverage in-rack bandwidth, single hop
BRAD HEDLUND .com

Data Processing: Map


How many times does Fraud appear in File.txt?
Name Node

Client

Job Tracker

Count Fraud in Block C

Map Task
Data Node 1

Map Task
Data Node 5

Map Task
Data Node 9

Fraud = 3

Fraud = 0

Fraud = 11

File.txt

Map: Run this computation on your local data Job Tracker delivers Java code to Nodes with local data
BRAD HEDLUND .com

What if data isnt local?


How many times does Fraud appear in File.txt?
Name Node

Client

Job Tracker
switch

Count Fraud in Block C

switch

switch

I need block A
Data Node 1

Map Task
Data Node 5

Map Task
Data Node 9

Fraud = 0

Fraud = 11

no Map tasks left


Data Node 2

Rack 1

Rack 5

Rack 9

Job Tracker tries to select Node in same rack as data Name Node rack awareness
BRAD HEDLUND .com

Data Processing: Reduce


Client Job Tracker Sum Fraud

Results.txt Fraud = 14
X Y Z

HDFS

Data Node 3

Reduce Task

Fraud = 0 Map Task


Data Node 1

Map Task
A
Data Node 5

Map Task
B
Data Node 9

Reduce: Run this computation across Map results Map Tasks deliver output data over the network Reduce Task data output written to and read from HDFS
BRAD HEDLUND .com

Unbalanced Cluster
switch switch NEW
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch NEW
Data Node

**I was assigned a Map Task but dont have the block. Guess I need to get it.

Data Node
Data Node Data Node Data Node Data Node

Rack 1

Rack 2

New Rack

New Rack

*Im bored!

File.txt

Hadoop prefers local processing if possible New servers underutilized for Map Reduce, HDFS* Might see more network bandwidth, slower job times**
BRAD HEDLUND .com

Cluster Balancing
switch switch NEW
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch
Data Node Data Node Data Node Data Node Data Node Data Node

switch NEW
Data Node

Data Node
Data Node Data Node Data Node Data Node

Rack 1

Rack 2

New Rack

New Rack

File.txt

brad@cloudera-1:~$hadoop balancer

Balancer utility (if used) runs in the background Does not interfere with Map Reduce or HDFS Default speed limit 1 MB/s
BRAD HEDLUND .com

Thanks!
Narrated at: http://bradhedlund.com/?p=3108

BRAD HEDLUND .com

You might also like