Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)

Hadoop
Fasilkom/Pusilkom UI
(Credit: Samuel Louvan)
Outline
● Motivation - Hadoop
● Hadoop ecosystem
○ HDFS, MapReduce, Spark, Hive, Ambari, Zookeeper etc
● HDFS Concept
○ HDFS Architecture
○ HDFS Read-Write Mechanism
○ HDFS Replication
○ Rack Awareness Algorithm

Data!
Storing big data was a problem
Processing it take “years”

Problem
● Data storage
○ Although the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read from
drives— have not kept up
● Data processing/analysis
○ Most analysis tasks need to be able to combine the data in some way, and
data read from one disk may need to be combined with data from any of
the other disks
Hadoop
Hadoop provides: a reliable, scalable platform for storage

and analysis. What’s more, because it runs on commodity
hardware and is open source, Hadoop is affordable.
Hadoop characteristics
● Open source
● Reliable
● Scalable
● Distributed data replication
● Commodity hardware
Hadoop vs RDBMS
Brief History of Hadoop
Hadoop Ecosystem
HDFS
● Stores different type of dataset
● Creates a level of abstraction over the resources
● Has two core components

MapReduce
● Core component in Hadoop
● Two functions : Map() and Reduce()
● Map function performs actions such as filtering, grouping, and sorting
● Reduce function aggregates and summarizes the result produced by map
● MapReduce programs can be implemented with various programming

languages
YARN
● Introduced in Hadoop 2.0
● Performs all processing activities by allocating resources

and scheduling tasks
● Two services : Resource Manager and Node Manager

Apache PIG
● Using Pig Latin language
● 1 line of pig latin = approx. 100 lines of Map-Reduce Job
● Compiler converts pig top Map Reduce
● Platform for building data flow for ETL (Extract,

Transform, and Load)
Apache Hive
● Developed by Facebook
● Using a query language HQL (Hive Query language)

Apache Spark
● Framework for real time data analytics
● Written in Scala, originally developed at UC Berkeley
● In-memory computation
● Can be 100x faster than Hadoop for large scala data

processing
Apache HBase
● NoSQL database
● Modelled after Google’s BigTable

Zookeeper
● Coordinates various applications within Hadoop

ecosystem
● Perform synchronization, conf. maintenance, grouping,

and naming
Apache Ambari
● Software for provisioning, managing, and monitoring
Apache Hadoop clusters
● Monitors health and status of the Hadoop cluster

Original Google Data Stack
Facebook Stack
Cloudera Stack
Hadoop Core Components
Processing
Storage
Distributed File System
HDFS (Distributed File System)
1 Machine
4 I/O Channel 10 Machine
Each channel 100 4 I/O Channel
MB/s Each channel 100
MB/s
43 minutes
4.3 minutes
Hadoop 2.x Daemons
Hadoop 2.x Core
Component
HDFS YARN (Map Reduce)
Master Resource
NameNode
Manager
DataNode Slave Node Manager

NameNode and DataNode
NameNode
NameNode:
● Master daemon
● Maintain and manage DataNode
● Records metadata
● Receives heartbeat from DataNode

DataNode:
● Slave daemon
DataNodes
● Stores the actual data
● Serves read and write requests from

clients
Hadoop Cluster Architecture
Master Machine
Core Switch Core Switch
Computer 1 Computer 1 Computer 1
Computer 2 Computer 2 Computer 2

Slave
machines Computer 3 Computer 3 Computer 3
Computer n Computer n Computer n

Rack 1 Rack 2 Rack 2
How data is stored in HDFS?
HDFS blocks
● File is divided into HDFS blocks (128 MB by default) and duplicated into
multiple places
2016-dec.log
513 MB
a b c d e
128 MB 128 MB 128 MB 128 MB 1

HDFS block
Why is a block in HDFS so large?
“The reason is to minimize the cost of seeks. If the block is large enough, the
time it takes to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus, transferring a large file made of
multiple blocks operates at the disk transfer rate.”
HDFS Blocks
Let say we have a file with the size of 100 MB :
Store it in 1 block
vs
Store it in 100 blocks

Is it safe to have just 1 copy of each block?
HDFS Block Replication
Node 1 Node 2
Block Block Block Block Block

1 1 4 1 4
Block
Block Block Block Block
2
3 5 2 5
Block
3 Node 3 Node 4
Block Block Block Block Block

4 2 4 1 3
Block Block Block

Block 3
5 2 5
HDFS : Rackawareness
Prefer within-rack transfers

Rack Awareness Algorithm
Block A : Block B : Block C :
Rack 1 Rack 2 Rack 3
1 5 9
2 6 10
3 7 11
4 8 12
HDFS Read/Write Mechanism
HDFS - Write Operation
HDFS Write
HDFS Write
HDFS Write
HDFS Read
Using HDFS
● Command Line
● UI (e.g. Ambari)
● Java interface
References
● Tom White. Hadoop The Definitive Guide 4th ed. O’Reilly. 2015
● https://www.edureka.co/blog/hadoop-tutorial/
● https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
● Hadoop Big Data Tutorial
● Coursera Big Data
● Hadoop Official Website

Hands-on HDFS

Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)

Uploaded by

Copyright:

Available Formats

Hadoop

○ HDFS, MapReduce, Spark, Hive, Ambari, Zookeeper etc

○ HDFS Read-Write Mechanism

○ Rack Awareness Algorithm

Storing big data was a problem

Processing it take “years”

Hadoop provides: a reliable, scalable platform for storage

● Distributed data replication

● Creates a level of abstraction over the resources

● Has two core components

● Two functions : Map() and Reduce()

● Map function performs actions such as filtering, grouping, and sorting

● Reduce function aggregates and summarizes the result produced by map

● MapReduce programs can be implemented with various programming

● Performs all processing activities by allocating resources

● Two services : Resource Manager and Node Manager

● 1 line of pig latin = approx. 100 lines of Map-Reduce Job

● Compiler converts pig top Map Reduce

● Platform for building data flow for ETL (Extract,

● Using a query language HQL (Hive Query language)

● Written in Scala, originally developed at UC Berkeley

● Can be 100x faster than Hadoop for large scala data

● Modelled after Google’s BigTable

● Coordinates various applications within Hadoop

● Perform synchronization, conf. maintenance, grouping,

● Monitors health and status of the Hadoop cluster

HDFS YARN (Map Reduce)

DataNode Slave Node Manager

● Maintain and manage DataNode

● Receives heartbeat from DataNode

● Serves read and write requests from

Core Switch Core Switch

Computer 1 Computer 1 Computer 1

Computer 2 Computer 2 Computer 2

Computer n Computer n Computer n

128 MB 128 MB 128 MB 128 MB 1

Store it in 100 blocks

Block Block Block Block Block

Block Block Block Block Block

Block Block Block

Prefer within-rack transfers

Rack 1 Rack 2 Rack 3

● Hadoop Big Data Tutorial

● Coursera Big Data

● Hadoop Official Website

You might also like