You are on page 1of 44

Hadoop

Fasilkom/Pusilkom UI
(Credit: Samuel Louvan)
Outline
● Motivation - Hadoop

● Hadoop ecosystem

○ HDFS, MapReduce, Spark, Hive, Ambari, Zookeeper etc

● HDFS Concept

○ HDFS Architecture

○ HDFS Read-Write Mechanism

○ HDFS Replication

○ Rack Awareness Algorithm


Data!

Storing big data was a problem

Processing it take “years”


Problem
● Data storage
○ Although the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read from
drives— have not kept up

● Data processing/analysis
○ Most analysis tasks need to be able to combine the data in some way, and
data read from one disk may need to be combined with data from any of
the other disks
Hadoop

Hadoop provides: a reliable, scalable platform for storage


and analysis. What’s more, because it runs on commodity
hardware and is open source, Hadoop is affordable.
Hadoop characteristics
● Open source

● Reliable

● Scalable

● Distributed data replication

● Commodity hardware
Hadoop vs RDBMS
Brief History of Hadoop
Hadoop Ecosystem
HDFS
● Stores different type of dataset

● Creates a level of abstraction over the resources

● Has two core components


MapReduce
● Core component in Hadoop

● Two functions : Map() and Reduce()

● Map function performs actions such as filtering, grouping, and sorting

● Reduce function aggregates and summarizes the result produced by map

● MapReduce programs can be implemented with various programming


languages
YARN
● Introduced in Hadoop 2.0

● Performs all processing activities by allocating resources


and scheduling tasks

● Two services : Resource Manager and Node Manager


Apache PIG
● Using Pig Latin language

● 1 line of pig latin = approx. 100 lines of Map-Reduce Job

● Compiler converts pig top Map Reduce

● Platform for building data flow for ETL (Extract,


Transform, and Load)
Apache Hive
● Developed by Facebook

● Using a query language HQL (Hive Query language)


Apache Spark
● Framework for real time data analytics

● Written in Scala, originally developed at UC Berkeley

● In-memory computation

● Can be 100x faster than Hadoop for large scala data


processing
Apache HBase
● NoSQL database

● Modelled after Google’s BigTable


Zookeeper

● Coordinates various applications within Hadoop


ecosystem

● Perform synchronization, conf. maintenance, grouping,


and naming
Apache Ambari
● Software for provisioning, managing, and monitoring
Apache Hadoop clusters

● Monitors health and status of the Hadoop cluster


Original Google Data Stack
Facebook Stack
Cloudera Stack
Hadoop Core Components

Processing

Storage
Distributed File System
HDFS (Distributed File System)

1 Machine
4 I/O Channel 10 Machine
Each channel 100 4 I/O Channel
MB/s Each channel 100
MB/s

43 minutes
4.3 minutes
Hadoop 2.x Daemons
Hadoop 2.x Core
Component

HDFS YARN (Map Reduce)

Master Resource
NameNode
Manager

DataNode Slave Node Manager


NameNode and DataNode
NameNode
NameNode:

● Master daemon

● Maintain and manage DataNode

● Records metadata

● Receives heartbeat from DataNode


DataNode:

● Slave daemon
DataNodes
● Stores the actual data

● Serves read and write requests from


clients
Hadoop Cluster Architecture
Master Machine

Core Switch Core Switch

Computer 1 Computer 1 Computer 1

Computer 2 Computer 2 Computer 2


Slave
machines Computer 3 Computer 3 Computer 3

Computer n Computer n Computer n


Rack 1 Rack 2 Rack 2
How data is stored in HDFS?
HDFS blocks
● File is divided into HDFS blocks (128 MB by default) and duplicated into
multiple places

2016-dec.log

513 MB

a b c d e

128 MB 128 MB 128 MB 128 MB 1


HDFS block
Why is a block in HDFS so large?

“The reason is to minimize the cost of seeks. If the block is large enough, the
time it takes to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus, transferring a large file made of
multiple blocks operates at the disk transfer rate.”
HDFS Blocks
Let say we have a file with the size of 100 MB :

Store it in 1 block

vs

Store it in 100 blocks


Is it safe to have just 1 copy of each block?
HDFS Block Replication
Node 1 Node 2

Block Block Block Block Block


1 1 4 1 4

Block
Block Block Block Block
2
3 5 2 5

Block
3 Node 3 Node 4

Block Block Block Block Block


4 2 4 1 3

Block Block Block


Block 3
5 2 5
HDFS : Rackawareness

Prefer within-rack transfers


Rack Awareness Algorithm
Block A : Block B : Block C :

Rack 1 Rack 2 Rack 3

1 5 9

2 6 10

3 7 11

4 8 12
HDFS Read/Write Mechanism
HDFS - Write Operation
HDFS Write
HDFS Write
HDFS Write
HDFS Read
Using HDFS
● Command Line

● UI (e.g. Ambari)

● Java interface
References
● Tom White. Hadoop The Definitive Guide 4th ed. O’Reilly. 2015

● https://www.edureka.co/blog/hadoop-tutorial/

● https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/

● Hadoop Big Data Tutorial

● Coursera Big Data

● Hadoop Official Website


Hands-on HDFS

You might also like