You are on page 1of 48

Hadoop Distributed File System – HDFS

Big Data & Hadoop

© EduPristine – www.edupristine.com
© EduPristine For [BDH – HDFS ] (Confidential)
Prerequisites

1. JAVA
• OOPs concepts
• Serialization
• Data Structures (Hash Map, Lists)
• FILE I/O
2. Basic UNIX Commands

© EduPristine For [BDH – HDFS ] (Confidential) 1


Big Data Problem Context

Big Data Problem

Storage Processing

Distributed File System -HDFS Map Reduce, Hive, Pig, Spark . . .

© EduPristine For [BDH – HDFS ] (Confidential) 2


Problem Statement

We are an analytical company and deal with multiple data sets from various clients and sources.
These data sets are very huge in size and continue to grow day by day.
In our analytical platform we also process social media feeds, which our servers keep on pulling in
real time.
Altogether, These data size is in Tera Bytes.
 In some cases, one single file size is more than 1 TB.
 We need a Cost Effective Storage solution. We can not use tapes as we need data in our hot
storage – readily available.
 Some data is generated as an outcome of very long running, resource intensive jobs and mission
critical.
 We also need easy scalability where we can keep adding more storage as we consume our storage
capacity.
 We do not want to lock-in with particular hardware type
 We want to keep Hardware cost to minimum along with fault tolerance
 We receive data from Upstream systems over SFTP onto our Shared File System. These drives are
mounted on our main data server.
 Also, storage should supportive for further complex analytical processing

© EduPristine For [BDH – HDFS ] (Confidential) 3


Solution

We need to deal with,


 Large Data Sets including files with more than 1 TB in size, which might be larger than a single Disk
that a machine may have
 Fault tolerance
 Cost effective and Scalable
 Storage that is readily accessible for further processing

Think of Options . . .
 Can this data be stored in 1 machine? Hard drives are approximately 1TB in size. Even if you add
external hard drives, you can't store the data in Peta bytes. Let's say you add external hard drives
and store this data, you wouldn't be able to open or process that file because of insufficient RAM.
And processing, it would take months to analyze this data.
Do we need a Distributed File System ?

© EduPristine For [BDH – HDFS ] (Confidential) 4


HDFS Features

 Data is distributed over several machines, and replicated to ensure their durability to failure and
high availability to parallel applications
 Designed for very large files (in GBs, TBs)
 Block oriented
 Unix like commands interface
 Write once and read many times
 Commodity hardware
 Fault Tolerant when nodes fail
 Scalable by adding new nodes

© EduPristine For [BDH – HDFS ] (Confidential) 5


It is Not Designed for

 Small files.
 Multiple writes, arbitrary file modification -> Writes are always supported at the end of the file,
modifications can't be made at random offsets of files.
 Low latency data access -> Since we are accessing huge amount of data it comes at the expense of
time taken to access the data.

© EduPristine For [BDH – HDFS ] (Confidential) 6


HDFS Design

 Individual files are broken into blocks of fixed size (typically 64 Mb >> 4Kb block structured FS),
and stored across cluster of nodes. (Not necessarily on same machine). These files can be more
than the size of individual machine's hard drive.
 Individual machines are called Data Nodes.
 So access to a file requires cooperation of several nodes.

© EduPristine For [BDH – HDFS ] (Confidential) 7


Distributed File System - HDFS

© EduPristine For [BDH – HDFS ] (Confidential) 8


Hadoop Master Salve Architecture

© EduPristine For [BDH – HDFS ] (Confidential) 9


Function of Name Node

 Name Node is controller and manager of HDFS. It knows the status and the metadata of all the
files in HDFS.
 Metadata is -> file names, permissions, and locations of each block of file
 HDFS cluster can be accessed concurrently by multiple clients, even then this metadata
information is never desynchronized. Hence, all this information is handled by a single machine.
 Since metadata is typically small, all this info. is stored in main memory of Name Node, allowing
fast access to metadata.

© EduPristine For [BDH – HDFS ] (Confidential) 10


Name Node HA & Node Manager

Since function of name node is very critical for overall health of HDFS, even if individual data nodes
fail, HDFS can recover and function with a little less capacity however crash of name node can lose
all the information and the complete file system irrecoverably. That's why metadata and involvement
of name node in data transfer is kept minimal.

HighAvailibility(HA)
 In Hadoop 1.0 there is single NameNode which led to fact that if the name node fails then the
complete process gets affected until the NameNode is restarted again which is taken care of by
the introduction of HA in hadoop 2.0 onwards.
 The HDFS High Availability feature addresses the above problems by providing the option of
running two redundant Name Nodes. Therefore, a typical HA cluster is created in two separate
machines which are configured as NameNode. At any point in time, one of the NameNode is in
an Active state, and the other is in a Standby state. The Active NameNode is responsible for all
client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough
state to provide a fast failover if required.

© EduPristine For [BDH – HDFS ] (Confidential) 11


HDFS High Availability

© EduPristine For [BDH – HDFS ] (Confidential) 12


Purpose of Secondary Name Node

 It is not backup of name node nor data nodes connect to this. It is just a helper of name node.
 It only performs periodic checkpoints.
 It communicates with name node and to take snapshots of HDFS metadata.
 These snapshots help minimize downtime and loss of data.

© EduPristine For [BDH – HDFS ] (Confidential) 13


Purpose of Resource Manager

 New architecture of Hadoop divides the two major functions of the JobTracker : resource
management and job life-cycle management into separate components.
 ResourceManager manages the global assignment of compute resources to applications.
 It also looks after the per-application ApplicationMaster which manages the application‚
scheduling and coordination.
 It has two main components: Scheduler and ApplicationsManager.
 Scheduler is responsible for allocating resources to the various running applications
 Applications Manager is responsible for accepting job-submissions, negotiating the first container
for executing the application specific ApplicationMaster.

© EduPristine For [BDH – HDFS ] (Confidential) 14


Overall Function

 To open a file, client contacts the name node retrieves list of locations of blocks of that file. These
locations identify data nodes that hold the block. Clients then read the data directly from data
nodes in parallel. Name Node is not involved in this stage.

© EduPristine For [BDH – HDFS ] (Confidential) 15


HDFS Read

Distributed 2: Get block locations


1: Open NameNode
File System
HDF client 3: Read
FS Data Name Node
6: Close Input Stream
Client JVM

Client Node

4: Read 5: Read

DataNode DataNode DataNode

Data Node Data Node Data Node

© EduPristine For [BDH – HDFS ] (Confidential) 16


HDFS Write

Distributed 2: Create
1: Create NameNode
File System
HDF client 3: Write 7: Complete
FS Data Name Node
6: Close Output Stream
Client JVM

Client Node

4: Write packet 5: Ack packet

4 4
Pipelines of
DataNode DataNode DataNode
data nodes
5 5

Data Node Data Node Data Node

© EduPristine For [BDH – HDFS ] (Confidential) 17


Replica Placement

Node

Rack

Data Center

© EduPristine For [BDH – HDFS ] (Confidential) 18


How does it Run

 HDFS files are not part of ordinary file system. "ls" command will not list the files in HDFS.
 HADOOP runs as a separate process and in a different namespace isolated from normal FS
on the OS.
 HDFS files/blocks are stored in a particular directory controlled by Datanode. Files are stored with
block Ids. You can't interact with them using normal Unix commands.
 There are separate utilities similar to Unix that can be used to access the HDFS

© EduPristine For [BDH – HDFS ] (Confidential) 19


Listing files in HDFS

© EduPristine For [BDH – HDFS ] (Confidential) 20


© EduPristine For [BDH – HDFS ] (Confidential) 21
Environment Needed for HADOOP

 The production and stable environment for HADOOP is Unix.


 You can run HADOOP on windows as well, but again you would need Unix like system (e.g.,
cygwin) to run HADOOP processes. For this cygwin should be run as a service in windows OS and
can be a problem if user doesn’t have admin rights
 Since most people have Windows OS, we will run HADOOP on preconfigured VM (given by
Yahoo/CDH)

© EduPristine For [BDH – HDFS ] (Confidential) 22


Changing Hadoop configuration files

All the files are in /usr/local/hadoop/etc/hadoop/


In hadoop-env.sh add,
export JAVA_HOME=/usr/lib/jvm/jdk/<java version should be mentioned here>

In core-site.xml add following between configuration tabs,


<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
In yarn-site.xml add following between configuration tabs,
<property>
<name> yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
© EduPristine For [BDH – HDFS ] (Confidential) 23
Changing Hadoop configuration files Cont’d…

In mapred-site.xml copy the mapred-site.xml.template and rename it to mapred-site.xml & add


following between configuration tabs,

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

In hdfs-site.xml add following between configuration tabs,


<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/mydata/hdfs/datanode</value>
</property>

© EduPristine For [BDH – HDFS ] (Confidential) 24


Core-site.xml & hdfs-site.xml

© EduPristine For [BDH – HDFS ] (Confidential) 25


HDFS Interfaces / Clients

 HTTP Interfaces with browsers


 Using CLI (Command line) utility
 Programmatically using Java
 Using thrift APIs with other languages like C
 Most common use is with command line and Java APIs

© EduPristine For [BDH – HDFS ] (Confidential) 26


HDFS HTTP Interface – Hue GUI

© EduPristine For [BDH – HDFS ] (Confidential) 27


HDFS CLI

© EduPristine For [BDH – HDFS ] (Confidential) 28


ls

ls

Usage: hadoop dfs -ls [-R] <args>

The -R option will return stat recursively through the directory structure.

For a file returns stat on the file with the following format:
Permissions - number_of_replicas - userid - groupid - filesize - modification_date -
modification_time - filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:
permissions - userid – groupid - modification_date - modification_time - dirname

Example:

hadoop dfs -ls /user/hadoop/file1

Exit Code:
Returns 0 on success and -1 on error.

© EduPristine For [BDH – HDFS ] (Confidential) 29


chown

chown

Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Change the owner of files. The user must be a super-user. Additional information is in the
Permissions Guide.

The -R option will make the change recursively through the directory structure.

chmod

Usage: hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]

Change the permissions of files. With -R, make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user. Additional information is in
the Permissions Guide.

The -R option will make the change recursively through the directory structure.

© EduPristine For [BDH – HDFS ] (Confidential) 30


mkdir

mkdir

Usage: hdfs dfs -mkdir [-p] <paths>

Takes path uri's as argument and creates directories.

Options:

The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error

© EduPristine For [BDH – HDFS ] (Confidential) 31


put

put

Usage: hdfs dfs -put <localsrc> ... <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input
from stdin and writes to destination file system.

hdfs dfs -put localfile /user/hadoop/hadoopfile

hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir

hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile

hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

Returns 0 on success and -1 on error.

© EduPristine For [BDH – HDFS ] (Confidential) 32


Please visit below commands documentation for HDFS

https://hadoop.apache.org/docs/r2.6.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

© EduPristine For [BDH – HDFS ] (Confidential) 33


Implement Solution

© EduPristine For [BDH – HDFS ] (Confidential) 34


Step 1 : Review Dataset & decide the permissions

 Please refer to ‘master-datasets’ directory provided in LMS, lets assume this as an data set for the
given problem statement.
 Below is the table for Data sets and its owners
Data Set Owener
bse-data-sets Team-A
flight_data Team-A, but Team-B needs read access
t-air-mobile-subscriber-data Team-B
truck-drivers-datasets Team-A & Team-B
twitter-dataset All

 For this case study, Pick any 2 users from classroom and consider them as Team-A & Team-B. Get
their username so we can change ownership, change permission etc.

© EduPristine For [BDH – HDFS ] (Confidential) 35


Step2: Simulate Environment as per Requirement

 Transfer master-dataset to Linux machine using WinSCP. Assume Linux machine is your production
data server mentioned in problem statement.
 Move file to HDFS from Linux.

© EduPristine For [BDH – HDFS ] (Confidential) 36


Step 3: Setup permissions

 Change Owner for Directories

 Change R/W/X permission for directories


 Use chmod

© EduPristine For [BDH – HDFS ] (Confidential) 37


Java API

© EduPristine For [BDH – HDFS ] (Confidential) 38


Case Study 2 : Develop HDFS tool to find files modified X days ago

 Problem Statement: We need to run scheduled job which can keep on cleaning files older than X
days, where X is configurable. We do not want this to wrap up in any Script which can accidently
be modified and lead to any data loss.
 HDFS does not provide any command out of the box.
 We would like to further enhance this tool to deliver our all domain specific requirement of data
archival and management

© EduPristine For [BDH – HDFS ] (Confidential) 39


Solution

Let’s develop a HDFS tool which can list files older than X days.

But we will not use shell script or any other scripting.

HDFS provides Java APIs !!

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html

© EduPristine For [BDH – HDFS ] (Confidential) 40


Build Utility

1. Develop
1. Develop a maven project
2. Code it out
2. Build & Deploy
1. Build with all dependencies
2. Transfer jar to Linux machine
3. Execute

 Refer : Project Source Code :

© EduPristine For [BDH – HDFS ] (Confidential) 41


Build

 Execute “mvn package” in project directory in your command prompt.

© EduPristine For [BDH – HDFS ] (Confidential) 42


Once build is successful,
 You should see a new
directory ‘target’ is
created. Inside it there
a jar file with name like
‘*with-dependencies*’

© EduPristine For [BDH – HDFS ] (Confidential) 43


Transfer jar file to Linux machine

 Deploy/transfer that jar file to Linux machine

© EduPristine For [BDH – HDFS ] (Confidential) 44


Execute

Program expects two parameter,


 Full Path of HDFS directory that needs to be scan for files
 How old file to find in terms of Number Of Days

 Note : To find the HDFS full path refer /etc/Hadoop/conf/core-site.xml for fs.defaultFS property
value.

© EduPristine For [BDH – HDFS ] (Confidential) 45


Source Code

 Complete Source Code is provided with content. Please find ‘hdfsapi’.


 ‘hdfsapi’ is a maven project and can be imported in eclipse easily.
 Maven downloads all the dependencies from maven repository, hence internet is required to
download dependencies first time.

© EduPristine For [BDH – HDFS ] (Confidential) 46


Thank You!

care@edupristine.com
www.edupristine.com

© EduPristine – www.edupristine.com
© EduPristine For [BDH – HDFS ] (Confidential)

You might also like