Concepts Planning Installation

Hadoop Distributed File System – HDFS
Big Data & Hadoop
© EduPristine – www.edupristine.com
© EduPristine For [BDH – HDFS ] (Confidential)
Prerequisites
1. JAVA
• OOPs concepts
• Serialization
• Data Structures (Hash Map, Lists)
• FILE I/O
2. Basic UNIX Commands
© EduPristine For [BDH – HDFS ] (Confidential) 1

Big Data Problem Context
Big Data Problem
Storage Processing
Distributed File System -HDFS Map Reduce, Hive, Pig, Spark . . .

Problem Statement
We are an analytical company and deal with multiple data sets from various clients and sources.
These data sets are very huge in size and continue to grow day by day.
In our analytical platform we also process social media feeds, which our servers keep on pulling in
real time.
Altogether, These data size is in Tera Bytes.
 In some cases, one single file size is more than 1 TB.
 We need a Cost Effective Storage solution. We can not use tapes as we need data in our hot
storage – readily available.
 Some data is generated as an outcome of very long running, resource intensive jobs and mission
critical.
 We also need easy scalability where we can keep adding more storage as we consume our storage
capacity.
 We do not want to lock-in with particular hardware type
 We want to keep Hardware cost to minimum along with fault tolerance
 We receive data from Upstream systems over SFTP onto our Shared File System. These drives are
mounted on our main data server.
 Also, storage should supportive for further complex analytical processing

Solution
We need to deal with,

 Large Data Sets including files with more than 1 TB in size, which might be larger than a single Disk
that a machine may have
 Fault tolerance
 Cost effective and Scalable
 Storage that is readily accessible for further processing
Think of Options . . .
 Can this data be stored in 1 machine? Hard drives are approximately 1TB in size. Even if you add
external hard drives, you can't store the data in Peta bytes. Let's say you add external hard drives
and store this data, you wouldn't be able to open or process that file because of insufficient RAM.
And processing, it would take months to analyze this data.
Do we need a Distributed File System ?

HDFS Features
 Data is distributed over several machines, and replicated to ensure their durability to failure and
high availability to parallel applications
 Designed for very large files (in GBs, TBs)
 Block oriented
 Unix like commands interface
 Write once and read many times
 Commodity hardware
 Fault Tolerant when nodes fail
 Scalable by adding new nodes

It is Not Designed for
 Small files.
 Multiple writes, arbitrary file modification -> Writes are always supported at the end of the file,
modifications can't be made at random offsets of files.
 Low latency data access -> Since we are accessing huge amount of data it comes at the expense of
time taken to access the data.

HDFS Design
 Individual files are broken into blocks of fixed size (typically 64 Mb >> 4Kb block structured FS),
and stored across cluster of nodes. (Not necessarily on same machine). These files can be more
than the size of individual machine's hard drive.
 Individual machines are called Data Nodes.
 So access to a file requires cooperation of several nodes.

Distributed File System - HDFS

Hadoop Master Salve Architecture

Function of Name Node
 Name Node is controller and manager of HDFS. It knows the status and the metadata of all the
files in HDFS.
 Metadata is -> file names, permissions, and locations of each block of file
 HDFS cluster can be accessed concurrently by multiple clients, even then this metadata
information is never desynchronized. Hence, all this information is handled by a single machine.
 Since metadata is typically small, all this info. is stored in main memory of Name Node, allowing
fast access to metadata.

Name Node HA & Node Manager
Since function of name node is very critical for overall health of HDFS, even if individual data nodes
fail, HDFS can recover and function with a little less capacity however crash of name node can lose
all the information and the complete file system irrecoverably. That's why metadata and involvement
of name node in data transfer is kept minimal.
HighAvailibility(HA)
 In Hadoop 1.0 there is single NameNode which led to fact that if the name node fails then the
complete process gets affected until the NameNode is restarted again which is taken care of by
the introduction of HA in hadoop 2.0 onwards.
 The HDFS High Availability feature addresses the above problems by providing the option of
running two redundant Name Nodes. Therefore, a typical HA cluster is created in two separate
machines which are configured as NameNode. At any point in time, one of the NameNode is in
an Active state, and the other is in a Standby state. The Active NameNode is responsible for all
client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough
state to provide a fast failover if required.

HDFS High Availability

Purpose of Secondary Name Node
 It is not backup of name node nor data nodes connect to this. It is just a helper of name node.
 It only performs periodic checkpoints.
 It communicates with name node and to take snapshots of HDFS metadata.
 These snapshots help minimize downtime and loss of data.

Purpose of Resource Manager
 New architecture of Hadoop divides the two major functions of the JobTracker : resource
management and job life-cycle management into separate components.
 ResourceManager manages the global assignment of compute resources to applications.
 It also looks after the per-application ApplicationMaster which manages the application‚
scheduling and coordination.
 It has two main components: Scheduler and ApplicationsManager.
 Scheduler is responsible for allocating resources to the various running applications
 Applications Manager is responsible for accepting job-submissions, negotiating the first container
for executing the application specific ApplicationMaster.

Overall Function
 To open a file, client contacts the name node retrieves list of locations of blocks of that file. These
locations identify data nodes that hold the block. Clients then read the data directly from data
nodes in parallel. Name Node is not involved in this stage.

HDFS Read
Distributed 2: Get block locations

1: Open NameNode
File System
HDF client 3: Read
FS Data Name Node
6: Close Input Stream
Client JVM
Client Node
4: Read 5: Read
DataNode DataNode DataNode
Data Node Data Node Data Node

HDFS Write
Distributed 2: Create
1: Create NameNode
File System
HDF client 3: Write 7: Complete
FS Data Name Node
6: Close Output Stream
Client JVM
Client Node
4: Write packet 5: Ack packet
4 4
Pipelines of
DataNode DataNode DataNode
data nodes
5 5
Data Node Data Node Data Node

Replica Placement
Node
Rack
Data Center

How does it Run
 HDFS files are not part of ordinary file system. "ls" command will not list the files in HDFS.
 HADOOP runs as a separate process and in a different namespace isolated from normal FS
on the OS.
 HDFS files/blocks are stored in a particular directory controlled by Datanode. Files are stored with
block Ids. You can't interact with them using normal Unix commands.
 There are separate utilities similar to Unix that can be used to access the HDFS

Listing files in HDFS

Environment Needed for HADOOP
 The production and stable environment for HADOOP is Unix.

 You can run HADOOP on windows as well, but again you would need Unix like system (e.g.,
cygwin) to run HADOOP processes. For this cygwin should be run as a service in windows OS and
can be a problem if user doesn’t have admin rights
 Since most people have Windows OS, we will run HADOOP on preconfigured VM (given by
Yahoo/CDH)

Changing Hadoop configuration files
All the files are in /usr/local/hadoop/etc/hadoop/

In hadoop-env.sh add,
export JAVA_HOME=/usr/lib/jvm/jdk/<java version should be mentioned here>
In core-site.xml add following between configuration tabs,

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
In yarn-site.xml add following between configuration tabs,
<property>
<name> yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Changing Hadoop configuration files Cont’d…
In mapred-site.xml copy the mapred-site.xml.template and rename it to mapred-site.xml & add

following between configuration tabs,
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
In hdfs-site.xml add following between configuration tabs,

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/mydata/hdfs/datanode</value>
</property>

Core-site.xml & hdfs-site.xml

HDFS Interfaces / Clients
 HTTP Interfaces with browsers

 Using CLI (Command line) utility
 Programmatically using Java
 Using thrift APIs with other languages like C
 Most common use is with command line and Java APIs

HDFS HTTP Interface – Hue GUI

HDFS CLI

ls
ls
Usage: hadoop dfs -ls [-R] <args>
The -R option will return stat recursively through the directory structure.
For a file returns stat on the file with the following format:
Permissions - number_of_replicas - userid - groupid - filesize - modification_date -
modification_time - filename
For a directory it returns list of its direct children as in Unix. A directory is listed as:
permissions - userid – groupid - modification_date - modification_time - dirname
Example:
hadoop dfs -ls /user/hadoop/file1
Exit Code:
Returns 0 on success and -1 on error.

chown
chown
Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. The user must be a super-user. Additional information is in the
Permissions Guide.
The -R option will make the change recursively through the directory structure.
chmod
Usage: hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
Change the permissions of files. With -R, make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user. Additional information is in
the Permissions Guide.
The -R option will make the change recursively through the directory structure.

mkdir
mkdir
Usage: hdfs dfs -mkdir [-p] <paths>
Takes path uri's as argument and creates directories.
Options:
The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.
Example:
hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error

put
put
Usage: hdfs dfs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input
from stdin and writes to destination file system.
hdfs dfs -put localfile /user/hadoop/hadoopfile
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.

Please visit below commands documentation for HDFS
https://hadoop.apache.org/docs/r2.6.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

Implement Solution

Step 1 : Review Dataset & decide the permissions
 Please refer to ‘master-datasets’ directory provided in LMS, lets assume this as an data set for the
given problem statement.
 Below is the table for Data sets and its owners
Data Set Owener
bse-data-sets Team-A
flight_data Team-A, but Team-B needs read access
t-air-mobile-subscriber-data Team-B
truck-drivers-datasets Team-A & Team-B
twitter-dataset All
 For this case study, Pick any 2 users from classroom and consider them as Team-A & Team-B. Get
their username so we can change ownership, change permission etc.

Step2: Simulate Environment as per Requirement
 Transfer master-dataset to Linux machine using WinSCP. Assume Linux machine is your production
data server mentioned in problem statement.
 Move file to HDFS from Linux.

Step 3: Setup permissions
 Change Owner for Directories
 Change R/W/X permission for directories

 Use chmod

Java API

Case Study 2 : Develop HDFS tool to find files modified X days ago
 Problem Statement: We need to run scheduled job which can keep on cleaning files older than X
days, where X is configurable. We do not want this to wrap up in any Script which can accidently
be modified and lead to any data loss.
 HDFS does not provide any command out of the box.
 We would like to further enhance this tool to deliver our all domain specific requirement of data
archival and management

Solution
Let’s develop a HDFS tool which can list files older than X days.
But we will not use shell script or any other scripting.
HDFS provides Java APIs !!
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html

Build Utility
1. Develop
1. Develop a maven project
2. Code it out
2. Build & Deploy
1. Build with all dependencies
2. Transfer jar to Linux machine
3. Execute
 Refer : Project Source Code :

Build
 Execute “mvn package” in project directory in your command prompt.

Once build is successful,
 You should see a new
directory ‘target’ is
created. Inside it there
a jar file with name like
‘*with-dependencies*’

Transfer jar file to Linux machine
 Deploy/transfer that jar file to Linux machine

Execute
Program expects two parameter,

 Full Path of HDFS directory that needs to be scan for files
 How old file to find in terms of Number Of Days
 Note : To find the HDFS full path refer /etc/Hadoop/conf/core-site.xml for fs.defaultFS property
value.

Source Code
 Complete Source Code is provided with content. Please find ‘hdfsapi’.

 ‘hdfsapi’ is a maven project and can be imported in eclipse easily.
 Maven downloads all the dependencies from maven repository, hence internet is required to
download dependencies first time.

Thank You!
care@edupristine.com
www.edupristine.com
© EduPristine – www.edupristine.com
© EduPristine For [BDH – HDFS ] (Confidential)

Concepts Planning Installation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Concepts Planning Installation

Uploaded by

Copyright:

Available Formats

Hadoop Distributed File System – HDFS

Big Data & Hadoop

© EduPristine For [BDH – HDFS ] (Confidential) 1

Big Data Problem

Distributed File System -HDFS Map Reduce, Hive, Pig, Spark . . .

© EduPristine For [BDH – HDFS ] (Confidential) 2

© EduPristine For [BDH – HDFS ] (Confidential) 3

We need to deal with,

© EduPristine For [BDH – HDFS ] (Confidential) 4

© EduPristine For [BDH – HDFS ] (Confidential) 5

© EduPristine For [BDH – HDFS ] (Confidential) 6

© EduPristine For [BDH – HDFS ] (Confidential) 7

© EduPristine For [BDH – HDFS ] (Confidential) 8

© EduPristine For [BDH – HDFS ] (Confidential) 9

© EduPristine For [BDH – HDFS ] (Confidential) 10

© EduPristine For [BDH – HDFS ] (Confidential) 11

© EduPristine For [BDH – HDFS ] (Confidential) 12

© EduPristine For [BDH – HDFS ] (Confidential) 13

© EduPristine For [BDH – HDFS ] (Confidential) 14

© EduPristine For [BDH – HDFS ] (Confidential) 15

Distributed 2: Get block locations

DataNode DataNode DataNode

Data Node Data Node Data Node

© EduPristine For [BDH – HDFS ] (Confidential) 16

4: Write packet 5: Ack packet

Data Node Data Node Data Node

© EduPristine For [BDH – HDFS ] (Confidential) 17

© EduPristine For [BDH – HDFS ] (Confidential) 18

© EduPristine For [BDH – HDFS ] (Confidential) 19

© EduPristine For [BDH – HDFS ] (Confidential) 20

 The production and stable environment for HADOOP is Unix.

© EduPristine For [BDH – HDFS ] (Confidential) 22

All the files are in /usr/local/hadoop/etc/hadoop/

In core-site.xml add following between configuration tabs,

In mapred-site.xml copy the mapred-site.xml.template and rename it to mapred-site.xml & add

In hdfs-site.xml add following between configuration tabs,

© EduPristine For [BDH – HDFS ] (Confidential) 24

© EduPristine For [BDH – HDFS ] (Confidential) 25

 HTTP Interfaces with browsers

© EduPristine For [BDH – HDFS ] (Confidential) 26

© EduPristine For [BDH – HDFS ] (Confidential) 27

© EduPristine For [BDH – HDFS ] (Confidential) 28

Usage: hadoop dfs -ls [-R] <args>

hadoop dfs -ls /user/hadoop/file1

© EduPristine For [BDH – HDFS ] (Confidential) 29

Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

© EduPristine For [BDH – HDFS ] (Confidential) 30

Usage: hdfs dfs -mkdir [-p] <paths>

Takes path uri's as argument and creates directories.

hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

Returns 0 on success and -1 on error

© EduPristine For [BDH – HDFS ] (Confidential) 31

Usage: hdfs dfs -put <localsrc> ... <dst>

hdfs dfs -put localfile /user/hadoop/hadoopfile

hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir

hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile

hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Returns 0 on success and -1 on error.

© EduPristine For [BDH – HDFS ] (Confidential) 32

© EduPristine For [BDH – HDFS ] (Confidential) 33

© EduPristine For [BDH – HDFS ] (Confidential) 34

© EduPristine For [BDH – HDFS ] (Confidential) 35

© EduPristine For [BDH – HDFS ] (Confidential) 36