You are on page 1of 8

What is HDFS used for?

Hadoop Distributed File System- HDFS is used for storing structure and
unstructured data in distributed manner by using commodity hardware.

What is Hadoop Distributed File System and what are its components?
Hadoop HDFS is a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster.
Components of HDFS:
HDFS comprises of 3 important components NameNode,
DataNode and Secondary NameNode.
HDFS operates on a Master-Slave architecture model where the NameNode
acts as the master node for keeping a track of the storage cluster and the
DataNode acts as a slave node summing up to the various systems within
a Hadoop cluster.

What is NameNode and DataNode in HDFS?


NameNode:

It maintains and manages the slave nodes and assign the task to them.
It only stores the metadata of HDFS. Namenode executes file system
namespace operations like opening, closing and renaming files and directions.
All replication factor details should maintain in Name node.This metadata is
available in memory in the master for faster retrieval data.

DataNode:

This is the daemon that runs on the slave, these are actual workers nodes that
store the data. Data nodes are the slaves which are deployed on each machine
and provide actual storage. Data nodes are responsible for serving read and
write requests from the file system clients, also perform block creation,
deletion, and replication upon instruction from the name node.

Why Hadoop uses filesystem for storage?


HDFS is built to support applications with large data sets, including individual
files that reach into the terabytes. File systems are more affordable to handle
huge amount of data.

Where does NameNode keeps meta data? does it store the same in hdfs
or local fs or memory and why ?

In Hadoop, Namenode has two types of files.


1) edit logs files
2) FsImage files
These files are available on namenode disk(persistent data storage).
when you are starting namenode, latest fsimage file is load into “in-memory”.
and at the same time, edit log file is also loaded into memory if fsimage file
doesn’t contain up-to date information. The information which is available in
edit log(s) will be replayed to update the in-memory of fsimage data. What
information is available in editlog, fsimage…etc can be shown as below.
Namenode stored metadata in “in-memory” in order to serve the multiple
client request(s) as fast as possible. If we didn’t stored the metadata
information in “in-memory”, then for every operation, namenode has to load
the metadata information from the disk to in-memory and start performing
various check’s on this metadata information. This process will consume more
disk seek time for every operations(Reading from and Writing to Disk is a
time consuming process. That’s why metadata information is stored in “In-
Memory”. As part of in-memory, it will have both file metadata and bitmap
metadata information.

1) fsimage – An fsimage file contains the complete state of the file system at a
point in time. Every file system modification is assigned a unique,
monotonically increasing transaction ID. An fsimage file represents the file
system state after all modifications up to a specific transaction ID.
2) edits – An edits file is a log that lists each file system change (file creation,
deletion or modification) that was made after the most recent fsimage.

Corresponding to a file of 1GB, how much Metadata Name Node will store.
Block-size, Replication factor will be default.

NameNode Metadata stores the file to Block mapping, locations of blocks


on DataNodes, active data nodes, a bunch of other metadata is all stored in
memory on the NameNode. When we check the NameNode status website,
pretty much all of that information is stored in memory somewhere.
The only thing stored on disk is the fsimage, edit log, and status logs.
NameNode never really uses these files on disk, except for when it starts. The
fsimage and edits file pretty much only exist to be able to bring the NameNode
back up if it needs to be stopped or it crashes.

When a file is put into HDFS, it is split into blocks (of configurable size).
Let’s say we have a file called “file.txt” that is 1GB (1000MB) and our block
size is 128MB. We will end up with 7 128MB blocks and a 104MB block. The
NameNode keeps track of the fact that “file.txt” in HDFS maps to these eight
blocks and three replicas of each block. DataNodes store blocks, not files, so
the mapping is important to understanding where our data is and what our
data is.
Corresponding to a block 150 bytes (roughly) of metadata is created, Since
there are 8 blocks with replication factor 3 i.e. 24 blocks. Hence 150×24 =
3600 bytes of metadata will be created.

On disk, the NameNode stores the metadata for the file system. This includes
file and directory permissions, ownerships, and assigned blocks in the fsimage
and the edit logs. In properly configured setups, it also includes a list of
DataNodes that make up the HDFS (dfs.include parameter) and DataNodes
that are to be removed from

that list (dfs.exclude parameter). Note that which DataNodes have which
blocks is only stored in memory and not on disk.

when hadoop / namenode enter in the safe mode ? what is the need of
safemode ? why namenode enters into the safemode

The Name node enters Safe mode in primarily two cases:

During the start-up of name node daemon, the name node enters safe mode
for a certain period of time.
The administrator can also enter the Safe mode manually with the below
command (in the case of maintenance/up-gradation of the cluster ):
Hadoop dfsadmin -safe mode enter

What is the need of Safemode ?/ why namenode enters into the Safemode?

The primary reason of namenode to enter Safemode is to ensure that the


minimum replication condition is reached. The minimum replication
condition is when 99.9% of the blocks in the whole HDFS meet their minimum
replication factor/level (which is by default 1, and can be configured by
dfs.replication.min)
This ensures High Availability, reliability, Fault Tolerance of data before any
data manipulation in file system takes place.

What are the modes Hadoop can run in?

Hadoop can run in following modes.


Standalone Mode
1. Default mode for Hadoop
2. HDFS is not utilized here instead local file system is used for input and
output.
3. Mainly used for debugging purpose.
4. Custom configuration not required within 3 Hadoop files(mapred-site.xml,
core-site.xml,hdfs-site.xml)
5. Faster that Pseudo-distributed node.

Pseudo-distributed mode
1. This is the cluster where all Daemons(Master Node, Data Node, Resource
Manager, Node Manager) runs on one node.
2. Replication factor is 1 for HDFS.
3. Custom configuration required within 3 Hadoop files(mapred-site.xml,
core-site.xml,hdfs-site.xml)

Fully-Distributed mode
1. This is mainly used in Production phase.
2. Data is stored and processed accross multiple nodes.
3. Different node will be used as all Daemons(Master Node, Data Node,
Resource Manager, Node Manager)

The default data block size of HDFS/Hadoop is 64MB. The block size in disk is
generally 4KB. Why block size is large in HDFS/Hadoop

The Block in HDFS can be configured, But default size is 64 MB and 128 MB in
Hadoop version 2.
If the block size was 4 KB like Unix system, then this would lead to more
number of blocks and too many mappers to process this which would degrade
performance.
Importantly, Hadoop is for handling Big Data, hence block size should be large
and it also reduces load on namenode as it contains the metadata,the more
blocks, the more overload on namenode too.

Describe Read operation in HDFS?


How reading of file is done in HDFS?
To read data from HDFS, the client needs to communicate with the namenode
for metadata. The client gets name of files and its location from the namenode.
The Namenode responds with details of number of Blocks of the file,
replication factor, and in Datanode where each block will be stored./strong
Now client communicates with Datanode where the blocks are actually stored.
Clients start reading data in parallel from the Datanodes based on the
information received from the namenodes. Once client or application receives
all the blocks of the file, it will combine these blocks to form a file.
For read performance improvement, the location of each block ordered by
their distance from the client. HDFS selects the replica which is closest to the
client. This reduces the read latency and bandwidth consumption. It first read
the block in the same node, then another node in the same rack, and then
finally another Datanode in another rack.
Describe Write operation in HDFS?
How do we write data or files in HDFS?

Client Needs to interact with the Namenode for any write operation in
the HDFS write operation . Namenode provides the address for the slave .

1.Client Node in the JVM sends the create request() to the File system API
which in turn sends the create request to the Name node .

2.Now Name node provides the metadata for the write operation about the
slave Nodes to which the data can be written after the validation of the access
rights of the client nodes for the write operation.

3.The Client Node now directly sends the write request() to


the OutputStream which can write directly on the slave node .

4. Now the data blocks will be coped to the First data node and then it is
copied to second data node and then it will be written to the third Node , the
acknowledgement will be sent to the Client in reverse order
(DNn..DN3,DN2,DN1) once the block write operation completed.

5.The number copies depends on the replication factor and during the write
operation all the data nodes will be in contact with the Master node .

6.Also the write operation of the Blocks for the single copy of data will be
executed in parallel across all the data nodes which will be handled by
the Hadoop client (HDFS setup) .

7.Finally close() request will be sent to the HDFS client once all the actions are
completed.

What is heartbeat in Hadoop ? who send heartbeat ? what is sent as heartbeat ?

In Hadoop Name node and data node do communicate using Heartbeat.


Therefore Heartbeat is the signal that is sent by the datanode to the
namenode after the regular interval to time to indicate its presence, i.e. to
indicate that it is alive.

If after a certain time of heartbeat Name Node does not receive any response
from Data Node, then that particular Data Node used to be declared as dead.
The default heartbeat interval is 3 seconds. If the DataNode in HDFS does
not send heartbeat to NameNode in ten minutes, then NameNode considers
the DataNode to be out of service and the Blocks replicas hosted by that
DataNode to be unavailable. The NameNode then schedules the creation of
new replicas of those blocks on other DataNodes.

NameNode that receives the Heartbeats from a DataNode also carries


information like total storage capacity, the fraction of storage in use, and the
number of data transfers currently in progress. For the NameNode’s block
allocation and load balancing decisions, we use these statistics.

How Secondary NameNode solves the issue of Namenode?


Why Secondary NameNode is used in Hadoop?
Is Secondary Name node a hot standby of namenode?

Secondary Namenode, by its name we assume that it as a backup node but


its not. First let me give a brief about Namenode.
Namenode holds the metadata for HDFS like Block information, size etc. This
Information is stored in main memory as well as disk for persistence storage .
The information is stored in 2 different files .They are

 Editlogs- It keeps track of each and every changes to HDFS.


 Fsimage- It stores the snapshot of the file system.
Any changes done to HDFS gets noted in the edit logos the file size grows
where as the size of fsimage remains same. This not have any impact until we
restart the server. When we restart the server the edit file logs are written
into fsimage file and loaded into main memory which takes some time. If we
restart the cluster after a long time there will be a vast down time since the
edit log file would have grown. Secondary namenode would come into picture
in rescue of this problem.

Secondary Namenode simply gets edit logs from name node periodically and
copies to fsimage. This new fsimage is copied back to namenode.Namenode
now, this uses this new fsimage for next restart which reduces the startup
time. It is a helper node to Namenode and to precise Secondary Namenode
whole purpose is to have checkpoint in HDFS, which helps namenode to
function effectively. Hence, It is also called as Checkpoint node.

Whenever I start Hadoop cluster, namenode enters into the safemode.


Namenode takes very long time to leave safemode automatically. I want to
leave the safemode forcefully, is it possible ? If namenode is removed from
safemode manually, will there be any consequences ?

Safemode is a maintenance state of NameNode during which NameNode


doesn’t allow any modifications to the file system. HDFS cluster is in read
mode only during safemode.
At the startup of NameNode it loads the file system namespace from the last
saved fsimage into its main memory and the edits log file. Then, merges edits
log file on fsimage and results in new file system namespace. Then it
receives Block reports containing information about block location from all
datanodes. In Safemode NameNode performs collection of block reports from
datanodes.
NameNode enters safemode automatically during its start up. NameNode
leaves Safemode after the DataNodes have reported that most blocks are
available.

Use below command to know status of Safemode


hadoop dfsadmin –safemode get
Use below command to know enter in Safemode
bin/hadoop dfsadmin –safemode enter
Use below command to leave Safemode
hadoop dfsadmin -safemode leave

Can Hadoop handle small files efficiently ? What happens when we store small
files in Hadoop ? What is small file problem ? How to resolve small file problem
?

Hadoop is not suited for small data. Hadoop HDFS lacks the ability to
efficiently support the random reading of small files because of its high
capacity design. Small file in HDFS is significantly smaller than the HDFS block
size (default 128 MB). If we are storing these huge numbers of small files,
HDFS can’t handle these lots of files, as HDFS was designed to work properly
with the small number of large files for storing large datasets rather than a
large number of small files.

Following are the issues with the large number of small files in Hadoop:

Every file in HDFS is mapped to an object and stored in Namenode’s memory.


So, a large number of small files will end up using a lot of memory of the
master and scaling up in this fashion is not feasible.
When there are large number of files, there will be a lot of seeks on the disk as
frequent hopping from data node to data node will be done and hence
increasing the file read/write time.
Solution

Can multiple clients write into an HDFS file concurrently?


No, multiple clients cannot write into an HDFS file at same time.
When one client is given permission by Name node to write data on data node
block, the block gets locked till the write operations is completed. If some
other client requests to write on the same block of a particular file in data
node, it is not permitted to do so. It has to wait till the write lock is revoked on
a particular data node. All the requests are in the queue and only one client is
allowed to write at a time.

In HDFS,can we read from the file which is already open for writing?
What happens, when client try to read from file already opened for writing in
HDFS?

Yes, the client can read the file which is already opened for writing.
But, the problem in reading a file which is currently being written, lies in the
consistency of data i.e. Hadoop HDFS does not provide the surety that the data
which has been written into the file will be visible to a new reader before the
file has been closed.

How indexing is done in Hadoop?


Hadoop emerged as a solution to the “Big Data” problems. It is an open source
software framework for distributed storage and distributed processing of
large data sets.
Apache Hadoop has a unique way of Indexing . As, Hadoop framework store
the data as per the Data Bock size, HDFS will keep on storing the last part of
the data which will say where the next part of the data will be. In fact this is
the base of HDFS.

You might also like