Professional Documents
Culture Documents
Hadoop Distributed File System- HDFS is used for storing structure and
unstructured data in distributed manner by using commodity hardware.
What is Hadoop Distributed File System and what are its components?
Hadoop HDFS is a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster.
Components of HDFS:
HDFS comprises of 3 important components NameNode,
DataNode and Secondary NameNode.
HDFS operates on a Master-Slave architecture model where the NameNode
acts as the master node for keeping a track of the storage cluster and the
DataNode acts as a slave node summing up to the various systems within
a Hadoop cluster.
It maintains and manages the slave nodes and assign the task to them.
It only stores the metadata of HDFS. Namenode executes file system
namespace operations like opening, closing and renaming files and directions.
All replication factor details should maintain in Name node.This metadata is
available in memory in the master for faster retrieval data.
DataNode:
This is the daemon that runs on the slave, these are actual workers nodes that
store the data. Data nodes are the slaves which are deployed on each machine
and provide actual storage. Data nodes are responsible for serving read and
write requests from the file system clients, also perform block creation,
deletion, and replication upon instruction from the name node.
Where does NameNode keeps meta data? does it store the same in hdfs
or local fs or memory and why ?
1) fsimage – An fsimage file contains the complete state of the file system at a
point in time. Every file system modification is assigned a unique,
monotonically increasing transaction ID. An fsimage file represents the file
system state after all modifications up to a specific transaction ID.
2) edits – An edits file is a log that lists each file system change (file creation,
deletion or modification) that was made after the most recent fsimage.
Corresponding to a file of 1GB, how much Metadata Name Node will store.
Block-size, Replication factor will be default.
When a file is put into HDFS, it is split into blocks (of configurable size).
Let’s say we have a file called “file.txt” that is 1GB (1000MB) and our block
size is 128MB. We will end up with 7 128MB blocks and a 104MB block. The
NameNode keeps track of the fact that “file.txt” in HDFS maps to these eight
blocks and three replicas of each block. DataNodes store blocks, not files, so
the mapping is important to understanding where our data is and what our
data is.
Corresponding to a block 150 bytes (roughly) of metadata is created, Since
there are 8 blocks with replication factor 3 i.e. 24 blocks. Hence 150×24 =
3600 bytes of metadata will be created.
On disk, the NameNode stores the metadata for the file system. This includes
file and directory permissions, ownerships, and assigned blocks in the fsimage
and the edit logs. In properly configured setups, it also includes a list of
DataNodes that make up the HDFS (dfs.include parameter) and DataNodes
that are to be removed from
that list (dfs.exclude parameter). Note that which DataNodes have which
blocks is only stored in memory and not on disk.
when hadoop / namenode enter in the safe mode ? what is the need of
safemode ? why namenode enters into the safemode
During the start-up of name node daemon, the name node enters safe mode
for a certain period of time.
The administrator can also enter the Safe mode manually with the below
command (in the case of maintenance/up-gradation of the cluster ):
Hadoop dfsadmin -safe mode enter
What is the need of Safemode ?/ why namenode enters into the Safemode?
Pseudo-distributed mode
1. This is the cluster where all Daemons(Master Node, Data Node, Resource
Manager, Node Manager) runs on one node.
2. Replication factor is 1 for HDFS.
3. Custom configuration required within 3 Hadoop files(mapred-site.xml,
core-site.xml,hdfs-site.xml)
Fully-Distributed mode
1. This is mainly used in Production phase.
2. Data is stored and processed accross multiple nodes.
3. Different node will be used as all Daemons(Master Node, Data Node,
Resource Manager, Node Manager)
The default data block size of HDFS/Hadoop is 64MB. The block size in disk is
generally 4KB. Why block size is large in HDFS/Hadoop
The Block in HDFS can be configured, But default size is 64 MB and 128 MB in
Hadoop version 2.
If the block size was 4 KB like Unix system, then this would lead to more
number of blocks and too many mappers to process this which would degrade
performance.
Importantly, Hadoop is for handling Big Data, hence block size should be large
and it also reduces load on namenode as it contains the metadata,the more
blocks, the more overload on namenode too.
Client Needs to interact with the Namenode for any write operation in
the HDFS write operation . Namenode provides the address for the slave .
1.Client Node in the JVM sends the create request() to the File system API
which in turn sends the create request to the Name node .
2.Now Name node provides the metadata for the write operation about the
slave Nodes to which the data can be written after the validation of the access
rights of the client nodes for the write operation.
4. Now the data blocks will be coped to the First data node and then it is
copied to second data node and then it will be written to the third Node , the
acknowledgement will be sent to the Client in reverse order
(DNn..DN3,DN2,DN1) once the block write operation completed.
5.The number copies depends on the replication factor and during the write
operation all the data nodes will be in contact with the Master node .
6.Also the write operation of the Blocks for the single copy of data will be
executed in parallel across all the data nodes which will be handled by
the Hadoop client (HDFS setup) .
7.Finally close() request will be sent to the HDFS client once all the actions are
completed.
If after a certain time of heartbeat Name Node does not receive any response
from Data Node, then that particular Data Node used to be declared as dead.
The default heartbeat interval is 3 seconds. If the DataNode in HDFS does
not send heartbeat to NameNode in ten minutes, then NameNode considers
the DataNode to be out of service and the Blocks replicas hosted by that
DataNode to be unavailable. The NameNode then schedules the creation of
new replicas of those blocks on other DataNodes.
Secondary Namenode simply gets edit logs from name node periodically and
copies to fsimage. This new fsimage is copied back to namenode.Namenode
now, this uses this new fsimage for next restart which reduces the startup
time. It is a helper node to Namenode and to precise Secondary Namenode
whole purpose is to have checkpoint in HDFS, which helps namenode to
function effectively. Hence, It is also called as Checkpoint node.
Can Hadoop handle small files efficiently ? What happens when we store small
files in Hadoop ? What is small file problem ? How to resolve small file problem
?
Hadoop is not suited for small data. Hadoop HDFS lacks the ability to
efficiently support the random reading of small files because of its high
capacity design. Small file in HDFS is significantly smaller than the HDFS block
size (default 128 MB). If we are storing these huge numbers of small files,
HDFS can’t handle these lots of files, as HDFS was designed to work properly
with the small number of large files for storing large datasets rather than a
large number of small files.
Following are the issues with the large number of small files in Hadoop:
In HDFS,can we read from the file which is already open for writing?
What happens, when client try to read from file already opened for writing in
HDFS?
Yes, the client can read the file which is already opened for writing.
But, the problem in reading a file which is currently being written, lies in the
consistency of data i.e. Hadoop HDFS does not provide the surety that the data
which has been written into the file will be visible to a new reader before the
file has been closed.