Professional Documents
Culture Documents
Module-1
Hadoop Distributed File System (HDFS) Basics
Hadoop Distributed File System (HDFS) is a distributed file system which is designed to run on
commodity hardware. Commodity hardware is cheaper in cost. Since Hadoop requires processing
power of multiple machines and since it is expensive to deploy costly hardware, we use commodity
hardware. When commodity hardware is used, failures are more common rather than an exception.
HDFS is highly fault-tolerant and is designed to run on commodity hardware.
HDFS provides high throughput access to the data stored. So it is extremely useful when you want
to build applications which require large data sets.
HDFS was originally built as infrastructure layer for Apache Nutch. It is now pretty much part of
Apache Hadoop project.
1
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
DataNodes serves the read and write requests from HDFS file system clients. They are also
responsible for creation of block replicas and for checking if blocks are corrupted or not. It sends
the ping messages to the NameNode in the form of block mappings.
2
“BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
3
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last block are the same size. Blocks
belonging to a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. Files in HDFS are write-once and have strictly one writer at any time. An
application can specify the number of replicas of a file. The replication factor can be specified at
file creation time and can be changed later. The Namenode makes all decisions regarding
replication of blocks. It periodically receives Heartbeat and a Blockreport from each of the
Datanodes in the cluster. A receipt of a heartbeat implies that the Datanode is in good health and is
serving data as desired. A Blockreport contains a list of all blocks on that Datanode.
4
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
If a HDFS cluster spans multiple data centers, then a replica that is resident in the local data center
is preferred over remote replicas.
SafeMode
On startup, the Namenode enters a special state called Safemode. Replication of data blocks does
not occur when the Namenode is in Safemode state. The Namenode receives Heartbeat and
Blockreport from the Datanodes. A Blockreport contains the list of data blocks that a Datanode
reports to the Namenode. Each block has a specified minimum number of replicas. A block is
considered safely-replicated when the minimum number of replicas of that data block has checked
in with the Namenode. When a configurable percentage of safely-replicated data blocks checks in
with the Namenode (plus an additional 30 seconds), the Namenode exits the Safemode state. It
then determines the list of data blocks (if any) that have fewer than the specified number of
replicas. The Namenode then replicates these blocks to other Datanodes.
5
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Call (RPC) abstraction wraps the ClientProtocol and the DatanodeProtocol. By design, the
Namenode never initiates an RPC. It responds to RPC requests issued by a Datanode or a client.
Robustness
The primary objective of HDFS is to store data reliably even in the presence of failures. The three
types of common failures are Namenode failures, Datanode failures and network partitions.
Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. It is possible that data may
move automatically from one Datanode to another if the free space on a Datanode falls below a
certain threshold. Also, a sudden high demand for a particular file can dynamically cause creation
of additional replicas and rebalancing of other data in the cluster. These types of rebalancing
schemes are not yet implemented.
Data Correctness
It is possible that a block of data fetched from a Datanode is corrupted. This corruption can occur
because of faults in the storage device, a bad network or buggy software. The HDFS client
implements checksum checking on the contents of a HDFS file. When a client creates a HDFS file,
it computes a checksum of each block on the file and stores these checksums in a separate hidden
file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it
received from a Datanode satisfies the checksum stored in the checksum file. If not, then the client
can opt to retrieve that block from another Datanode that has a replica of that block.
6
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
for the HDFS cluster. If a Namenode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the Namenode software to another machine is not supported.
Snapshots
Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot-
feature may be to roll back a corrupted cluster to a previously known good point in time. HDFS
current does not support snapshots but it will be supported it in future release.
Data Blocks
HDFS is designed to support large files. Applications that are compatible with HDFS are those that
deal with large data sets. These applications write the data only once; they read the data one or
more times and require that reads are satisfied at streaming speeds. HDFS supports write-once-
read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, a HDFS file is
chopped up into 128MB chunks, and each chunk could reside in different Datanodes.
Staging
A client-request to create a file does not reach the Namenode immediately. In fact, the HDFS client
caches the file data into a temporary local file. An application-write is transparently redirected to
this temporary local file. When the local file accumulates data worth over a HDFS block size, the
client contacts the Namenode. The Namenode inserts the file name into the file system hierarchy
and allocates a data block for it. The Namenode responds to the client request with the identity of
the Datanode(s) and the destination data block. The client flushes the block of data from the local
temporary file to the specified Datanode. When a file is closed, the remaining un-flushed data in
the temporary local file is transferred to the Datanode. The client then instructs the Namenode that
the file is closed. At this point, the Namenode commits the file creation operation into a persistent
store. If the Namenode dies before the file is closed, the file is lost. The above approach has been
adopted after careful consideration of target applications that run on HDFS. Applications need
streaming writes to files. If a client writes to a remote file directly without any client side
buffering, the network speed and the congestion in the network impacts throughput considerably.
This approach is not without precedence either. Earlier distributed file system, e.g. AFS have used
client side caching to improve performance. A POSIX requirement has been relaxed to achieve
higher performance of data uploads.
Pipelining
When a client is writing data to a HDFS file, its data is first written to a local file as explained
above. Suppose the HDFS file has a replication factor of three. When the local file accumulates a
block of user data, the client retrieves a list of Datanodes from the Namenode. This list represents
the Datanodes that will host a replica of that block. The client then flushes the data block to the
first Datanode. The first Datanode starts receiving the data in small portions (4 KB), writes each
portion to its local repository and transfers that portion to the second Datanode in the list. The
second Datanode, in turn, starts receiving each portion of the data block, writes that portion to its
repository and then flushes that portion to the third Datanode. The third Datanode writes the data to
7
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
its local repository. A Datanode could be receiving data from the previous one in the pipeline and
at the same time it could be forwarding data to the next one in the pipeline. Thus, the data is
pipelined from one Datanode to the next.
Accessibility
HDFS can be accessed by application by many different ways. Natively, HDFS provides a Java
API for applications to use. A C language wrapper for this Java API is available. A HTTP browser
can also be used to browse the file in HDFS. Work is in progress to expose a HDFS content
repository through the WebDAV Protocol.
DFSShell
HDFS allows user data to be organized in the form of files and directories. It provides an interface
called DFSShell that lets a user interact with the data in HDFS. The syntax of this command set is
similar to other shells (e.g. bash, csh) that users are already familiar with.
Here are some sample commands:
Create a directory named /foodir : hadoop dfs -mkdir /foodir
View a file /foodir/myfile.txt : hadoop dfs -cat /foodir/myfile.txt
Delete a file /foodir/myfile.txt : hadoop dfs -rm /foodir myfile.txt
The command syntax for DFSShell is targeted for applications that need a scripting language to
interact with the stored data.
DFSAdmin
The DFSAdmin command set is used for administering a dfs cluster. These are commands that are
used only by a HDFS administrator. Here are some sample commands:
Put a cluster in Safe Mode : bin/hadoop dfsadmin -safemode enter
Generate a list of Datanodes : bin/hadoop dfsadmin -report
Decommission a Datanode : bin/hadoop dfsadmin -decommission datanodename
Browser Interface
A typical HDFS install configures a web-server to expose the HDFS namespace through a
configurable port. This allows a Web browser to navigate the HDFS namespace and view contents
of a HDFS file.
Space Reclamation
1. File Deletes and Undelete
When a file is deleted by a user or an application, it is not immediately removed from HDFS.
HDFS renames it to a file in the /trash directory. The file can be restored quickly as long as it
remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its
life in /trash, the Namenode deletes the file from the HDFS namespace. The deletion of the file
causes the blocks associated with the file to be freed. There could be an appreciable time delay
between the time a file is deleted by a user and the time of the corresponding increase in free space
in HDFS.
8
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user
wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve
the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash
directory is just like any other directory with one special feature: HDFS applies specified policies
to automatically delete files from this directory. The current default policy is to delete files that are
older than 6 hours. In future, this policy will be configurable through a well-defined interface.
9
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Hadoop MapReduce is a programming paradigm at the heart of Apache Hadoop for providing
massive scalability across hundreds or thousands of Hadoop clusters on commodity hardware. The
MapReduce model processes large unstructured data sets with a distributed algorithm on a Hadoop
cluster.
The term MapReduce represents two separate and distinct tasks Hadoop programs perform-Map
Job and Reduce Job. Map job scales takes data sets as input and processes them to produce key
value pairs. Reduce job takes the output of the Map job i.e. the key value pairs and aggregates
them to produce desired results. The input and output of the map and reduce jobs are stored in
HDFS.
10
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
The following word count example explains MapReduce method. For simplicity, let's consider a
few words of a text document. We want to find the number of occurrence of each word. First the
input is split to distribute the work among all the map nodes as shown in the figure. Then each
word is identified and mapped to the number one. Thus the pairs also called as tuples are created.
In the first mapper node three words Deer, Bear and River are passed. Thus the output of the node
will be three key, value pairs with three distinct keys and value set to one. The mapping process
remains the same in all the nodes. These tuples are then passed to the reduce nodes. A partitioner
comes into action which carries out shuffling so that all the tuples with same key are sent to same
node.
11
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
The first command compiles the program using the classes developed by Hadoop (i.e., hadoop-
core-1.0.4.jar). The second command creates a jar file called WordCount.jar that you will use for
running the WordCount program in Hadoop
The first command starts the Hadoop services. The second command establishes a secure
connection with your machine. The third command creates the directory where you will put file
containing The Miserables.
Afterwards, copy the WordCount.jar and the TheMiserables.txt file into the folder containing your
Hadoop installation.
Then prepare the input for the WordCount program:
hadoop J$ bin/hadoop dfs -mkdir input
hadoop J$ bin/hadoop dfs -put LesMiserables.txt input
The former command creates a directory called input in the Hadoop Distributed File System
(HDFS). The second command will copy TheMiserables.txt into the input folder in HDFS.
Without this command Hadoop cannot find the input file. Finally execute the following
commands:
hadoop J$ bin/hadoop jar WordCount.jar WordCount input output
hadoop J$ bin/hadoop dfs -get output output
The first command run the WordCount program in Hadoop. Note that the command specifies the
names of:
• the class where the main method resides (cf. the WordCount.java file).
• the HDFS folder where the input files resides.
• the HDFS folder that will contain the output files.
The second command copies the output folder from HDFS to your machine. You will find the
result of the WordCount program in a file (probably) called part-00000.
12
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Module-2
Essential Hadoop Tools
• The Pig scripting tool is introduced as a way to quickly examine data both locally and on a
Hadoop cluster.
• The Hive SQL-like query tool is explained using two examples.
• The Sqoop RDBMS tool is used to import and export data from MySQL to/from HDFS.
• The Flume streaming data transport utility is configured to capture weblog data into HDFS.
• The Oozie workflow manager is used to run basic and complex Hadoop workflows.
• The distributed HBase database is used to store and access data on a Hadoop cluster.
The Hadoop ecosystem offers many tools to help with data input, high-level processing, workflow
management, and creation of huge databases. Hadoop Ecosystem is neither a programming
language nor a service, it is a platform or framework which solves big data problems. You can
consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and
maintaining) inside it.
13
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs, and that‘s an abstraction (which works like black box). PIG was initially
developed by Yahoo. It gives you a platform for building data flow for ETL (Extract, Transform
and Load), processing and analyzing huge data sets. In PIG, first the load command, loads the
data. Then we perform various functions on it like grouping, filtering, joining, sorting, etc. At last,
either you can dump the data on the screen or you can store the result back in HDFS
Apache Pig has several usage modes. The first is a local mode in which all processing is done on
the local machine. The non-local (cluster) modes are MapReduce and Tez. These modes execute
the job on the cluster using either the MapReduce engine or the optimized Tez engine. There are
also interactive and batch modes available; they enable Pig applications to be developed locally in
interactive modes, using small amounts of data, and then run at scale on the cluster in a production
mode.
Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel at home
while working in a Hadoop Ecosystem. Basically, HIVE is a data warehousing component which
performs reading, writing and managing large data sets in a distributed environment using SQL-
like interface.
The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver. The Hive Command
line interface is used to execute HQL commands. While, Java Database Connectivity (JDBC) and
Object Database Connectivity (ODBC) is used to establish connection from data storage.
Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing
(i.e. Batch query processing) and real time processing (i.e. Interactive query processing). It
supports all primitive data types of SQL. You can use predefined functions, or write tailored user
defined functions (UDF) also to accomplish your specific needs.
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use
Sqoop to import data from a relational database management system (RDBMS) into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop, and then export the data back into
an RDBMS.
14
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Sqoop can be used with any Java Database Connectivity (JDBC)–compliant database and has been
tested on Microsoft SQL Server, PostgresSQL, MySQL, and Oracle
When we submit Sqoop command, our main task gets divided into sub tasks which is handled by
individual Map Task internally. Map Task is the sub task, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks imports the whole data
When we submit our Job, it is mapped into Map Tasks which brings the chunk of data from
HDFS. These chunks are exported to a structured data destination. Combining all these exported
chunks of data, we receive the whole data at the destination, which in most of the cases is an
RDBMS (MYSQL/Oracle/SQL Server).
Apache Solr and Apache Lucene are the two services which are used for searching and indexing in
Hadoop Ecosystem.
15
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
APACHE AMBARI
Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more
manageable. It includes software for provisioning, managing and monitoringApache Hadoop
clusters.
Apache Flume is an independent agent designed to collect, transport, and store data into HDFS.
Often data transport involves a number of Flume agents that may traverse a series of machines and
locations. Flume is often used for log files, social media-generated data, email messages, and just
about any continuous data source.
As shown in Figure 2.3, a Flume agent is composed of three components.
16
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
• Source. The source component receives data and sends it to a channel. It can send the data
to more than one channel. The input data can be from a real-time source (e.g., weblog) or
another Flume agent.
• Channel. A channel is a data queue that forwards the source data to the sink destination. It
can be thought of as a buffer that manages input (source) and output (sink) flow rates.
• Sink. The sink delivers data to destination such as HDFS, a local file, or another Flume
agent.
A Flume agent must have all three of these components defined. A Flume agent can have several
sources, channels, and sinks. Sources can write to multiple channels, but a sink can take data from
only a single channel. Data written to a channel remain in the channel until a sink removes the
data. By default, the data in a channel are kept in memory but may be optionally stored on disk to
prevent data loss in the event of a network failure.
17
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
18
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Specific HBase cell values are identified by a row key, column (column family and column), and
version (timestamp). It is possible to have many versions of data within an HBase cell. A version is
specified as a timestamp and is created each time data are written to a cell. Almost anything can
serve as a row key, from strings to binary representations of longs to serialized data structures.
Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte
array denotes both the start and the end of a table‘s namespace. All table accesses are via the table
row key, which is considered its primary key.
YARN DISTRIBUTED-SHELL
The Hadoop YARN project includes the Distributed-Shell application, which is an example
of a Hadoop non-MapReduce application built on top of YARN. Distributed-Shell is a simple
mechanism for running shell commands and scripts in containers on multiple nodes in a Hadoop
cluster.
The central YARN ResourceManager runs as a scheduling daemon on a dedicated machine
and acts as the central authority for allocating resources to the various competing applications in
the cluster. The ResourceManager has a central and global view of all cluster resources and,
therefore, can ensure fairness, capacity, and locality are shared across all users. Depending on the
application demand, scheduling priorities, and resource availability, the ResourceManager
dynamically allocates resource containers to applications to run on particular nodes. A container is
a logical bundle of resources (e.g., memory, cores) bound to a particular cluster node. To enforce
and track such assignments, the ResourceManager interacts with a special system daemon running
on each node called the NodeManager. Communications between the ResourceManager and
NodeManagers are heartbeat based for scalability. NodeManagers are responsible for local
monitoring of resource availability, fault reporting, and container life-cycle management (e.g.,
19
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
starting and killing jobs). The ResourceManager depends on the NodeManagers for its ―global
view‖ of the cluster.
User applications are submitted to the ResourceManager via a public protocol and go
through anadmission control phase during which security credentials are validated and various
operational and administrative checks are performed. Those applications that are accepted pass to
the scheduler and are allowed to run. Once the scheduler has enough resources to satisfy the
request, the application is moved from an accepted state to a running state. Aside from internal
bookkeeping, this process involves allocating a container for the single ApplicationMaster and
spawning it on a node in the cluster. Often called container 0, the ApplicationMaster does not have
any additional resources at this point, but rather must request additional resources from the
ResourceManager.
The ApplicationMaster is the ―master‖ user job that manages all application life-cycle
aspects, including dynamically increasing and decreasing resource consumption (i.e., containers),
managing the flow of execution (e.g., in case of MapReduce jobs, running reducers against the
output of maps), handling faults and computation skew, and performing other local optimizations.
The ApplicationMaster is designed to run arbitrary user code that can be written in any
programming language, as all communication with the ResourceManager and NodeManager is
encoded using extensible network protocols
The ApplicationMaster will need to harness the processing power of multiple servers to
complete a job. To achieve this, the ApplicationMaster issues resource requests to the
ResourceManager. The form of these requests includes specification of locality preferences (e.g.,
to accommodate HDFS use) and properties of the containers. The ResourceManager will attempt to
satisfy the resource requests coming from each application according to availability and scheduling
policies. When a resource is scheduled on behalf of an ApplicationMaster, the ResourceManager
generates a lease for the resource, which is acquired by a subsequent ApplicationMaster heartbeat.
The ApplicationMaster then works with the NodeManagers to start the resource. A token-based
security mechanism guarantees its authenticity when the ApplicationMaster presents the container
lease to the NodeManager. In a typical situation, running containers will communicate with the
ApplicationMaster through an application-specific protocol to report status and health information
and to receive framework-specific commands. In this way, YARN provides a basic infrastructure
for monitoring and life-cycle management of containers, while each framework manages
application-specific semantics independently.
The YARN components appear as the large outer boxes (ResourceManager and
NodeManagers), and the two applications appear as smaller boxes (containers), one dark and one
light. Each application uses a different ApplicationMaster; the darker client is running a Message
Passing Interface (MPI) application and the lighter client is running a traditional MapReduce
application.
20
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Fig 2.6 YARN architecture with two clients (MapReduce and MPI).
21
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
Distributed-Shell
As described earlier in this chapter, Distributed-Shell is an example application included with the
Hadoop core components that demonstrates how to write applications on top of YARN. It provides
a simple method for running shell commands and scripts in containers in parallel on a Hadoop
YARN cluster.
Hadoop MapReduce
MapReduce was the first YARN framework and drove many of YARN‘s requirements. It is
integrated tightly with the rest of the Hadoop ecosystem projects, such as Apache Pig, Apache
Hive, and Apache Oozie.
Apache Tez
One great example of a new YARN framework is Apache Tez. Many Hadoop jobs involve the
execution of a complex directed acyclic graph (DAG) of tasks using separate MapReduce stages.
Apache Tez generalizes this process and enables these tasks to be spread across stages so that they
can be run as a single, all-encompassing job. Tez can be used as a MapReduce replacement for
projects such as Apache Hive and Apache Pig. No changes are needed to the Hive or Pig
applications.
Apache Giraph
Apache Giraph is an iterative graph processing system built for high scalability. Facebook, Twitter,
and LinkedIn use it to create social graphs of users. Giraph was originally written to run on
standard Hadoop V1 using the MapReduce framework, but that approach proved inefficient and
totally unnatural for various reasons.. In addition, using the flexibility of YARN, the Giraph
developers plan on implementing their own web interface to monitor job progress.
Apache Spark
Spark was initially developed for applications in which keeping data in memory improves
performance, such as iterative algorithms, which are common in machine learning, and interactive
data mining. Spark differs from classic MapReduce in two important ways. First, Spark holds
22
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
intermediate results in memory, rather than writing them to disk. Second, Spark supports more than
just MapReduce functions; that is, it greatly expands the set of possible analyses that can be
executed over HDFS data stores. It also provides APIs in Scala, Java, and Python.
Since 2013, Spark has been running on production YARN clusters at Yahoo!. The advantage of
porting and running Spark on top of YARN is the common resource management and a single
underlying file system
Apache Storm
Traditional MapReduce jobs are expected to eventually finish, but Apache Storm continuously
processes messages until it is stopped. This framework is designed to process unbounded streams
of data in real time. It can be used in any programming language. The basic Storm use-cases
include real-time analytics, online machine learning, continuous computation, distributed RPC
(remote procedure calls), ETL (extract, transform, and load), and more. Storm provides fast
performance, is scalable, is fault tolerant, and provides processing guarantees. It works directly
under YARN and takes advantage of the common data and resource management substrate.
23
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
program optimization. It also offers native support for iterations, incremental iterations, and
programs consisting of large DAGs of operations.
Flink is primarily a stream-processing framework that can look like a batch-processing
environment. The immediate benefit from this approach is the ability to use the same algorithms
for both streaming and batch modes (exactly as is done in Apache Spark). However, Flink can
provide low-latency similar to that found in Apache Storm, but which is not available in Apache
Spark. In addition, Flink has its own memory management system, separate from Java‘s garbage
collector. By managing memory explicitly, Flink almost eliminates the memory spikes often seen
on Spark clusters.
Hadoop has two main areas of administration: the YARN resource manager and the HDFS file
system. Other application frameworks (e.g., the MapReduce framework) and tools have their own
management files. Hadoop configuration is accomplished through the use of XML configuration
files. The basic files and their function are as follows:
core-default.xml: System-wide properties
hdfs-default.xml: Hadoop Distributed File System properties
mapred-default.xml: Properties for the YARN MapReduce framework
yarn-default.xml: YARN properties
24
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
YARN has several built-in administrative features and commands. The main administration
command is yarn rmadmin (resource manager administration). Enter yarn rmadmin -help to learn
more about the various options.
YARN WebProxy
The Web Application Proxy is a separate proxy server in YARN that addresses security issues with
the cluster web interface on ApplicationMasters. By default, the proxy runs as part of the Resource
Manager itself, but it can be configured to run in a stand-alone mode by adding the configuration
property yarn.web-proxy.address to yarn-site.xml. (Using Ambari, go to the YARN Configs view,
scroll to the bottom, and select Custom yarn-site.xml/Add property.) In stand-alone
mode, yarn.web-proxy.principal and yarn.web-proxy.keytab control the Kerberos principal name
and the corresponding keytab, respectively, for use in secure mode. These elements can be added
to the yarn-site.xml if required.
25
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
26
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
essential information about HDFS and offers the capability to browse the HDFS namespace and
logs.
The web-based UI can be started from within Ambari or from a web browser connected to the
NameNode. In Ambari, simply select the HDFS service window and click on the Quick Links pull-
down menu in the top middle of the page. Select NameNode UI. A new browser tab will open with
the UI shown. You can also start the UI directly by entering the following command
$ firefox http://localhost:50070
There are five tabs on the UI: Overview, Datanodes, Snapshot, Startup Progress, and Utilities.
The Overview page provides much of the essential information that the command-line tools also
offer, but in a much easier-to-read format. The Datanodes tab displays node information. The
Snapshot window lists the ―snapshottable‖ directories and the snapshots.
In NameNode startup progress view, when the NameNode starts, it reads the previous file system
image file (fsimage); applies any new edits to the file system image, thereby creating a new file
system image; and drops into safe mode until enough DataNodes come online. This progress is
shown in real time in the UI as the NameNode starts. Completed phases are displayed in bold text.
The currently running phase is displayed in italics. Phases that have not yet begun are displayed in
gray text.
The Utilities menu offers two options. The first, is a file system browser. From this window, you
can easily explore the HDFS namespace. The second option, links to the various NameNode logs.
27
"BIG DATA ANALYTICS"
(15CS82 CBCS as per VTU Syllabus)
By default, the balancer will continue to rebalance the nodes until the number of data blocks on all
DataNodes are within 10% of each other. The balancer can be stopped, without harming HDFS, at
any time by entering a Ctrl-C. Lower or higher thresholds can be set using the -threshold argument.
For example, giving the following command sets a 5% threshold:
$ hdfs balancer -threshold 5
SecondaryNameNode
To avoid long NameNode restarts and other issues, the performance of the SecondaryNameNode
should be verified. The hdfs-site.xml defines a property called fs.checkpoint.period (called HDFS
Maximum Checkpoint Delay in Ambari). This property provides the time in seconds between the
SecondaryNameNode checkpoints.
When a checkpoint occurs, a new fsimage* file is created in the directory corresponding to the
value of dfs.namenode.checkpoint.dir in the hdfs-site.xml file. This file is also placed in the
NameNode directory corresponding to the dfs.namenode.name.dir path designated in the hdfs-
site.xml file. To test the checkpoint process, a short time period (e.g., 300 seconds) can be used
for fs.checkpoint.period and HDFS restarted. After five minutes, two identical fsimage* files
should be present in each of the two previously mentioned directories. If these files are not recent
or are missing, consult the NameNode and SecondaryNameNode logs.
Once the SecondaryNameNode process is confirmed to be working correctly, reset
the fs.checkpoint.period to the previous value and restart HDFS. (Ambari versioning is helpful
with this type or procedure.) If the SecondaryNameNode is not running, a checkpoint can be forced
by running the following command:
$ hdfs secondarynamenode -checkpoint force
28
29
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner