Cloudera Administrator Training

Cloudera Administrator
Training for Apache Hadoop
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
201201$
01-1
Chapter 1
Introduction
01-2
Introduction
About This Course
About Cloudera
Course Logistics
01-3
Course Objectives
During this course, you will learn:
The core technologies of Hadoop
How to plan your Hadoop cluster hardware and software
How to deploy a Hadoop cluster
How to schedule jobs on the cluster
How to maintain your cluster
How to monitor, troubleshoot, and optimize the cluster
What system administrator issues to consider when installing
Hive, HBase and Pig
How to populate HDFS from external sources
01-4
Course Contents
Chapter 1: Introduction
Chapter 2: An Introduction to Hadoop and HDFS
Chapter 3: Planning Your Hadoop Cluster
Chapter 4: Deploying Your Cluster
Chapter 5: Scheduling Jobs
Chapter 6: Cluster Maintenance
Chapter 7: Cluster Monitoring, Troubleshooting, and Optimizing
Chapter 8: Installing and Managing Hadoop Ecosystem Projects
Chapter 9: Populating HDFS From External Sources
Chapter 10: Conclusion
01-5
Cloudera Certified Administrator for Apache

Hadoop
At the end of the course, you will take the Cloudera Certified
Administrator for Apache Hadoop exam
Passing the exam earns you the CCAH credential
Your instructor will tell you more about the exam during the week
01-6
Introduction
About This Course
About Cloudera
Course Logistics
01-7
About Cloudera
Cloudera is The commercial Hadoop company
Founded by leading experts on Hadoop from Facebook, Google,
Oracle and Yahoo
Staff includes several committers to Hadoop projects
01-8
Cloudera Products
Clouderas Distribution of Hadoop (CDH)
A single, easy-to-install package from the Apache Hadoop core
repository
Includes a stable version of Hadoop, plus critical bug fixes and
solid new features from the development version
Open-source
No vendor lock-in
Cloudera Manager
Easy, Wizard-based creation and management of Hadoop
clusters
Central monitoring and management point for the cluster
Free version supports up to 50 nodes
01-9
Cloudera Enterprise
Cloudera Enterprise
Complete package of software and support
Built on top of CDH
Includes full version of Cloudera Manager
Install, manage, and maintain a cluster of any size
LDAP integration
Includes powerful cluster monitoring and auditing tools
Resource consumption tracking
Proactive health checks
Alerting
Configuration change audit trails
And more
24 x 7 support
01-10
Cloudera Services
Provides consultancy services to many key users of Hadoop
Including Adconicon, AOL Advertising, Comscore, Groupon,
NAVTEQ, Samsung, Trend Micro, Trulia
Solutions Architects and engineers are experts in Hadoop and
related technologies
Several are committers to Apache Hadoop and related projects
Provides training in key areas of Hadoop administration and
Development
Courses include Developer Training for Apache Hadoop,
Analyzing Data with Hive and Pig, HBase Training, Cloudera
Essentials
Custom course development available
Both public and on-site training available
01-11
Introduction
About This Course
About Cloudera
Course Logistics
01-12
Logistics
Course start and end times
Lunch
Breaks
Restrooms
Can I come in early/stay late?
Certification
01-13
Introductions
About your instructor
About you
Experience with Hadoop?
Experience as a System Administrator?
What platform(s) do you use?
Expectations from the course?
01-14
Chapter 2
An Introduction to Hadoop
02-1
In this chapter, you will learn:
What Hadoop is
Why Hadoop is important
What features the Hadoop Distributed File System (HDFS)
provides
How MapReduce works
What other Apache Hadoop ecosystem projects exist, and what
they do
02-2
Why Hadoop?
What is HDFS?
What is MapReduce?
Hive, Pig, HBase and other Ecosystem projects
Hands-On Exercise: Installing Hadoop
Conclusion
02-3
Some Numbers
Max data in memory (RAM): 64GB
Max data per computer (disk): 24TB
Data processed by Google every month: 400PB in 2007
Average job size: 180GB
Time 180GB of data would take to read sequentially off a single
disk drive: approximately 45 minutes
02-4
Data Access Speed is the Bottleneck

We can process data very quickly, but we can only read/write it
very slowly
Solution: parallel reads
1 HDD = 75MB/sec
1,000 HDDs = 75GB/sec
Far more acceptable
02-5
Sharing is Slow
Grid computing is not new
MPI, PVM, Condor,
Grid focus is on distributing the workload
Uses a NetApp filer or other SAN-based solution for many
compute nodes
Fine for relatively limited amounts of data
Reading large amounts of data from a single SAN device can
leave nodes starved
02-6
Sharing is Tricky
Exchanging data requires synchronization
Deadlocks become a problem
Finite bandwidth is available
Distributed systems can drown themselves
Failovers can cause cascading failure of the system
Temporal dependencies are complicated
Difficult to make decisions regarding partial restarts
02-7
Reliability
Failure is the defining difference between

distributed and local programming
!Ken!Arnold,!CORBA!designer!
02-8
Moving to a Cluster of Machines

In the late 1990s, Google decided to design its architecture using
clusters of low-cost machines
Rather than fewer, more powerful machines
Creating an architecture around low-cost, unreliable hardware
presents a number of challenges
02-9
System Requirements
System should support partial failure
Failure of one part of the system should result in a graceful
decline in performance
Not a full halt
System should support data recoverability
If components fail, their workload should be picked up by stillfunctioning units
System should support individual recoverability
Nodes that fail and restart should be able to rejoin the group
activity without a full group restart
02-10
System Requirements (contd)

System should be consistent
Concurrent operations or partial internal failures should not cause
the results of the job to change
System should be scalable
Adding increased load to a system should not cause outright
failure
Instead, should result in a graceful decline
Increasing resources should support a proportional increase in
load capacity
02-11
Hadoops Origins
Google created an architecture which answers these (and other)
requirements
Released two White Papers
2003: Description of the Google File System (GFS)
A method for storing data in a distributed, reliable fashion
2004: Description of distributed MapReduce
A method for processing data in a parallel fashion
Hadoop was based on these White Papers
All of Hadoop is written in Java
Developers typically write their MapReduce code in Java
Higher-level abstractions on top of MapReduce have also been
developed
02-12
Hadoop: A Radical Way Out

The Google architecture, and hence Hadoop, provides a radical
approach to these issues:
Nodes talk to each other as little as possible
Probably never!
This is known as a shared nothing architecture
Programmer should not explicitly write code which communicates
between nodes
Data is spread throughout machines in the cluster
Data distribution happens when data is loaded on to the cluster
Computation happens where the data is stored
Instead of bringing data to the processors, Hadoop brings the
processing to the data
02-13
Why Hadoop?
What is HDFS?
What is MapReduce?
Conclusion
02-14
HDFS: Hadoop Distributed File System

Based on Googles GFS (Google File System)
Provides redundant storage of massive amounts of data
Using cheap, unreliable computers
At load time, data is distributed across all nodes
Provides for efficient MapReduce processing (see later)
02-15
HDFS Assumptions
High component failure rates
Inexpensive components fail all the time
Modest number of HUGE files
Just a few million
Each file likely to be 100MB or larger
Multi-Gigabyte files typical
Files are write-once
Append support is available in CDH3 for HBase reliability support
Should not be used by developers!
Large streaming reads
Not random access
High sustained throughput should be favored over low latency
02-16
HDFS Features
Operates on top of an existing filesystem
Files are stored as blocks
Much larger than for most filesystems
Default is 64MB
Provides reliability through replication
Each block is replicated across multiple DataNodes
Default replication factor is 3
Single NameNode daemon stores metadata and co-ordinates
access
Provides simple, centralized management
Blocks are stored on slave nodes
Running the DataNode daemon
02-17
HDFS: Block Diagram

NameNode:
Stores metadata only
METADATA:
/user/diana/foo -> 1, 2, 4
/user/diana/bar -> 3, 5
5
2
3
4
DataNodes: Store blocks from les

02-18
The NameNode
The NameNode stores all metadata
Information about file locations in HDFS
Information about file ownership and permissions
Names of the individual blocks
Locations of the blocks
Metadata is stored on disk and read when the NameNode
daemon starts up
Filename is fsimage
Note: block locations are not stored in fsimage
When changes to the metadata are required, these are made in
RAM
Changes are also written to a log file on disk called edits
Full details later
02-19
The NameNode: Memory Allocation

When the NameNode is running, all metadata is held in RAM for
fast response
Each item consumes 150-200 bytes of RAM
Items:
Filename, permissions, etc.
Block information for each block
02-20
The NameNode: Memory Allocation (contd)

Why HDFS prefers fewer, larger files:
Consider 1GB of data, HDFS block size 128MB
Stored as 1 x 1GB file
Name: 1 item
Blocks: 8 x 3 = 24 items
Total items: 25
Stored as 1000 x 1MB files
Names: 1000 items
Blocks: 1000 x 3 = 3000 items
Total items: 4000
02-21
The Slave Nodes

Actual contents of the files are stored as blocks on the slave
nodes
Blocks are simply files on the slave nodes underlying filesystem
Named blk_xxxxxxx
Nothing on the slave node provides information about what
underlying file the block is a part of
That information is only stored in the NameNodes metadata
Each block is stored on multiple different nodes for redundancy
Default is three replicas
Each slave node runs a DataNode daemon
Controls access to the blocks
Communicates with the NameNode
02-22
Anatomy of a File Write
02-23
Anatomy of a File Write (contd)

1. Client connects to the NameNode
2. NameNode places an entry for the file in its metadata, returns
the block name and list of DataNodes to the client
3. Client connects to the first DataNode and starts sending data
4. As data is received by the first DataNode, it connects to the
second and starts sending data
5. Second DataNode similarly connects to the third
6. ack packets from the pipeline are sent back to the client
7. Client reports to the NameNode when the block is written
02-24
Anatomy of a File Write (contd)

If a DataNode in the pipeline fails
The pipeline is closed
The data continues to be written to the two good nodes in the
pipeline
The NameNode will realize that the block is under-replicated, and
will re-replicate it to another DataNode
As the blocks are written, a checksum is also calculated and
written
Used to ensure the integrity of the data when it is later read
02-25
Hadoop is Rack-aware
Hadoop understands the concept of rack awareness
The idea of where nodes are located, relative to one another
Helps the JobTracker to assign tasks to nodes closest to the data
Helps the NameNode determine the closest block to a client
during reads
In reality, this should perhaps be described as being switchaware
HDFS replicates data blocks on nodes on different racks
Provides extra data security in case of catastrophic hardware
failure
Rack-awareness is determined by a user-defined script
See later
02-26
HDFS Block Replication Strategy

First copy of the block is placed on the same node as the client
If the client is not part of the cluster, the first block is placed on a
random node
System tries to find one which is not too busy
Second copy of the block is placed on a node residing on a
different rack
Third copy of the block is placed on different node in the same
rack as the second copy
02-27
Anatomy of a File Read
02-28
Anatomy of a File Read (contd)

1. Client connects to the NameNode
2. NameNode returns the name and locations of the first few
blocks of the file
Block locations are returned closest-first
3. Client connects to the first of the DataNodes, and reads the
block
If the DataNode fails during the read, the client will seamlessly
connect to the next one in the list to read the block
02-29
Dealing With Data Corruption

As the DataNode is reading the block, it also calculates the
checksum
Live checksum is compared to the checksum created when the
block was stored
If they differ, the client reads from the next DataNode in the list
The NameNode is informed that a corrupted version of the block
has been found
The NameNode will then re-replicate that block elsewhere
The DataNode verifies the checksums for blocks on a regular
basis to avoid bit rot
Default is every three weeks after the block was created
02-30
Data Reliability and Recovery

DataNodes send heartbeats to the NameNode
Every three seconds
After a period without any heartbeats, a DataNode is assumed to
be lost
NameNode determines which blocks were on the lost node
NameNode finds other DataNodes with copies of these blocks
These DataNodes are instructed to copy the blocks to other
nodes
Three-fold replication is actively maintained
02-31
The NameNode Is Not A Bottleneck

Note: the data never travels via the NameNode
For writes
For reads
During re-replication
02-32
HDFS File Permissions

Files in HDFS have an owner, a group, and permissions
Very similar to Unix file permissions
File permissions are read (r), write (w) and execute (x) for each of
owner, group, and other
x is ignored for files
For directories, x means that its children can be accessed
HDFS permissions are designed to stop good people doing
foolish things
Not to stop bad people doing bad things!
HDFS believes you are who you tell it you are
02-33
Stronger Security in Hadoop

Hadoop has always had authorization
The ability to allow people to do some things but not others
Example: file permissions
CDH supports Kerberos-based authentication
Making people prove they are who they say they are
Disabled by default
Complex to configure and administer
Requires a good knowledge of Kerberos
In practice, few people use this
Most rely on firewalls and other systems to restrict access to
clusters
02-34
The Secondary NameNode: Caution!

The Secondary NameNode is not a failover NameNode!
It performs memory-intensive administrative functions for the
NameNode
NameNode keeps information about files and blocks (the
metadata) in memory
NameNode writes metadata changes to an editlog
Secondary NameNode periodically combines a prior
filesystem snapshot and editlog into a new snapshot
New snapshot is transmitted back to the NameNode
More on the detail of this process later in the course
Secondary NameNode should run on a separate machine in a
large installation
It requires as much RAM as the NameNode
02-35
Why Hadoop?
What is HDFS?
What is MapReduce?
Conclusion
02-36
What Is MapReduce?
MapReduce is a method for distributing a task across multiple
nodes
Each node processes data stored on that node
Where possible
Consists of two developer-created phases
Map
Reduce
In between Map and Reduce is the shuffle and sort
Sends data from the Mappers to the Reducers
02-37
MapReduce: The Big Picture
02-38
What Is MapReduce? (contd)

Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '\.html' | sort | uniq c > /my/outfile
Map
Shuffle
and sort
Reduce
02-39
What Is MapReduce? (contd)

Key concepts to keep in mind with MapReduce:
The Mapper works on an individual record at a time
The Reducer aggregates results from the Mappers
The intermediate keys produced by the Mapper are the keys on
which the aggregation will be based
02-40
Features of MapReduce
Automatic parallelization and distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
MapReduce programs are usually written in Java
Can be written in any scripting language using Hadoop
Streaming
All of Hadoop is written in Java
MapReduce abstracts all the housekeeping away from the
developer
Developer can concentrate simply on writing the Map and
Reduce functions
02-41
MapReduce: Basic Concepts

Each Mapper processes a single input split from HDFS
Often a single HDFS block
Hadoop passes the developers Map code one record at a time
Each record has a key and a value
Intermediate data is written by the Mapper to local disk
During the shuffle and sort phase, all the values associated with
the same intermediate key are transferred to the same Reducer
The developer specifies the number of Reducers
Reducer is passed each key and a list of all its values
Keys are passed in sorted order
Output from the Reducers is written to HDFS
02-42
MapReduce: A Simple Example

WordCount is the Hello, World! of Hadoop
Map
// assume input is a set of text files
// k is a byte offset
// v is the line for that offset
let map(k, v) =
foreach word in v:
emit(word, 1)
02-43
MapReduce: A Simple Example (contd)

Sample input to the Mapper:
1202
the cat sat on the mat
1225
the aardvark sat on the sofa
Intermediate data produced:

(the, 1), (cat, 1), (sat, 1), (on, 1), (the, 1),
(mat, 1), (the, 1), (aardvark, 1), (sat, 1),
(on, 1), (the, 1), (sofa, 1)
02-44

Input to the Reducer:
(aardvark, [1])
(cat, [1])
(mat, [1])
(on, [1, 1])
(sat, [1, 1])
(sofa, [1])
(the, [1, 1, 1, 1])
02-45

Reduce
// k is a word, vals is a list of 1s
let reduce(k, vals) =
sum = 0
foreach (v in vals):
sum = sum + v
emit (k, sum)
02-46

Output from the Reducer, written to HDFS:
(aardvark, 1)
(cat, 1)
(mat, 1)
(on, 2)
(sat, 2)
(sofa, 1)
(the, 4)
02-47
Some MapReduce Terminology

A user runs a client program on a client computer
The client program submits a job to Hadoop
The job consists of a mapper, a reducer, and a list of inputs
The job is sent to the JobTracker
Each Slave Node runs a process called the TaskTracker
The JobTracker instructs TaskTrackers to run and monitor tasks
A Map or Reduce over a piece of data is a single task
A task attempt is an instance of a task running on a slave node
Task attempts can fail, in which case they will be restarted (more
later)
There will be at least as many task attempts as there are tasks
which need to be performed
02-48
Aside: The Job Submission Process

When a job is submitted, the following happens:
The job configuration information is turned into an XML file
The client places the XML file and the job Jar in a temporary
directory in HDFS
The client calculates the input splits for the job
How the input data will be split up between Mappers
The client contacts the JobTracker with information on the
location of the XML and Jar files, and the list of input splits
02-49
MapReduce: High Level
02-50
MapReduce Failure Recovery

Task processes send heartbeats to the TaskTracker
TaskTrackers send heartbeats to the JobTracker
Any task that fails to report in 10 minutes is assumed to have
failed
Its JVM is killed by the TaskTracker
Any task that throws an exception is said to have failed
Failed tasks are reported to the JobTracker by the TaskTracker
The JobTracker reschedules any failed tasks
It tries to avoid rescheduling the task on the same TaskTracker
where it previously failed
If a task fails four times, the whole job fails
02-51
MapReduce Failure Recovery (Contd)

Any TaskTracker that fails to report in 10 minutes is assumed to
have crashed
All tasks on the node are restarted elsewhere
Any TaskTracker reporting a high number of failed tasks is
blacklisted, to prevent the node from blocking the entire job
There is also a global blacklist, for TaskTrackers which fail on
multiple jobs
The JobTracker manages the state of each job
Partial results of failed tasks are ignored
02-52
Why Hadoop?
What is HDFS?
What is MapReduce?
Conclusion
02-53
The Apache Hadoop Project

Hadoop is a top-level Apache project
Created and managed under the auspices of the Apache
Software Foundation
Several other projects exist that rely on some or all of Hadoop
Typically either both HDFS and MapReduce, or just HDFS
Ecosystem projects are often also top-level Apache projects
Some are Apache incubator projects
Some are not managed by the Apache Software Foundation
Ecosystem projects include Hive, Pig, Sqoop, Flume, HBase,
Oozie,
02-54
Hive
Hive is a high-level abstraction on top of MapReduce
Initially created by a team at Facebook
Avoids having to write Java MapReduce code
Data in HDFS is queried using a language very similar to SQL
Known as HiveQL
HiveQL queries are turned into MapReduce jobs by the Hive
interpreter
Tables are just directories of files stored in HDFS
A Hive Metastore contains information on how to map a file to
a table structure
02-55
Hive (contd)
Example Hive query:
SELECT stock.product, SUM(orders.purchases)
FROM stock INNER JOIN orders
ON (stock.id = orders.stock_id)
WHERE orders.quarter = 'Q1'
GROUP BY stock.product;
We will discuss how to install Hive later in the course
02-56
Pig
Pig is another high-level abstraction on top of MapReduce
Originally created at Yahoo!
Uses a dataflow scripting language known as PigLatin
PigLatin scripts are converted to MapReduce jobs by the Pig
interpreter
02-57
Pig (contd)
Sample PigLatin script:
stock = LOAD '/user/fred/stock' AS (id, item);
orders= LOAD '/user/fred/orders' AS (id, cost);
grpd = GROUP orders BY id;
totals = FOREACH grpd GENERATE group, SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group;
DUMP result;
We will discuss how to install Pig later in the course
02-58
HBase
HBase is described as The Hadoop database
A column-oriented data store
Provides random, real-time read/write access to large amounts of
data
Allows you to manage tables consisting of billions of rows, with
potentially millions of columns
HBase stores its data in HDFS for reliability and availability
We will discuss issues related to HBase installation and
maintenance later in the course
02-59
Why Hadoop?
What is HDFS?
What is MapReduce?
Conclusion
02-60

Please refer to the Exercise Manual
02-61
Why Hadoop?
What is HDFS?
What is MapReduce?
Conclusion
02-62
Conclusion
In this chapter, you have learned:
What Hadoop is
Why Hadoop is important
What features the Hadoop Distributed File System (HDFS)
provides
How MapReduce works
What other Apache Hadoop ecosystem projects exist, and what
they do
02-63
Chapter 3
Planning Your
Hadoop Cluster
03-1
Planning Your Hadoop Cluster

What issues to consider when planning your Hadoop cluster
What types of hardware are typically used for Hadoop nodes
How to optimally configure your network topology
How to select the right operating system and Hadoop distribution
03-2

General Planning Considerations
Choosing The Right Hardware
Network Considerations
Configuring Nodes
Conclusion
03-3
Thinking About the Problem

Hadoop can run on a single machine
Great for testing, developing
Obviously not practical for large amounts of data
Many people start with a small cluster and grow it as required
Perhaps initially just four or six nodes
As the volume of data grows, more nodes can easily be added
Ways of deciding when the cluster needs to grow
Increasing amount of computation power needed
Increasing amount of data which needs to be stored
Increasing amount of memory needed to process tasks
03-4
Cluster Growth Based on Storage Capacity

Basing your cluster growth on storage capacity is often a good
method to use
Example:
Data grows by approximately 1TB per week
HDFS set up to replicate each block three times
Therefore, 3TB of extra storage space required per week
Plus some overhead say, 30%
Assuming machines with 4 x 1TB hard drives, this equates to a
new machine required each week
Alternatively: Two years of data 100TB will require
approximately 100 machines
03-5

Configuring Nodes
Conclusion
03-6
Classifying Nodes
Nodes can be classified as either slave nodes or master nodes
Slave node runs DataNode plus TaskTracker daemons
Master node runs either a NameNode daemon, a Secondary
NameNode Daemon, or a JobTracker daemon
On smaller clusters, NameNode and JobTracker are often run on
the same machine
Sometimes even Secondary NameNode is on the same machine
as the NameNode and JobTracker
Important that at least one copy of the NameNodes metadata
is stored on a separate machine (see later)
03-7
Slave Nodes: Recommended Configuration

Typical base configuration for a slave Node
4 x 1TB or 2TB hard drives, in a JBOD* configuration
Do not use RAID! (See later)
2 x Quad-core CPUs
24-32GB RAM
Gigabit Ethernet
Multiples of (1 hard drive + 2 cores + 6-8GB RAM) tend to work
well for many types of applications
Especially those that are I/O bound
JBOD: Just a Bunch Of Disks

03-8
Slave Nodes: More Details

In general, when considering higher-performance vs lowerperformance components:
Save the money, buy more nodes!
Typically, a cluster with more nodes will perform better than one
with fewer, slightly faster nodes
03-9
Slave Nodes: More Details (CPU)

Quad-core CPUs are now standard
Hex-core CPUs are becoming more prevalent
But are more expensive
Hyper-threading should be enabled
Hadoop nodes are seldom CPU-bound
They are typically disk- and network-I/O bound
Therefore, top-of-the-range CPUs are usually not necessary
03-10
Slave Nodes: More Details (RAM)

Slave node configuration specifies the maximum number of Map
and Reduce tasks that can run simultaneously on that node
Each Map or Reduce task will take 1GB to 2GB of RAM
Slave nodes should not be using virtual memory
Ensure you have enough RAM to run all tasks, plus overhead for
the DataNode and TaskTracker daemons, plus the operating
system
Rule of thumb:
Total number of tasks = 1.5 x number of processor cores
This is a starting point, and should not be taken as a definitive
setting for all clusters
03-11
Slave Nodes: More Details (Disk)

In general, more spindles (disks) is better
In practice, we see anywhere from four to 12 disks per node
Use 3.5" disks
Faster, cheaper, higher capacity than 2.5" disks
7,200 RPM SATA drives are fine
No need to buy 15,000 RPM drives
8 x 1.5TB drives is likely to be better than 6 x 2TB drives
Different tasks are more likely to be accessing different disks
A good practical maximum is 24TB per slave node
More than that will result in massive network traffic if a node dies
and block re-replication must take place
03-12
Slave Nodes: Why Not RAID?

Slave Nodes do not benefit from using RAID* storage
HDFS provides built-in redundancy by replicating blocks across
multiple nodes
RAID striping (RAID 0) is actually slower than the JBOD
configuration used by HDFS
RAID 0 read and write operations are limited by the speed of
the slowest disk in the RAID array
Disk operations on JBOD are independent, so the average
speed is greater than that of the slowest disk
One test by Yahoo showed JBOD performing between 10%
and 30% faster than RAID 0, depending on the operations
being performed
RAID: Redundant Array of Inexpensive Disks

03-13
What About Virtualization?

Virtualization is usually not worth considering
Multiple virtual nodes per machine hurts performance
Hadoop runs optimally when it can use all the disks at once
03-14
What About Blade Servers?

Blade servers are not recommended
Failure of a blade chassis results in many nodes being
unavailable
Individual blades usually have very limited hard disk capacity
Network interconnection between the chassis and top-of-rack
switch can become a bottleneck
03-15
Master Nodes: Single Points of Failure

Slave nodes are expected to fail at some point
This is an assumption built into Hadoop
NameNode will automatically re-replicate blocks that were on the
failed node to other nodes in the cluster, retaining the 3x
replication requirement
JobTracker will automatically re-assign tasks that were running
on failed nodes
Master nodes are single points of failure
If the NameNode goes down, the cluster is inaccessible
If the JobTracker goes down, no jobs can run on the cluster
All currently running jobs will fail
Spend more money on your master nodes!
03-16
Master Node Hardware Recommendations

Carrier-class hardware
Not commodity hardware
Dual power supplies
Dual Ethernet cards
Bonded to provide failover
RAIDed hard drives
At least 32GB of RAM
03-17

Configuring Nodes
Conclusion
03-18
General Network Considerations

Hadoop is very bandwidth-intensive!
Often, all nodes are communicating with each other at the same
time
Use dedicated switches for your Hadoop cluster
Nodes are connected to a top-of-rack switch
Nodes should be connected at a minimum speed of 1Gb/sec
For clusters where large amounts of intermediate data is
generated, consider 10Gb/sec connections
Expensive
Alternative: bond two 1Gb/sec connections to each node
03-19
General Network Considerations (contd)

Racks are interconnected via core switches
Core switches should connect to top-of-rack switches at 10Gb/
sec or faster
Beware of oversubscription in top-of-rack and core switches
Consider bonded Ethernet to mitigate against failure
Consider redundant top-of-rack and core switches
03-20

Configuring Nodes
Conclusion
03-21
Operating System Recommendations

Choose an OS youre comfortable administering
CentOS: geared towards servers rather than individual
workstations
Conservative about package versions
Very widely used in production
RedHat Enterprise Linux (RHEL): RedHat-supported analog to
CentOS
Includes support contracts, for a price
In production, we often see a mixture of RHEL and CentOS
machines
Often RHEL on master nodes, CentOS on slaves
03-22
Operating System Recommendations (contd)

Fedora Core: geared towards individual workstations
Includes newer versions of software, at the expense of some
stability
We recommend server-based, rather than workstation-based,
Linux distributions
Ubuntu: Very popular distribution, based on Debian
Both desktop and server versions available
Try to use an LTS (Long Term Support) version
SuSE: popular distribution, especially in Europe
Cloudera provides CDH packages for SuSE
Solaris, OpenSolaris: not commonly seen in production clusters
03-23
Configuring The System

Do not use Linuxs LVM (Logical Volume Manager) to make all
your disks appear as a single volume
As with RAID 0, this limits speed to that of the slowest disk
Check the machines BIOS* settings
BIOS settings may not be configured for optimal performance
For example, if you have SATA drives make sure IDE emulation is
not enabled
Test disk I/O speed with hdparm -t
Example:
hdparm -t /dev/sda1
You should see speeds of 70MB/sec or more
Anything less is an indication of possible problems
*
BIOS: Basic Input/Output System

03-24
Configuring The System (contd)

Hadoop has no specific disk partitioning requirements
Use whatever partitioning system makes sense to you
Mount disks with the noatime option
Common directory structure for data mount points:
/data/<n>/dfs/nn
/data/<n>/dfs/dn
/data/<n>/dfs/snn
/data/<n>/mapred/local
Reduce the swappiness of the system
Set vm.swappiness to 0 or 5 in /etc/sysctl.conf
03-25
Filesystem Considerations
Cloudera recommends the ext3 and ext4 filesystems
ext4 is now becoming more commonly used
XFS provides some performance benefit during kickstart
It formats in 0 seconds, vs several minutes for each disk with ext3
XFS has some performance issues
Slow deletes in some versions
Some performance improvements are available; see e.g.,
http://everything2.com/index.pl?node_id=1479435
Some versions had problems when a machine runs out of

memory
03-26
Operating System Parameters

Increase the nofile ulimit for the mapred and hdfs users to at
least 32K
Setting is in /etc/security/limits.conf
Disable IPv6
Disable SELinux
Install and configure the ntp daemon
Ensures the time on all nodes is synchronized
Important for HBase
Useful when using logs to debug problems
03-27
Java Virtual Machine (JVM) Recommendations

Always use the official Oracle JDK (http://java.com/)
Hadoop is complex software, and often exposes bugs in other
JDK implementations
Version 1.6 is required
Avoid 1.6.0u18
This version had significant bugs
Hadoop is not yet production-tested with Java 7 (1.7)
Recommendation: dont upgrade to a new version as soon as it
is released
Wait until it has been tested for some time
03-28
Which Version of Hadoop?

Standard Apache version of Hadoop is available at
http://hadoop.apache.org/
Clouderas Distribution including Apache Hadoop (CDH) starts

with the latest stable Hadoop distribution
Includes useful patches and bugfixes backported from future
releases
Includes improvements developed by Cloudera for our Support
customers
Includes additional tools for ease of installation, configuration and
use
Ensures interoperability between different Ecosystem projects
Provided in RPM, Ubuntu and SuSE package, and tarball formats
Available from http://www.cloudera.com/
03-29

Configuring Nodes
Conclusion
03-30
Conclusion
What issues to consider when planning your Hadoop cluster
What types of hardware are typically used for Hadoop nodes
How to optimally configure your network topology
How to select the right operating system and Hadoop distribution
03-31
Chapter 4
Configuring and
Deploying Your Cluster
04-1

The different installation configurations available in Hadoop
How to install Hadoop
How SCM can make installation and configuration easier
How to launch the Hadoop daemons
How to configure Hadoop
How to specify your rack topology script
04-2

Deployment Types
Installing Hadoop
Using SCM for Easy Installation
Typical Configuration Parameters
Configuring Rack Awareness
Using Configuration Management Tools
Hands-On Exercise: Install A Hadoop Cluster
Conclusion
04-3
Hadoops Different Deployment Modes

Hadoop can be configured to run in three different modes
LocalJobRunner
Pseudo-distributed
Fully distributed
04-4
LocalJobRunner Mode
In LocalJobRunner mode, no daemons run
Everything runs in a single Java Virtual Machine (JVM)
Hadoop uses the machines standard filesystem for data storage
Not HDFS
Suitable for testing MapReduce programs during development
04-5
Pseudo-Distributed Mode
In pseudo-distributed mode, all daemons run on the local
machine
Each runs in its own JVM (Java Virtual Machine)
Hadoop uses HDFS to store data (by default)
Useful to simulate a cluster on a single machine
Convenient for debugging programs before launching them on
the real cluster
04-6
Fully-Distributed Mode
In fully-distributed mode, Hadoop daemons run on a cluster of
machines
HDFS used to distribute data amongst the nodes
Unless you are running a small cluster (less than 10 or 20
nodes), the NameNode and JobTracker should each be running
on dedicated nodes
For small clusters, its acceptable for both to run on the same
physical node
04-7

Deployment Types
Installing Hadoop
Conclusion
04-8
Deploying on Multiple Machines

If you are installing multiple machines, use some kind of
automated deployment
Red Hats Kickstart
Debian Fully Automatic Installation
Solaris JumpStart
Dell Crowbar

04-9
RPM/Package vs Tarballs
Clouderas Distribution including Apache Hadoop (CDH) is
available in multiple formats
RPMs for Red Hat-style Linux distributions (RHEL, CentOS)
Packages for Ubuntu and SuSE Linux
As a tarball
RPMs/Packages include some features not in the tarball
Automatic creation of mapred and hdfs users
init scripts to automatically start the Hadoop daemons
Although these are not activated by default
Configures the alternatives system to allow multiple
configurations on the same machine
Strong recommendation: use the RPMs/packages whenever
possible
04-10
Installation From RPM or Package

Install Hadoop
Add the Cloudera repository
Full installation details at
http://archive.cloudera.com/docs/
yum install hadoop-0.20 (RPM-based systems)
apt-get y install hadoop-0.20 (Debian-based systems)
Install the init scripts for the daemons which should run on each
machine
Example:
sudo yum install hadoop-0.20-datanode
sudo yum install hadoop-0.20-tasktracker
04-11
Installation From the Tarball

Install Java 6
Create mapred and hdfs system users and groups
Download and unpack the Hadoop tarball
Place this somewhere sensible, such as /usr/local
Edit the configuration files
Create the relevant directories
Format the HDFS filesystem
04-12
Starting the Hadoop Daemons

CDH installed from package or RPM includes init scripts to
start the daemons
If you have installed Hadoop manually, or from the CDH tarball,
you will have to start the daemons manually
Not all daemons run on each machine
DataNode, TaskTracker
On each data node in the cluster
NameNode, JobTracker
One per cluster
Secondary NameNode
One per cluster
04-13
Avoid Using start-all.sh, stop-all.sh

Hadoop includes scripts called start-all.sh and stop-all.sh
These connect to, and start, all the DataNode and TaskTracker
daemons
Cloudera recommends not using these scripts
They require all DataNodes to allow passwordless SSH login,
which most environments will not allow
04-14
An Aside: SSH
Note that most tutorials tell you to create a passwordless SSH
login on each machine
This is not necessary for the operation of Hadoop
Hadoop does not use SSH in any of its internal communications
ssh is only required if you intend to use the start-all.sh and
stop-all.sh scripts
04-15
Verify the Installation

To verify that everything has started correctly, check by running
an example job:
Copy files into Hadoop for input
hadoop fs -put /etc/hadoop-0.20/conf/*.xml input
Run an example job

hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar \
grep input output 'dfs[a-z.]+'
View the output

hadoop fs -cat output/part-00000 | head
04-16

Deployment Types
Installing Hadoop
Conclusion
04-17
Clouderas SCM For Easy Cluster Installation

Cloudera has released Service and Configuration Manager
(SCM), a tool for easy deployment and configuration of Hadoop
clusters
The free version, SCM Express, can manage up to 50 nodes
The version supplied with Cloudera Enterprise supports an
unlimited number of nodes
04-18
Installing SCM Express

1. Download SCM Express to a management machine
2. Make the binary executable with chmod, and run it
3. Follow the on-screen instructions
This process installs the SCM server
Once installed, you can access the server via its Web interface
http://scm_manager_host:7180/
04-19
Using SCM Express

The first time you connect to the SCM Express server via the
Web interface, a Wizard guides you through initial setup of your
cluster
You are asked for the names or IP addresses of the machines in
the cluster
SCM then connects to each machine and installs CDH, plus the
SCM agent which controls the Hadoop daemons
04-20
Using SCM Express (contd)
04-21
Using SCM Express (contd)

Once CDH is installed, the Wizard allows you to set up any of
HDFS, MapReduce, and HBase on the cluster
By default, it will choose the most appropriate machine(s) to act
as the master nodes
Based on the hardware specifications of the nodes and the
number of nodes in the cluster
To specify machines manually, once the Wizard has completed
you may remove the services and re-create them manually from
the main configuration screen
04-22
Using SCM Express: Main Configuration Screen
04-23
Using SCM Express: Changing Configurations

SCM Express provides a central point from which to manage
machine configurations
Configurations are changed via the Web interface
They are then pushed out to the relevant machines on the cluster
You can restart daemons centrally, from the Web interface
04-24
Using SCM Express: Changing Configurations

(contd)
04-25

Deployment Types
Installing Hadoop
Conclusion
04-26
Hadoops Configuration Files

Each machine in the Hadoop cluster has its own set of
configuration files
Configuration files all reside in Hadoops conf directory
Typically /etc/hadoop/conf
Primary configuration files are written in XML
04-27
Hadoops Configuration Files (contd)

Earlier versions of Hadoop stored all configuration in
hadoop-site.xml
From 0.20 onwards, configurations have been separated out
based on functionality
Core properties: core-site.xml
HDFS properties: hdfs-site.xml
MapReduce properties: mapred-site.xml
hadoop-env.sh sets some environment variables used by
Hadoop
Such as location of log files and pid files
04-28
Sample Configuration File

Sample configuration file (mapred-site.xml)
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="configuration.xsl">
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
04-29
Configuration Value Precedence

Configuration parameters can be specified more than once
Highest-precedence value takes priority
Precedence order (lowest to highest):
*-site.xml on the slave node
*-site.xml on the client machine
Values set explicitly in the JobConf object for a MapReduce job
If a value in a configuration file is marked as final it overrides
all others
<property>
<name>some.property.name</name>
<value>somevalue</value>
<final>true</final>
</property>
04-30
Recommended Parameter Values

There are many different parameters which can be set
Defaults are documented at
http://archive.cloudera.com/cdh/3/hadoop/core-default.html
http://archive.cloudera.com/cdh/3/hadoop/hdfs-default.html
http://archive.cloudera.com/cdh/3/hadoop/mapred-default.html
Hadoop is still a young system

Best practices and optional values change as more and more
organizations deploy Hadoop in production
Here we present some of the key parameters, and suggest
recommended values
Based on our experiences working with clusters ranging from a
few nodes up to 1,000+
04-31
hdfs-site.xml
The single most important configuration value on your entire
cluster, set on the NameNode:
dfs.name.dir
Where on the local filesystem the

NameNode stores its metadata. A
comma-separated list. Default is
${hadoop.tmp.dir}/dfs/name.
Loss of the NameNodes metadata will result in the effective loss

of all the data on the cluster
Although the blocks will remain, there is no way of reconstructing
the original files without the metadata
This must be at least two disks (or a RAID volume) on the
NameNode, plus an NFS mount elsewhere on the network
Failure to set this correctly will result in eventual loss of your
clusters data
04-32
hdfs-site.xml (contd)
The NameNode will write to the edit log in all directories in
dfs.name.dir synchronously
If a directory in the list disappears, the NameNode will continue
to function
It will ignore that directory until it is restarted
Recommendation for the NFS mount point
tcp,soft,intr,timeo=10,retrans=10
Soft mount so the NameNode will not hang if the mount point
disappears
Will retry transactions 10 times, at 1-10 second intervals, before
being deemed to have failed
Note: no space between the comma and next directory name in
the list!
04-33
dfs.block.size
The block size for new files, in bytes.

Default is 67108864 (64MB).
Recommended: 134217728 (128MB).
Specified on each node, including
clients.
dfs.data.dir
Where on the local filesystem a

DataNode stores its blocks. Can be a
comma-separated list of directories (no
spaces between the comma and the
path); round-robin writes to the
directories (no redundancy). Specified
on each DataNode; can be different on
different DataNodes.
04-34
dfs.namenode.handler.count
The number of threads the NameNode

uses to handle RPC requests from
DataNodes. Default: 10. Recommended:
10% of the number of nodes, with a floor
of 10 and a ceiling of 200. Symptoms of
this being set too low: connection
refused messages in DataNode logs as
they try to transmit block reports to the
NameNode. Specified on the
NameNode.
dfs.permissions
If true (the default), checks file

permissions. If false, permission
checking is disabled (everyone can
access every file). Specified on the
NameNode.
04-35
dfs.datanode.du.reserved
The amount of space on each volume

which cannot be used for HDFS block
storage. Recommended: at least 10GB
(See later.) Specified on each
DataNode.
dfs.datanode.failed.volumes.to The number of volumes allowed to fail

lerated
before the DataNode takes itself offline,
ultimately resulting in all of its blocks
being re-replicated. Default: 0, but often
increased on machines with several
disks. Specified on each DataNode.
dfs.replication
The number of times each block should

be replicated when a file is written.
Default: 3. Recommended: 3. Specified
on each node, including clients.
04-36
core-site.xml
fs.default.name
The name of the default filesystem.

Usually the name and port for the
NameNode. Example:
hdfs://<your_namenode>:8020/
Specified on every machine which needs
access to the cluster, including all nodes.
fs.checkpoint.dir
Comma-separated list of directories in

which the Secondary NameNode will
store its checkpoint images. If more than
one directory is specified, all are written
to. Specified on the Secondary
NameNode.
04-37
core-site.xml (contd)
fs.trash.interval
When a file is deleted, it is placed in

a .Trash directory in the users home
directory, rather than being immediately
deleted. It is purged from HDFS after the
number of minutes specified. Default: 0
(disabled). Recommended: 1440 (one
day). Specified on clients and on the
NameNode.
04-38
core-site.xml (contd)
hadoop.tmp.dir
Base temporary directory, both on the

local disk and in HDFS. Default is
/tmp/hadoop-${user.name}.
Specified on all nodes.
io.file.buffer.size
Determines how much data is buffered

during read and write operations. Should
be a power of 2 of hardware page size.
Default: 4096. Recommendation: 65536
(64KB). Specified on all nodes.
io.compression.codecs
List of compression codecs that Hadoop

can use for file compression. Specified
on all nodes. Default is
org.apache.hadoop.io.compress.DefaultCodec,or
g.apache.hadoop.io.compress.GzipCodec,org.apa
che.hadoop.io.compress.BZip2Codec,org.apache.
hadoop.io.compress.DeflateCodec,org.apache.ha
doop.io.compress.SnappyCodec
04-39
mapred-site.xml
mapred.job.tracker
Hostname and port of the JobTracker.

Example: my_job_tracker:8021.
Specified on all nodes and clients.
mapred.child.java.opts
Java options passed to the TaskTracker

child processes. Default is -Xmx200m
(200MB of heap space).
Recommendation: increase to 512MB or
1GB, depending on the requirements
from your developers. Specified on each
TaskTracker node.
mapred.child.ulimit
Maximum virtual memory in KB allocated

to any child process of the TaskTracker.
If specified, set to at least 2x the valueof
mapred.child.java.opts or the
child JVM may not start.
04-40
mapred-site.xml (contd)
mapred.local.dir
The local directory where MapReduce

stores its intermediate data files. May be
a comma-separated list of directories on
different devices. Recommendation: list
directories on all disks, and set
dfs.du.reserved (in hdfssite.xml) such that approximately
25% of the total disk capacity cannot be
used by HDFS. Example: for a node with
4 x 1TB disks, set dfs.du.reserved
to 250GB. Specified on each
TaskTracker node.
04-41
mapred.job.tracker.handler.count
Number of threads used by the

JobTracker to respond to heartbeats
from the TaskTrackers. Default: 10.
Recommendation: approx. 4% of the
number of nodes with a floor of 10 and a
ceiling of 200. Specified on the
JobTracker.
mapred.reduce.parallel.copies
Number of TaskTrackers a Reducer can

connect to in parallel to transfer its data.
Default: 5. Recommendation:
SQRT(number_of_nodes) with a floor of
10. Specified on all TaskTracker nodes.
tasktracker.http.threads
The number of HTTP threads in the

TaskTracker which the Reducers use to
retrieve data. Default: 40.
Recommendation: 80. Specified on all
TaskTracker nodes.
04-42
mapred.reduce.slowstart.comple The percentage of Map tasks which
ted.maps
must be completed before the
JobTracker will schedule Reducers on
the cluster. Default: 0.05.
Recommendation: 0.5 to 0.8. Specified
on the JobTracker.
mapred.jobtracker.taskScheduler
The class used by the JobTracker to

determine how to schedule tasks on the
cluster. Default:
org.apache.hadoop.mapred.JobQu
eueTaskScheduler.!
Recommendation:
org.apache.hadoop.mapred.FairS
cheduler. (Job and task scheduling is
discussed later in the course.) Specified
on the JobTracker.
04-43
mapred.tasktracker.map.tasks.m Number of Map tasks which can be run
aximum
simultaneously by the TaskTracker.
Specified on each TaskTracker node.
mapred.tasktracker.reduce.task Number of Reduce tasks which can be
s.maximum
run simultaneously by the TaskTracker.
Specified on each TaskTracker node.
Rule of thumb: total number of Map + Reduce tasks on a node

should be approximately 1.5 x the number of processor cores on
that node
Assuming there is enough RAM on the node
This should be monitored
If the node is not processor or I/O bound, increase the total
number of tasks
Typical distribution: 60% Map tasks, 40% Reduce tasks or 70%
Map tasks, 30% Reduce tasks
04-44
mapred.map.tasks.speculative.e Whether to allow speculative execution
xecution
for Map tasks. Default: true.
Recommendation: true. Specified on the
JobTracker.
mapred.reduce.tasks.speculativ Whether to allow speculative execution
e.execution
for Reduce tasks. Default: true.
Recommendation: false. Specified on
the JobTracker.
If a task is running significantly more slowly than the average

speed of tasks for that job, speculative execution may occur
Another attempt to run the same task is instantiated on a different
node
The results from the first completed task are used
The slower task is killed
04-45
mapred.compress.map.output
Determines whether intermediate data

from Mappers should be compressed
before transfer across the network.
Default: false. Recommendation: true.
Specified on all TaskTracker nodes.
mapred.output.compression.type If the output from the Reducers are

SequenceFiles, determines whether to
compress the SequenceFiles. Default:
RECORD. Options: NONE, RECORD,
BLOCK. Recommendation: BLOCK.
Specified on all TaskTracker nodes.
04-46
io.sort.mb
The size of the buffer on the Mapper to

which the Mapper writes its Key/Value
pairs. Default: 100MB.
Recommendation: 256MB. This
allocation comes out of the tasks JVM
heap space. Specified on each
TaskTracker node.
io.sort.factor
The number of streams to merge at once

when sorting files. Specified on each
TaskTracker node.
More discussion of these parameters later in the course
04-47
Additional Configuration Files

There are several more configuration files in
/etc/hadoop/conf
hadoop-env.sh: environment variables for Hadoop daemons
HDFS and MapReduce include/exclude files
Controls who can connect to the NameNode and JobTracker
masters, slaves: hostname lists for ssh control
hadoop-policy.xml: Access control policies
log4j.properties: logging (covered later in the course)
fair-scheduler.xml: Scheduler (covered later in the course)
hadoop-metrics.properties: Monitoring (covered later in
the course)
04-48
Environment Setup: hadoop-env.sh

hadoop-env.sh sets environment variables necessary for
Hadoop to run
HADOOP_CLASSPATH
HADOOP_HEAPSIZE
HADOOP_LOG_DIR
HADOOP_PID_DIR
JAVA_HOME
Values are sourced into all Hadoop control scripts and therefore
the Hadoop daemons
If you need to set environment variables, do it here to ensure that
they are passed through to the control scripts
04-49
Environment Setup: hadoop-env.sh (contd)

HADOOP_HEAPSIZE
Controls the heap size for Hadoop daemons
Default 1GB
Comment this out, and set the heap for individual daemons
HADOOP_NAMENODE_OPTS
Java options for the NameNode
At least 4GB: -Xmx4g
HADOOP_JOBTRACKER_OPTS
Java options for the JobTracker
At least 4GB: -Xmx4g
HADOOP_DATANODE_OPTS, HADOOP_TASKTRACKER_OPTS
Set to 1GB each: -Xmx1g
04-50
masters and slaves

masters and slaves list the master and slave nodes in the
cluster
Only used by start-all.sh, stop-all.sh scripts
Recommended that these scripts are not used
Therefore the masters and slaves files are not necessary
04-51
Host include and exclude Files

Optionally, specify dfs.hosts in hdfs-site.xml to point to a
file listing hosts which are allowed to connect to the NameNode
and act as DataNodes
Similarly, mapred.hosts points to a file which lists hosts allowed
to connect as TaskTrackers
Both files are optional
If omitted, any host may connect and act as a DataNode/
TaskTracker
This is a possible security/data integrity issue
NameNode can be forced to reread the dfs.hosts file with
hadoop dfsadmin -refreshNodes
No such command for the JobTracker, which has to be restarted
to re-read the mapred.hosts file, so many System
Administrators only create a dfs.hosts file
04-52
Host include and exclude Files (contd)

It is possible to explicitly prevent one or more hosts from acting
as DataNodes
Create a dfs.hosts.exclude property, and specify a filename
List the names of all the hosts to exclude in that file
These hosts will then not be allowed to connect to the NameNode
This is often used if you intend to decommission nodes (see later)
Run hadoop dfsadmin -refreshNodes to make the
NameNode re-read the file
Similarly, mapred.hosts.exclude can be used to specify a file
listing hosts which may not connect to the JobTracker
Not as commonly used, since the JobTracker must be restarted in
order to re-read the file
04-53

Deployment Types
Installing Hadoop
Conclusion
04-54
Rack Topology Awareness

Recall that HDFS is rack aware
Distributes blocks based on hosts locations
Administrator supplies a script which tells Hadoop which rack a
node is in
Should return a hierarchical rack ID for each argument its
passed
Rack ID is of the form /datacenter/rack
Example: /datactr1/rack40
Script can use a flat file, database, etc etc
Script name is in topology.script.file.name in
core-site.xml
If this is blank (default), Hadoop returns a value of
/default-rack for all nodes
04-55
Sample Rack Topology Script

A sample rack topology script:
#!/usr/bin/env python
import sys
DEFAULT_RACK = "/default-rack"
HOST_RACK_FILE = "/etc/hadoop/conf/host-rack.map"
host_rack = {}
for line in open(HOST_RACK_FILE):
(host, rack) = line.split()
host_rack[host] = rack
for host in sys.argv[1:]:
if host in host_rack:
print host_rack[host]
else:
print DEFAULT_RACK
04-56
Sample Rack Topology Script (contd)

The /etc/hadoop/conf/host-rack.map file:
host1
host2
host3
host4
host5
host6
host7
host8
...
/datacenter1/rack1
/datacenter1/rack1
/datacenter1/rack1
/datacenter1/rack1
/datacenter1/rack2
/datacenter1/rack2
/datacenter1/rack2
/datacenter1/rack2
04-57
Naming Machines to Aid Rack Awareness

A common scenario is to name your hosts in such a way that the
Rack Topology Script can easily determine their location
Example: a host called r1m32
32nd machine in Rack 1
The Rack Topology Script can simply deconstruct the machine
name and then return the rack awareness information
04-58
A Note on DNS vs IP Addresses

You can use machine names or IP addresses to identify nodes in
Hadoops configuration files
You should use one or the other, but not a combination!
Hadoop performs both forward and reverse lookups on IP
addresses in different situations; if the results dont match, it
could cause major problems
Most people use names rather than IP addresses
This means you must ensure DNS is configured correctly on your
cluster
Just using the /etc/hosts file on each node will cause
configuration headaches as the cluster grows
04-59
Reading Configuration Changes

Cluster daemons generally need to be restarted to read in
changes to their configuration files
DataNodes do not need to be restarted if only NameNode
parameters were changed
If you need to restart everything:
Put HDFS in Safe Mode
Take the DataNodes down
Stop and start the NameNode
Start the DataNodes
04-60

Deployment Types
Installing Hadoop
Conclusion
04-61
Managing Large Clusters

Each node in the cluster requires its own configuration files
Managing a small cluster is relatively easy
Log in to each machine to make changes
Manually change configuration files
As the cluster grows larger, management becomes more
complex
Many administrators use cluster shell-type utilities to log in to
multiple machines simultaneously
Potentially dangerous!
04-62
Cluster Shell: Example (csshX)
04-63
Configuration Management Tools

A much better solution: use configuration management software
Popular open source tools: Puppet, Chef
Many others exist
Many commercial tools also exist
These tools allow you to manage configuration of multiple
machines at once
Can update files, restart daemons or even reboot machines
automatically where necessary
04-64
Configuration Management Tools (contd)

Recommendation: start using such tools when the cluster is
small!
Retrofitting configuration management software to an existing
cluster can be difficult
Machines tend not to be set up identically
Configuration scripts end up containing many exceptions for
different machines
Alternative: Use Clouderas Service and Configuration Manager
(SCM)
Free for clusters of up to 50 nodes
04-65

Deployment Types
Installing Hadoop
Conclusion
04-66
Hands-On Exercise
In this exercise, you will collaborate with other students to create
a real Hadoop cluster in the classroom
Please refer to the hands-on exercise manual
04-67

Deployment Types
Installing Hadoop
Conclusion
04-68
Conclusion
The different installation configurations available in Hadoop
How to install Hadoop
How to launch the Hadoop daemons
How to configure Hadoop
How to specify your rack topology
04-69
Chapter 5
Managing and
Scheduling Jobs
05-1
Managing and Scheduling Jobs

How to stop jobs running on the cluster
The options available for scheduling multiple jobs on the same
cluster
The downsides of the default FIFO Scheduler
How to configure the Fair Scheduler
05-2

Managing Running Jobs
Hands-On Exercise: Managing Jobs
The FIFO Scheduler
The Fair Scheduler
Configuring the Fair Scheduler
Hands-On Exercise: Using the Fair Scheduler
Conclusion
05-3
Displaying Running Jobs

To view all jobs running on the cluster, use hadoop job -list
[training@localhost ~]$ hadoop job -list
1 jobs currently running
JobId
State
StartTime
job_201110311158_0008
UserName
Priority
1320210148487
training
SchedulingInfo
NORMAL
NA
05-4
Displaying All Jobs

To display all jobs including completed jobs, use
hadoop job -list all
[training@localhost ~]$ hadoop job -list all
7 jobs submitted
States are:
Running : 1
Succeded : 2
Failed : 3
Prep : 4
JobId
State
StartTime
UserName
Priority
SchedulingInfo
job_201110311158_0004
2
1320177624627
training
NORMAL
job_201110311158_0005
2
1320177864702
training
NORMAL
job_201110311158_0006
2
1320209627260
training
NORMAL
job_201110311158_0007
2
1320210018614
training
NORMAL
job_201110311158_0008
2
1320210148487
training
NORMAL
job_201110311158_0001
2
1320097902546
training
NORMAL
job_201110311158_0003
2
1320099376966
training
NORMAL
NA
NA
NA
NA
NA
NA
NA
05-5
Displaying All Jobs (contd)

Note that states are displayed as numeric values
1: Running
2: Succeeded
3: Failed
4: In preparation
5: (undocumented) Killed
Easy to write a cron job that periodically lists (for example) all
failed jobs, running a command such as
hadoop job -list all | grep '<tab>3<tab>'
05-6
Displaying the Status of an Individual Job

hadoop job -status <job_id> provides status about an
individual job
Completion percentage
Values of counters
System counters and user-defined counters
Note: job name is not displayed!
The Web user interface is the most convenient way to view more
details about an individual job
More details later
05-7
Killing a Job
It is important to note that once a user has submitted a job, they
can not stop it just by hitting CTRL-C on their terminal
This stops job output appearing on the users console
The job is still running on the cluster!
05-8
Killing a Job (contd)

To kill a job use hadoop job -kill <job_id>

JobId
State
StartTime
UserName
job_201110311158_0009
1
1320210791739
Priority
training
SchedulingInfo
NORMAL NA
[training@localhost ~]$ hadoop job -kill job_201110311158_0009

Killed job job_201110311158_0009
JobId
State
StartTime
UserName
Priority
SchedulingInfo
05-9

The FIFO Scheduler
The Fair Scheduler
Conclusion
05-10

In this Hands-On Exercise, you will start and kill jobs from the
command line
Please refer to the Hands-On Exercise Manual
05-11

The FIFO Scheduler
The Fair Scheduler
Conclusion
05-12
Job Scheduling Basics

A Hadoop job is composed of
An unordered set of Map tasks which have locality preferences
An unordered set of Reduce tasks
Tasks are scheduled by the JobTracker
They are then by TaskTrackers
One TaskTracker per node
Each TaskTracker has a fixed number of slots for Map and
Reduce tasks
This may differ per node a node with a powerful processor
may have more slots than one with a slower CPU
TaskTrackers report the availability of free task slots to the
JobTracker on the Master node
Scheduling a job requires assigning Map and Reduce tasks to
available Map and Reduce task slots
05-13
The FIFO Scheduler

Default Hadoop job scheduler is FIFO
First In, First Out
Given two jobs A and B, submitted in that order, all Map tasks
from job A are scheduled before any Map tasks from job B are
considered
Similarly for Reduce tasks
Order of task execution within a job may be shuffled around
A1
A2
A3
A4
B1
B2
B3
05-14
Priorities in the FIFO Scheduler

The FIFO Scheduler supports assigning priorities to jobs
Priorities are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
Set with the mapred.job.priority property
May be changed from the command-line as the job is running
hadoop job -set-priority <job_id> <priority>
All work in each queue is processed before moving on to the next
All higher-priority tasks are run first, if they exist
C1
C2
C3
High Priority
Before any
lower-priority
tasks are started,
regardless of
submission order
A1
A2
A3
A4
B1
B2
B3
Normal Priority
05-15
Priorities in the FIFO Scheduler: Problems

Problem: Job A may have 2,000 tasks; Job B may have 20
Job B will not make any progress until Job A has nearly finished
Completion time should be proportional to job size
Users with poor understanding of the system may flag all their
jobs as HIGH_PRIORITY
Thus starving other jobs of processing time
All or nothing nature of the scheduler makes sharing a cluster
between production jobs with SLAs and interactive users
challenging
05-16

The FIFO Scheduler
The Fair Scheduler
Conclusion
05-17
Goals of the Fair Scheduler

Fair Scheduler is designed to allow multiple users to share the
cluster simultaneously
Should allow short interactive jobs to coexist with long
production jobs
Should allow resources to be controlled proportionally
Should ensure that the cluster is efficiently utilized
05-18
The Fair Scheduler: Basic Concepts

Each job is assigned to a pool
Default assignment is one pool per username
Jobs may be assigned to arbitrarily-named pools
Such as production
Physical slots are not bound to any specific pool
Each pool gets an even share
of the available task slots
05-19
Pool Creation
By default, pools are created dynamically based on the username
submitting the job
No configuration necessary
Jobs can be sent to designated pools (e.g., production)
Pools can be defined in a configuration file (see later)
Pools may have a minimum number of mappers and reducers
defined
05-20
Adding Pools Readjusts the Share of Slots

If Charlie now submits a job in a new pool, shares of slots are
adjusted
05-21
Determining the Fair Share

The fair share of tasks slots assigned to the pool is based on:
The actual number of task slots available across the cluster
The demand from the pool
The number of tasks eligible to run
The minimum share, if any, configured for the pool
The fair share of each other active pool
The fair share for a pool will never be higher than the actual
demand
Pools are filled up to their minimum share, assuming cluster
capacity
Excess cluster capacity is spread across all pools
Aim is to maintain the most even loading possible
05-22
Example Minimum Share Allocation
First, fill Production up to 20 slot minimum guarantee

Then distribute remaining 10 slots evenly across Alice and Bob
05-23
Example Allocation 2: Production Queue Empty
Production has no demand, so no slots reserved

All slots allocated evenly across Alice and Bob
05-24
Example Allocation 3: MinShares Exceed Slots
minShare of Production, Research exceeds available capacity

minShares are scaled down pro rata to match actual slots
No slots remain for users without minShare (i.e., Bob)
05-25
Example 4: minShare < Fair Share
Production filled to minShare

Remaining 25 slots distributed across all pools
Production pool gets more than minShare, to maintain fairness
05-26
Pools With Weights

Instead of (or in addition to) setting minShare, pools can be
assigned a weight
Pools with higher weight get more slots during free slot
allocation
Even water glass height analogy:
Think of the weight as controlling the width of the glass
05-27
Example: Pool With Double Weight
Production filled to minShare (5)

Remaining 25 slots distributed across pools
Bobs pool gets two slots instead of one during each round
05-28
Multiple Jobs Within A Pool

A pool exists if it has one or more jobs in it
So far, weve only described how slots are assigned to pools
We need to determine how jobs are scheduled within a given
pool
05-29
Job Scheduling Within a Pool

Within a pool, resources are fair-scheduled across all jobs
This is achieved via another instance of Fair Scheduler
It is possible to enforce FIFO scheduling within a pool
May be appropriate for jobs that would compete for external
bandwidth, for example
Pools can have a maximum number of concurrent jobs
configured
The weight of a job within a pool is determined by its priority
(NORMAL, HIGH etc)
05-30
Preemption in the Fair Scheduler

If shares are imbalanced, pools which are over their fair share
may not assign new tasks when their old ones complete
Eventually, as tasks complete, free slots will become available
Those free slots will be used by pools which were under their fair
share
This may not be acceptable in a production environment, where
tasks take a long time to complete
Two types of preemption are supported
minShare preemption
Fair Share preemption
05-31
minShare Preemption
Pools with a minimum share configured are operating on an SLA
(Service Level Agreement)
Waiting for tasks from other pools to finish may not be
appropriate
Pools which are below their minimum guaranteed share can kill
the newest tasks from other pools to reap slots
Can then use those slots for their own tasks
Ensures that the minimum share will be delivered within a timeout
window
05-32
Fair Share Preemption

Pools not receiving their fair share can kill tasks from other
pools
A pool will kill the newest task(s) in an over-share pool to forcibly
make room for starved pools
Fair share preemption is used conservatively
A pool must be operating at less than 50% of its fair share for 10
minutes before it can preempt tasks from other pools
05-33

The FIFO Scheduler
The Fair Scheduler
Conclusion
05-34
Steps to Configure the Fair Scheduler

1. Enable the Fair Scheduler
2. Configure Scheduler parameters
3. Configure pools
05-35
1. Enable the Fair Scheduler

In mapred-site.xml on the JobTracker, specify the scheduler to
use:
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
Identify the pool configuration file:

<property>
<name>mapred.fairscheduler.allocation.file</name>
<value>/etc/hadoop/conf/allocations.xml</value>
</property>
05-36
Scheduler Parameters in mapred-site.xml

mapred.fairscheduler.poolnameproperty
Specifies which JobConf property is used to

determine the pool that a job belongs in.
Default is user.name (i.e., one pool per
user). Other options include group.name,
mapred.job.queue.name
mapred.fairscheduler.sizebasedweight
Makes a pools weight proportional to

log(demand) of the pool. Default: false.
mapred.fairscheduler.weightadjuster
Specifies a WeightAdjuster implementation

that tunes job weights dynamically. Default is
blank; can be set to
org.apache.hadoop.mapred.NewJobWei
ghtBooster.
05-37
Configuring Pools
The allocations configuration file must exist, and contain an
<allocations> entity
<pool> entities can contain minMaps, minReduces,
maxRunningJobs, weight
<user> entities (optional) can contain maxRunningJobs
Limits the number of simultaneous jobs a user can run
userMaxJobsDefault entity (optional)
Maximum number of jobs for any user without a specified limit
System-wide and per-pool timeouts can be set
05-38
Very Basic Pool Configuration

The allocations configuration file must exist, and contain at least
this:
<allocations>
</allocations>
05-39
Example: Limit Users to Three Jobs Each

Limit max jobs for any user: specify userMaxJobsDefault
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
</allocations>
05-40
Example: Allow One User More Jobs

If a user needs more than the standard maximum number of jobs,
create a <user> entity
<allocations>
<user name="bob">
<maxRunningJobs>6</maxRunningJobs>
</user>
</allocations>
05-41
Example: Add a Fair Share Timeout

Set a Preemption timeout
<allocations>
<user name="bob">
<maxRunningJobs>6</maxRunningJobs>
</user>
<fairSharePreemptionTimeout>300</fairSharePreemptionTimeout>
</allocations>
05-42
Example: Create a production Pool

Pools are created by adding <pool> entities
<allocations>
<pool name="production">
<minMaps>20</minMaps>
<minReduces>5</minReduces>
<weight>2.0</weight>
</pool>
</allocations>
05-43
Example: Add an SLA to the Pool
<allocations>
<pool name="production">
<weight>2.0</weight>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
</pool>
</allocations>
05-44
Example: Create a FIFO Pool

FIFO pools are useful for jobs which are, for example, bandwidthintensive
<allocations>
<pool name="bandwidth_intensive">
<schedulingMode>FIFO</schedulingMode>
</pool>
</allocations>
Note: <schedulingMode>FAIR</schedulingMode> would use

the Fair Scheduler
05-45
Monitoring Pools and Allocations

The Fair Scheduler exposes a status page in the JobTracker Web
user interface at
http://<job_tracker_host>:50030/scheduler
Allows you to inspect pools and allocations
Any changes to the pool configuration file (e.g.,
allocations.xml) will automatically be reloaded by the running
scheduler
Scheduler detects a timestamp change on the file
Waits five seconds after the change was detected, then
reloads the file
If the scheduler cannot parse the XML in the configuration file,
it will log a warning and continue to use the previous
configuration
05-46

The FIFO Scheduler
The Fair Scheduler
Conclusion
05-47
Hands-On Exercise: Using The Fair Scheduler

In this Hands-On Exercise, you will run jobs in different pools
Please refer to the Hands-On Exercise manual
05-48

The FIFO Scheduler
The Fair Scheduler
Conclusion
05-49
Conclusion
How to stop jobs running on the cluster
The options available for scheduling multiple jobs on the same
cluster
The downsides of the default FIFO Scheduler
How to configure the Fair Scheduler
05-50
Chapter 6
Cluster Maintenance
06-1
Cluster Maintenance
How to check the status of HDFS
How to copy data between clusters
How to add and remove nodes
How to rebalance the cluster
The purpose of the Secondary NameNode
What strategies to employ for NameNode Metadata backup
How to upgrade your cluster
06-2
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
06-3
Checking for Corruption in HDFS

fsck checks for missing or corrupt data blocks
Unlike system fsck, does not attempt to repair errors
Can be configured to list all files
Also all blocks for each file, all block locations, all racks
Examples:
hadoop fsck
hadoop fsck
hadoop fsck
hadoop fsck
hadoop fsck
/
/
/
/
/
-files
-files -blocks
-files -blocks -locations
-files -blocks -locations -racks
06-4
Checking for Corruption in HDFS (contd)

Good idea to run fsck as a regular cron job that e-mails the
results to administrators
Choose a low-usage time to run the check
-move option moves corrupted files to /lost+found
A corrupted file is one where all replicas of a block are missing
-delete option deletes corrupted files
06-5
Using dfsadmin
dfsadmin provides a number of administrative features
including:
List information about HDFS on a per-datanode basis
$ hadoop dfsadmin -report
Re-read the dfs.hosts and dfs.hosts.exclude files

$ hadoop dfsadmin -refreshNodes
06-6
Using dfsadmin (contd)

Manually set the filesystem to safe mode
NameNode starts up in safe mode
Read-only no changes can be made to the metadata
Does not replicate or delete blocks
Leaves safe mode when the (configured) minimum
percentage of blocks satisfy the minimum replication
condition
$ hadoop dfsadmin safemode enter
$ hadoop dfsadmin safemode leave
Can also block until safemode is exited

Useful for shell scripts
hadoop dfsadmin -safemode wait
06-7
Using dfsadmin (contd)

Saves the NameNode metadata to disk and resets the edit log
Must be in safe mode
$ hadoop dfsadmin saveNamespace
More on this later
06-8
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-9
Hands-On Exercise: Breaking the Cluster

In this hands-on exercise, you will introduce some problems into
the cluster
06-10
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-11
Copying Data
Hadoop clusters can hold massive amounts of data
A frequent requirement is to back up the cluster for disaster
recovery
Ultimately, this is not a Hadoop problem!
Its a managing huge amounts of data problem
Cluster could be backed up to tape etc if necessary
Custom software may be needed
06-12
Copying Data with distcp

distcp copies data within a cluster, or between clusters
Used to copy large amounts of data
Turns the copy procedure into a MapReduce job
Syntax:
hadoop distcp hdfs://nn1:8020/path/to/src \
hdfs://nn2:8020/path/to/dest
hdfs:// and port portions are optional if source and destination
are on the local cluster
Copies files or entire directories
Files previously copied will be skipped
Note that the only check for duplicate files is that the files
name and size are identical
06-13
Copying Data: Best Practices

In practice, many organizations do not copy data between
clusters
Instead, they write their data to two clusters as it is being
imported
This is often more efficient
Not necessary to run all MapReduce jobs on the backup cluster
As long as the source data is available, all derived data can be
regenerated later
06-14
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-15
Adding Cluster Nodes

To add nodes to the cluster:
1. Add the names of the nodes to the include file(s), if you are
using this method to explicitly list allowed nodes
The file(s) referred to by dfs.hosts (and mapred.hosts
if that has been used)
2. Update your rack awareness script with the new information
3. Update the NameNode with this new information
hadoop dfsadmin refreshNodes
4. Start the new DataNode and TaskTracker instances
5. Restart the JobTracker (if you have changed mapred.hosts)
There is currently no way to refresh a running JobTracker
6. Check that the new DataNodes and TaskTrackers appear in the
Web UI
06-16
Adding Nodes: Points to Note

The NameNode will not favor a new node added to the cluster
It will not prefer to write blocks to the node rather than to other
nodes
This is by design
The assumption is that new data is more likely to be processed
by MapReduce jobs
If all new blocks were written to the new node, this would impact
data locality for MapReduce jobs
06-17
Removing Cluster Nodes
To remove nodes from the cluster:

1. Add the names of the nodes to the exclude file(s)
The file(s) referred to by dfs.hosts.exclude (and
mapred.hosts.exclude if that has been used)
2. Update the NameNode with the new set of DataNodes
hadoop dfsadmin -refreshNodes
The NameNode UI will show the admin state change to
Decommission In Progress for affected DataNodes
When all DataNodes report their state as
Decommissioned, all the blocks will have been replicated
elsewhere
3. Shut down the decommissioned nodes
4. Remove the nodes from the include and exclude files and
update the NameNode as above
06-18
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-19
Cluster Rebalancing
An HDFS cluster can become unbalanced
Some Nodes have much more data on them than others
Example: add a new Node to the cluster
Even after adding some files to HDFS, this Node will have far
less data than the others
During MapReduce processing, this Node will use much more
network bandwidth as it retrieves data from other Nodes
Clusters can be rebalanced using the balancer utility
06-20
Using balancer
balancer reviews data block placement on nodes and adjusts
blocks to ensure all nodes are within x% utilization of each other
Utilization is defined as amount of data storage used
x is known as the threshold
A node is under-utilized if its utilization is less than (average
utilization - threshold)
A node is over-utilized if its utilization is more than (average
utilization + threshold)
Note: balancer does not consider block placement on individual
disks on a node
Only the utilization of the node as a whole
06-21
Using balancer (contd)

Syntax:
hadoop balancer -threshold x
Threshold is optional
Defaults to 10 (i.e., 10% difference in utilization between nodes)
Rebalancing can be canceled at any time
Interrupt the command with Ctrl-C
Bandwidth usage can be controlled by setting the property
dfs.balance.bandwidthPerSec in hdfs-site.xml
Specifies a bandwidth in bytes/sec that each DataNode can use
for rebalancing
Default is 1048576 (1MB/sec)
Recommendation: approx. 0.1 x network speed
e.g., for a 1Gbps network, 10MB/sec
06-22
When To Rebalance
Cluster should become unbalanced during regular usage
Rebalance immediately after adding new nodes to the cluster
Rebalancing does not interfere with any existing MapReduce
jobs
However, it does use bandwidth
Not a good idea to rebalance during peak usage times
06-23
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-24
Hands-On Exercise: Verifying The Clusters

Self-Healing Features
In this Hands-On Exercise, you will verify that the cluster has
recovered from the problems you introduced in the last exercise
You will also cause the cluster some other problems, and
observe what happens
06-25
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-26
HDFS Replicates Data By Default

Recall that HDFS replicates blocks of data on a cluster
Default is three-fold replication
This means it is highly unlikely that data will be lost as a result of
an individual node failing
However, metadata from the NameNode must be backed up to
avoid disaster should the NameNode fail
06-27
The NameNodes Filesystem

The NameNodes directory structure looks like this:
${dfs.name.dir}/current/VERSION
/edits
/fsimage
/fstime
VERSION is a Java properties file containing information about
the version of HDFS that is running
fstime is a record of the time the last checkpoint was taken
(covered later)
Stored in Hadoops Writable serializable format
06-28
Backing Up the NameNode

Recall that dfs.name.dir is a comma-separated list of
directories
Writes go to all directories in the list
Recommendation: write to two local directories on different
physical volumes, and to an NFS-mounted directory
Data will be preserved even in the event of a total failure of the
NameNode machines
Recommendation: soft-mount the NFS directory
If the NFS mount goes offline, this will not cause the NameNode
to fail
06-29
fsimage and the Secondary NameNode

The fsimage file also contains the filesystem metadata
It is not updated at every write
This would be very slow
When an HDFS client performs a write operation, it is recorded in
the Primary NameNodes edit log
The edits file
The NameNodes in-memory representation of the filesystem
metadata is also updated
System resilience is not compromised, since recovery can be
performed by loading the fsimage file and applying all the
changes in the edits file to that information
Does not record the datanodes on which blocks are stored
This is reported to the NameNode by the DataNodes when
they join the cluster
06-30

(contd)
Applying all changes in the edits file could take a long time
The file would also grow to be huge
The Secondary NameNode periodically checkpoints the
NameNodes in-memory filesystem data
1. Tells the NameNode to roll its edits file
2. Retrieves fsimage and edits from the NameNode
3. Loads fsimage into memory and applies the changes from the
edits file
4. Creates a new, consolidated fsimage file
5. Sends the new fsimage file back to the primary NameNode
6. The NameNode replaces the old fsimage file with the new
one, replaces the old edits file with the new one it created in
step 1, and updates the fstime file to record the checkpoint
time
06-31

(contd)
This checkpointing operation is performed every hour
Configured by fs.checkpoint.period
Checkpointing will also occur if the edit log reaches 64MB
Configured by fs.checkpoint.size, in bytes
Secondary NameNode checks this size every five minutes
This determines the worst-case amount of data loss should the
primary NameNode crash
Note: the Secondary NameNode is not a live backup of the
primary NameNode!
Hadoop 0.21 renamed the Secondary NameNode as the
Checkpoint Node
06-32
Manually Backing Up fsimage and edits

You can retrieve copies of the fsimage and edits files from the
NameNode at any time via HTTP
fsimage: http://<namenode>:50070/getimage?getimage=1
edits: http://<namenode>:50070/getimage?getedit=1
Note: fsimage is a copy of the NameNodes fsimage file, not the
in-memory version of the metadata
It is good practice to regularly retrieve these files for offsite
backup
Typically done using a shell script and the curl utility or similar
The command hadoop dfsadmin -saveNamespace
will force the NameNode to write its in-memory Metadata as a
new fsimage file
Replaces the old fsimage and edits files
NameNode must be in safe mode to do this
06-33
Recovering the NameNode

To recover from a NameNode failure, or to restore from a backup:
1. Stop the NameNode
2. Remove existing copies of fsimage and edits from all
directories in the dfs.name.dir list
3. Place the recovery versions of fsimage and edits in the
appropriate directory
4. Ensure the recovery versions are owned by hdfs:hdfs
5. Start the NameNode
Note: Step 2 is crucial. Otherwise, the NameNode will copy the

first valid fsimage and edits files it finds into the other
directories in the list
Potentially overwriting your recovery versions
06-34
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-35
Upgrading Software: When to Upgrade?

Cloudera provides production and beta releases
Production:
Ready for production clusters
Passed all unit tests, functional tests
Has been tested on large clusters over a significant period
Beta:
Recommended for people who want more features
Passed unit tests, functional tests
Do not have the same soak time as our Production packages
A work in progress that will eventually be promoted to Production
Cloudera supports a Production release for at least one year
after a subsequent release is available
06-36
Upgrading Software: Procedures

Software upgrade procedure is fully documented on the
Cloudera Web site
General steps:
1. Stop the MapReduce cluster
2. Stop HDFS cluster
3. Install the new version of Hadoop
4. Start the NameNode with the -upgrade option
5. Monitor the HDFS cluster until it reports that the upgrade is
complete
6. Start the MapReduce cluster
Time taken to upgrade HDFS depends primarily on the number of
blocks per datanode
In general, 20-30 blocks per second on each node, depending on
hardware
06-37
Upgrading Software: Procedures (contd)

Once the upgraded cluster has been running for a few days with
no problems, finalize the upgrade by running
hadoop dfsadmin -finalizeUpgrade
DataNodes delete their previous version working directories, then
the NameNode does the same
If you encounter problems, you can roll back an (unfinalized)
upgrade by stopping the cluster, then starting the old version of
HDFS with the -rollback option
Note that this upgrade procedure is required when HDFS data
structures or RPC communication format change
For example, from CDH3 to CDH4
Probably not required for minor version changes
But see the documentation for definitive information!
06-38
Cluster Maintenance
Features
Cluster Upgrading
Conclusion
06-39
Conclusion
How to check the status of HDFS
How to copy data between clusters
How to add and remove nodes
How to rebalance the cluster
The purpose of the Secondary NameNode
What strategies to employ for NameNode Metadata backup
How to upgrade your cluster
06-40
Chapter 7
Cluster Monitoring and
Troubleshooting
07-1
Cluster Monitoring and Troubleshooting

What general system conditions to monitor
How to use the NameNode and JobTracker Web UIs
How to view and manage Hadoops log files
How the Ganglia monitoring tool works
Some common cluster problems, and their resolutions
How to benchmark your clusters performance
07-2

General System Monitoring
Managing Hadoops Log Files
Using the NameNode and JobTracker Web UI
Hands-On Exercise: Examining the Web UI
Cluster Monitoring with Ganglia
Common Troubleshooting Issues
Benchmarking Your Cluster
Conclusion
07-3

Later in this chapter you will see how to use Ganglia to monitor
your cluster
You should also use a general system monitoring tool to warn
you of potential or actual problems on individual machines in the
cluster
Many such tools exist, including
Nagios
Cacti
Hyperic
Zabbix
We do not have a specific recommendation
Use the tools with which you are most familiar
Here we present a list of items to monitor
07-4
Items to Monitor
Monitor the Hadoop daemons
Alert an operator if a daemon goes down
Check can be done with
service hadoop-0.20-daemon_name status
Monitor disks and disk partitions
Alert immediately if a disk fails
Send a warning when a disk reaches 80% capacity
Send a critical alert when a disk reaches 90% capacity
Monitor CPU usage on master nodes
Send an alert on excessive CPU usage
Slave nodes will often reach 100% usage
This is not a problem
07-5
Items to Monitor (contd)

Monitor swap on all nodes
Alert if the swap partition starts to be used
Memory allocation is overcommitted
Monitor network transfer speeds
Ensure that the Secondary NameNode checkpoints regularly
Check the age of the fsimage file and/or check the size of the
edits file
07-6

Conclusion
07-7
Hadoop Logging Basics

Hadoop logs are stored in $HADOOP_INSTALL/logs by default
Can be changed by setting the HADOOP_LOG_DIR environment
variable in hadoop-env.sh
Typically set to /var/log/hadoop/
Each Hadoop daemon writes two logfiles
*.log file written via log4j
Standard log4j configuration rotates logfiles daily
Old logfiles are not deleted (or gzipped) by default
First port of call when diagnosing problems
*.out file
Combination of stdout and stderr during daemon startup
Doesnt usually contain much output
Rotated when daemon restarts, five files retained
07-8
Hadoop Logging Basics (contd)

Log file names:
hadoop-<user-running-hadoop>-<daemon>-<hostname>.
{log|out}
Example:
hadoop-hadoop-datanode-r2n13.log
Configuration for log4j is at conf/log4j.properties
Log file growth:
Slow when cluster is idle
Can be very rapid when jobs are running
Monitor log directory to avoid out-of-space errors since old .log
files are not deleted by default
07-9
log4j Configuration
Log4j configuration is controlled by conf/log4j.properties
Default log level configured by hadoop.root.logger
Default is INFO
Log level can be set for any specific class with
log4j.logger.class.name = LEVEL
Example:
log4j.logger.org.apache.hadoop.mapred.JobTracker=INFO
Valid log levels:

FATAL, ERROR, WARN, INFO, DEBUG, TRACE
07-10
The DailyRollingFileAppender
An Appender is the destination for log messages
Hadoops default for daemon logs is the
DailyRollingFileAppender (DRFA)
Rotates logfiles daily
Frequency is configurable
Cannot limit filesize
Cannot limit the number of files kept
You must provide your own scripts to compress, archive, delete
logs
DRFA is the most popular choice for Hadoop logs
It is the default, and many system administrators are not familiar
with Java logging
07-11
An Alternative Appender: RollingFileAppender

RollingFileAppender (RFA) is also available in CDH
Despite its name, not a superclass of DRFA
Lets you specify the maximum size of generated log files
Lets you set the number of files retained
To use RFA:
Edit $HADOOP_HOME/bin/hadoop-daemon.sh
change
export HADOOP_ROOT_LOGGER="INFO,DRFA"
to
export HADOOP_ROOT_LOGGER="INFO,RFA"
07-12
An Alternative Appender: RollingFileAppender

(contd)
To configure, edit /etc/hadoop/conf/log4j.properties
Uncomment the lines under
#
# Rolling File Appender
#
(except for the comment line # Log file size
Edit to suit; in particular:
log4j.appender.RFA.MaxFileSize=100MB
log4j.appender.RFA.MaxBackupIndex=30
07-13
Job Logs: Created By Hadoop

When a job runs, two files are created:
The job configuration XML file
Contains job configuration settings specified by the developer
The job status file
Constantly updated as the job runs
Includes counters, task status information etc
These files are stored in multiple places:
07-14
Locations of Job Log Files

${hadoop.log.dir}/<job_id>_conf.xml (Job configuration
XML only)
Retained for mapred.jobtracker.retirejob.interval
milliseconds
Default is one day (24 * 60 * 60 * 1000)
hadoop.job.history.location
Default is ${hadoop.log.dir}/history
Retained for 30 days
hadoop.job.history.user.location
Default is <job_output_dir_in_hdfs>/_logs/history
Retained forever
In addition, the JobTracker keeps the data in memory for a
limited time
07-15
Developer Log Files

Caution: inexperienced developers will often create large log
files from their jobs
Data written to stdout/stderr
Data written using log4j from within the code
Large developer logfiles can run your slave nodes out of disk
space
Developer logs are stored in ${hadoop.log.dir}/userlogs
This is hardcoded
Ensure you have enough room on the partition for logs
Developer logs are deleted according to
mapred.userlog.retain.hours
Default is 24 hours
07-16

Conclusion
07-17
Common Hadoop Ports

Hadoop daemons each provide a Web-based User Interface
Useful for both users and system administrators
Expose information on a variety of different ports
Port numbers are configurable, although there are defaults for
most
Hadoop also uses various ports for components of the system to
communicate with each other
07-18
Hadoop Ports for Administrators

Daemon
Default+ Congura1on+Parameter
Port
Used+for
NameNode
8020
fs.default.name
Filesystem metadata operations
DataNode
50010
dfs.datanode.address
DFS data transfer
DataNode
50075
dfs.datanode.ipc.addr
ess
Block metadata operations and

recovery
BackupNode
50100
dfs.backup.address
HDFS metadata operations (from

Hadoop 0.21)
JobTracker
Usually
8021,
9001,
or 8012
mapred.job.tracker
Job submission, task tracker

heartbeats
TaskTracker
Usually
8021,
9001,
or 8012
mapred.task.tracker.r
eport.address
Communicating with child jobs
07-19
MR
HDFS
Web UI Ports for Users
Daemon
Default Port
Configuration parameter
NameNode
50070
dfs.http.address
DataNode
50075
dfs.datanode.http.address
Secondary NameNode
50090
dfs.secondary.http.address
Backup/Checkpoint Node*
50105
dfs.backup.http.address
JobTracker
50030
mapred.job.tracker.http.address
TaskTracker
50060
mapred.task.tracker.http.address
From Hadoop 0.21 onwards

07-20
The JobTracker Web UI

JobTracker exposes its Web UI on port 50030
07-21
Drilling Down to Individual Jobs

Clicking on an individual job name will reveal more information
about that job
07-22
Stopping MapReduce Jobs From the Web UI

By default, the JobTracker Web UI is read-only
Job information is displayed, but the job cannot be controlled in
any way
It is possible to set the UI to allow jobs, or individual Map or
Reduce tasks, to be killed
Add the following property to core-site.xml
<property>
<name>webinterface.private.actions</name>
<value>true</value>
</property>
Restart the JobTracker

07-23
Stopping Jobs From the Web UI (contd)

The Web UI will now include an actions column for each task
And an overall option to kill entire jobs
07-24
Stopping Jobs From the Web UI (contd)

Caution: anyone with access to the Web UI can now manipulate
running jobs!
Best practice: make this available only to administrative users
Better to use the command-line to stop jobs
Discussed earlier in the course
07-25

Conclusion
07-26

In this brief Hands-On Exercise, you will examine the NameNode
Web UI (at http://<namenode_location>:50070/) and JobTracker
Web UI (at http://<jobtracker_location>:50030/)
From the command line, run a Hadoop job
Example:
hadoop jar /usr/lib/hadoop/hadoop-examples.jar \
sleep -m 10 -r 10 -mt 10000 -rt 10000
Open the JobTracker Web UI and view the progress of the
Mappers and Reducers
Investigate the NameNodes Web UI
07-27

Conclusion
07-28
Hadoop Metrics
Hadoop can be configured to log many different metrics
Metrics are grouped into contexts
jvm
Statistics from the JVM including memory usage, thread
counts, garbage collection information
All Hadoop daemons use this context
dfs
NameNode capacity, number of files, under-replicated blocks
mapred
JobTracker information, similar to that found on the
JobTrackers Web status page
rpc
For Remote Procedure Calls
07-29
Hadoop Metrics (contd)

Configure in conf/hadoop-metrics.properties
Example to log metrics to files:
# Configuration of the dfs context for file
dfs.class=org.apache.hadoop.metrics.file.FileContext
dfs.period=10
# Youll want to change the path
dfs.fileName=/tmp/dfsmetrics.log
# Configuration of the mapred context for file
mapred.class=org.apache.hadoop.metrics.file.FileContext
mapred.period=10
mapred.fileName=/tmp/mrmetrics.log
# Configuration of the jvm context for file
jvm.class=org.apache.hadoop.metrics.file.FileContext
jvm.period=10
jvm.fileName=/tmp/jvmmetrics.log
# Configuration of the rpc context for file
rpc.class=org.apache.hadoop.metrics.file.FileContext
rpc.period=10
rpc.fileName=/tmp/rpcmetrics.log
07-30

Example dfs metrics:
dfs.datanode: hostName=doorstop.local, sessionId=, blockReports_avg_time=0,
blockReports_num_ops=1, block_verification_failures=0, blocks_read=0,
blocks_removed=0, blocks_replicated=0, blocks_verified=0,
blocks_written=44, bytes_written=64223, copyBlockOp_avg_time=0,
copyBlockOp_num_ops=0, heartBeats_avg_time=1, heartBeats_num_ops=7,
readBlockOp_avg_time=0, readBlockOp_num_ops=0, readMetadataOp_avg_time=0,
readMetadataOp_num_ops=0, reads_from_local_client=0,
reads_from_remote_client=0, replaceBlockOp_avg_time=0,
replaceBlockOp_num_ops=0, writeBlockOp_avg_time=5, writeBlockOp_num_ops=44,
writes_from_local_client=44, writes_from_remote_client=0dfs.namenode:
hostName=doorstop.local, sessionId=, AddBlockOps=44, CreateFileOps=44,
DeleteFileOps=0, FilesCreated=59, FilesRenamed=0, GetBlockLocations=0,
GetListingOps=1, SafemodeTime=102, Syncs_avg_time=0, Syncs_num_ops=100,
Transactions_avg_time=0, Transactions_num_ops=148, blockReport_avg_time=0,
blockReport_num_ops=1, fsImageLoadTime=98dfs.FSNamesystem:
hostName=doorstop.local, sessionId=, BlocksTotal=44,
CapacityRemainingGB=78, CapacityTotalGB=201, CapacityUsedGB=0,
FilesTotal=60, PendingReplicationBlocks=0, ScheduledReplicationBlocks=0,
TotalLoad=1, UnderReplicatedBlocks=44
07-31

Metrics are also exposed via the Web interface at
http://<namenode_address>:50070/metrics
07-32
Monitoring Challenges
System monitoring becomes a challenge when dealing with large
numbers of systems
Multiple solutions exist, such as
Nagios
Hyperic
Zabbix
Many of these are very general purpose
Fine for monitoring the machines themselves
Not so useful for integrating with Hadoop
07-33
Monitoring Cluster Metrics with Ganglia

Ganglia is an open-source, scalable, distributed monitoring
product for high-performance computing systems
Specifically designed for clusters of machines
Collects, aggregates, and provides time-series views of metrics
Integrates with Hadoops metrics-collection system
Note: Ganglia doesnt provide alerts
07-34
Ganglia Network Architecture
Cluster+node+
Web+Server+
Data$Collector:$$
GMOND+
Data$Consolidator:$$
GMETAD+
Cluster+node+
Data$Collector:$$
GMOND+
Database:$
rrdtool+
Apache$+$
PHP$scripts$
Cluster+node+
Data$Collector:$$
GMOND+
07-35
Example Ganglia Web App Output
07-36
Ganglia Configuration
Install the GMOND daemon on every cluster Node
Make sure port 8649 is open for both UDP and TCP connections
Install GMETAD on a Web server
Configure Hadoop to publish metrics to Ganglia in
conf/hadoop-metrics.properties
Example:
# Configuration of the "dfs" context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaC
ontext
dfs.period=10
dfs.servers=127.0.0.1:8649
07-37
Ganglia Versions
Ganglia 3.0.x and 3.1 both work well with Hadoop out-of-the-box
Ganglia 3.1 also works out-of-the-box with CDH
Ganglia 3.1 uses
org.apache.hadoop.metrics.ganglia.GangliaContext31
07-38

Conclusion
07-39
General Troubleshooting Issues: Introduction

On the next few slides you will find some common problems
exhibited by clusters, and suggested solutions
Note that these are just some of the issues you could run in to on
a cluster
Also note that these are possible causes and resolutions
The problems could equally be caused by many other issues
07-40
Map/Reduce Task Out Of Memory Error

FATAL org.apache.hadoop.mapred.TaskTracker:
Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask
$MapOutputBuffer.<init>(MapTask.java:781)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Possible causes
Map or Reduce task has run out of memory
Possibly due to a memory leak in the job code
Possible resolution
Increase size of RAM allocated in mapred.java.child.opts
Ensure io.sort.mb is smaller than RAM allocated in
mapred.java.child.opts
07-41
JobTracker Out Of Memory Error

ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:122)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:653)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3965)
at org.apache.hadoop.mapred.EagerTaskInitializationListener
$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Cause: JobTracker has exceeded allocated memory

Possible resolutions
Increase JobTrackers memory allocation
Reduce mapred.jobtracker.completeuserjobs.maximum
Amount of job history held in JobTrackers RAM
07-42
Too Many Fetch Failures

INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for
output of task:
Cause
Reducers are failing to fetch intermediate data from a
TaskTracker where a Map process ran
Too many of these failures will cause a TaskTracker to be
blacklisted
Increase tasktracker.http.threads
Decrease mapred.reduce.parallel.copies
Upgrade to CDH3u2
The version of Jetty (the Web server) in earlier versions of the
TaskTracker was prone to fetch failures
07-43
Not Able To Place Enough Replicas

WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas
Possible causes
Fewer DataNodes available than the replication factor of the
blocks
DataNodes do not have enough xciever threads
Default is 256 threads to manage connections
Note: yes, the configuration option is misspelled!
Increase dfs.datanode.max.xcievers to 4096
Check replication factor
07-44

Conclusion
07-45
Why Benchmark?
Common question after adding new nodes to a cluster:
How much faster is my cluster running now?
Benchmarking clusters is not an exact science
Performance depends on the type of job youre running
Standard benchmark is Terasort
Example: Generate a 10,000,000-line file, each line containing
100 bytes, then sort that file
hadoop jar $HADOOP_HOME/hadoop-*-examples.jar teragen 10000000 input_dir
hadoop jar $HADOOP_HOME/hadoop-*-examples.jar terasort input-dir output-dir
This is predominantly testing network and disk I/O performance

07-46
Real-World Benchmarks
Test your cluster before and after adding nodes
Remember to take into account other jobs running on the nodes
while youre benchmarking!
As a (very high-end!) guide: in April 2009, Arun Murthy and Owen
OMalley at Yahoo! sorted a terabyte of data in 62 seconds on a
cluster of 1,406 nodes
Albeit using a somewhat modified version of Hadoop
07-47

Conclusion
07-48
Conclusion
What general system conditions to monitor
How to use the NameNode and JobTracker Web UIs
How to view and manage Hadoops log files
How the Ganglia monitoring tool works
Some common cluster problems, and their resolutions
How to benchmark your clusters performance
07-49
Chapter 8
Populating HDFS From
External Sources
08-1
Populating HDFS Using Flume and Sqoop

What Flume is
How Flume works
What Sqoop is
How to use Sqoop to import data from RDBMSs to HDFS
Best practices for importing data
Note: In this chapter we can only provide a brief
overview of Flume and Sqoop; consult the
documentation at
http://archive.cloudera.com/docs/
for full details on installation and configuration
08-2

An Overview of Flume
Hands-On Exercise: Using Flume
An Overview of Sqoop
Best Practices for Importing Data
Conclusion
08-3
What Is Flume?
Flume is a distributed, reliable, available service for efficiently
moving large amounts of data as it is produced
Ideally suited to gathering logs from multiple systems and
inserting them into HDFS as they are generated
Developed in-house by Cloudera, and released as open-source
software
Design goals:
Reliability
Scalability
Manageability
Extensibility
08-4
Flume: High-Level Overview
Agent&
Agent&&
Agent&
Agent&
encrypt&
MASTER&
Master'communicates'
with'all'Agents,'
specifying'congura$on'
etc.'
Processor&
Processor&
compress&
batch&
encrypt&
Writes'to'mul$ple'HDFS'le'
formats'(text,'SequenceFile,'
JSON,'Avro,'others)'
Parallelized'writes'across'many'
collectors''as'much'write'
throughput'as'required'
Collector(s)&
Mul$ple'congurable'
levels'of'reliability'
Agents''can'guarantee'
delivery'in'event'of'
failure'
Op$onally'deployable,'
centrally'administered'
Op$onally'pre=process'
incoming'data:'perform'
transforma$ons,'
suppressions,'metadata'
enrichment'
Flexibly'
deploy'
decorators'at'
any'step'to'
improve'
performance,''
reliability'or'
security'
08-5
Flumes Design Goals: Reliability

Flume is designed to continue delivering events in the face of
system component failure
Provides configurable reliability guarantees
End-to-end
Once Flume acknowledges receipt of an event, the event will
eventually make it to the end of the pipeline
Store on failure
Nodes require acknowledgment of receipt from the node one
hop downstream
Best effort
No attempt is made to confirm receipt of data from the node
downstream
08-6
Flumes Design Goals: Scalability

Flume scales horizontally
As load increases, more machines can be added to the
configuration
Aggregator nodes can be configured to receive data from
multiple upstream nodes and then pass them on down the chain
08-7
Flumes Design Goals: Manageability

Flume provides a central Master controller
System Administrators can monitor data flows and reconfigure
them on the fly
Via a Web interface or a scriptable command-line shell
No remote logging in to a machine on which a Flume node is
running is required to change configuration
08-8
Flumes Design Goals: Extensibility

Flume can be extended by adding connectors to existing storage
layers or data platforms
General sources already provided include data from files, syslog,
and standard output (stdout) from a process
General endpoints already provided include files on the local
filesystem or in HDFS
Other connectors can be added using Flumes API
08-9
Flume: General System Architecture

The Master holds configuration information for each Node, plus a
version number for that node
Version number is associated with the Nodes configuration
Nodes communicate with the Master every five seconds
Node passes its version number to the Master
If the Master has a later version number for the Node, it tells the
Node to reconfigure itself
The Node then requests the new configuration information
from the Master, and dynamically applies that new
configuration
08-10
Flume Node Characteristics

Each node has a source and a sink
Source tells the node where to receive data from
Sink tells the node where to send data to
Sink can have one or more decorators
Decorators perform simple processing on the data as it passes
through, such as:
Compression
Encryption
awk, grep-like functionality
08-11
Installing and Using Flume

Flume is available as a tarball, RPM or Debian package
Once installed, start the Master
Usually achieved via an init script, or as a stand-alone process
with
flume master
Configure Agent Nodes on the machine(s) generating the data
Minimum configuration:
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>flume.master.servers</name>
<value>master_host_name</value>
</property>
</configuration>'
08-12
Installing and Using Flume (contd)

Start the Agent Node(s)
Typically via an init script, or as a stand-alone process with
flume node
Configure the Agent Nodes via the Masters Web interface at
http://<master_host_name>:35871/
08-13

Conclusion
08-14

In this Hands-On Exercise you will create a simple Flume
configuration to store dynamically-generated data in HDFS
Please refer to the Exercise Manual
08-15

Conclusion
08-16
What is Sqoop?
Sqoop is the SQL-to-Hadoop database import tool
Developed at Cloudera
Open-source
Included as part of Clouderas Distribution including Apache
Hadoop (CDH)
Designed to import data from RDBMSs (Relational Database
Management Systems) into Hadoop
Can also send data the other way, from Hadoop to an RDBMS
Uses JDBC (Java Database Connectivity) to connect to the
RDBMS
08-17
How Does Sqoop Work?

Sqoop examines each table and automatically generates a Java
class to import data into HDFS
It then creates and runs a Map-only MapReduce job to import the
data
By default, four Mappers connect to the RDBMS
Each imports a quarter of the data
08-18
Sqoop Features
Imports a single table, or all tables in a database
Can specify which rows to import
Via a WHERE clause
Can specify which columns to import
Can provide an arbitrary SELECT statement
Sqoop can automatically create a Hive table based on the
imported data
Supports incremental imports of data
Can export data from HDFS to a database table
08-19
Sqoop Connectors
Cloudera has partnered with third parties to create Sqoop
connectors
Add-ons to Sqoop which use a databases native protocols to
import data, rather than JDBC
Typically orders of magnitude faster
Not open-source, but freely downloadable from the Cloudera Web
site
Current products supported
Oracle Database
MicroStrategy
Netezza
Others being developed
Microsoft has produced a version of Sqoop optimized for SQL
Server
08-20
Sqoop Usage Examples

List all databases
sqoop list-databases --username fred --password derf \
--connect jdbc:mysql://dbserver.example.com/
List all tables in the world database

sqoop list-tables --username fred --password derf \
--connect jdbc:mysql://dbserver.example.com/world
Import all tables in the world database

sqoop import-all-tables --username fred --password derf \
--connect jdbc:mysql://dbserver.example.com/world
08-21

Conclusion
08-22
What Do Others See As Data Is Imported?

When a client starts to write data to HDFS, the NameNode marks
the file as existing, but being of zero size
Other clients will see that as an empty file
After each block is written, other clients will see that block
They will see the file growing as it is being created, one block at a
time
This is typically not a good idea
Other clients may begin to process a file as it is being written
08-23
Importing Data: Best Practices

Best practice is to import data into a temporary directory
After its completely written, move data to the target directory
This is an atomic operation
Happens very quickly since it merely requires an update of the
NameNodes metadata
Many organizations standardize on a directory structure such as
/incoming/<import_job_name>/<files>
/for_processing/<import_job_name>/<files>
/completed/<import_job_name>/<files>
It is the jobs responsibility to move the files from
for_processing to completed after the job has finished
successfully
Discussion point: your best practices?
08-24

Conclusion
08-25
Conclusion
What Flume is
How Flume works
What Sqoop is
How to use Sqoop to import data from RDBMSs to HDFS
Best practices for importing data
08-26
Chapter 9
Installing and Managing
Other Hadoop Projects
09-1
Installing and Managing Other Hadoop Projects

What features Hive, HBase, and Pig provide
What administrative requirements they impose
09-2
Note
Note that this chapter does not go into any significant detail
about Hive, HBase, or Pig
Our intention is to draw your attention to issues System
Administrators will need to deal with, if users request these
products be installed
For more details on the products themselves, Cloudera offers
dedicated training courses on HBase, and on Hive and Pig
09-3

Hive
Pig
HBase
Conclusion
09-4
Using Hive to Query Large Datasets

Hive is a project initially created at Facebook
Now a top-level Apache project
Motivation: many data analysts are very familiar with SQL (the
Structured Query Language)
The de facto standard for querying data in Relational Database
Management Systems (RDBMSs)
Data analysts tend to be far less familiar with programming
languages such as Java
Hive provides a way to query data in HDFS using Java
Around 99% of Facebooks Hadoop jobs are now created by the
Hive interpreter
09-5
Sample Hive Query

SELECT * from movies m JOIN scores s ON (m.id = s.movie_id)
WHERE m.year > 1995
ORDER BY m.name DESC
LIMIT 50;
09-6
What Hive Provides

Hive allows users to query data using HiveQL, a language very
similar to standard SQL
Hive turns HiveQL queries into standard MapReduce jobs
Automatically runs the jobs, and displays the results to the user
Note that Hive is not an RDBMS!
Results take many seconds, minutes, or even hours to be
produced
Not possible to modify the data using HiveQL
UPDATE and DELETE are not supported
09-7
Getting Data Into Hive

A Table in Hive represents an HDFS directory
Hive interprets all files in the directory as the contents of the table
by knowing how the columns and rows are delimited within the
files
As well as the datatypes and names of the resulting columns
Stores this information in the Hive Metastore
09-8
Installing Hive
Hive runs on a users machine
Not on the Hadoop cluster itself
A user can set up Hive with no System Administrator input
Using the standard Hive command-line or Web-based
interface
If users will be running JDBC-based clients, Hive should be run
as a service on a centrally-available machine
By default, Hive uses a Metastore on the users machine
Metastore uses Derby, a Java-based RDBMS
If multiple users will be running Hive, the System Administrator
should configure a shared Metastore for all users
09-9
Creating a Shared Metastore

A shared Metastore is a database in an RDBMS such as MySQL
Configuration is simple:
1. Create a user and database in your RDBMS
2. Modify hive-site.xml on each users machine to refer to the
shared Metastore
09-10
Sample hive-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql:/DB_HOST_NAME:DB_PORT/DATABASE_NAME</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>USERNAME</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>PASSWORD</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://NAMENODE_HOST:NAMENODE_PORT/user/hive/warehouse</value>
</property>
</configuration>
09-11

Hive
Pig
HBase
Conclusion
09-12
What Is Pig?
Pig is another high-level abstraction on top of MapReduce
Originally developed at Yahoo!
Now a top-level Apache project
Provides a scripting language known as Pig Latin
Abstracts MapReduce details away from the user
Composed of operations that are applied to the input data to
produce output
Language will be relatively easy to learn for people experienced
in Perl, Ruby or other scripting languages
Fairly easy to write complex tasks such as joins of multiple
datasets
Under the covers, Pig Latin scripts are converted to MapReduce
jobs
09-13
Sample Pig Script

movies = LOAD '/data/films' AS
(id:int, name:string, year:int);
ratings = LOAD '/data/ratings' AS
(movie_id: int, user_id: int, score:int);
jnd = JOIN movies BY id, ratings BY movie_id;
recent = FILTER jnd BY year > 1995;
srtd = ORDER recent BY name DESC;
justafew = LIMIT srtd 50;
STORE justafew INTO '/data/pigoutput';
09-14
Installing Pig
Pig runs as a client-side application
There is nothing extra to install on the cluster
Set the configuration file to point to the Hadoop cluster
In the pig.properties file in Pigs conf directory, set
fs.default.name=hdfs://<namenode_location>/
mapred.job.tracker=<jobtracker_location>:8021
09-15

Hive
Pig
HBase
Conclusion
09-16
HBase: A Column-Oriented Datastore

HBase is a distributed, sparse, column-oriented datastore
Distributed: designed to use multiple machines to store and serve
data
Leverages HDFS
Sparse: each row may or may not have values for all columns
Column-oriented: Data is stored grouped by column, rather than
by row
Columns are grouped into column families, which define
what columns are physically stored together
Modeled after Googles BigTable datastore
09-17
HBase Usage Scenarios

Storing large amounts of data
Hundreds of gigabytes up to petabytes
Situations requiring high write throughput
Thousands of insert, update or delete operations per second
Rapid lookup of values by key
09-18
HBase Terminology
Region
A subset of a tables rows
Similar to a partition
HRegionServer
Serves data for reads and writes
Master
Responsible for coordinating HRegionServers
Assigns Regions, detects failures of HRegionServers, and
controls administrative functions
09-19
Installing and Running HBase

CDH3 includes HBase
In conf/hbase-site.xml , set the hbase.rootdir property to
point to the Hadoop filesystem to use
Dont manually create this directory the first time; HBase will do
so, and will add all the required files
Assuming your are running a fully-distributed Hadoop cluster,
set the hbase.cluster.distributed property to true
Edit the file ${HBASE_HOME}/conf/regionservers to list all
hosts running HRegionServers
Start HDFS
Start HBase by running ${HBASE_HOME}/bin/start-hbase.sh
09-20
Advanced HBase Configuration

HBase Master includes a built-in version of ZooKeeper
A scalable, highly-available system that facilitates coordination
among distributed processes
Commonly used to provide locking, configuration, and naming
services
Larger installations of HBase should use a separate ZooKeeper
cluster
Typically three or five machines external to the Hadoop cluster
There are other, more complex configuration options available
for HBase
Cloudera offers an HBase training course which covers many of
these issues
09-21
HBase Best Practices

We recommend deploying HBase on a dedicated cluster
HBase clients are latency-sensitive
General MapReduce jobs are batch jobs, with spiky load
characteristics
Mixing the two can cause problems
Deploy RegionServers on all machines running the DataNode
daemon
Only deploying RegionServers on some nodes can result in
uneven storage utilization
Allocate plenty of RAM to slave nodes
Extra RAM will be used by the operating system for file system
caches, resulting in faster disk I/O
09-22
HBase Best Practices

Never deploy ZooKeeper on slave nodes
ZooKeeper is extremely latency-sensitive
If a ZooKeeper node has to wait for a disk operation, or swaps
out of memory, it may be considered dead by its quorum peers
If ZooKeeper fails, the HBase cluster will fail
Increase dfs.datanode.max.xcievers to 8096 or even higher
Xcievers handle sending and receiving block data
HBase uses many xcievers
Allocate 8GB to 12GB of RAM to each RegionServer
Do not run the HDFS balancer!
This will destroy data locality for RegionServers
Deploy HBase on a homogeneous cluster
09-23

Hive
Pig
HBase
Conclusion
09-24
Conclusion
What features Hive, HBase, and Pig provide
What administrative requirements they impose
09-25
Chapter 10
Conclusion
10-1
Conclusion
During this course, you have learned:
The core technologies of Hadoop
How to plan your Hadoop cluster hardware and software
How to deploy a Hadoop cluster
How to schedule jobs on the cluster
How to maintain your cluster
How to monitor, troubleshoot, and optimize the cluster
What system administrator issues to consider when installing
Hive, HBase and Pig
How to populate HDFS from external sources
10-2
Next Steps
Cloudera offers a number of other training courses:
Developer training for Hadoop
Hive training
HBase training
Custom courses
Cloudera also provides consultancy and troubleshooting
services
Please ask your instructor for more information
10-3
Class Evaluation
Please take a few minutes to complete the class evaluation
Your instructor will show you how to access the online form
10-4
Certification Exam
You are now ready to take the Hadoop Certified System
Administrator examination
Your instructor will explain how to access the exam
10-5
Thank You!
Thank you for attending this course
If you have any further questions or comments, please feel free
to contact us
Full contact details are on our Web site at
http://www.cloudera.com/
10-6

Cloudera Administrator Training

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloudera Administrator Training

Uploaded by

Copyright:

Available Formats

Cloudera Administrator

Training for Apache Hadoop

Cloudera Certified Administrator for Apache

Data Access Speed is the Bottleneck

Failure is the defining difference between

Moving to a Cluster of Machines

System Requirements (contd)

Hadoop: A Radical Way Out

HDFS: Hadoop Distributed File System

HDFS: Block Diagram

DataNodes: Store blocks from les

The NameNode: Memory Allocation

The NameNode: Memory Allocation (contd)

The Slave Nodes

Anatomy of a File Write

Anatomy of a File Write (contd)

Anatomy of a File Write (contd)

HDFS Block Replication Strategy

Anatomy of a File Read

Anatomy of a File Read (contd)

Dealing With Data Corruption

Data Reliability and Recovery

The NameNode Is Not A Bottleneck

HDFS File Permissions

Stronger Security in Hadoop

The Secondary NameNode: Caution!

MapReduce: The Big Picture

What Is MapReduce? (contd)

What Is MapReduce? (contd)

MapReduce: Basic Concepts

MapReduce: A Simple Example

MapReduce: A Simple Example (contd)

the cat sat on the mat

the aardvark sat on the sofa

Intermediate data produced:

MapReduce: A Simple Example (contd)

MapReduce: A Simple Example (contd)

MapReduce: A Simple Example (contd)

Some MapReduce Terminology

Aside: The Job Submission Process

MapReduce: High Level

MapReduce Failure Recovery

MapReduce Failure Recovery (Contd)

The Apache Hadoop Project

We will discuss how to install Pig later in the course

Hands-On Exercise: Installing Hadoop

Planning Your Hadoop Cluster

Planning Your Hadoop Cluster

Thinking About the Problem

Cluster Growth Based on Storage Capacity

Planning Your Hadoop Cluster

Slave Nodes: Recommended Configuration

JBOD: Just a Bunch Of Disks

Slave Nodes: More Details

Save the money, buy more nodes!

Slave Nodes: More Details (CPU)

Slave Nodes: More Details (RAM)

Slave Nodes: More Details (Disk)

Slave Nodes: Why Not RAID?

RAID: Redundant Array of Inexpensive Disks

What About Virtualization?

What About Blade Servers?

Master Nodes: Single Points of Failure

Master Node Hardware Recommendations