Professional Documents
Culture Documents
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
201201$
01-1
Chapter 1
Introduction
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-2
Introduction
About This Course
About Cloudera
Course Logistics
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-3
Course Objectives
During this course, you will learn:
The core technologies of Hadoop
How to plan your Hadoop cluster hardware and software
How to deploy a Hadoop cluster
How to schedule jobs on the cluster
How to maintain your cluster
How to monitor, troubleshoot, and optimize the cluster
What system administrator issues to consider when installing
Hive, HBase and Pig
How to populate HDFS from external sources
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-4
Course Contents
Chapter 1: Introduction
Chapter 2: An Introduction to Hadoop and HDFS
Chapter 3: Planning Your Hadoop Cluster
Chapter 4: Deploying Your Cluster
Chapter 5: Scheduling Jobs
Chapter 6: Cluster Maintenance
Chapter 7: Cluster Monitoring, Troubleshooting, and Optimizing
Chapter 8: Installing and Managing Hadoop Ecosystem Projects
Chapter 9: Populating HDFS From External Sources
Chapter 10: Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-6
Introduction
About This Course
About Cloudera
Course Logistics
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-7
About Cloudera
Cloudera is The commercial Hadoop company
Founded by leading experts on Hadoop from Facebook, Google,
Oracle and Yahoo
Staff includes several committers to Hadoop projects
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-8
Cloudera Products
Clouderas Distribution of Hadoop (CDH)
A single, easy-to-install package from the Apache Hadoop core
repository
Includes a stable version of Hadoop, plus critical bug fixes and
solid new features from the development version
Open-source
No vendor lock-in
Cloudera Manager
Easy, Wizard-based creation and management of Hadoop
clusters
Central monitoring and management point for the cluster
Free version supports up to 50 nodes
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-9
Cloudera Enterprise
Cloudera Enterprise
Complete package of software and support
Built on top of CDH
Includes full version of Cloudera Manager
Install, manage, and maintain a cluster of any size
LDAP integration
Includes powerful cluster monitoring and auditing tools
Resource consumption tracking
Proactive health checks
Alerting
Configuration change audit trails
And more
24 x 7 support
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-10
Cloudera Services
Provides consultancy services to many key users of Hadoop
Including Adconicon, AOL Advertising, Comscore, Groupon,
NAVTEQ, Samsung, Trend Micro, Trulia
Solutions Architects and engineers are experts in Hadoop and
related technologies
Several are committers to Apache Hadoop and related projects
Provides training in key areas of Hadoop administration and
Development
Courses include Developer Training for Apache Hadoop,
Analyzing Data with Hive and Pig, HBase Training, Cloudera
Essentials
Custom course development available
Both public and on-site training available
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-11
Introduction
About This Course
About Cloudera
Course Logistics
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-12
Logistics
Course start and end times
Lunch
Breaks
Restrooms
Can I come in early/stay late?
Certification
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-13
Introductions
About your instructor
About you
Experience with Hadoop?
Experience as a System Administrator?
What platform(s) do you use?
Expectations from the course?
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
01-14
Chapter 2
An Introduction to Hadoop
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-1
An Introduction to Hadoop
In this chapter, you will learn:
What Hadoop is
Why Hadoop is important
What features the Hadoop Distributed File System (HDFS)
provides
How MapReduce works
What other Apache Hadoop ecosystem projects exist, and what
they do
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-2
An Introduction to Hadoop
Why Hadoop?
What is HDFS?
What is MapReduce?
Hive, Pig, HBase and other Ecosystem projects
Hands-On Exercise: Installing Hadoop
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-3
Some Numbers
Max data in memory (RAM): 64GB
Max data per computer (disk): 24TB
Data processed by Google every month: 400PB in 2007
Average job size: 180GB
Time 180GB of data would take to read sequentially off a single
disk drive: approximately 45 minutes
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-5
Sharing is Slow
Grid computing is not new
MPI, PVM, Condor,
Grid focus is on distributing the workload
Uses a NetApp filer or other SAN-based solution for many
compute nodes
Fine for relatively limited amounts of data
Reading large amounts of data from a single SAN device can
leave nodes starved
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-6
Sharing is Tricky
Exchanging data requires synchronization
Deadlocks become a problem
Finite bandwidth is available
Distributed systems can drown themselves
Failovers can cause cascading failure of the system
Temporal dependencies are complicated
Difficult to make decisions regarding partial restarts
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-7
Reliability
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-9
System Requirements
System should support partial failure
Failure of one part of the system should result in a graceful
decline in performance
Not a full halt
System should support data recoverability
If components fail, their workload should be picked up by stillfunctioning units
System should support individual recoverability
Nodes that fail and restart should be able to rejoin the group
activity without a full group restart
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-11
Hadoops Origins
Google created an architecture which answers these (and other)
requirements
Released two White Papers
2003: Description of the Google File System (GFS)
A method for storing data in a distributed, reliable fashion
2004: Description of distributed MapReduce
A method for processing data in a parallel fashion
Hadoop was based on these White Papers
All of Hadoop is written in Java
Developers typically write their MapReduce code in Java
Higher-level abstractions on top of MapReduce have also been
developed
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-13
An Introduction to Hadoop
Why Hadoop?
What is HDFS?
What is MapReduce?
Hive, Pig, HBase and other Ecosystem projects
Hands-On Exercise: Installing Hadoop
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-15
HDFS Assumptions
High component failure rates
Inexpensive components fail all the time
Modest number of HUGE files
Just a few million
Each file likely to be 100MB or larger
Multi-Gigabyte files typical
Files are write-once
Append support is available in CDH3 for HBase reliability support
Should not be used by developers!
Large streaming reads
Not random access
High sustained throughput should be favored over low latency
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-16
HDFS Features
Operates on top of an existing filesystem
Files are stored as blocks
Much larger than for most filesystems
Default is 64MB
Provides reliability through replication
Each block is replicated across multiple DataNodes
Default replication factor is 3
Single NameNode daemon stores metadata and co-ordinates
access
Provides simple, centralized management
Blocks are stored on slave nodes
Running the DataNode daemon
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-17
5
2
3
4
02-18
The NameNode
The NameNode stores all metadata
Information about file locations in HDFS
Information about file ownership and permissions
Names of the individual blocks
Locations of the blocks
Metadata is stored on disk and read when the NameNode
daemon starts up
Filename is fsimage
Note: block locations are not stored in fsimage
When changes to the metadata are required, these are made in
RAM
Changes are also written to a log file on disk called edits
Full details later
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-19
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-21
02-22
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-25
Hadoop is Rack-aware
Hadoop understands the concept of rack awareness
The idea of where nodes are located, relative to one another
Helps the JobTracker to assign tasks to nodes closest to the data
Helps the NameNode determine the closest block to a client
during reads
In reality, this should perhaps be described as being switchaware
HDFS replicates data blocks on nodes on different racks
Provides extra data security in case of catastrophic hardware
failure
Rack-awareness is determined by a user-defined script
See later
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-28
If the DataNode fails during the read, the client will seamlessly
connect to the next one in the list to read the block
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-29
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-32
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-34
02-35
An Introduction to Hadoop
Why Hadoop?
What is HDFS?
What is MapReduce?
Hive, Pig, HBase and other Ecosystem projects
Hands-On Exercise: Installing Hadoop
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-36
What Is MapReduce?
MapReduce is a method for distributing a task across multiple
nodes
Each node processes data stored on that node
Where possible
Consists of two developer-created phases
Map
Reduce
In between Map and Reduce is the shuffle and sort
Sends data from the Mappers to the Reducers
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-37
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-38
Map
Shuffle
and sort
Reduce
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-39
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-40
Features of MapReduce
Automatic parallelization and distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
MapReduce programs are usually written in Java
Can be written in any scripting language using Hadoop
Streaming
All of Hadoop is written in Java
MapReduce abstracts all the housekeeping away from the
developer
Developer can concentrate simply on writing the Map and
Reduce functions
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-41
02-42
Map
// assume input is a set of text files
// k is a byte offset
// v is the line for that offset
let map(k, v) =
foreach word in v:
emit(word, 1)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-43
1225
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-44
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-45
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-46
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-47
02-48
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-49
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-50
02-51
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-52
An Introduction to Hadoop
Why Hadoop?
What is HDFS?
What is MapReduce?
Hive, Pig, HBase and other Ecosystem projects
Hands-On Exercise: Installing Hadoop
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-53
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-54
Hive
Hive is a high-level abstraction on top of MapReduce
Initially created by a team at Facebook
Avoids having to write Java MapReduce code
Data in HDFS is queried using a language very similar to SQL
Known as HiveQL
HiveQL queries are turned into MapReduce jobs by the Hive
interpreter
Tables are just directories of files stored in HDFS
A Hive Metastore contains information on how to map a file to
a table structure
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-55
Hive (contd)
Example Hive query:
SELECT stock.product, SUM(orders.purchases)
FROM stock INNER JOIN orders
ON (stock.id = orders.stock_id)
WHERE orders.quarter = 'Q1'
GROUP BY stock.product;
We will discuss how to install Hive later in the course
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-56
Pig
Pig is another high-level abstraction on top of MapReduce
Originally created at Yahoo!
Uses a dataflow scripting language known as PigLatin
PigLatin scripts are converted to MapReduce jobs by the Pig
interpreter
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-57
Pig (contd)
Sample PigLatin script:
stock = LOAD '/user/fred/stock' AS (id, item);
orders= LOAD '/user/fred/orders' AS (id, cost);
grpd = GROUP orders BY id;
totals = FOREACH grpd GENERATE group, SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group;
DUMP result;
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-58
HBase
HBase is described as The Hadoop database
A column-oriented data store
Provides random, real-time read/write access to large amounts of
data
Allows you to manage tables consisting of billions of rows, with
potentially millions of columns
HBase stores its data in HDFS for reliability and availability
We will discuss issues related to HBase installation and
maintenance later in the course
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-59
An Introduction to Hadoop
Why Hadoop?
What is HDFS?
What is MapReduce?
Hive, Pig, HBase and other Ecosystem projects
Hands-On Exercise: Installing Hadoop
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-60
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-61
An Introduction to Hadoop
Why Hadoop?
What is HDFS?
What is MapReduce?
Hive, Pig, HBase and other Ecosystem projects
Hands-On Exercise: Installing Hadoop
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-62
Conclusion
In this chapter, you have learned:
What Hadoop is
Why Hadoop is important
What features the Hadoop Distributed File System (HDFS)
provides
How MapReduce works
What other Apache Hadoop ecosystem projects exist, and what
they do
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
02-63
Chapter 3
Planning Your
Hadoop Cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-1
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-6
Classifying Nodes
Nodes can be classified as either slave nodes or master nodes
Slave node runs DataNode plus TaskTracker daemons
Master node runs either a NameNode daemon, a Secondary
NameNode Daemon, or a JobTracker daemon
On smaller clusters, NameNode and JobTracker are often run on
the same machine
Sometimes even Secondary NameNode is on the same machine
as the NameNode and JobTracker
Important that at least one copy of the NameNodes metadata
is stored on a separate machine (see later)
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-7
03-8
Typically, a cluster with more nodes will perform better than one
with fewer, slightly faster nodes
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-12
03-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-19
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-22
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-23
03-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-25
Filesystem Considerations
Cloudera recommends the ext3 and ext4 filesystems
ext4 is now becoming more commonly used
XFS provides some performance benefit during kickstart
It formats in 0 seconds, vs several minutes for each disk with ext3
XFS has some performance issues
Slow deletes in some versions
Some performance improvements are available; see e.g.,
http://everything2.com/index.pl?node_id=1479435
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-28
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-29
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-30
Conclusion
In this chapter, you have learned:
What issues to consider when planning your Hadoop cluster
What types of hardware are typically used for Hadoop nodes
How to optimally configure your network topology
How to select the right operating system and Hadoop distribution
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
03-31
Chapter 4
Configuring and
Deploying Your Cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-1
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-2
04-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-4
LocalJobRunner Mode
In LocalJobRunner mode, no daemons run
Everything runs in a single Java Virtual Machine (JVM)
Hadoop uses the machines standard filesystem for data storage
Not HDFS
Suitable for testing MapReduce programs during development
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-5
Pseudo-Distributed Mode
In pseudo-distributed mode, all daemons run on the local
machine
Each runs in its own JVM (Java Virtual Machine)
Hadoop uses HDFS to store data (by default)
Useful to simulate a cluster on a single machine
Convenient for debugging programs before launching them on
the real cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-6
Fully-Distributed Mode
In fully-distributed mode, Hadoop daemons run on a cluster of
machines
HDFS used to distribute data amongst the nodes
Unless you are running a small cluster (less than 10 or 20
nodes), the NameNode and JobTracker should each be running
on dedicated nodes
For small clusters, its acceptable for both to run on the same
physical node
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-7
04-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-9
RPM/Package vs Tarballs
Clouderas Distribution including Apache Hadoop (CDH) is
available in multiple formats
RPMs for Red Hat-style Linux distributions (RHEL, CentOS)
Packages for Ubuntu and SuSE Linux
As a tarball
RPMs/Packages include some features not in the tarball
Automatic creation of mapred and hdfs users
init scripts to automatically start the Hadoop daemons
Although these are not activated by default
Configures the alternatives system to allow multiple
configurations on the same machine
Strong recommendation: use the RPMs/packages whenever
possible
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-14
An Aside: SSH
Note that most tutorials tell you to create a passwordless SSH
login on each machine
This is not necessary for the operation of Hadoop
Hadoop does not use SSH in any of its internal communications
ssh is only required if you intend to use the start-all.sh and
stop-all.sh scripts
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-16
04-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-18
Once installed, you can access the server via its Web interface
http://scm_manager_host:7180/
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-19
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-22
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-25
04-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-28
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-29
<name>some.property.name</name>
<value>somevalue</value>
<final>true</final>
</property>
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-31
hdfs-site.xml
The single most important configuration value on your entire
cluster, set on the NameNode:
dfs.name.dir
04-32
hdfs-site.xml (contd)
The NameNode will write to the edit log in all directories in
dfs.name.dir synchronously
If a directory in the list disappears, the NameNode will continue
to function
It will ignore that directory until it is restarted
Recommendation for the NFS mount point
tcp,soft,intr,timeo=10,retrans=10
Soft mount so the NameNode will not hang if the mount point
disappears
Will retry transactions 10 times, at 1-10 second intervals, before
being deemed to have failed
Note: no space between the comma and next directory name in
the list!
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-33
hdfs-site.xml (contd)
dfs.block.size
dfs.data.dir
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-34
hdfs-site.xml (contd)
dfs.namenode.handler.count
dfs.permissions
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-35
hdfs-site.xml (contd)
dfs.datanode.du.reserved
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-36
core-site.xml
fs.default.name
fs.checkpoint.dir
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-37
core-site.xml (contd)
fs.trash.interval
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-38
core-site.xml (contd)
hadoop.tmp.dir
io.file.buffer.size
io.compression.codecs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-39
mapred-site.xml
mapred.job.tracker
mapred.child.java.opts
mapred.child.ulimit
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-40
mapred-site.xml (contd)
mapred.local.dir
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-41
mapred-site.xml (contd)
mapred.job.tracker.handler.count
mapred.reduce.parallel.copies
tasktracker.http.threads
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-42
mapred-site.xml (contd)
mapred.reduce.slowstart.comple The percentage of Map tasks which
ted.maps
must be completed before the
JobTracker will schedule Reducers on
the cluster. Default: 0.05.
Recommendation: 0.5 to 0.8. Specified
on the JobTracker.
mapred.jobtracker.taskScheduler
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-43
mapred-site.xml (contd)
mapred.tasktracker.map.tasks.m Number of Map tasks which can be run
aximum
simultaneously by the TaskTracker.
Specified on each TaskTracker node.
mapred.tasktracker.reduce.task Number of Reduce tasks which can be
s.maximum
run simultaneously by the TaskTracker.
Specified on each TaskTracker node.
04-44
mapred-site.xml (contd)
mapred.map.tasks.speculative.e Whether to allow speculative execution
xecution
for Map tasks. Default: true.
Recommendation: true. Specified on the
JobTracker.
mapred.reduce.tasks.speculativ Whether to allow speculative execution
e.execution
for Reduce tasks. Default: true.
Recommendation: false. Specified on
the JobTracker.
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-45
mapred-site.xml (contd)
mapred.compress.map.output
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-46
mapred-site.xml (contd)
io.sort.mb
io.sort.factor
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-47
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-48
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-49
04-50
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-51
04-52
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-53
04-54
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-55
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-56
/datacenter1/rack1
/datacenter1/rack1
/datacenter1/rack1
/datacenter1/rack1
/datacenter1/rack2
/datacenter1/rack2
/datacenter1/rack2
/datacenter1/rack2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-57
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-58
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-59
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-60
04-61
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-62
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-63
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-64
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-65
04-66
Hands-On Exercise
In this exercise, you will collaborate with other students to create
a real Hadoop cluster in the classroom
Please refer to the hands-on exercise manual
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-67
04-68
Conclusion
In this chapter, you have learned:
The different installation configurations available in Hadoop
How to install Hadoop
How to launch the Hadoop daemons
How to configure Hadoop
How to specify your rack topology
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
04-69
Chapter 5
Managing and
Scheduling Jobs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-1
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-3
State
StartTime
job_201110311158_0008
UserName
Priority
1320210148487
training
SchedulingInfo
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
NORMAL
NA
05-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
NA
NA
NA
NA
NA
NA
NA
05-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-7
Killing a Job
It is important to note that once a user has submitted a job, they
can not stop it just by hitting CTRL-C on their terminal
This stops job output appearing on the users console
The job is still running on the cluster!
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-8
Priority
training
SchedulingInfo
NORMAL NA
Priority
SchedulingInfo
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-12
05-13
A1
A2
A3
A4
B1
B2
B3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-14
C1
C2
C3
High Priority
Before any
lower-priority
tasks are started,
regardless of
submission order
A1
A2
A3
A4
B1
B2
B3
Normal Priority
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-18
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-19
Pool Creation
By default, pools are created dynamically based on the username
submitting the job
No configuration necessary
Jobs can be sent to designated pools (e.g., production)
Pools can be defined in a configuration file (see later)
Pools may have a minimum number of mappers and reducers
defined
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-21
05-22
05-23
05-24
05-25
05-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-27
05-28
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-29
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-31
minShare Preemption
Pools with a minimum share configured are operating on an SLA
(Service Level Agreement)
Waiting for tasks from other pools to finish may not be
appropriate
Pools which are below their minimum guaranteed share can kill
the newest tasks from other pools to reap slots
Can then use those slots for their own tasks
Ensures that the minimum share will be delivered within a timeout
window
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-32
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-34
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-35
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-36
mapred.fairscheduler.sizebasedweight
mapred.fairscheduler.weightadjuster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-37
Configuring Pools
The allocations configuration file must exist, and contain an
<allocations> entity
<pool> entities can contain minMaps, minReduces,
maxRunningJobs, weight
<user> entities (optional) can contain maxRunningJobs
Limits the number of simultaneous jobs a user can run
userMaxJobsDefault entity (optional)
Maximum number of jobs for any user without a specified limit
System-wide and per-pool timeouts can be set
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-38
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-39
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-40
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-41
<?xml version="1.0"?>
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
<user name="bob">
<maxRunningJobs>6</maxRunningJobs>
</user>
<fairSharePreemptionTimeout>300</fairSharePreemptionTimeout>
</allocations>
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-42
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-43
<?xml version="1.0"?>
<allocations>
<userMaxJobsDefault>3</userMaxJobsDefault>
<pool name="production">
<minMaps>20</minMaps>
<minReduces>5</minReduces>
<weight>2.0</weight>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
</pool>
</allocations>
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-44
05-45
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-46
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-47
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-48
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-49
Conclusion
In this chapter, you have learned:
How to stop jobs running on the cluster
The options available for scheduling multiple jobs on the same
cluster
The downsides of the default FIFO Scheduler
How to configure the Fair Scheduler
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
05-50
Chapter 6
Cluster Maintenance
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-1
Cluster Maintenance
In this chapter, you will learn:
How to check the status of HDFS
How to copy data between clusters
How to add and remove nodes
How to rebalance the cluster
The purpose of the Secondary NameNode
What strategies to employ for NameNode Metadata backup
How to upgrade your cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-2
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-3
/
/
/
/
/
-files
-files -blocks
-files -blocks -locations
-files -blocks -locations -racks
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-5
Using dfsadmin
dfsadmin provides a number of administrative features
including:
List information about HDFS on a per-datanode basis
$ hadoop dfsadmin -report
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-6
06-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-8
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-10
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-11
Copying Data
Hadoop clusters can hold massive amounts of data
A frequent requirement is to back up the cluster for disaster
recovery
Ultimately, this is not a Hadoop problem!
Its a managing huge amounts of data problem
Cluster could be backed up to tape etc if necessary
Custom software may be needed
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-14
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-17
06-18
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-19
Cluster Rebalancing
An HDFS cluster can become unbalanced
Some Nodes have much more data on them than others
Example: add a new Node to the cluster
Even after adding some files to HDFS, this Node will have far
less data than the others
During MapReduce processing, this Node will use much more
network bandwidth as it retrieves data from other Nodes
Clusters can be rebalanced using the balancer utility
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-20
Using balancer
balancer reviews data block placement on nodes and adjusts
blocks to ensure all nodes are within x% utilization of each other
Utilization is defined as amount of data storage used
x is known as the threshold
A node is under-utilized if its utilization is less than (average
utilization - threshold)
A node is over-utilized if its utilization is more than (average
utilization + threshold)
Note: balancer does not consider block placement on individual
disks on a node
Only the utilization of the node as a whole
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-21
06-22
When To Rebalance
Cluster should become unbalanced during regular usage
Rebalance immediately after adding new nodes to the cluster
Rebalancing does not interfere with any existing MapReduce
jobs
However, it does use bandwidth
Not a good idea to rebalance during peak usage times
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-23
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-25
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-27
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-28
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-29
06-30
06-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-32
06-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-34
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-35
06-36
06-37
06-38
Cluster Maintenance
Checking HDFS status
Hands-On Exercise: Breaking The Cluster
Copying Data Between Clusters
Adding and Removing Cluster Nodes
Rebalancing The Cluster
Hands-On Exercise: Verifying The Clusters Self-Healing
Features
NameNode Metadata Backup
Cluster Upgrading
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-39
Conclusion
In this chapter, you have learned:
How to check the status of HDFS
How to copy data between clusters
How to add and remove nodes
How to rebalance the cluster
The purpose of the Secondary NameNode
What strategies to employ for NameNode Metadata backup
How to upgrade your cluster
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
06-40
Chapter 7
Cluster Monitoring and
Troubleshooting
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-1
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-2
07-3
07-4
Items to Monitor
Monitor the Hadoop daemons
Alert an operator if a daemon goes down
Check can be done with
service hadoop-0.20-daemon_name status
Monitor disks and disk partitions
Alert immediately if a disk fails
Send a warning when a disk reaches 80% capacity
Send a critical alert when a disk reaches 90% capacity
Monitor CPU usage on master nodes
Send an alert on excessive CPU usage
Slave nodes will often reach 100% usage
This is not a problem
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-6
07-7
07-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-9
log4j Configuration
Log4j configuration is controlled by conf/log4j.properties
Default log level configured by hadoop.root.logger
Default is INFO
Log level can be set for any specific class with
log4j.logger.class.name = LEVEL
Example:
log4j.logger.org.apache.hadoop.mapred.JobTracker=INFO
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-10
The DailyRollingFileAppender
An Appender is the destination for log messages
Hadoops default for daemon logs is the
DailyRollingFileAppender (DRFA)
Rotates logfiles daily
Frequency is configurable
Cannot limit filesize
Cannot limit the number of files kept
You must provide your own scripts to compress, archive, delete
logs
DRFA is the most popular choice for Hadoop logs
It is the default, and many system administrators are not familiar
with Java logging
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-14
07-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-16
07-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-18
Default+ Congura1on+Parameter
Port
Used+for
NameNode
8020
fs.default.name
DataNode
50010
dfs.datanode.address
DataNode
50075
dfs.datanode.ipc.addr
ess
BackupNode
50100
dfs.backup.address
JobTracker
Usually
8021,
9001,
or 8012
mapred.job.tracker
TaskTracker
Usually
8021,
9001,
or 8012
mapred.task.tracker.r
eport.address
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-19
MR
HDFS
Daemon
Default Port
Configuration parameter
NameNode
50070
dfs.http.address
DataNode
50075
dfs.datanode.http.address
Secondary NameNode
50090
dfs.secondary.http.address
Backup/Checkpoint Node*
50105
dfs.backup.http.address
JobTracker
50030
mapred.job.tracker.http.address
TaskTracker
50060
mapred.task.tracker.http.address
07-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-22
07-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-25
07-26
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-27
07-28
Hadoop Metrics
Hadoop can be configured to log many different metrics
Metrics are grouped into contexts
jvm
Statistics from the JVM including memory usage, thread
counts, garbage collection information
All Hadoop daemons use this context
dfs
NameNode capacity, number of files, under-replicated blocks
mapred
JobTracker information, similar to that found on the
JobTrackers Web status page
rpc
For Remote Procedure Calls
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-29
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-30
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-32
Monitoring Challenges
System monitoring becomes a challenge when dealing with large
numbers of systems
Multiple solutions exist, such as
Nagios
Hyperic
Zabbix
Many of these are very general purpose
Fine for monitoring the machines themselves
Not so useful for integrating with Hadoop
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-33
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-34
Cluster+node+
Web+Server+
Data$Collector:$$
GMOND+
Data$Consolidator:$$
GMETAD+
Cluster+node+
Data$Collector:$$
GMOND+
Database:$
rrdtool+
Apache$+$
PHP$scripts$
Cluster+node+
Data$Collector:$$
GMOND+
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-35
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-36
Ganglia Configuration
Install the GMOND daemon on every cluster Node
Make sure port 8649 is open for both UDP and TCP connections
Install GMETAD on a Web server
Configure Hadoop to publish metrics to Ganglia in
conf/hadoop-metrics.properties
Example:
# Configuration of the "dfs" context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaC
ontext
dfs.period=10
dfs.servers=127.0.0.1:8649
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-37
Ganglia Versions
Ganglia 3.0.x and 3.1 both work well with Hadoop out-of-the-box
Ganglia 3.1 also works out-of-the-box with CDH
Ganglia 3.1 uses
org.apache.hadoop.metrics.ganglia.GangliaContext31
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-38
07-39
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-40
Possible causes
Map or Reduce task has run out of memory
Possibly due to a memory leak in the job code
Possible resolution
Increase size of RAM allocated in mapred.java.child.opts
Ensure io.sort.mb is smaller than RAM allocated in
mapred.java.child.opts
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-41
07-42
Cause
Reducers are failing to fetch intermediate data from a
TaskTracker where a Map process ran
Too many of these failures will cause a TaskTracker to be
blacklisted
Possible resolutions
Increase tasktracker.http.threads
Decrease mapred.reduce.parallel.copies
Upgrade to CDH3u2
The version of Jetty (the Web server) in earlier versions of the
TaskTracker was prone to fetch failures
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-43
Possible causes
Fewer DataNodes available than the replication factor of the
blocks
DataNodes do not have enough xciever threads
Default is 256 threads to manage connections
Note: yes, the configuration option is misspelled!
Possible resolutions
Increase dfs.datanode.max.xcievers to 4096
Check replication factor
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-44
07-45
Why Benchmark?
Common question after adding new nodes to a cluster:
How much faster is my cluster running now?
Benchmarking clusters is not an exact science
Performance depends on the type of job youre running
Standard benchmark is Terasort
Example: Generate a 10,000,000-line file, each line containing
100 bytes, then sort that file
hadoop jar $HADOOP_HOME/hadoop-*-examples.jar teragen 10000000 input_dir
hadoop jar $HADOOP_HOME/hadoop-*-examples.jar terasort input-dir output-dir
07-46
Real-World Benchmarks
Test your cluster before and after adding nodes
Remember to take into account other jobs running on the nodes
while youre benchmarking!
As a (very high-end!) guide: in April 2009, Arun Murthy and Owen
OMalley at Yahoo! sorted a terabyte of data in 62 seconds on a
cluster of 1,406 nodes
Albeit using a somewhat modified version of Hadoop
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-47
07-48
Conclusion
In this chapter, you have learned:
What general system conditions to monitor
How to use the NameNode and JobTracker Web UIs
How to view and manage Hadoops log files
How the Ganglia monitoring tool works
Some common cluster problems, and their resolutions
How to benchmark your clusters performance
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
07-49
Chapter 8
Populating HDFS From
External Sources
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-1
08-2
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-3
What Is Flume?
Flume is a distributed, reliable, available service for efficiently
moving large amounts of data as it is produced
Ideally suited to gathering logs from multiple systems and
inserting them into HDFS as they are generated
Developed in-house by Cloudera, and released as open-source
software
Design goals:
Reliability
Scalability
Manageability
Extensibility
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-4
Agent&
Agent&&
Agent&
Agent&
encrypt&
MASTER&
Master'communicates'
with'all'Agents,'
specifying'congura$on'
etc.'
Processor&
Processor&
compress&
batch&
encrypt&
Writes'to'mul$ple'HDFS'le'
formats'(text,'SequenceFile,'
JSON,'Avro,'others)'
Parallelized'writes'across'many'
collectors''as'much'write'
throughput'as'required'
Collector(s)&
Mul$ple'congurable'
levels'of'reliability'
Agents''can'guarantee'
delivery'in'event'of'
failure'
Op$onally'deployable,'
centrally'administered'
Op$onally'pre=process'
incoming'data:'perform'
transforma$ons,'
suppressions,'metadata'
enrichment'
Flexibly'
deploy'
decorators'at'
any'step'to'
improve'
performance,''
reliability'or'
security'
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-8
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-10
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-11
08-12
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-14
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-16
What is Sqoop?
Sqoop is the SQL-to-Hadoop database import tool
Developed at Cloudera
Open-source
Included as part of Clouderas Distribution including Apache
Hadoop (CDH)
Designed to import data from RDBMSs (Relational Database
Management Systems) into Hadoop
Can also send data the other way, from Hadoop to an RDBMS
Uses JDBC (Java Database Connectivity) to connect to the
RDBMS
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-18
Sqoop Features
Imports a single table, or all tables in a database
Can specify which rows to import
Via a WHERE clause
Can specify which columns to import
Can provide an arbitrary SELECT statement
Sqoop can automatically create a Hive table based on the
imported data
Supports incremental imports of data
Can export data from HDFS to a database table
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-19
Sqoop Connectors
Cloudera has partnered with third parties to create Sqoop
connectors
Add-ons to Sqoop which use a databases native protocols to
import data, rather than JDBC
Typically orders of magnitude faster
Not open-source, but freely downloadable from the Cloudera Web
site
Current products supported
Oracle Database
MicroStrategy
Netezza
Others being developed
Microsoft has produced a version of Sqoop optimized for SQL
Server
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-22
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-23
08-24
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-25
Conclusion
In this chapter, you have learned:
What Flume is
How Flume works
What Sqoop is
How to use Sqoop to import data from RDBMSs to HDFS
Best practices for importing data
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
08-26
Chapter 9
Installing and Managing
Other Hadoop Projects
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-1
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-2
Note
Note that this chapter does not go into any significant detail
about Hive, HBase, or Pig
Our intention is to draw your attention to issues System
Administrators will need to deal with, if users request these
products be installed
For more details on the products themselves, Cloudera offers
dedicated training courses on HBase, and on Hive and Pig
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-3
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-4
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-5
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-6
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-7
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-8
Installing Hive
Hive runs on a users machine
Not on the Hadoop cluster itself
A user can set up Hive with no System Administrator input
Using the standard Hive command-line or Web-based
interface
If users will be running JDBC-based clients, Hive should be run
as a service on a centrally-available machine
By default, Hive uses a Metastore on the users machine
Metastore uses Derby, a Java-based RDBMS
If multiple users will be running Hive, the System Administrator
should configure a shared Metastore for all users
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-9
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-10
Sample hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql:/DB_HOST_NAME:DB_PORT/DATABASE_NAME</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>USERNAME</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>PASSWORD</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://NAMENODE_HOST:NAMENODE_PORT/user/hive/warehouse</value>
</property>
</configuration>
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-11
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-12
What Is Pig?
Pig is another high-level abstraction on top of MapReduce
Originally developed at Yahoo!
Now a top-level Apache project
Provides a scripting language known as Pig Latin
Abstracts MapReduce details away from the user
Composed of operations that are applied to the input data to
produce output
Language will be relatively easy to learn for people experienced
in Perl, Ruby or other scripting languages
Fairly easy to write complex tasks such as joins of multiple
datasets
Under the covers, Pig Latin scripts are converted to MapReduce
jobs
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-13
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-14
Installing Pig
Pig runs as a client-side application
There is nothing extra to install on the cluster
Set the configuration file to point to the Hadoop cluster
In the pig.properties file in Pigs conf directory, set
fs.default.name=hdfs://<namenode_location>/
mapred.job.tracker=<jobtracker_location>:8021
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-15
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-16
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-17
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-18
HBase Terminology
Region
A subset of a tables rows
Similar to a partition
HRegionServer
Serves data for reads and writes
Master
Responsible for coordinating HRegionServers
Assigns Regions, detects failures of HRegionServers, and
controls administrative functions
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-19
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-20
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-21
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-22
09-23
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-24
Conclusion
In this chapter, you have learned:
What features Hive, HBase, and Pig provide
What administrative requirements they impose
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
09-25
Chapter 10
Conclusion
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-1
Conclusion
During this course, you have learned:
The core technologies of Hadoop
How to plan your Hadoop cluster hardware and software
How to deploy a Hadoop cluster
How to schedule jobs on the cluster
How to maintain your cluster
How to monitor, troubleshoot, and optimize the cluster
What system administrator issues to consider when installing
Hive, HBase and Pig
How to populate HDFS from external sources
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-2
Next Steps
Cloudera offers a number of other training courses:
Developer training for Hadoop
Hive training
HBase training
Custom courses
Cloudera also provides consultancy and troubleshooting
services
Please ask your instructor for more information
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-3
Class Evaluation
Please take a few minutes to complete the class evaluation
Your instructor will show you how to access the online form
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-4
Certification Exam
You are now ready to take the Hadoop Certified System
Administrator examination
Your instructor will explain how to access the exam
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-5
Thank You!
Thank you for attending this course
If you have any further questions or comments, please feel free
to contact us
Full contact details are on our Web site at
http://www.cloudera.com/
Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
10-6