Hadoop Administration

Hadoop Administration
What does Hadoop mean?
https://developers.google.com/hadoop/images/hadoop-elephant.png
"The name my kid gave a stuffed yellow elephant. Short, relatively easy to
spell and pronounce, meaningless and not used elsewhere: those are my
naming criteria. Kids are good at generating such. Googol is a kid's term."
Doug Cutting, Hadoop Project Creator
Outline
First Day:
Session I:
Introduction to the course
Introduction to Big Data and Hadoop
Session II:
HDFS
Session III:
YARN
Session IV:
MapReduce
Outline #2
Second Day:
Session I:
Installation and configuration of the Hadoop in a pseudo-

distributed mode
Session II:
Running MapReduce jobs on the Hadoop cluster
Session III:
Hadoop ecosystem tools: Pig, Hive, Sqoop, HBase, Flume,

Oozie
Session IV:
Supporting tools: Hue, Cloudera Manager
Big Data future: Impala, Tez, Spark, NoSQL
Outline #3
Third Day:
Session I:
Hadoop cluster planning
Session II:
Hadoop cluster installation and configuration
Session III:
Hadoop cluster installation and configuration
Session IV:
Case Studies
Certification and Surveys
First Day - Session I
Introduction
http://news.nost.org.cn/wp-content/uploads/2013/06/BigDataBigBuildings.jpg
"I think there is a world market for maybe

five computers"
http://upload.wikimedia.org/wikipedia/commons/7/7e/Thomas_J_Watson_Sr.jpg
Thomas Watson, IBM, 1943.
How big are the data we are talking about?
http://www.atkearney.com.tr/documents/10192/698536/FG-Big-Data-and-the-Creative-
Destruction-of-Todays-Business-Models-1.png
Foundations of Big Data"
http://blog.softwareinsider.org/wp-content/uploads/2012/02/Screen-shot-2012-02-27-at-
11.18.56-AM.png
What is Hadoop?
Apache project for storing and processing large data sets
Open-source implementation of Google Big Data solutions
Components:
core:
HDFS (Hadoop Distributed File System)
YARN (Yet Another Resource Negotiator)
noncore:
Data processing paradigms (MapReduce, Spark, Impala,

Tez, etc.)
Ecosystem tools (Pig, Hive, Sqoop, HBase, etc.)

Written in Java
Virtualization vs clustering
Data storage evolution

1956 - HDD (Hard Disk Drive), now up to 6 TB
1983 - SDD (Solid State Drive), now up to 16 TB
1984 - NFS (Network File System), first NAS (Network Attached

Storage) implementation
1987 - RAID (Redundant Array of Independent Disks), now up to ~100

disks
1993 - Disk Arrays, now up to ~200 disks
1994 - Fibre-channel, first SAN (Storage Area Network) implementation
2003 - GFS (Google File System), first Big Data implementation
Big Data and Hadoop evolution

2003 - GFS (Google File System) publication, available here
2004 - Google MapReduce publication, available here

2005 - Hadoop project founded
2006 - Google BigTable publication, available here
2008 - First Hadoop implementation
2010 - Google Dremel publication, available here
2010 - First HBase implementation
2012 - Apache Hadoop YARN project founded
2013 - Cloudera releases Impala
2014 - Apache releases Spark
Hadoop distributions
Apache Hadoop, http://hadoop.apache.org/
CDH (Cloudera Distribution including apache

Hadoop), http://www.cloudera.com/
HDP (Hortonworks Data Platform), http://hortonworks.com/
M3, M5 and M7, http://www.mapr.com/
Amazon Elastic
MapReduce, https://aws.amazon.com/elasticmapreduce/
BigInsights Enterprise Edition, http://www.ibm.com/
Intel Distribution for Apache Hadoop, http://hadoop.intel.com/
And more ...
Hadoop automated deploymen tools

Cloudera Manager, https://www.cloudera.com/products/cloudera-
manager.html
Apache Ambari, https://ambari.apache.org/

MapR Control
System, http://doc.mapr.com/display/MapR/MapR+Control+System
VMware Big Data Platform, https://www.vmware.com/products/big-

data-extensions
OpenStack Sahara, https://wiki.openstack.org/wiki/Sahara
Amazon Elastic
MapReduce, https://aws.amazon.com/elasticmapreduce/
HDInsight, https://azure.microsoft.com/en-us/services/hdinsight/
Hadoop on Google Cloud

Platform, https://cloud.google.com/hadoop/
Who uses Hadoop?
http://cdn2.hadoopwizard.com/wp-
content/uploads/2013/01/hadoopwizard_companies_by_node_size600.png
Where is it all going?

At this point we have all the tools to start changing the future:
capability of storing and processing any amount of data
tools for storing and processing both structured and unstructured

data
tools for batch and interactive data processing

Big Data paradigm has matured
BigData future:
search engines
scientific analysis
trends prediction
business intelligence
artificial intelligence
References
Books:
E. Sammer. "Hadoop Operations". O'Reilly, 1st edition
T. White. "Hadoop: The Definitive Guide". O'Reilly, 4th edition
Publications:
http://research.google.com/pubs/papers.html
http://ieeexplore.ieee.org/Xplore/home.jsp
Websites:
Apache Hadoop project website: http://hadoop.apache.org/
Cloudera
website: http://www.cloudera.com/content/cloudera/en/home.ht
ml
Hortonworks website: http://hortonworks.com/
MapR website: http://www.mapr.com/
Internet
Introduction to the lab

Lab components:
Laptop with Windows 8
Virtual Machine with CentOS 7.2 on VirtualBox:
Hadoop instance in a pseudo-distributed mode
Hadoop ecosystem tools
terminal for connecting to the GCE
GCE (Google Compute Engine) cloud-based IaaS platform
Hadoop cluster
VirtualBox
VirtualBox 64-bit version (click here to download)
VM name: Hadoop Administration
Credentials: terminal / terminal (use root account)
Snapshots (top right corner)
Press right "Control" key to release
GCE
GUI: https://console.cloud.google.com/compute/
Project ID: check after logging in in the "My First Project" lap
Connect the VM to the GCE:
sudo /opt/google-cloud-sdk/bin/gcloud auth login
follow on-screen instructions
sudo /opt/google-cloud-sdk/bin/gcloud config set project [Project

ID]
sudo /opt/google-cloud-sdk/bin/gcloud components update
sudo /opt/google-cloud-sdk/bin/gcloud config list
First Day - Session II
HDFS
http://munnamark.files.wordpress.com/2012/03/petabyte1.jpg
Goals and motivation

Very large files:
file sizes larger than tens of terabytes
filesystem sizes larger than tens of petabytes
Streaming data access:
write-once, read-many-times approach
optimized for batch processing
Fault tolerance:
data replication
metadata high availability
Scalability:
commodity hardware
horizontal scalability
Resilience:
components failures are common
auto-healing feature
Support for MapReduce data processing paradigm
Design
FUSE (Filesystem in USErspace)
Non POSIX-compliant
filesystem (http://standards.ieee.org/findstds/standard/1003.1-
2008.html)
Block abstraction:
block size: 64 MB
block replication factor: 3
Benefits:
support for file sizes larger than disk sizes
segregation of data and metadata
data durability
Lab Exercise 1.2.1
Daemons
Datanode:
responsible for storing and retrieving data
many per cluster
Namenode:
responsible for storing and retrieving metadata
responsible for maintaining a database of data locations
1 active per cluster
Secondary namenode:
responsible for checkpointing metadata
Data
Stored in a form of blocks on datanodes
Blocks are files on datanodes' local filesystem
Reported periodically to the namenode in a form of block

reports
Durability and parallel processing thanks to replication
Metadata
Stored in a form of files on the namenode:
fsimage_[transaction ID] - complete snapshot of the filesystem

metadata up to specified transaction
edits_inprogress_[transaction ID] - incremental modifications

made to the metadata since specified transaction
Contain information on HDFS filesystem's structure and

properties
Copied and served from the namenode's RAM
Don't confuse metadata with the database of data locations!
Lab Exercise 1.2.2
Metadata checkpointing
When does it occur?
every hour
edits file reaches 64 MB
How does it work?
T. White. "Hadoop: The Definitive Guide"
Read path
1. A client opens a file by calling the "open()" method on the
"FileSystem" object
2. The client calls the namenode to return a sorted list of

datanodes for the first batch of blocks in the file
3. The client connects to the first datanode from the list
4. The client streams the data block from the datanode
5. The client closes the connection to the datanode
6. The client repeats steps 3-5 for the next block or steps 2-5 for
the next batch of blocks, or closes the file when copied
Read path #2
Data read preferences

Determines the closest datanode to the client
The distance is based on the theoretical network bandwidth

and latency:
0 - the same node
2 - the same rack
4 - the same data center
6 - different data centers
Rack Topology feature has to be configured prior

Write path
1. A client creates a file by calling the "create()" method on the
"DistributedFileSystem" object
2. The client calls the namenode to create the file with no blocks
in the filesystem namespace
3. The client calls the namenode to return a list of datanodes to

store replicas of a batch data blocks
4. The client connects to the first datanode from the list
5. The client streams the data block to the datanode
6. The datanode connects to the the second datanode from the

list, streams the data block, and so on
7. The datanodes acknowledge the receiving of the data block
8. The client repeats steps 4-5 for the next blocks or steps 3-5 for
the next batch of blocks, or closes the file when written
9. The client notifies the namenode that the file is written
Write path #2
Data write preferences

1st copy - on the client (if running datanode daemon) or on randomly
selected non-overloaded datanode
2nd copy - on a datanode in a different rack in the same data center
3rd copy - on randomly selected non-overloaded datanode in the

same rack
4th and further copies - on randomly selected non-overloaded

datanaode
Namenode High Availability

Namenode - SPOF (Single Point Of Failure)
Manual namenode recovery process may take up to 30 minutes
Namenode HA feature available since 0.23 release
High Availability type: active / standby
Based on: QJM (Quorum Journal Manager) and ZooKeeper
Failover type: manual or automatic with ZKFC (ZooKeeper Failover

Controller)
STONITH (Shot The Other Node In The Head)
Fencing methods
Secondary namenode functionality moved to the standby

namenode
Namenode High Availability #2

http://www.binospace.com/wp-content/uploads/2013/07/slide-10-1024_%E5%89%AF
%E6%9C%AC.jpg
Journal Node
Relatively lightweight
Usually collocated on the machines running namenode or

resource manager daemons
Changes to the "edits" file must be written to a majority of

journal nodes
Usually deployed in 3, 5, 7, 9, etc.
Namenode Federation
Namenode memory limitations
Namenode federation feature available since 0.23 release
Horizontal scalability
Filesystem partitioning
ViewFS
Federation of HA pairs
Lab Exercise 1.2.3

Lab Exercise 1.2.4
Namenode Federation #2
E. Sammer. "Hadoop Operations"
Interfaces
Underlying interface:
Java API - native Hadoop access method
Additional interfaces:
CLI
C interface
HTTP interface
FUSE
URIs
[path] - local HDFS filesystem
file://[path] - local HDFS filesystem
hdfs://[namenode]/[path] - remote HDFS filesystem
http://[namenode]/[path] - HTTP
viewfs://[path] - ViewFS
Java interface
Reminder: Hadoop is written in Java
Destined to Hadoop developers rather than Hadoop

administrators
Based on the "FileSystem" class
Reading data:
by streaming from Hadoop URIs
by using Hadoop API
Writing data:
by streaming to Hadoop URIs
by appending to files
Command-line interface
Shell-like commands
All begin with the "hdfs" keyword
No current working directory
Example: Overriding the replication factor:
hdfs dfs -setrep [RF] [file]
All commands available here
Base commands
hdfs dfs -mkdir [URI] - creates a directory
hdfs dfs -rm [URI] - removes the file
hdfs dfs -rmr [URI] - removes the directory
hdfs dfs -cat [URI] - displays a content of the file
hdfs dfs -ls [URI] - displays a content of the directory
hdfs dfs -cp [source URI] [destination URI] - copies the file or the
directory
hdfs dfs -mv [source URI] [destination URI] - moves the file or the
directory
hdfs dfs -du [URI] - displays a size of the file or the directory
Base commands #2
hdfs dfs -chmod [mode] [URI] - changes permissions of the file or
the directory
hdfs dfs -chown [owner] [URI] - changes the owner of the file or the
directory
hdfs dfs -chgrp [group] [URI] - changes the owner group of the file
or the directory
hdfs dfs -put [source path] [destination URI] - uploads the data
into HDFS
hdfs dfs -getmerge [source directory] [destination file] -

merges files from directory into a single file
hadoop archive -archiveName [name] [source URI]*

[destination URI] - creates a Hadoop archive
hadoop fsck [URI] - performs an HDFS filesystem check
hadoop version - displays a Hadoop version
HTTP interface
HDFS HTTP ports:
50070 - namenode
50075 - datanode
50090 - secondary namenode
Access methods:
direct
via proxy
FUSE
Reminder:
HDFS is a user-space filesystem
HDFS is a non-POSIX-compliant filesystem
Fuse-DFS module
Allows HDFS to be mounted under the Linux FHS (Filesystem

Hierarchy Standard)
Allows Unix tools to be used against the HDFS

Documentation: src/contrib/fuse-dfs
Quotas
Count quotas vs space quotas:
count quotas - limit a number of files inside of the HDFS directory
space quotas - limit a post-replication disk space utilization of the

HDFS directory
Setting quotas:
hadoop dfsadmin -setQuota [count] [path]

hadoop dfsadmin -setSpaceQuota [size] [path]
Removing quotas:
hadoop dfsadmin -clrQuota [path]

hadoop dfsadmin -clrSpaceQuota [path]
Viewing quotas:
hadoop fs -count -q [path]
Limitations
Low-latency data access:
tens of milliseconds
alternatives: HBase, Cassandra
Lots of small files
limited by namenode's physical memory
inefficient storage utilization due to large block size
Multiple writers, arbitrary file modifications:
single writer
append operations only
First Day - Session III
YARN
http://th00.deviantart.net/fs71/PRE/f/2013/185/c/9/spinning_yarn_by_chaosfissure-
d6byoqu.jpg
Legacy architecture
Historically, there had been only one data processing paradigm
for Hadoop - MapReduce
Hadoop with MRv1 architecture consisted of two core

components: HDFS and MapReduce
MapReduce component was responsible for cluster resources

management and MapReduce jobs execution
As other data processing paradigms have become available,

Hadoop with MRv2 (YARN) was developed

Segregation of services:
single MapReduce service for resource and job management
purpose
separate YARN services for resource and job management

purpose
Scalability:
MapReduce framework hits scalability limitations in clusters

consisting of 5000 nodes
YARN framework doesnt hit scalability limitations in clusters

consisting of 40000 nodes
Flexibility:
MapReduce framework is capable of executing MapReduce jobs

only
YARN framework is capable of executing any jobs
Daemons
ResourceManager:
responsible for cluster resources management
consists of a scheduler and an application manager component
NodeManager:
responsible for node resources management, jobs execution and

tasks execution
provides containers abstraction
many per cluster
JobHistoryServer:
responsible for serving information about completed jobs

Design
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN job stages

1. A client retrieves an application ID from the resource manager
2. The client calculates input splits and writes the job resources (e.g. jar
file) into HDFS
3. The client submits the job by calling the submitApplication() method on

the resource manager
4. The resource manager allocates a container for the job execution

purpose and launches the application master process inside the
container
5. The application master process initializes the job and retrieves job
resources from HDFS
6. The application master process requests the resource manager for

containers allocation
7. The resource manager allocates containers for tasks execution purpose
8. The application master process requests node managers to launch

JVMs inside the allocated containers
9. Containers retrieve job resources and data from HDFS
10. Containers periodically report progress and status updates to the

application master process
11. The client periodically polls the application master process for
progress and status updates
YARN job stages #2
http://pravinchavan.files.wordpress.com/2013/04/yarn-2.png
ResourceManager high availability

ResourceManager - SPOF
High Availability type: active / passive
Based on: ZooKeeper
Failover type: manual or automatic with ZKFC
ZFCK daemon embedded in the ResourceManager code
STONITH
Fencing methods
Resources configuration
Allocated resources:
physical memory
virtual cores
Resources configuration (yarn-site.xml):
yarn.nodemanager.resource.memory-mb - available
memory in megabytes
yarn.nodemanager.resource.cpu-vcores - available number

of virtual cores
yarn.scheduler.minimum-allocation-mb - minimum allocated

memory in megabytes
yarn.scheduler.minimum-allocation-vcores - minimum
number of allocated virtual cores
yarn.scheduler.increment-allocation-mb - increment
memory in megabytes into which the request is rounded up
yarn.scheduler.increment-allocation-vcores - increment
number of virtual cores into which the request is rounded up
Lab Exercise 1.3.1
First Day - Session IV
MapReduce
http://cacm.acm.org/system/assets/0000/2020/121609_CACMpg73_MapReduce_A_Flexible.la
rge.jpg

Automatic parallelism and distribution of work:
developers are required to only write simple map and reduce

functions
distribution and parallelism are handled by the MapReduce

framework
Data locality:
computation operations are performed on data local to the

computing node
data transfer over the network is reduced to an absolute

minimum
Fault tolerance:
jobs high availability
tasks distribution
Scalability:
horizontal scalability
Design
Fully automated parallel data processing paradigm
Jobs abstraction as an atomic unit
Jobs operate on key-value pairs
Job components:
map tasks
reduce tasks
Benefits:
simplicity of development
fast data processing times
MapReduce by analogy
Lab Exercise 1.4.1
Input splits vs HDFS blocks

http://media.wiley.com/Lux/21/423321.image0.jpg
Sample data
http://academic.udayton.edu/kissock/http/Weather/gsod95-
current/allsites.zip
1 1 1995 82.4
1 2 1995 75.1
1 3 1995 73.7
1 4 1995 77.1
1 5 1995 79.5
1 6 1995 71.3
1 7 1995 71.4
1 8 1995 75.2
1 9 1995 66.3
1 10 1995 61.8
Sample map function

Function:
#!/usr/bin/perl
use strict;
my %results;
while (<>)
{
$_ =~ m/.*(\d{4}).*\s(\d+\.\d+)/;
push (@{$results{$1}}, $2);
}
foreach my $year (sort keys %results)
{
print "$year: ";
foreach my $temperature (@{$results{$year}})
{
print "$temperature ";
}
print "\n";
}
Function output:
year: temperature1 temperature2 ...
Sample reduce function

Function:
#!/usr/bin/perl
use List::Util qw(max);
use strict;
my %results;
while (<>)
{
$_ =~ m/(\d{4})\:\s(.*)/;
my @temperatures = split(' ', $2);
@{$results{$1}} = @temperatures;
}
foreach my $year (sort keys %results)
{
my $temperature = max(@{$results{$year}});
print "$year: $temperature\n";
}
Function output:
year: temperature
Schufle and sort phase
http://blog.cloudera.com/wp-content/uploads/2014/03/ssd1.png
Map task in details
Map function:
takes key-value pair
returns key-value pair
Input key-value pair: file - input split
Output key-value pair: defined by the map function
Output key-value pairs are sorted by the key
Partitioner assigns each key-value pair to a partition
Reduce task in details

Reduce function:
takes key-value pair
returns key-value pair
Input key-value pair: returned by the map function
Output key-value pair: final result
Input is received in the shuffle and sort phase
Each reducer is assigned to one or more partitions
Reducer starts copying the data from partitions as soon as

they are written
Combine function
Executed by the map task
Executed just after the map function
Most often the same as the reduce function

Lab Exercise 1.4.2
Java MapReduce
MapReduce jobs are written in Java
org.apache.hadoop.mapreduce namespace:
Mapper class and map() function
Reducer class and reduce() function
MapReduce jobs are executed by the hadoop program
Sample Java MapReduce job available here [[1]]
Sample Java MapReduce job execution:
hadoop [MapReduce program path] [input data path] [output data path]
Streaming
Uses hadoop-streaming.jar class
Supports all languages capable of reading from the stdin and

writing to the stdout
Sample streaming MapReduce job execution:
hadoop jar [hadoop-streaming.jar path] \

-input [input data path] \
-output [output data path] \
-mapper [map function path] \
-reducer [reduce function path] \
-combiner [combiner function path] \
-file [attached files path]
Pipes
Supports all languages capable of reading from and writing to
network sockets
Mostly applicable for C and C++ languages

Sample piped MapReduce job execution:
hadoop pipes \
-D hadoop.pipes.java.recordreader=true
-D hadoop.pipes.java.recordwriter=true
-input [input data path] \
-output [output data path] \
-program [MapReduce program path]
-file [attached files path]
Limitations
If one woman can have a baby in nine months, nine women should
be able to have a baby in one month.
Batch processing system:
MapReduce jobs run on the order of minutes or hours
no support for low-latency data access
Simplicity:
optimized for full table scan operations
no support for data access optimization
Too low-level:
raw data processing functions
alternatives: Pig, Hive
Not all algorithms can be parallelised:
algorithms with shared state
algorithms with dependent variables
Second Day - Session I

Pseudo-distributed mode
http://www.cpusage.com/blog/wp-content/uploads/2013/11/distributed-computing-1.jpg
Hadoop installation in pseudo-distributed

mode
Lab Exercise 2.1.1
Lab Exercise 2.1.2
Lab Exercise 2.1.3
Lab Exercise 2.1.4
Second Day - Session II

Running MapReduce jobs
http://news.nost.org.cn/wp-content/uploads/2013/06/BigDataBigBuildings.jpg
Running MapReduce jobs in

pseudo-distributed mode
Lab Exercise 2.2.1
Lab Exercise 2.2.2
Lab Exercise 2.2.3
Lab Exercise 2.2.4
Lab Exercise 2.2.5
Lab Exercise 2.2.6

Second Day - Session III
Hadoop ecosystem
http://www.imaso.co.kr/data/article/1(2542).jpg
Pig - Goals and motivation

Developed by Yachoo
Simplicity of development:
however, MapReduce programs themselves are simple, designing

complex regular expressions may be challenging and time
consuming
Pig offers much reacher data structures for pattern matching

purpose
Length and clarity of the source code:
MapReduce programs are usually long and comprehensible
Pig programs are usually short and understandable
Simplicity of execution:
MapReduce programs require compiling, packaging and
submitting
Pig programs can be executed ad-hoc from an interactive shell
Pig - Design
Pig components:
Pig Latin - scripting language for data flows
each program is made up of series of transformations

applied to the input data
each transformation is made up of series of MapReduce

jobs run on the input data
Pig Latin programs execution environment:
launched from a single JVM
execution distributed over the Hadoop cluster
Pig - relational operators

LOAD - loads the data from a filesystem or other storage into a relation
STORE - saves the relation to a filesystem or other storage
DUMP - prints the relation to the console
FILTER - removes unwanted rows from the relation
DISTINCT - removes duplicate rows from the relation
SAMPLE - selects a random sample from the relation
JOIN - joins two or more relations
GROUP - groups the data in a single relation
ORDER - sorts the relation by one or more fields

LIMIT - limits the size of the relation to a maximum number of tuples
and more
Pig - mathematical and logical operators

literal - constant / variable
$n - n-th column
x + y / x - y - addition / substraction
x * y / x / y - multiplication / division
x == y / x != y - equals / not equals
x > y / x < y - greater then / less then
x >= y / x <= y - greater or equal then / less or equal then
x matches y - pattern matching with regular expressions
x is null - is null
x is not null - is not null
x or y - logical or
x and y - logical and
not x - logical negation
Pig - types
int - 32-bit signed integer
long - 64-bit signed integer
float - 32-bit floating-point number
double - 64-bit floating-point number

chararray - character array in UTF-16 format
bytearray - byte array
tuple - sequence of fields of any type
map - set of key-value pairs:
key - character array
value - any type
Pig - Exercises
Lab Exercise 2.3.1
Hive - Goals and Motivation

Developed by Facebook
Data structurization:
HDFS stores data in an unstructured format
Hive stores data in a structured format
SQL interface:
HDFS does not provide the SQL interface
Hive provides an SQL-like interface
Hive - Design
Metastore:
stores Hive metadata
RDBMS database:
embedded: Derby
local: MySQL, PostgreSQL, etc.
remote: MySQL, PostgreSQL, etc.
HiveQL:
used to read data by accepting queries and translating them to a

series of MapReduce jobs
used to write data by uploading them into HDFS and updating

the Metastore
Interfaces:
CLI
Web GUI
ODBC
JDBC
Hive - Data Types

TINYINT - 1-byte signed integer
SMALLINT - 2-byte signed integer
INT - 4-byte signed integer
BIGINT - 8-byte signed integer
FLOAT - 4-byte floating-point number
D0UBLE - 8-byte floating-point number
BOOLEAN - true/false value
STRING - character string
BINARY - byte array
MAP - an unordered collection of key-value pairs

Hive - SQL vs HiveQL
Feature SQL HiveQL
UPDATE, INSERT,
Updates INSERT OVERWRITE TABLE
DELETE
Transactions Supported Not Supported
Indexes Supported Not supported
Latency Sub-second Minutes
Multitable inserts Partially supported Supported
Create table as
Partially supported Supported
select
Views Updatable Read-only
Hive - Exercises
Lab Exercise 2.3.2
Sqoop - Goals and Motivation

Cooperation with RDBMS:
RDBMS are widely used for data storing purpose
need for importing / exporting data from RDBMS to HDFS and

vice versa
Data import / export flexibility:

manual data import / export using HDFS CLI, Pig or Hive
automatic data import / export using Sqoop
Sqoop - Design
JDBC - used to connect to the RDBMS
SQL - used to retrieve / populate data from / to RDBMS
MapReduce - used to retrieve / populate data from / to HDFS
Sqoop Code Generator - used to create table-specific class to hold

records extracted from the table
Sqoop - Import process
Sqoop - Export process

Sqoop - Exercises
Lab Exercise 2.3.3
HBase - Goals and Motivation

Designed by Google (BigTable)
Developed by Apache
Interactive queries:
MapReduce, Pig and Hive are batch processing frameworks
HBase is a real-time processing framework

HBase stores data in a semi-structured format
HBase - Design
Runs on the top of HDFS
Used to be concidered a core Hadoop component
Key-value NoSQL database
Columnar storage
Metadata stored in the -ROOT- and .META. tables
Data can be accessed interactively or using MapReduce
HBase - Architecture
Tables are distributed across the cluster
Tables are automatically partitioned horizontally into regions
Tables consist of rows and columns
Table cells are versionde (by a timestamp by default)
Table cell type is an uninterpreted array of bytes
Table rows are sorted by the row key which is the table's primary key
Table columns are grouped into column families
Table column families must be defined on the table creation stage
Table columns can be added on the fly if the column family exists
HBase - Exercises
Lab Exercise 2.3.4
Flume - Goals and motivation

Streaming data:
HDFS does not have any built-in mechanisms for handling

streaming data flows
Flume is designed to collect, aggregate and move streaming

data flows into HDFS
Data queueing:
When writing directly to HDFS data are lost during spike periods
Flume is designed to buffer data during spike periods
Guarantee of data delivery:
Flume is designed to guarantee a delivery of the data by using a

single-hop message delivery semantics
Flume - Design
What is Flume? Flume is a frontend to HDFS
Event - unit of data that Flume can transport
Flow - movement of the events from a point of origin to their final

destination
Client - an interface implementation that operates at the point of

origin of the events
Agent - an independent process that hosts flume components such as

sources, channels and sinks
Source - an interface implementation that can consume the events

delivered to it and deliver them to channels
Channel - a transient store for the events, which are delivered by the
sources and removed by sinks
Sink - an interface implementation that can remove the events from

the channel and transmit them further
Flume - Design #2
http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/images/UserGuide_image00.png
Flume - Closer look

Flume sources:
Avro, HTTP, Syslog, etc.
https://flume.apache.org/FlumeUserGuide.html#flume-sources
Sink types:
regular - transmit the event to another agent
terminal - transmit the event to its final destination
Flow pipeline:
from one source to one destination
from multiple sources to one destination
from one source to multiple destinations
from multiple sources to multiple destinations
Oozie - Goals and motivation

Hadoop jobs execution:
Hadoop clients execute Hadoop jobs from CLI
using hadoop command
Oozie provides web-based GUI for Hadoop jobs definition and

execution
Hadoop jobs management:
Hadoop doesn't provide any built-in mechanism for jobs

management (e.g. forks, merges, decisions, etc.)
Oozie provides Hadoop jobs management feature based on a

control dependency DAG
Oozie - Design
What is Oozie? Oozie is a frontend to Hadoop jobs
Oozie - workflow scheduling system
Oozie workflow
control flow nodes:
define the beginning and the end of the workflow
provide a mechanism of controlling the workflow execution

path
action nodes:
are mechanisms by which the workflow triggers an

execution of Hadoop jobs
supported job types include Java MapReduce, Pig, Hive,

Sqoop and more
Oozie - Web GUI

http://www.gruter.com/download_files/cloumon-oozie-01.png
Second Day - Session IV
Supporting tools and Big Data

future
http://www.datacenterjournal.com/wp-content/uploads/2012/11/big-data1-612x300.jpg
Hue - Goals and motivation
Hadoop cluster management:
each Hadoop component is managed independently
Hue provides centralized Hadoop components management tool
Hadoop components are mostly managed from the CLI
Hue provides web-based GUI for Hadoop components

management
Open Source:
other cluster management tools (i.e. Cloudera Manager) are

payable and closed
Hue is free of charge and an open-source tool
Hue - Design
What is Hue? Hue is a frontend to Hadoop components
Hue components:
Hue server - web application's container
Hue database - web application's data
Hue UI - web user interface application
REST API - for communication with Hadoop components
Supported browsers:
Google Chrome
Mozilla Firefox 17+
Internet Explorer 9+
Safari 5+
Lab Exercise 2.4.1
Hue - Design #2
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.3.0/Hue-2-User-
Guide/images/huearch.jpg
Cloudera Manager - Goals and motivation

Hadoop cluster deployment and configuration:
Hadoop components need to be deployed and configured

manually
Cloudera Manager provides an automatic deployment and

configuration of Hadoop components
Hadoop cluster management:
each Hadoop component is managed independently
Cloudera Manager provides a centralized Hadoop components

management tool
Hadoop components are mostly managed from the CLI
Cloudera Manager provides a web-based GUI for Hadoop

components management
Cloudera Manager - Design

What is Cloudera Manager? Cloudera Manager is a frontend to
Hadoop components
Cloudera Manager versions:
Express - deployment, configuration, management, monitoring

and diagnostics
Enterprise - advanced management features and support
Cloudera Manager components:
Cloudera Manager Server - web application's container and core

Cloudera Manager engine
Cloudera Manager Database - web application's data and

monitoring information
Admin Console - web user interface application
Agent - installed on every component in the cluster
REST API - for communication with Agents
Lab Exercise 2.4.2
Lab Exercise 2.4.3
Lab Exercise 2.4.4
Cloudera Manager - Design #2

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Navigator/1.0/Cloudera-
Navigator-Installation-and-User-Guide/images/image1.jpeg
Impala - Goals and Motivation

Designed by Google (Dremel)
Developed by Cloudera
MapReduce, Pig and Hive are batch processing frameworks
Impala is a real-time processing framework
HBase and Cassandra store data in a semi-structured format
Impala stores data in a structured format
SQL interface:
HDFS does not provide the SQL interface
HBase and Cassandra do not provide the SQL interface by default
Impala provides the SQL interface
Impala - Design
Runs on the top of HDFS or HBase
Columnar storage
Uses Hive metastore
Implements SQL tables and queries
Interfaces:
CLI
web GUI
ODBC
JDBC
Impala - Daemons
Impala Daemon:
performs read and write operations
accepts queries and returns query results
parallelizes queries and distributes the work
Impala Statestore:
checks health of the Impala Daemons
reports detected failures to other nodes in the cluster
Impala Catalog Server:
relays metadata changes to all nodes in the cluster
Tez - Goals and Motivation

MapReduce is a batch processing framework
Tez is a real-time processing framework
Job types:
MapReduce is destined for key-value-based data processing
Tez is destined for DAG-based data processing
Tez - Design
http://images.cnitblog.com/blog/312753/201310/19114512-
89b2001bf0444863b5c11da4d5100401.png
Spark
Fast and general engine for large-scale data processing.
Runs programs up to 100x faster than Hadoop MapReduce in

memory, or 10x faster on disk.
Supports applications written in Java, Scala, Python, R.
Combines SQL, streaming, and complex analytics.
Runs on Hadoop, Mesos, standalone, or in the cloud.
Can access diverse data sources including HDFS, Cassandra,

HBase, and S3.
NoSQL
"Not only SQL" database"
key-value store concept
fast and easy
examples:
MongoDB, https://www.mongodb.org/
Redis, http://redis.io/
Couchbase, http://www.couchbase.com/
Third Day - Session I
Hadoop cluster planning
http://news.fincit.com/wp-content/uploads/2012/07/tangle.jpg
Picking up a distribution
Considerations:
license cost
open / closed source
Hadoop version
available features
supported OS types
ease of deployment
integrity
support
Hadoop releases
http://drcos.boudnik.org/
Release notes: http://hadoop.apache.org/releases.html
Why CDH?
Free
Open source
Stable (always based on stable Hadoop releases)
Wide range of available features
Support for RHEL, SUSE and Debian
Easy to deploy
Very well integrated
Very well supported
Hardware selection
Considerations:
CPU
RAM
storage IO
storage capacity
storage resilience
network bandwidth
network resilience
hardware resilience
Hardware selection #2
Master nodes:
namenodes
resource managers
secondary namenode
Worker nodes
Network devices
Supporting nodes
Namenode hardware selection

CPU: quad-core 2.6 GHz
RAM: 24-96 GB DDR3 (1 GB per 1 million of blocks)
Storage IO: SAS / SATA II
Storage capacity: less than 1 TB
Storage resilience: RAID
Network bandwidth: 2 Gb/s
Network resilience: LACP
Hardware resilience: non-commodity hardware
ResourceManager hardware selection

CPU: quad-core 2.6 GHz
RAM: 24-96 GB DDR3 (unpredictable)
Storage capacity: less than 1 TB
Storage resilience: RAID
Network bandwidth: 2 Gb/s
Network resilience: LACP

Hardware resilience: non-commodity hardware
Secondary namenode hardware selection

Almost always identical to namenode
The same requirements for RAM, storage capacity and desired

resilience level
Very often used as a replacement hardware in case of

namenode failure
Worker hardware selection

CPU: 2 * 6 core 2.9 GHz
RAM: 64-96 GB DDR3
Storage capacity: 24-36 TB JBOD
Storage resilience: none
Network bandwidth: 1-10 Gb/s
Network resilience: none
Hardware resilience: commodity hardware
Network devices hardware selection

Downlink bandwidth: 1-10 Gb/s
Uplink bandwidth: 10 Gb/s
Computing resources: vendor-specific
Resilience: by network topology
Supporting nodes hardware selection

Backup system
Monitoring system
Security system
Logging system
Cluster growth planning

Considerations:
replication factor: 3 by default
temporary data: 20-30% of the worker storage space
data growth: based on trends analysis
the amount of data being analised: based on cluster

requirements
usual task RAM requirements: 2-4 GB
Lab Exercise 3.1.1
Lab Exercise 3.1.2
Network topologies
Hadoop East/West traffic flow design
Network topologies:
Leaf
Spine
Inter data center replication
Lab Exercise 3.1.3
Leaf topology
Spine topology
Blades, SANs, RAIDs and virtualization

Blades considerations:
Hadoop is designed to work on the commodity hardware
Hadoop cluster scales out rather than up
SANs considerations:
Hadoop is designed for processing local data
HDFS is distributed and shared by design

RAIDs considerations:
Hadoop is designed to provide data durability
RAID introduces additional limitations and overhead
Virtualization considerations:
Hadoop worker nodes don't benefit from virtualization
Hadoop is a clustering solution which is the opposite to

virtualization
Hadoop directories
Grows
Directory Linux path
?
Hadoop install directory /usr/* no
defined by an
Namenode metadata directory yes
administrator
Secondary namenode metadata defined by an

yes
directory administrator
defined by an
Datanode data directory yes
administrator
defined by an
MapReduce local directory no
administrator
Hadoop log directory /var/log/* yes
Hadoop pid directory /var/run/hadoop no

Hadoop temp directory /tmp no
Filesystems
LVM (Logical Volume Manager) - HDFS killer
Considerations:
extent-based design
burn-in time
Winners:
EXT4
XFS
Lab Exercise 3.1.4
Fourth Day - Session II and III
Hadoop cluster installation and

configuration
http://www.michael-noll.com/blog/uploads/Yahoo-hadoop-cluster_OSCON_2007.jpeg
Underpinning software
Prerequisites:
JDK (Java Development Kit)
Additional software:
Cron daemon
NTP client
SSH server
MTA (Mail Transport Agent)
RSYNC tool
NMAP tool
Hadoop installation
Should be performed as the root user
Installation instructions are distribution-dependent
Installation types:
manual installation
automated installation
Installation sources:
packages
tarballs
source code
Hadoop package content

/etc/hadoop - configuration files
/etc/rc.d/init.d - SYSV init scripts for Hadoop daemons
/usr/bin - executables
/usr/include/hadoop - C++ header files for Hadoop pipes
/usr/lib - C libraries
/usr/libexec - miscellaneous files used by various libraries and scripts
/usr/sbin - administrator scripts
/usr/share/doc/hadoop - License, NOTICE and README files
CDH-specific additions
Binaries for Hadoop Ecosystem
Use of alternatives system for Hadoop configuration directory:
/etc/alternatives/hadoop-conf
Default nofile and nonproc limits:
/etc/security/limits.d
Hadoop configuration
hadoop-env.sh - environmental variables
core-site.xml - all daemons
hdfs-site.xml - HDFS daemons
mapred-site.xml - MapReduce daemons
log4j.properties - logging information
masters (optional) - list of servers which run secondary namenode

daemon
slaves (optional) - list of servers which run datanode and tasktracker
daemons
fail-scheduler.xml (optional) - fair scheduler configuration and

properties
capacity-scheduler.xml (optional) - capacity scheduler configuration

and properties
dfs.include (optional) - list of servers permitted to connect to the

namenode
dfs.exclude (optional) - list of servers denied to connect to the

namenode
hadoop-policy.xml - user / group permissions for RPC functions

execution
mapred-queue-acl.xml - user / group permissions for jobs

submission
Hadoop XML files structure

<?xml version="1.0"?>
<configuration>
<property>
<name>[name]</name>
<value>[value]</value>
(<final>{true | false}</final>)
</property>
</configuration>
Parameter: fs.defaultFS
Configuration file: core-site.xml
Purpose: specifies the default filesystem used by clients
Format: Hadoop URL
Used by: NN, SNN, DN, RM, NM, clients

Example:
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode.hadooplab:8020</value>
</property>
Parameter: io.file.buffer.size
Purpose: specifies a size of the IO buffer
Guidline: 64 KB
Format: integer
Used by: NN, SNN, DN, JT, TT, clients
Example:
<property>
<name>io.file.buffer.size</name>
<value>65536</value>
</property>
Parameter: fs.trash.interval
Purpose: specifies the amount of time (in minutes) the file is retained
in the .Trash directory
Format: integer
Used by: NN, clients
Example:
<property>
<name>fs.trash.interval</name>
<value>60</value>
</property>
Parameter: ha.zookeeper.quorum
Purpose: specifies the list of ZKFCs for automatic jobtracker failover

purpose
Format: comma separated list of hostname:port pairs
Used by: JT
Example:
<property>
<name>ha.zookeeper.quorum</name>
<value>jobtracker1.hadooplab:2181,jobtracker2.hadooplab:2181</value>
</property>
Parameter: dfs.name.dir
Configuration file: hdfs-site.xml
Purpose: specifies where the namenode should store a copy of the

HDFS metadata
Format: comma separated list of Hadoop URLs
Used by: NN
Example:
<property>
<name>dfs.name.dir</name>
<value>file:///namenodedir,file:///mnt/nfs/namenodedir</value>
</property>
Parameter: dfs.data.dir
Purpose: specifies where the datanode should store HDFS data
Used by: DN
Example:
<property>
<name>dfs.data.dir</name>
<value>file:///datanodedir1,file:///datanodedir2</value>
</property>
Parameter: fs.checkpoint.dir
Purpose: specifies where the secondary namenode should store HDFS

metadata
Used by: SNN
Example:
<property>
<name>fs.checkpoint.dir</name>
<value>file:///secondarynamenodedir</value>
</property>
Parameter: dfs.permissions.supergroup

Purpose: specifies a group permitted to perform all HDFS filesystem
operations
Format: string
Example:
<property>
<name>dfs.permissions.supergroup</name>
<value>supergroup</value>
</property>
Parameter: dfs.balance.bandwidthPerSec
Purpose: specifies the maximum allowed bandwidth for HDFS

balancing operations
Format: integer
Used by: DN
Example:
<property>
<name>dfs.balance.bandwidthPerSec</name>
</property>
Parameter: dfs.blocksize
Purpose: specifies the HDFS block size of newly created files
Guideline: 64 MB
Format: integer
Used by: clients
Example:
<property>
<name>dfs.block.size</name>
</property>
Parameter: dfs.datanode.du.reserved
Purpose: specifies the disk space reserved for MapReduce temporary

data
Guideline: 20-30% of the datanode disk space
Format: integer
Used by: DN
Example:
<property>
<name>dfs.datanode.du.reserved</name>
<value>10737418240</value>
</property>
Parameter: dfs.namenode.handler.count
Purpose: specifies the number of concurrent namenode handlers
Guideline: natural logarithm of the number of cluster nodes, times 20,

as a whole number
Format: integer
Used by: NN
Example:
<property>
<name>dfs.namenode.handler.count</name>
<value>105</value>
</property>
Parameter: dfs.datanode.failed.volumes.tolerated
Purpose: specifies the number of datanode disks that are permitted to

die before failing the entire datanode
Format: integer
Used by: DN
Example:
<property>
<name>dfs.datanode.failed.volumes.tolerated</name>
<value>0</value>
</property>
Parameter: dfs.hosts
Purpose: specifies the location of dfs.hosts file
Format: Linux path
Used by: NN
Example:
<property>
<name>dfs.hosts</name>
<value>/etc/hadoop/conf/dfs.hosts</value>
</property>
Parameter: dfs.hosts.exclude
Purpose: specifies the location of dfs.hosts.exclude file
Format: Linux path
Used by: NN
Example:
<property>
<name>dfs.hosts.exclude</name>
<value>/etc/hadoop/conf/dfs.hosts.exclude</value>
</property>
Parameter: dfs.nameservices
Purpose: specifies virtual namenode names used for the namenode

HA and federation purpose
Format: comma separated strings
Example:
<property>
<name>dfs.nameservices</name>
<value>HAcluster1,HAcluster2</value>
</property>
Parameter: dfs.ha.namenodes.[nameservice]
Purpose: specifies logical namenode names used for the namenode

HA purpose
Used by: DN, NN, clients
Example:
<property>
<name>dfs.ha.namenodes.HAcluster1</name>
<value>namenode1,namenode2</value>
</property>
Parameter: dfs.namenode.rpc-address.[nameservice].[namenode]
Purpose: specifies namenode RPC address used for the namenode HA

purpose
Format: Hadoop URL
Example:
<property>
<name>dfs.namenode.rpc-address.HAcluster1.namenode1</name>
<value>http://namenode1.hadooplab:8020</value>
</property>
Parameter: dfs.namenode.http-address.[nameservice].[namenode]
Purpose: specifies namenode HTTP address used for the namenode

HA purpose
Format: Hadoop URL
Example:
<property>
<name>dfs.namenode.http-address.[nameservice].[namenode]</name>
<value>http://namenode1.hadooplab:50070</value>
</property>
Parameter: topology.script.file.name
Purpose: specifies the location of a script for rack-awareness purpose
Format: Linux path
Used by: NN
Example:
<property>
<name>topology.script.file.name</name>
<value>/etc/hadoop/conf/topology.sh</value>
</property>
Parameter: dfs.namenode.shared.edits.dir
Purpose: specifies the location of shared directory on QJM with

namenode metadata for namenode HA purpose
Format: Hadoop URL
Example:
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://namenode1.hadooplab.com:8485/hadoop/hdfs/HAcluster</
value>
</property>
Parameter: dfs.client.failover.proxy.provider.[nameservice]
Purpose: specifies the class used for locating active namenode for
namenode HA purpose
Format: Java class
Used by: DN, clients
Example:
<property>
<name>dfs.client.failover.proxy.provider.HAcluster1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverPro
xyProvider</value>
</property>
Parameter: dfs.ha.fencing.methods
Purpose: specifies the fencing methods for namenode HA purpose
Format: new line-separated list of Linux paths

Used by: NN
Example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>/usr/local/bin/STONITH.sh</value>
</property>
Parameter: dfs.ha.automatic-failover.enabled
Purpose: specifies whether the automatic namenode failover is

enabled or not
Format: 'true' / 'false'
Example:
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
Parameter: ha.zookeeper.quorum
Purpose: specifies the list of ZKFCs for automatic namenode failover

purpose
Format: comma separated list of hostname:port pairs
Used by: NN
Example:
<property>
<name>ha.zookeeper.quorum</name>
<value>namenode1.hadooplab:2181,namenode2.hadooplab:2181</value>
</property>
Parameter: fs.viewfs.mounttable.default.link.[path]
Purpose: specifies the mount point of the namenode or namenode

cluster for namenode federation purpose
Format: Hadoop URL or namenode cluster virtual name
Used by: NN
Example:
<property>
<name>fs.viewfs.mounttable.default.link./federation1</name>
<value>hdfs://namenode.hadooplabl:8020</value>
</property>
Parameter: yarn.resourcemanager.hostname
Configuration file: yarn-site.xml
Purpose: specifies the location of resource manager
Format: hostname:port
Used by: RM, NM, clients
Example:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager.hadooplab:8021</value>
</property>
Parameter: yarn.nodemanager.local-dirs
Purpose: specifies where the resource manager and node manager

should store application temporary data
Used by: RM, NM
Example:
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///applocaldir</value>
</property>
Parameter: yarn.nodemanager.resource.memory-mb
Purpose: specifies memory in megabytes available for allocation for

the applications
Format: integer
Used by: NM
Example:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
</property>
Parameter: yarn.nodemanager.resource.cpu-vcores
Purpose: specifies virtual cores available for allocation for the

applications
Format: integer
Used by: NM
Example:
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>
Parameter: yarn.scheduler.minimum-allocation-mb
Purpose: specifies minimum amount of memory in megabytes

allocated for the applications
Format: integer
Used by: NM
Example:
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
Parameter: yarn.scheduler.minimum-allocation-vcores
Purpose: specifies minimum number of virtual cores allocated for the

applications
Format: integer
Used by: NM
Example:
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>2</value>
</property>
Parameter: yarn.scheduler.increment-allocation-mb
Purpose: specifies increment amount of memory in megabytes for

allocation for the applications
Format: integer
Used by: NM
Example:
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>512</value>
</property>
Parameter: yarn.scheduler.increment-allocation-vcores
Purpose: specifies incremental number of virtual cores for allocation

for the applications
Format: integer
Used by: NM
Example:
<property>
<name>yarn.scheduler.increment-allocation-vcores</name>
<value>1</value>
</property>
Parameter: mapreduce.task.io.sort.mb
Configuration file: mapred-site.xml
Purpose: specifies the size of the circular in-memory buffer used to

store MapReduce temporary data
Guideline: ~10% of the JVM heap size
Format: integer (MB)
Used by: NM
Example:
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>128</value>
</property>
Parameter: mapreduce.task.io.sort.factor
Purpose: specifies the number of files to merge at once
Guideline: 64 for 1 GB JVM heap size
Format: integer
Used by: NM
Example:
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>64</value>
</property>
Parameter: mapreduce.map.output.compress
Purpose: specifies whether the output of the map tasks should be

compressed or not
Format: 'true' / 'false'
Used by: NM
Example:
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
Parameter: mapreduce.map.output.compress.codec
Purpose: specifies the codec used for map tasks output compression
Format: Java class
Used by: NM
Example:
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.io.compress.SnappyCodec</value>
</property>
Parameter: mapreduce.output.fileoutputformat.compress.type
Purpose: specifies whether the compression should be applied to the

values only, key-value pairs or not at all
Format: 'RECORD' / 'BLOCK' / 'NONE'
Used by: NM
Example:
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
</property>
Parameter: mapreduce.reduce.shuffle.parallelcopies
Purpose: specifies the number of threads started in parallel to fetch

the output from map tasks
Guideline: natural logarithm of the size of the cluster, times four, as a

whole number
Format: integer
Used by: NM
Example:
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>5</value>
</property>
Parameter: mapreduce.job.reduces
Purpose: specifies the number of reduce tasks to use for the

MapReduce jobs
Format: integer
Used by: RM
Example:
<property>
<name>mapreduce.job.reduces</name>
<value>1</value>
</property>
Parameter: mapreduce.shuffle.max.threads
Purpose: specifies the number of threads started in parallel to serve

the output of map tasks
Format: integer
Used by: RM
Example:
<property>
<name>mapreduce.shuffle.max.threads</name>
<value>1</value>
</property>
Parameter: mapreduce.job.reduce.slowstart.completedmaps

Purpose: specifies when to start allocating reducers as the number of
complete map tasks grows
Format: float
Used by: RM
Example:
<property>
<name>mapreduce.job.reduce.slowstart.completedmaps</name>
<value>0.5</value>
</property>
Parameter: yarn.resourcemanager.ha.rm-ids
Purpose: specifies logical resource manager names used for resource

manager HA purpose
Example:
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>resourcemanager1,resourcemanager2</value>
</property>
Parameter: yarn.resourcemanager.ha.id
Purpose: specifies preferred resource manager for an active unit
Format: string
Used by: RM
Example:
<property>
<name>yarn.resourcemanager.ha.id</name>
<value>resourcemanager1</value>
</property>
Parameter: yarn.resourcemanager.address.[resourcemanager]
Purpose: specifies resource manager RPC address
Format: Hadoop URL
Used by: RM, clients
Example:
<property>
<name>yarn.resourcemanager.address.resourcemanager1</name>
<value>http://resourcemanager1.hadooplab:8032</value>
</property>
Parameter: yarn.resourcemanager.scheduler.address.
[resourcemanager]
Purpose: specifies resource manager RPC address for Scheduler

service
Format: Hadoop URL
Example:
<property>
<name>yarn.resourcemanager.scheduler.address.resourcemanager1</name>
</property>
Parameter: yarn.resourcemanager.admin.address.[resourcemanager]
Purpose: specifies resource manager RPC address used for the

resource manager HA purpose
Format: Hadoop URL
Example:
<property>
<name>yarn.resourcemanager.admin.address.resourcemanager1</name>
</property>
Parameter: yarn.resourcemanager.resource-tracker.address.
[resourcemanager]
Purpose: specifies resource manager RPC address for Resource

Tracker service
Format: Hadoop URL
Used by: RM, NM
Example:
<property>
<name>yarn.resourcemanager.resource-
tracker.address.resourcemanager1</name>
</property>
Parameter: yarn.resourcemanager.webapp.address.
[resourcemanager]
Purpose: specifies resource manager address for web UI and REST API
services
Format: Hadoop URL
Example:
<property>
<name>yarn.resourcemanager.webapp.address.resourcemanager1</name>
</property>
Parameter: yarn.resourcemanager.ha.fencer
Purpose: specifies the fencing methods for resourcemanager HA

purpose
Format: new-line-separated list of Linux paths
Used by: RM
Example:
<property>
<name>yarn.resourcemanager.ha.fencer</name>
<value>/usr/local/bin/STONITH.sh</value>
</property>
Parameter: yarn.resourcemanager.ha.auto-failover.enabled
Purpose: enables automatic failover in resource manager HA pair
Format: boolean
Used by: RM
Example:
<property>
<name>yarn.resourcemanager.ha.auto-failover.enabled</name>
<value>true</value>
</property>
Parameter: yarn.resourcemanager.recovery.enabled
Purpose: enables job recovery on resource manager restart or during

the failover
Format: boolean
Used by: RM
Example:
<property>
<name>yarn.resourcemanager.ha.recovery.enabled</name>
<value>true</value>
</property>
Parameter: yarn.resourcemanager.ha.auto-failover.port

Purpose: specifies ZKFC port
Format: integer
Used by: RM
Example:
<property>
<name>yarn.resourcemanager.ha.auto-failover.port</name>
<value>8018</value>
</property>
Parameter: yarn.resourcemanager.ha.enabled
Purpose: enables resource manager HA feature
Format: boolean value
Used by: RM
Example:
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
Parameter: yarn.resourcemanager.store.class
Purpose: specifies storage for resource manager internal state
Format: string
Used by: RM
Example:
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStat
eStore</value>
</property>
Parameter: yarn.resourcemanager.cluster-id
Purpose: specifies the logical name of YARN cluster
Format: string
Example:
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>HAcluster</value>
</property>
Hadoop configuration - exercises

Lab Exercise 3.2.1
Lab Exercise 3.2.2
Lab Exercise 3.2.3
Lab Exercise 3.3.1
Lab Exercise 3.3.2
Lab Exercise 3.3.3
Lab Exercise 3.3.4

Lab Exercise 3.3.5
Lab Exercise 3.3.6
Lab Exercise 3.3.7
Lab Exercise 3.3.8
Lab Exercise 3.3.9
Third Day - Session IV
Case Studies, Certification and

Surveys
http://zone16.pcansw.org.au/site/ponyclub/image/fullsize/60786.jpg
Certification and Surveys

Congratulations on completing the course!
Official Cloudera certification authority:
http://www.cloudera.com/content/cloudera/en/training/certificatio
n/ccah.html
Visit http://www.nobleprog.com for other courses
Surveys: http://www.nobleprog.pl/te

Hadoop Administration

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Administration

Uploaded by

Copyright:

Available Formats

Hadoop Administration

What does Hadoop mean?

Doug Cutting, Hadoop Project Creator

Introduction to the course

Introduction to Big Data and Hadoop

Installation and configuration of the Hadoop in a pseudo-

Running MapReduce jobs on the Hadoop cluster

Hadoop ecosystem tools: Pig, Hive, Sqoop, HBase, Flume,

Supporting tools: Hue, Cloudera Manager

Big Data future: Impala, Tez, Spark, NoSQL

Hadoop cluster planning

Hadoop cluster installation and configuration

Certification and Surveys

First Day - Session I

"I think there is a world market for maybe

Thomas Watson, IBM, 1943.

How big are the data we are talking about?

Open-source implementation of Google Big Data solutions

HDFS (Hadoop Distributed File System)

YARN (Yet Another Resource Negotiator)

Data processing paradigms (MapReduce, Spark, Impala,

Ecosystem tools (Pig, Hive, Sqoop, HBase, etc.)

Data storage evolution

1983 - SDD (Solid State Drive), now up to 16 TB

1984 - NFS (Network File System), first NAS (Network Attached

1987 - RAID (Redundant Array of Independent Disks), now up to ~100

1993 - Disk Arrays, now up to ~200 disks

1994 - Fibre-channel, first SAN (Storage Area Network) implementation

2003 - GFS (Google File System), first Big Data implementation

Big Data and Hadoop evolution

2004 - Google MapReduce publication, available here

2006 - Google BigTable publication, available here

2008 - First Hadoop implementation

2010 - Google Dremel publication, available here

2010 - First HBase implementation

2012 - Apache Hadoop YARN project founded

2013 - Cloudera releases Impala

2014 - Apache releases Spark

CDH (Cloudera Distribution including apache

HDP (Hortonworks Data Platform), http://hortonworks.com/

M3, M5 and M7, http://www.mapr.com/

BigInsights Enterprise Edition, http://www.ibm.com/

Intel Distribution for Apache Hadoop, http://hadoop.intel.com/

And more ...

Hadoop automated deploymen tools

Apache Ambari, https://ambari.apache.org/

VMware Big Data Platform, https://www.vmware.com/products/big-

OpenStack Sahara, https://wiki.openstack.org/wiki/Sahara

Hadoop on Google Cloud

Who uses Hadoop?

Where is it all going?

capability of storing and processing any amount of data

tools for storing and processing both structured and unstructured

tools for batch and interactive data processing

E. Sammer. "Hadoop Operations". O'Reilly, 1st edition

T. White. "Hadoop: The Definitive Guide". O'Reilly, 4th edition

Apache Hadoop project website: http://hadoop.apache.org/

Hortonworks website: http://hortonworks.com/

MapR website: http://www.mapr.com/

Introduction to the lab

Laptop with Windows 8

Virtual Machine with CentOS 7.2 on VirtualBox:

Hadoop instance in a pseudo-distributed mode

Hadoop ecosystem tools