Professional Documents
Culture Documents
(/)
In this document we provide some background information about the framework, the key distributions,
modules, components, and related products. We also provide you with single and multi-node Hadoop
installation commands and con guration parameters.
The nal section includes some tips and tricks to help you get started, and provides guidance in setting up a
Hadoop project.
Contents
Hadoop Distributions
Hadoop Modules
Hadoop Components
Hadoop Ecosystem
https://jethro.io/hadoop-deployment-cheat-sheet 1/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
https://jethro.io/hadoop-deployment-cheat-sheet 2/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Pivotal Integration
(/) w/ Greenplum and Cloud Foundry (CF)
Hadoop Modules
Module Description
HDFS
Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware
YARN Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling
MapReduce Software framework for parallel processing of large data sets based on YARN
Hadoop Components
Component / Module Description
NameNode / HDFS The directory tree of the Hadoop HDFS le system (a.k.a Hadoop inode)
https://jethro.io/hadoop-deployment-cheat-sheet 3/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Secondary(/)NameNode / HDFS
High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits le
into the fsimage le
JournalNode / HDFS Arbiter node that supports auto failover between NameNodes
DataNode / HDFS Nodes (or servers) that store the actual data
ResourceManager / YARN
Global daemon that arbitrates resources among all the applications in the Hadoop cluster
ApplicationMaster / YARN
Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager
to consume them and monitor the tasks
NodeManager / YARN
Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage
such as CPU and disk, and reporting back to the ResourceManager
Container / YARN
Running speci c tasks on a speci c machine for a speci c application based on allocated resources
https://jethro.io/hadoop-deployment-cheat-sheet 4/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Hadoop
(/) Ecosystem – Related Products
Product Description
Ambari
A completely open-source management platform for provisioning, managing, monitoring and securing Apache
Hadoop clusters
Flume Reliable, distributed and available service that streams logs into HDFS
Mahout
Machine learning algorithm (clustering, classi cation and batch-based collaborative ltering) implementation based on
MapReduce
Ranger Access policy manager for HDFS les, folders, databases, tables and columns
https://jethro.io/hadoop-deployment-cheat-sheet 5/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Spark (/)
Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like
interface and machine learning library.
Sqoop Data migration application between RDBMS and Hadoop using CLI
Tez Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN
Pig High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark
ZooKeeper
Distributed name registry, synchronization service and con guration service that is used as a sub-system in Hadoop
https://jethro.io/hadoop-deployment-cheat-sheet 6/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Common
(/) Data Formats
Format Description
Avro JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.
Java Installation / Install >sudo apt-get -y update && sudo apt-get -y install default-jdk
https://jethro.io/hadoop-deployment-cheat-sheet 7/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Create User
(/) and Permissions / Create User >useradd hadoop
>passwd hadoop
>mkdir /home/hadoop
>chown -R hadoop:hadoop /home/hadoop
https://jethro.io/hadoop-deployment-cheat-sheet 8/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Environment
(/) / Env Vars >source ~/.bashrc
>export HADOOP_HOME=/home/hadoop/hadoop
>export HADOOP_INSTALL=$HADOOP_HOME
>export HADOOP_MAPRED_HOME=$HADOOP_HOME
>export HADOOP_COMMON_HOME=$HADOOP_HOME
>export HADOOP_HDFS_HOME=$HADOOP_HOME
>export YARN_HOME=$HADOOP_HOME
>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
https://jethro.io/hadoop-deployment-cheat-sheet 9/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Start System
(/) >cd $HADOOP_HOME/sbin/
>start-dfs.sh
>start-yarn.sh
Multi-node Installation
Task Command
https://jethro.io/hadoop-deployment-cheat-sheet 10/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Copy system
(/) >su - hadoop
>cd /opt/hadoop
>scp -r hadoop hadoop-slave-1:/opt/hadoop
>scp -r hadoop hadoop-slave-2:/opt/hadoop
>cd /opt/hadoop/hadoop
>vi conf/masters
//add your master node to the file:
hadoop-master
>vi conf/slaves
//add your slave nodes to the file, one hostname per line:
hadoop-slave-1
hadoop-slave-2
>su - hadoop
>cd /opt/hadoop/hadoop
https://jethro.io/hadoop-deployment-cheat-sheet 11/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
List the content of the home directory >hdfs dfs -ls /data/
Upload a le from the local le system to HDFS >hdfs dfs -put logs.csv /data/
Read the content of the le from HDFS >hdfs dfs -cat /data/logs.csv
https://jethro.io/hadoop-deployment-cheat-sheet 12/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Move the (/)le to the newly-created subdirectory >hdfs dfs -mv logs.csv logs/
HDFS Administration
Task Command
Run the RPC portmap for the NFS3 gateway >hdfs portmap
YARN
Task Command
https://jethro.io/hadoop-deployment-cheat-sheet 13/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
De ne log level
>yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG or
TRACE
User commands
Administration commands
https://jethro.io/hadoop-deployment-cheat-sheet 14/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Run ResourceManager
(/) admin client >yarn rmadmin
MapReduce
Submit the WordCount MapReduce job to the cluster
Check the output of this job in HDFS >hadoop fs -cat logs -output/*
Resource Manager UI
Resource Default URI
NameNode http://:50070/
https://jethro.io/hadoop-deployment-cheat-sheet 15/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
DataNode(/) http://:50075/
Secure Hadoop
Aspect Best Practice
Authentication
De ne users
Enable Kerberos in Hadoop
Setup Knox gateway to control access and authentication to the HDFS cluster
Integrate with the organization’s SSO and LDAP
Authorization
De ne groups
De ne HDFS Permissions
De ne HDFS ACL’s
Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns
https://jethro.io/hadoop-deployment-cheat-sheet 16/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Data Protection
Wire encryption with Knox or Hadoop
Iterate cluster sizing to optimize performance and meet actual load patterns
Hardware
The higher the storage per node, the longer the recovery time
https://jethro.io/hadoop-deployment-cheat-sheet 17/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
Networking
(/)cost should be 20% of hardware budget
Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows
for 3 replicas
Must be 64-bit
Disable hugepages
System
Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to
recover your cluster when needed
Keep replication
(/) >= 3
Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation
Verify that the le system you selected is supported by your Hadoop vendor
https://jethro.io/hadoop-deployment-cheat-sheet 19/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
© Copyright -
Jethro Data
(http://twitter.com/jethrodata) (http://facebook.com/Jethrodata)
https://jethro.io/hadoop-deployment-cheat-sheet 20/20