Administration of Hadoop Summer 2014 Lab Guide v3.1

Administration of Hadoop
Lab Guide
Summer 2014
Cluster Admin on Hadoop
This Certified Training Services Partner Program Guide (the Program Guide) is protected under
U.S. and international copyright laws, and is the exclusive property of MapR Technologies,
Inc. 2014, MapR Technologies, Inc. All rights reserved.
PROPRIETARY AND CONFIDENTIAL INFORMATION

2014 MapR Technologies, Inc. All Rights Reserved.
ii
Contents
Administration of Hadoop Lab Guide ......................................................................i
Get Started ............................................................................................................ 8
Get Started 1: Set up a lab environment in Amazon Web Services (AWS) ................................. 8
Lab Procedure ............................................................................................................................. 8
Create an AWS Account .......................................................................................................... 8
Configure Virtual Private Cloud (VPC) Networking ................................................................. 8
Create AWS Virtual Machine Instances for Hadoop Installation ........................................... 10
Create an AWS VM Instance for NFS Access ......................................................................... 12
Log in to AWS Nodes ............................................................................................................. 14
Managing your Nodes ........................................................................................................... 15
Terminating Your Instances and EBS Storage ....................................................................... 16
Get Started 2: Setup passwordless ssh access between nodes ................................................ 17
Get Started 3: Log into the class cluster .................................................................................... 18
Lab Procedure ........................................................................................................................... 18
Get Started 4: Explore the MapR Control System ..................................................................... 22
Lab Procedure ........................................................................................................................... 23
Log on and explore different views of the cluster ................................................................ 23
Identify Specific Characteristics of Your Cluster .................................................................... 24
Conclusion ................................................................................................................................. 25
Lessons Learned .................................................................................................................... 25
Lesson 1: Pre-install ........................................................................................... 26

Lab Overview ............................................................................................................................. 26
Lab Procedures .......................................................................................................................... 26

iii
Lab 1.1: Pre-install validation ................................................................................................ 26

Lab 1.2: Network, Memory and IO ........................................................................................ 29
Conclusion ............................................................................................................................. 30
Lesson 2: Install MapR software ......................................................................... 31

Lab Overview ............................................................................................................................. 31
Lab Procedure ........................................................................................................................... 31
Install a MapR cluster using the map-installer on the AWS environment ............................ 31
Conclusion ............................................................................................................................. 38
Discussion .............................................................................................................................. 38
Lesson 3: Post-install .......................................................................................... 39

Lab Overview ............................................................................................................................. 39
Lab Procedures .......................................................................................................................... 39
3.1 Run RWSpeedTest ............................................................................................................... 39
3.2 TeraGen/TeraSort ............................................................................................................... 39
Lesson 4: Configure Cluster Storage Resources ................................................ 41

Lab Overview ............................................................................................................................. 41
Lab Procedures .......................................................................................................................... 41
Rack Layout ........................................................................................................................... 41
Lab 4.1: Configure Node Topology ............................................................................................ 42
Lab 4.1 Steps ......................................................................................................................... 43
Lab 4.2: Create Volumes and Set Quotas .................................................................................. 45
Lab 4.2 Overview ................................................................................................................... 46
Lab 4.2 Set-up ....................................................................................................................... 47
Lab 4.2 Steps ......................................................................................................................... 49
Examine the volumes already on the cluster .................................................................... 49
Examine volume properties from the volumes list ........................................................... 49


iv
Practice Creating and Removing Volumes ........................................................................ 50

Remove one volume ......................................................................................................... 51
Create a volume for each user .......................................................................................... 51
Create a volume for your team project ............................................................................. 53
Verify that the volumes are set up correctly ..................................................................... 55
Set disk usage quotas for your project Accounting Entity ................................................. 56
Conclusion ............................................................................................................................. 57
Lesson 5: Data Ingestion, Access & Availability ................................................. 58

Labs Overview ........................................................................................................................... 58
5.1 Get Data into an NFS Cluster ............................................................................................... 58
Create Input Directory for Data ............................................................................................ 58
Run a MapReduce Job on Data ............................................................................................. 59
Modify Data and Run MapReduce again ............................................................................... 59
Compare Results from Both MapReduce Jobs ...................................................................... 59
Conclusion ............................................................................................................................. 60
Lab 5.2: Snapshots .................................................................................................................... 60
Create snapshot in two ways ................................................................................................ 60
Put some sample data into your volume .......................................................................... 62
Create a volume snapshot of your volume using MCS ...................................................... 62
Create and view contents of a new snapshot ................................................................... 63
Show snapshots are time-specific ......................................................................................... 63
Create new data files in your volume by running a shell script ........................................ 64
Create a new snapshot, wait about 30 seconds, then create another snapshot .............. 64
Explore the snapshot directory from CLI ........................................................................... 65
Show snapshots preserve deleted data ................................................................................ 67
Stop the program .............................................................................................................. 67
Remove all files except log and static ............................................................................... 67
Schedule snapshots from MSC .............................................................................................. 68
Create a custom schedule ................................................................................................. 68
Apply the schedule ............................................................................................................ 68

Lab 5.3: Mirrors and schedules ................................................................................................. 70

Create a mirror from the MCS .............................................................................................. 70
Create a local mirror based on a source volume ............................................................... 70
Copy data to your new mirror volume .............................................................................. 71
Apply a schedule to the mirror ............................................................................................. 72
Create a mirror from CLI and initiate a mirror sync .............................................................. 72
Conclusion ............................................................................................................................. 73
Lab 5.4: Disaster Recovery ........................................................................................................ 73
Set Up .................................................................................................................................... 74
Configure all nodes in the destination cluster ...................................................................... 74
Verify that each cluster has a unique name .......................................................................... 75
Create a remote mirror volume on the destination cluster .................................................. 75
Initiate mirroring to the destination cluster ......................................................................... 77
Verify data from source cluster was copied to destination cluster ....................................... 77
Conclusion ............................................................................................................................. 78
Lab 5.5: Using the HBase shell .................................................................................................. 78
Start HBase shell ................................................................................................................... 78
Create a Table using MapR Control System (MCS) ............................................................... 80
MapR Tables - Solutions ........................................................................................................ 80
HBase Shell Solution .............................................................................................................. 81
Troubleshooting ................................................................................................................ 83
HBase shell commands (optional) ......................................................................................... 84
Using importtsv and copytable ............................................................................................. 87
View Existing Table Using MCS .......................................................................................... 87
Create a Table using MCS and import data using importtsv ............................................. 88
Lesson 6: Cluster Monitoring .............................................................................. 94

Lab Overview ............................................................................................................................. 94
Lab 6.1: Set up Email Addresses ................................................................................................ 94

vi
Lab 6.2: Set up SMTP ................................................................................................................. 95

Lab 6.3: Metrics, Monitoring & Troubleshooting in MCS .......................................................... 96
Explore MapR metrics via the MCS ....................................................................................... 96
Use MCS metrics to monitor MapReduce jobs in progress ................................................... 99
Troubleshoot Jobs ............................................................................................................... 100
Lesson 7: Managing Services on Nodes .......................................................... 104

Lab Overview ........................................................................................................................... 104
Lab 7.1: Managing Services ..................................................................................................... 104
Use MCS to see what services are running on your cluster ................................................ 104
Learn where active management services are running ...................................................... 105
Manage Node Services for a single node ............................................................................ 105
Stop TaskTracker on a single node ...................................................................................... 107
Return to the Dashboard .................................................................................................... 107
Restart TaskTracker on your team node ............................................................................. 108
Compare services view of MCS versus jps .......................................................................... 110
Compare services view of MCS versus MapR CLI ................................................................ 110
Observe JobTracker failover ................................................................................................ 110
Lab 7.2: Decommissioning vs. Maintenance ........................................................................... 111
Decommissioning ................................................................................................................ 111
Maintenance ....................................................................................................................... 111
Identify and log onto the Master CLDB node ...................................................................... 111
Navigate to the MapR logs directory and monitor the CLDB log file .................................. 112
Monitor the MCS ................................................................................................................. 112
Installing passwordless ssh ...................................................................................................... 113

vii
Get Started
Get Started 1: Set up a lab environment in Amazon Web
Services (AWS)
This set up procedure will show you how to create your lab environment in AWS for the MapR
Hadoop Operations on-demand training. For a classroom or virtual, instructor led training
session, these AWS environments will already be set up for you, and your instructor will give you
further instructions for how to access your lab environment.
The steps below need to be followed in order to properly set up the AWS lab environment.
Lab Procedure
Create an AWS Account
You need to have an account on Amazon Web Services. If you already have an AWS account,
you can skip this task. Note that you will need to provide your email address, billing information
(credit card), and a phone number that you may be contacted at in order to create the account.
1. Point your Web browser to http://aws.amazon.com
2. Click the "Sign Up" button at the top right-hand side of the Web page
3. Select the "I am a new user" radio button
4. Type your email address in the "My e-mail address is:" text field
5. Click the "Sign in using our secure server" button
6. Fill out the "Login Credentials" Web form and click the "Continue" button
7. Fill out the "Contact Information" Web form and click the "Create Account and
Continue" button
8. Fill out the "Payment Information" Web form and click the "Continue" button
9. Fill out the "Identity Verification" Web form and click the "Call Me Now" button. Once
you reply to the phone call using your 4-digit code from this Web form, click the
"Continue to select your Support Plan" button
10. Fill out the "Support Plan" Web form. Note you will not need support services from
Amazon in order to run the labs in this class. Click the "Continue" button
11. Your AWS account is now provisioned and you can begin setting up the virtual machines
for your class
Configure Virtual Private Cloud (VPC) Networking
AWS provides two types of network configurations: VPC and "classic". The lab guide has been
written using the recommended VPC network. The configuration steps are below.
2. Select "AWS Management Console" from the "My Account / Console" drop-down list
3. Type your email address in the "My e-mail address is:" text field. Select the "I am a
returning user and my password is:" radio button. Click the "Sign in using our secure
server" button
4. In the "Compute & Networking" section of your AWS management console, click the
"VPC" link
5. In the "Virtual Private Cloud" section of your navigation pane, click "Your VPCs"
6. Cliick the "Create VPC" button and fill out the Web form as follows:
a. Name tag: mapr-odt-vpc
b. CIDR block: 10.0.0.0/16
c. Tenancy: Default
d. Click the "Yes, Create" button
7. In the "Virtual Private Cloud" section of your navigation pane, click "Subnets"
8. Click the "Create Subnet" button and fill out the Web form as follows:
a. Name tag: mapr-odt-subnet
b. VPC: mapr-odt-vpc
c. Availability Zone: No Preference
d. CIDR block: 10.0.0.0/24
e. Click the "Yes, Create" button
9. Select the "mapr-odt-subnet" checkbox and click the "Modify Auto-Assign Public IP"
button as follows:
a. Select the "Enable auto-assign Public IP" checkbox
b. Click the "Save" button
10. In the "Virtual Private Cloud" section of your navigation pane, click "Route Tables"
11. Click the "Create Route table" button and fill out the Web form as follows:
a. Name tag: mapr-odt-routes
b. VPC: mapr-odt-vpc
c. Click the "Yes, Create" button
12. In the "Virtual Private Cloud" section of your navigation pane, click "Internet Gateways"
13. Click the "Create Internet Gateway" button and fill out the Web form as follows:
a. Name tag: mapr-odt-gw
b. Click the "Yes, Create" button
c. Select the checkbox next to the "mapr-odt-gw" object and click the "Attach to
VPC" button


d. Select "mapr-odt-vpc" from the "VPC" drop-down list and click the "Yes, Attach"
button
14. In the "Virtual Private Cloud" section of your navigation pane, click "Route Tables"
15. Select the "mapr-odt-routes" object, select the "Routes" tab, and click the "Edit" button.
Fill out the Web form as follows:
a. Destination: 0.0.0.0/0
b. Target: mapr-odt-gw
c. Click the "Save" button
16. In the "Virtual Private Cloud" section of your navigation pane, click "Subnets"
17. Select the "mapr-odt-subnets" object and select the "Route Table" tab. Click the "Edit"
button and fill out the form as follows:
a. Select the "Change To" drop-down list and select "mapr-odt-routes"
b. Click the "Save" button

Create AWS Virtual Machine Instances for Hadoop Installation

You need to provision at least 3 virtual machines in AWS in order to complete the labs in this
course, and you can provision more if youd prefer. More VMs will allow you to experiment
with different cluster service layout plans (see lesson 2 for more detail), and will give you better
performance when running jobs. The VMs needed for the lab environment are not included in
the Free Tier, however, and will accrue a nominal charge during the expected time to perform
the lab exercises. More VMs will also result in a higher charge for their use. Read the
Managing your Nodes section of this manual to learn more about minimizing the EC2 use
charges.
server" button
"EC2" link
5. In the upper right-hand corner of the EC2 Web page, select the availability zone (from
the drop-down list next to the "Help" drop-down list) nearest to where you are
physically located from the following choices. Note that an availability zone will already
be selected based on the contact information you provided when you provisioned your
AWS account:
a. US East (N. Virginia)


10
b. US West (Oregon)
c. US West (N. California)
d. EU (Ireland)
e. Asia Pacific (Singapore)
f. Asia Pacific (Tokyo)
g. Asia Pacific (Sydney)
h. South America (Sao Paulo)
6. In the "INSTANCES" section of the navigation pane on the left-hand side of the Web
page, click the "Instances" link
7. Click the "Launch Instance" button
8. In the "Step 1: Choose an Amazon Machine Image" Web page, scroll down to the
bottom of the page and select the 64-bit version of an image of Red Hat v6.4 or 6.5.
Note: Red Hat 7.0 is NOT currently supported.
9. In the "Step 2: Choose an Instance Type" Web page, select the checkbox for "m3.large"
type and click the "Next: Configure Instance Details" button
10. In the "Step 3: Configure Instance Details" Web page, fill out the form as follows:
a. Number of instances: 3
b. Purchasing option: leave "Request Spot Instances" unchecked
c. Network: mapr-odt-vpc
d. Subnet: mapr-odt-subnet
e. Auto-assign Public IP: enable
f. IAM role: None
g. Shutdown behavior: Stop
h. Enable termination protection: Check "protect against accidental termination"
checkbox
i. Monitoring: leave "Enable CloudWatch detailed monitoring" unchecked
j. Tenancy: "shared tenancy (multi-tenant hardware)
k. Click the "Next: Add Storage" button
11. In the "Step 4: Add Storage" Web page:
a. Click the "Add New Volume" button
b. Leave all the defaults except check the "Delete on termination" checkbox
c. Repeat the above steps 2 more times to add a total of 3 EBS volumes to your
instances
d. Click the "Next: Tag Instance" button
12. In the "Step 5: Tag Instance" Web page, type "mapr-install-node" in the "Value" field
and click the "Next: Configure Security Group" button
13. In the "Step 6: Configure Security Group" Web page, select the "Create new security
group" radio button, type "mapr-sg" in the "Security group name:" field", and perform
the following steps:
a. Click the "Add Rule" button

11
b. Select "All TCP" from the "Type" drop-down list and select "Anywhere" from the
"Source" drop-down list
c. Click the "Add Rule" button
d. Select "All UDP" from the "Type" drop-down list and select "Anywhere" from
the "Source" drop-down list
e. Click the "Add Rule" button
f. Select "All ICMP" from the "Type" drop-down list and select "Anywhere" from
the "Source" drop-down list
g. Click the "Review and Launch" button
14. In the "Step 7: Review Instance Launch" Web page, review your instance launch details
and click the "Launch" button
15. In the "Select an existing key pair or create a new key pair" pop-up window, perform
one of the following steps:
a. select "Create a new key pair" and type "mapr-odt-keypair" in the "Key pair
name" text field. Click the "Download Key Pair" button.

OR

b. Select "select an existing key pair" and select the key pair from the "key pair
name" drop-down list

IMPORTANT NOTE: makes sure you save a copy of the new or existing key pair
file in a location that you can reference it throughout your training. If you lose
this file, you will lose access to your AWS instances, and will have to create new
ones.
16. Click the "Launch Instances" button
17. In the "Launch Status" Web page, click the "View Instances" button
18. Wait for the instances to get in the "running" state and status checks to complete
19. Log the IP Addresses of VMs for use later.
Create an AWS VM Instance for NFS Access

You will need to launch an instance that will serve as your NFS client. This is the simplest
instance, and will qualify for Free Tier use. Use the following information to launch this instance
in AWS.

12
server" button
"EC2" link
6. Click the "Launch Instance" button
7. In the "Step 1: Choose an Amazon Machine Image" Web page, scroll down to the
bottom of the page and select the 64-bit version of an image of Red Hat v6.4 or 6.5.
Note: Red Hat 7.0 is NOT currently supported
8. In the "Step 2: Choose an Instance Type" Web page, select the checkbox for "t1.micro"
type and click the "Next: Configure Instance Details" button
9. In the "Step 3: Configure Instance Details" Web page, fill out the form as follows:
a. Number of instances: 1
b. Purchasing option: leave "Request Spot Instances" unchecked
c. Network: mapr-odt-vpc
d. Subnet: mapr-odt-subnet
e. Auto-assign Public IP: enable
f. IAM role: None
g. Shutdown behavior: Stop
h. Enable termination protection: Select "protect against accidental termination"
checkbox
i. Monitoring: leave "Enable CloudWatch detailed monitoring" unchecked
j. Tenancy: "shared tenancy (multi-tenant hardware)
k. Click the "Next: Add Storage" button
10. In the "Step 4: Add Storage" Web page, click the "Next: Tag Instance" button
11. In the "Step 5: Tag Instance" Web page, type "MapR-NFS-node" in the "Value" field and
click the "Next: Configure Security Group" button
12. In the "Step 6: Configure Security Group" Web page:
a. select the "select an existing security group" radio button
b. select the "mapr-sg" checkbox
c. Click the "Review and Launch" button
13. In the "Step 7: Review Instance Launch" Web page, review your instance launch details
and click the "Launch" button
14. In the "Select an existing key pair or create a new key pair" pop-up window:
a. select "select an existing key pair"


13
b. select the "mapr-odt-keypair" key pair from the "key pair name" drop-down list
Click the "I acknowledge that I have access to the selected private key file
(name), and that without this file, I won't be able to log into my instance"
checkbox

REMINDER: You must have a copy of this key file in a location that you can
reference it throughout your training. If you lose this file, you will lose access to
your AWS instances, and will have to create new ones.

15. Click the "Launch Instances" button
16. In the "Launch Status" Web page, click the "View Instances" button
17. Wait for the instance to get in the "running" state and status checks to complete
Log in to AWS Nodes

In order to login to your AWS nodes, you will need to use the SSH key pair that you downloaded
when launching your instances. There is only one login account on your RHEL 6.x instance called
"ec2-user" which requires using the SSH key pair to login.

1. Open the terminal emulation application on your computer
2. Navigate to the location where the SSH key pair file is saved
3. Change the permission of the SSH key pair file:
$ chmod 600 mapr-odt-keypair.pem
4. Login as ec2-user
$ ssh i mapr-odt-keypair.pem ec2-user@VM_IP_Address (such as
54.183.169.43)
5. Switch to user root:
$ sudo s
6. Determine and log the internal IP address of the VM instance (save the result, such as
10.0.0.167):
$ hostname
7. Create a mapr user on this VM:
$ useradd mapr

14
$ passwd mapr
then type the password for the mapr user when prompted
8. Set the root user password:
$ passwd root
then type the password for the root user when prompted
9. Allow password authentication to the VM:
$ vi /etc/ssh/sshd_config
change PasswordAuthentication no to PasswordAuthentication yes
save and exit vi
10. Repeat steps 6-8 for all VM instances and log the hostname of each instance

Now you have root access on your RHEL virtual machine instance, and you can proceed with
the MapR Hadoop Operations labs.

Managing your Nodes

AWS charges you by the hour for your instances so long as they are running. You don't need to
keep your nodes running while you are not performing tasks in the lab. You can safely stop your
instances while you are not using them and the restart them when you want to use them again.
This will ensure that you are only changed for time that you are using your nodes to perform lab
exercises. The Public IPs of your VMs may change when stopped, but the internal IPs will remain
consistent. You should check the Public IP address and note any changes, but you will not need
to re-check the VM hostnames.
server" button
"EC2" link
5. Select the instances that you want to stop, click on the Actions button, and select
Stop
6. Click on the Yes, Stop button

15
To restart the instances, repeat these steps, and select Start in step 5. Remember, you should
check the Public IP settings of your VMs and note any changes to your IP addresses. The
internal IP addresses will remain consistent, so the passwordless ssh and Hadoop software will
still function normally.

Terminating Your Instances and EBS Storage

When you are finished using your AWS nodes for the class exercises, you should terminate your
nodes. If you did not select the Delete on Termination box when creating the storage for your
nodes, then you will also need to terminate your EBS storage volumes.
server" button
"EC2" link
11. Disable Termination Protection on each instance, individually. You will have to perform
these steps on each instance, one at a time:
a. Select the instance that you would like to disable termination protection
b. Click on the Actions button, and select Change Termination Protection
c. Select the Yes, Disable button
d. Repeat these steps for all instances that you want to terminate.
12. Select the instances that you want to terminate, click on the Actions button, and
select Terminate
13. Click on the Yes, Terminate button

If you did not select Delete on Termination when adding storage, follow the steps
below to delete the EBS storage volumes.
14. In the "ELASTIC BLOCK STORE" section of the navigation pane on the left-hand side of
the Web page, click the "Volumes" link.
15. Select the checkbox next to the volumes that you want to remove, click on the Actions
button and select Delete Volumes

16
Get Started 2: Setup passwordless ssh access between

nodes
When testing hardware nodes and installing Hadoop, you will need to run various commands
and scripts on all of the nodes in the cluster. A tool like clustershell, or clush, will allow you to
propagate these commands from one master node to all of the other nodes on the cluster.
For clush to perform tasks on the other nodes, it needs passwordless ssh access, so that you do
not have to type in a password for every action on every node. Some of the actions we will do in
this course require root account access, and some require mapr account access. You will need to
perform the steps for passwordless ssh twice, once when logged in as root and once as mapr.
1. Log into one of your nodes. If you are using AWS VMs, use the instructions provided
above. We will set up passwordless ssh from this node to the other nodes. This node
will be the master node going forward, and we will run all further commands from this
node.
2. Su to root:
$ sudo -s
3. Generate an ssh key as the root user:
$ ssh-keygen
Enter file in which to save the key (/home): leave as default
Enter passphrase (empty for no passphrase): leave empty
Enter same passphrase again: leave empty
4. Copy the ssh key to the other nodes. We will be using the internal IP addresses
compiled when checking the hostname above:
$ ssh-copy-id IP-address-1
Are you sure you want to continue connecting (yes/no)? yes
5. Test the passwordless connection:
$ ssh IP-address-1
6. Return to the master node:
$ exit
7. Repeat steps 4 and 5 for each internal IP address on your list.
Congratulations! Your AWS environment is set up and ready for you to begin the lab exercises
with this course. You can perform all of the labs after Lesson 1 in this environment. Lesson 1
requires hardware nodes for pre-installation testing, and will not work properly in a VM
environment such as AWS.

17
Get Started 3: Log into the class cluster

For students in a classroom or virtual, instructor led class, there will be using a class cluster for
demonstrations. This setup procedure will show you how to log in to the MapR Hadoop cluster
that will be used by your class. It has procedures for windows user, as well as MAC and Linux
users.
Lab Procedure
Windows and Unix users use these initial instructions:
1. Download putty if you are on a windows machine
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
2. Connect to the MCS for your cluster. Go to class doc link provided for this class.
(Example: http://doc.mapr.com/display/SE/San+Jose+2013Oct)
3. For unix users do the following in a terminal window:
ssh -i students07172012.pem ec2-user@ <ec2-54-219-84-67.us-west-
1.compute.amazonaws.com>
Note: be sure to use the DNS resolvable names as indicated on your class webpage,
or use an Outside IP address to get to the Amazon node. In this example it would be
54.219.84.67
4. For windows users take the following steps in a putty window. Before clicking the OPEN
button install the PPK key file, see the following images:

18

19

5. Login as
ec2-user
password = mapr
6. To become root use #sudo i
7. To become user02 use the su command

8. To log in to your MapR GUI, point your browser to the URL of the first node in your class
cluster. Dont forget to use https://<node-url>: 8443 -- this is the port the MCS listens
on for MCS sessions

20

Click I understand the risk. Then add exception. Confirm the security exception in the popup
window.
9. Login as mapr; password is mapr.

You should see a view of the MCS GUI upon successful login.

21
Get Started 4: Explore the MapR Control System

The MapR Control System, or MCS, is a MapR specific feature that provides a convenient UI to
access, monitor and interact with a MapR cluster. This brief exercise provides an early
experience with the MCS as an opportunity to become familiar with some of the wide range of
information and cluster administration tasks to which it provides access.

22
Lab Procedure
Log on and explore different views of the cluster
1. Connect to the MCS for your cluster.
Use the username and password provided

loginID = mapr password = mapr
2. From the Dashboard, identify node icons on the cluster heat map. Notice designation for
racks if shown.

3. From the Dashboard, discover the different views of the cluster, including:
Health
CPU utilization
Memory utilization
Disk space utilization

4. Step 4. Find how to display the legend for each view.

23

5. Step 5. In the navigation pane, change view from Dashboard to Nodes.. You should see
something like this:
Find the drop-down menu under Overview to see more options.
Identify Specific Characteristics of Your Cluster

Now that you are familiar with some ways of looking at a cluster via the MapR Control System,
practice by answering these simple questions about the class cluster:
1. How many nodes are in the cluster?
2. How are nodes distributed with regard to racks?
3. What version of MapR software is installed on the cluster? (Hint: this is displayed in the
upper right-hand corner)
4. Are any alarms raised? If so, what are they?
5. When does the cluster license expire?
6. What is the health status of the nodes?
7. What is the significance of a green/orange/red node icon with regard to:
8. Health?


24
9. CPU utilization?
10. Memory utilization?
11. What % of disk space is being used by the entire cluster?
12. Are any MapReduce jobs currently running on the cluster?
Conclusion
The MapR Control system is a convenient way to access, monitor and perform administrative
tasks on your cluster.
Lessons Learned
Different filters can be applied to see different views of the cluster, including health,
CPU/memory utilization, and disk space utilization.
The left navigation bar directs you to different features or components of the system.
Some information displayed in the MCS is cluster-wide; other information is provided on

a per-node basis.

25
Lesson 1: Pre-install
Lab Overview
In this lesson you will learn where you can download a collection of tools and scripts that we will
use to prepare the cluster hardware for the parallel execution of tests and then test and
measure the performance of the hardware components for our cluster to determine that they
are functioning properly and within the specifications for Hadoop installation. we will also
identify the current firmware for each of the new hardware components in the cluster, and
update these components to make sure that they have matching firmware.
Lab 1.1: Pre-install validation downloads, setup and clustershell
Lab 1.2: Network, Memory and IO
Lab Procedures
Lab 1.1: Pre-install validation
Note: One of the most common causes for a failure when installing Hadoop is that the hardware
is not within the necessary specifications. You can see a list of the current hardware and OS
specifications at: http://doc.mapr.com/display/MapR/Preparing+Each+Node
The Professional Services team at MapR has developed a collection of all of the tools and scripts
that we will need to validate our hardware and prepare it for installation.
1. Download the cluster-validation package onto your master node from:
https://github.com/jbenninghoff/cluster-validation/archive/master.zip
Extract master.zip and move the pre-install and post-install folder directly under /root for
simplicity.
2. Here, we will find two directories, pre-install and post-install. We will use the tools and
scripts inside the pre-install directory to validate our new hardware prior to installing
Hadoop. We will use tools and scripts in the post-install later, to test our new cluster
after we have completed our install.
Note: The tools and files in this collection are updated frequently, so we should always make
sure we download the latest package when preparing for a new Hadoop installation.
3. To prepare the cluster for these validation tests, choose one node on the cluster to be
your set up master node. Generate ssh keys on this node, and make sure that it has
passwordless ssh access to all other nodes on the cluster. You can find steps for how to
do this in your lab guide at the end of this guide.
4. Inside the pre-install directory is a clustershell rpm. Install this rpm on the master node
with passwordless ssh access to the rest of our cluster. We will be making all further
commands for this exercise from this master node, using clush to propagate those
commands throughout the rest of our hardware.
5. Once installed, update the file to include an entry for all:
/etc/clustershell/groups
then the host names for the nodes we will use, such as:
all: node[0-19]
6. Once we have our node names listed, type the following to copy the /root/pre-install
directory to all of our node hardware.:
# clush -a --copy /root/pre-install
7. When that is complete, type to confirm that all of the nodes have a copy of the package:
# clush -Ba ls
/root/pre-install
8. After we have a copy of the pre-install package on all nodes, we are ready to start our
hardware validation tests. First, we will run an audit of our hardware to see exactly
what we have on each node, and to verify that they all have a similar configuration. to
run the cluster-audit.sh script, type:
/root/pre-install/cluster-audit.sh | tee cluster-audit.log

This will list hardware specifications from each of the new nodes.
We can examine the output log to look for hardware or software that does not match the
requirements to install Hadoop, or discrepancies in the hardware or software from one node to
the next.
20
Note: that the audit output will give us deltas when looking at things like the RAM. It will tell us
the total about of RAM, number of slots and then the types of DIMMs found, but it will not tell
us which exact DIMMs are in which slots. Also, if only one DIMM type is listed, then all slots
have the same DIMM type
Lab 1.2: Network, Memory and IO

1. Evaluate the network interconnect bandwidth.
Inside the pre-install directory, update the network-test.sh file so that the half1
and half2 arrays contain the correct IP addresses for our hardware nodes. Next delete
the exit command, and save the file.
2. When the file has been updated, type:
# /root/pre-install/network-test.sh | tee network-test.log
This will run RPC test to validate our network bandwidth. This test should take about 2
minutes to run, maybe a little longer.
We should expect to see results of about 90% of our peak bandwidth. Thus, with a
1GbE network, we should expect to see results of about 115MB/sec, or with a 10GbE
network, look for results around 1100MB/sec. If we are not seeing results in this range,
then we need to check with our network administrators to verify the connections and
firmware.
3. Next, we will evaluate the raw memory performance. Type to run the stream59 utility:
# clush -Ba '/root/pre-install/memory-test.sh | grep

^Triad' | tee memory-test.log
This tests the memory performance of the cluster. The exact bandwidth of memory is
highly variable and is dependent on the speed of the DIMMs, the number of memory
channels and to a lesser degree, the CPU frequency.
4. Evaluate the raw disk performance. The disk-test.sh script will run IOzone on our
hard drives to test their performance.
Note: This process is destructive to any existing data, so make sure the drives do not have any
needed data on them, and that you do not run this test after you have installed MapR Hadoop
on the cluster.

21
Type:
# clush -ab /root/pre-install/disk-test.sh

When you first run this script, it will list out the spindles to be tested. We need to verify
that this list is correct, and then edit the script to run the test.
The comments in the script will direct us to the edits that we need to make. When we are done,
we save the file and run the script again to perform the test.
If we have a large number of total drives, the summIOzone.sh script will provide us with a
summary of the disk-test.sh output.
We will keep the results of this test with the other benchmark tests for post installation
comparison.
Conclusion
Now that we have run all of our hardware tests, and compiled benchmarks for all of our
components, we have one final task to prepare our new hardware for installation.
The firmware for the new hardware must be up to date with vendor specifications and match
across each of the nodes of the same type. The BIOS versions and settings must also match for
similar nodes. In addition, the firmware for the management interfaces needs to be the same
on each of these nodes. Any other hardware components that we may have in our system, such
as NICs or onboard RAID controllers also need to have updated and matching firmware.
We will need to refer to the manual for each node vendor that we are including, and update the
firmware and BIOS according to their specifications. If there is a discrepancy in our BIOS or
firmware between nodes from the same vendor, then we can see inconsistent performance
across nodes.

22
Lesson 2: Install MapR software

Lab Overview
In this exercise you will install a MapR cluster. It is also important to consider how many of each
services will be running on the entire cluster to ensure that you have a robust HA service layout.
Lab Procedure
Install a MapR cluster using the map-installer on the AWS environment
Note: Check the following requirements prior to installation:
1. Log into the master node of your cluster as described above, or as described by your
instructor.
2. Navigate to the /home/mapr directory:
$ cd /home/mapr
3. Download the mapr-setup package:
$ wget http://package.mapr.com/releases/v.3.1.1/<yourLinuxOS>/mapr-setup
4. Download the pem key to the master node in you cluster

5. Set the permissions on the setup-mapr file and pem key:
$ chmod 755 mapr-setup
$ chmod 600 <yourPEMkey>
6. Run the mapr-setup script. Note: this script will create /opt/mapr-installer directory and
additional subdirectories
$ sudo ./mapr-setup
===============================================
Self Extracting Installer for MapR Installation
===============================================
Extracting installer.......
Copying setup files to "/opt/mapr-installer"......
Installed to "/opt/mapr-installer"
====================================
Run "/opt/mapr-installer/bin/install" as super user, to

begin install process
[root@ip-10-170-125-38 ec2-user]#
7. Copy the students07172012.pem key to the / opt/mapr-installer/bin directory:
$ mv <yourPEMkey> /opt/mapr-installer/bin
8. If you are using a config file with the installer, edit the config.example file to specify the
control nodes and data nodes information.
$ vi config.example

This information can also be input when running the installer, if not using a config file.
Additional information can be specified in the config file as well, including:
Disk used. Note: for Amazon, disk are the following /dev/xvdf,
/dev/xvdg
o Mysql database information (see you instructor for IP address)
o Repositories (can be local)
o Version
o Security
o M7
o Clustername
o Etc.
===============================================================
# Each Node section can specify nodes in the following format
# Node: disk1, disk2, disk3
# Specifying disks is optional. In which case the default disk information
# from the Default section will be picked up
[Control_Nodes]
<ip-10-171-58-175> : /dev/xvdf, /dev/xvdg
o

[Data_Nodes]
<ip-10-170-118-127> /dev/xvdf, /dev/xvdg
[Client_Nodes]
#C1
#C2
[Options]


32
MapReduce = true
YARN = false
HBase = false
M7 = true
ControlNodesAsDataNodes = true
WirelevelSecurity = false
LocalRepo = false
[Defaults]
ClusterName = <your_Team#_cluster>
User = mapr
Group = mapr
Password = mapr
UID = 2000
GID = 2000
Disks =
CoreRepoURL = http://package.mapr.com/releases
EcoRepoURL = http://package.mapr.com/releases/ecosystem
Version = <3.1.0>
MetricsDBHost = <node1 of classcluster_if setup_by_instructor>
MetricsDBUser = <mapr>
MetricsDBPassword = <mapr>
MetricsDBSchema = <metrics[1-6]>

[root@ip-10-170-125-38 bin]# bash /opt/mapr-installer/bin/install --help
Verifying install pre-requisites
updating package cache...
installing pre-requisite openssl098e
installing pre-requisite sshpass
... verified
======================================================================
MapR Installer
======================================================================
Version: 2.0.135
usage: mapr-install.py [-h] [-s] [-U SUDO_USER] [-u REMOTE_USER]

33
[--private-key PRIVATE_KEY_FILE] [-k] [-K]

[--skip-checks] [--quiet] [--cfg CFG_LOCATION]
[--debug] [--password REMOTE_PASS]
[--sudo-password SUDO_PASS]
{new,add} ...
positional arguments:
{new,add}
new Start new Installation
add Add to an existing Installation
optional arguments:
--cfg
--debug
run installer in debug mode
--password
REMOTE_PASS
remote ssh user password
--private-key
PRIVATE_KEY_FILE
use this file to authenticate the connection
--quiet
run installer in non-interactive mode
--skip-checks
skip pre-checks (DANGEROUS)
--sudo-password
SUDO_PASS
sudo user password
-K, --ask-sudo-pass
CFG_LOCATION config file to user
ask for sudo password
-U SUDO_USER, --sudo-user SUDO_USER

desired sudo user (default=root)
-h, --help
show this help message and exit
-k, --ask-pass
ask for SSH password
-s, --sudo
run operations with sudo (nopasswd)
-u REMOTE_USER, --user REMOTE_USER

9. A. If you are not using a config file, run the installer:
$ sudo /opt/mapr-installer/bin/install K s private-key
<yourPEMkey> -u ec2-user U root debug new

and fill in the cluster details when prompted listed above.


34

OR

B. If you are using a config file, run the installer to determine if the parameters you have
specified are correct.
$ sudo /opt/mapr-installer/bin/install K s --cfg
config.example
--private-key <yourPEMkey> -u ec2-user -U root --debug new

10. In the summary response area choose (a)bort after examining your parameters.
11. Rerun the installer with the -quiet aruguement for non-interactive mode and the &
to background the installer in case the window is lost or the laptop goes to hibernate
mode. This time, select (c) to continue with the install after reviewing the parameters.
A. $ sudo /opt/mapr-installer/bin/install K s privatekey <yourPEMkey> -u ec2-user U root debug --quiet new &
OR
B. $sudo /opt/mapr-installer/bin/install --cfg
config.example
--private-key students07172012.pem -u ec2-user -s -U root -debug --quiet new &
Note: View details about installing on an OS other than RedHat, or more options for
custom installation at:
http://www.mapr.com/doc/display/MapR/Preparing+Packages+and+Repositories
http://www.mapr.com/doc/display/MapR/Installing+MapR+Software
The administrative user who should be given full permission is mapr and the user
password is mapr.
When registering your cluster select an M7 Trial license. Also, be sure to apply your
M7 license before you close the License Management dialog.
12. Watch the installation process and look for the various packages being installed. After
the control nodes have been installed (usually 20-30min) log into the MCS by point your
browser to the IP address of one of the control nodes, at port 8443:
http://ControlNodeIP:8443/
13. Accept the MapR agreement, and select the licenses link in the upper right corner.


35
14. Apply the temporary M7 license received when registering for the course. If you do not
have a temporary license, contact training@mapr.com or ask your instructor if you are
taking a classroom or virtual training class.
15. After you have successfully applied a trial license you may notice that some of the nodes
in the cluster have orange icons in the heatmap indicating that they have degraded
service.

16. As the installer continues to install packages, and the warden service start the services
on each node, we will begin to see the nodes turn green. Eventually all of the nodes will
be green, indicating that all nodes are active and healthy.
Conclusion
Plan your service layout prior to installing the MapR software

o Make sure that you have identified where the key management services (CLDB,
Zookeeper, JobTracker, Webserver) will be running in the cluster
o Ensure that you have enough instances of the management services to maintain the
level of service that is appropriate for your organization
Follow the procedures outlined in the MapR documentation under Installation Guide
o http://mapr.com/doc/display/MapR/Installation+Guide
Use the MCS to verify that the cluster installation is complete and that the cluster is now
active
Discussion
1. Once you see that the cluster is active, try exploring the MCS by clicking on the different
links in the Navigation pane and on the Dashboard. What will you be able to monitor
once you begin to use your cluster?
2. What would your next step be after installing the cluster?


36
Lesson 3: Post-install
Lab Overview
If you remember, the package that we downloaded in our pre-install lesson contained a postinstall directory. That directory contains all of the tools and scripts that we need to run post
install benchmarks to make sure our new cluster is performing as expected.
First, we will test the drive throughput. As with our pre-install tests, we will use clush to push
this test to all of the nodes on our cluster.
Lab Procedures
3.1 Run RWSpeedTest
1. Log into the master node that we used for our pre-install tests and navigate to the
directory /root/post-install. In here we will find the file runRWSpeedTest.sh.
2. Note: This script uses an HDFS API to stress test the io subsystem. The output provides
an estimate of the maximum throughput the io subsystem can deliver. To begin the test,
type:
# clush -Ba /root/post-install/runRWSpeedTest.sh | tee
RWSpeedTest.log
3. After we run our RWSpeed, we can compare our results to our pre-install IOzone tests.
We should expect to see similar results, within 10-15% of the pre-installation test.
3.2 TeraGen/TeraSort
Teragen is a map/reduce program that will generate 1GB of synthetic data, and Terasort
samples this data and uses map/reduce to sort it into a total order. These two tests together
will challenge the upper limits of our clusters performance
1. Type:
# maprcli volume create -name data1 replication 1 mount 1
path /root/data1
# mkdir data1/out1
#
mkdir data1/out2
2. Verify that the new directories exist, then type:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2dev-examples.jar teragen 10000000 /data1/out1
3. This will create 1TB worth of small number data. Once teragen has finished then type to
sort the newly created data:
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2dev-examples.jar terasort /data1/out1 /data1/out2
When we are running Terasort, we can use the MCS to watch the node usage. When we set the
heatmap to show Disk Usage, we can see the load on each node. We are looking for the load
to be spread evenly across our cluster. Hotspots suggest a problem with a hard drive or its
controller. We can change the view of our heatmap to look at the load of different resources of
our cluster as we run our tests.
In addition to the heatmap views, we can look at the services and jobs. Since we are using
synthetic code, we know that it functions properly. If we have a job or task failure, then we
have an issue with our hardware.
When Terasort is finished, we can compare the results with our RWSpeedTest results. We
should expect to see our Terasort throughput to be between 50% to 70% of our RWSpeedTest
throughput. Since we know the Terasort job code does not have any errors, if we see
performance that doesnt match our expectations, we know we have a problem with the
hardware in our cluster.

32
Lesson 4: Configure Cluster Storage

Resources
Lab Overview
The labs in this chapter cover all the basics of cluster storage resources, including:
Topology and Storage Architecture:

o
the physical layer, including nodes, disks & storage pools
the logical layer, including files, chunks, containers
Volumes, including with mirrors, snapshots and remote mirrors
These labs provide insight into how data is managed in a MapR cluster, and teach hands-on
experience configuring topologies, volumes and quotas. You have a great degree of control over
your organizations MapR storage resources. Configuring a cluster with appropriate topologies
and volumes has long-term impacts on performance, reliability and ease-of-management. This
lab is broken into three separate exercises that build on each other.
Lab Procedures
Always set up node topology before deploying cluster. Never leave nodes in /data/defaultrack
Key Tips:
Create volumes to contain different types of data on the cluster before deploying the
cluster. (E.g., create one volume per user, one volume per project, distinct volumes for
production work and development work, etc.) Dont let data accumulate at the root
level of the cluster.
MapR separates the concepts of volume ownership and quota accounting. Project
members can have full ownership of files and folders for a project, while the collective
storage for the whole project is restricted by a quota independent of individual users.
Rack Layout
In this training lab environment, our physical rack layout is hypothetical. If you were
configuring node topology in a physical cluster environment, then you would coordinate with
the team responsible for the physical setup of the cluster to build a diagram of the physical rack
layout. For this lab, lets assume our clusters nodes are contained in two racks.
Note: If applicable, you may need to coordinate your activities on your Team# cluster with the
other members of your team.
Lab 4.1: Configure Node Topology

The first step in getting a cluster ready for data storage is to set up the node topology. Node
topology describes the logical organization of the cluster. Grouping nodes into proximity-based
topologies, i.e. racks, helps to distribute data across physical failure domains, thus decreasing
the probability of data loss. It is also important to define higher-level logical topologies, typically
named /data and /decommisioned, which serve as staging areas for nodes when
transitioning into and out of service.
/
data/
rack1/
r1_node1
r1_node2
r1_nodeN
rack2/
r2_node1
r2_node2
r2_nodeN
decomissioned/
<nodes_to_remove>

34
Important!
Dont start using your cluster with nodes assigned to /data/defaultrack. If you dont take the time to set up topologies early on, you will have
difficulty later taking advantage of MapRs HA features.
In this exercise you will define two new topologies, /data/rack1 and /data/rack2, and
assign all nodes in the cluster to one of the two. The diagram below shows the logical
organization of our clusters node topology.
Given this structure, assigning data to /data will distribute the data across all nodes in the
cluster. Assigning data to /data/rack1 will restrict data storage to the nodes in rack1. And, if
desired, assigning data to /data/rack2/node1 will restrict storage to that particular node.
Lab 4.1 Steps

1. Log on to the cluster . In a web browser open the MCS UI. Browse to
https://<webserver hostname>:8443 and log in.
LoginID = mapr
password = mapr
The MCS launches and displays the dashboard view.

2. Open an SSH session with a node in the cluster and type in
# maprcli node list -columns svc
3. Confirm that all nodes are running in good health. In the MCS dashboard, you should
see all green node shapes, indicating all nodes are in good health.
Note: Slide bar displays more information in each node shape
4. Move nodes to /data/rack1 and /data/rack2
a. In the MCS Navigation pane under the Cluster group click Nodes.
b. In the Nodes panel, select the node(s) you want to move.
c. Click the Change Topology button. The Change node topology dialog box appears.
d. Under New Path, enter or pick the destination topology. Click OK. Note: new
topologies will be added to the menu offering after being created.

35
5. Set the default physical topology using the CLI. You can change the default topology,
such that any new node added to the cluster will appear in the specified topology. In
this step, you are going to change the default topology to /data.
a. Open a SSH session with a node in the cluster.
b. Type the following command at a command line.
maprcli config load json | grep default
c. Notice the default topology.
d. To change it you would do the following:
maprcli config save -values
'{"cldb.default.volume.topology":"/data"}'
6. Verify that all nodes are assigned to a physical topology.
a. In the MCS Navigation pane under the Cluster group, click Nodes.
b. Look at the Topology pane and confirm that each node in the cluster appears in a
specific rack, and that no nodes remain under /default-rack.

36
Lab 4.2: Create Volumes and Set Quotas

In this lab exercise you will learn how to manage a MapR cluster in a shared environment.
Imagine that your cluster is going to be shared by up to 5 different groups each with multiple
users working on development and production projects. You need to manage the resources of
the cluster so all of these groups can work simultaneously without consuming more than their
share of storage and compute resources. You also need to make sure that development
projects do not impinge upon production work.
In this exercise you will create independent volumes for each user and project, and then you will
impose quotas on those volumes.
Important!
Dont store data in the root volume (/).
If all data is in the root volume, you lose the ability to specify location, quota, or HA properties
for different types of data.
As soon as you set up your cluster, start creating volumes to organize data on the cluster. As this
lab will demonstrate, MapR recommends that you create at least the following volumes:
1. Create a separate volume for each user.
2. For active projects, create separate volumes for development work and production
activity.
Note: In order for a MapR cluster to function correctly, the user accounts and groups must be
set up identically across all nodes.

37
Lab 4.2 Overview

The diagram below illustrates the key concepts of this exercise. In this case user01 and user02
are in the Log Analysis Development group (loganalysis_dev). Each of these users has
permission to read and write data to the project volume as well as their own user volume. The
cumulative storage used by these volumes rolls up to a group referred to as an Accounting
Entity. Each user, volume and Accounting Entity can have a separate disk quota for flexible
management of cluster disk usage.

38
Lab 4.2 Set-up

1. Set up the users and groups on all your cluster nodes. Note: they must all have the
same UID and GID on every node in the cluster. This is an opportunity to use the clush
utility if you wish.
# yum install clustershell
For example, run groupadd on every node in your cluster.
# groupadd -g 5000 loganalysis_dev
Add individual users on every node.
# useradd u 5001 g loganalysis_dev user17
Or
# clush -a groupadd -g 8000 <grouptest1>
# clush -a useradd -u 8001
-g 8000 <USER_number>
2. Add the user to the MCS to the MCS permissions popup.

39
Name
Username/loginID
Groupname
Teamname/Clusterna
me
user01
webcrawl_dev
Team1
user02
webcrawl_dev
Team1
user03
webcrawl_prod
Team1
user04
webcrawl_prod
Team1
user05
frauddetect_dev
Team2
user06
frauddetect_dev
Team2
user07
frauddetect_prod
Team2
user08
frauddetect_prod
Team2
user09
recommendations_dev
Team3
user10
recommendations_dev
Team3
user11
recommendations_prod
Team3
user12
recommendations_prod
Team3
user13
twittersentiment_dev
Team4
user14
twittersentiment_dev
Team4
user15
twittersentiment_prod
Team4
user16
twittersentiment_prod
Team4
user17
loganalysis_dev
Team5
user18
loganalysis_dev
Team5
user19
loganalysis_prod
Team5
user20
loganalysis_prod
Team5

40
Lab 4.2 Steps

Examine the volumes already on the cluster
1. Connect to the MCS for your cluster

2. Click Volumes under MapR-FS in the left navigation pane:
Notice how many volumes are listed. Do these include systems volumes? Hint: notice
whether or not the Systems check box is selected on upper menu.
Display only the non-system volumes by de-selecting the System check box on
the upper menu.
Locate the New Volume button that lets you create a new volume.
What other volume actions are allowed in the Volume Actions modify volume
menu?
Examine volume properties from the volumes list
1. From the list of volumes, choose a volume to examine.
Look across the columns to find whether the volume of interest contains data, and if
so, what is the data size?
What is the replication factor listed for the volume you are examining?
2. Find more details for this volume on the Volumes Properties pane. Hint: Open the pane
by clicking the highlighted name of the volume.
What is the minimum replication factor for this volume?
Does the volume have a quota?

41

Practice Creating and Removing Volumes
3. Click the New Volume button.

4. Select Standard Volume for the Volume Type in the new pop up window.
5. Enter a volume name using your name (or some other unique name) and designate
volume number 1 (e.g. name-vol1 where name is your name) in the Volume Name
field.
6. Type the mount path /name-vol1 in the Mount Path field.
Note: MapR MCS will not create any parental directories above the mount point, so make them
beforehand if necessary with the mkdir command.
7. Verify /data is displayed in the Topology field (This is the default topology; we will
discuss topology in the next lecture).
8. Verify the default replication factor and minimum replication settings. Are they set to
what was recommended in the Volumes lecture?
9. At the bottom of the popup window, click OK to create the volume.
10. Verify that your new volume appears in the volumes list. Do you see the volumes
created by the other students in the class?
(Note: If not, you will need to go to the volume name filter <ted-vol1>the top and remove the
filter by clicking the minus sign:
Repeat the process in Step 1 to create a user volume 2 for your name.
Verify that your new volume appears in the volumes list.
Once again remove the filter so that you can view the full list of non-system
volumes.

42

Remove one volume
1. Decide which of your own volumes you want to remove and select it by clicking the
check box by the volume name.
2. Select Remove on the modify Volume menu. You will see this dialog box:
Make your choice for what style of removal you want and click the Remove Volume
button on lower right.
Verify in the volumes list that one of your volumes has disappeared.
Create a volume for each user
In this step, you will create a home volume for all project members, if applicable. On each user
volume:
Restrict the volume to the /data/rack2 topology, which prevents users from consuming
storage resources on /data/rack1.
Assign the Accounting Entity of the user volume to the appropriate group for that user.
Assigning this Accounting Entity prevents the members of the group from collectively
overshooting a storage quota for the project.
Set quotas for the user volume.
Note: user17 and loganalysis_dev are used as examples below. Be sure to substitute the
appropriate user name and group when you create the volumes for your team members.
1. In the MCS, in the Navigation pane under the MapR-FS group, click Volumes.
2. In the Volumes tab click the New Volume button.
3. Following the example below, enter the volume settings for each user volume in the
New Standard Volume dialog box.
Volume Setup section
43
Volume Type: Standard Volume

Volume Name: user17-homedir
Mount Path: /mapr/<Team1_cluster>/home/user17/vol
Topology: /data/rack2
User/Group (This specifies the Accounting Entity)
Group loganalysis_dev
Note: group must exist on all nodes in the cluster
Permissions: u:user17
Usage Tracking
o
o
o
fc
Quotas (This specifies disk quota for the volume itself)

Volume Advisory Quota: 100G
Volume Hard Quota: 128G
4. Click OK.

44
Command Line
It is also possible to create a new volume at the command line. For example:
maprcli volume create -path /home/user17/vol \
-ae loganalysis_dev -aetype 1 -topology /data/rack2 \
-quota 128G -advisoryquota 100G \
-user user17:fc -name user17-homedir
Note: The maprcli volume create command requires specific ordering of
arguments. Make sure that the -name option comes last.
You can change quotas later at the command line. For example:
maprcli volume modify -quota 20G -advisoryquota 15G \
-name user17-homedir
5. Change ownership of the volume for the user. At a command line type:
chown user17 /mapr/<my.cluster.com>/home/user17/
Create a volume for your team project
In this step, you will create a volume for your team project, if applicable. Bear in mind the
following criteria for your project volume
Restrict development volumes to the /data/rack2 topology, which prevents development

projects from consuming storage resources on /rack1.
Production volumes should be allowed to span the entire cluster, so they will have a
topology of /data
Set group permissions on each volume:
For development volumes, members of both prod and dev groups get full control
For production volumes, only members of the prod group get full control
Assign your group as the Accounting Entity
Set quotas for the project volume:
Development volumes Advisory Quota is 9T and the Hard Quota is 10T

Production volumes Advisory Quota is 19T and the Hard Quota is 20T
Note: loganalysis_dev is used in the examples below. Be sure to substitute the appropriate user
name and group when you create the volumes for your project.

45
1. Create the top-level project directory under /mapr/<my.cluster.com>/home/, if it

doesnt exist. For example, at a command line type:
mkdir /mapr/<my.cluster.com>/home/<loganalysis_dev>/
2. In the MCS, in the Navigation pane under the MapR-FS group, click Volumes.
3. Create the project volume. In the Volumes tab click the New Volume button.
4. Following the example below, enter the volume settings for the project volume in the
New Standard Volume dialog box.
Volume Setup section
Volume Type: Standard Volume
Volume Name: loganalysis-dev
Mount Path: /mapr/<Team1_cluster>//home/loganalysis_dev/vol
Note: the example below is for a development group volume. If you are creating a
volume for a production group then the topology would be /data
Topology: /data/rack2
Permissions section
Note: the example below is for a development group volume. If you are creating a
volume for a production group, do not add permissions for the development group.
g:loganalysis_dev
g:loganalysis_prod
fc
fc
Usage Tracking
User/Group (This specifies the Accounting Entity)
Group loganalysis_dev
Quotas (This specifies disk quota for the volume itself)
Note: the examples below are for a development group volume. If you are creating a
volume for a production group the Advisory Quota is 19T and the Hard Quota is 20T.
Volume Advisory Quota: 9T
Volume Hard Quota: 10T
5. Click OK.
6. Change ownership and permissions of the project volume. At a command line type:

46
chgrp loganalysis_dev
/mapr/<my.cluster.com>/home/loganalysis_dev/vol
chmod g+rwx /mapr/<my.cluster.com>/home/loganalysis_dev/vol
Verify that the volumes are set up correctly
1. In the MCS, in the Navigation pane under the MapR-FS group, click Volumes. The
Volumes view appears, listing all volumes in the cluster.
2. Confirm that all of the volumes you created are listed in the Volumes view. Other
volumes that are part of the default cluster configuration may also appear here. You can
use the Filter option to list, for example, only the volumes with a mount path matching
/home*, as shown below.
3. Navigate the volumes at the command line and verify that they have been mounted. For
example:
ls -al /mapr/<my.cluster.com>/home/
ls -al /mapr/<my.cluster.com>/home/loganalysis_dev/vol
You should see the volumes you just created in the previous steps mounted in these
locations.

47

Set disk usage quotas for your project Accounting Entity
By setting a quota on an Accounting Entity, we can make sure that all volumes assigned to the
Accounting Entity (including user volumes and project volumes) do not collectively overshoot a
project maximum.
1. In the MCS, in the Navigation pane under the MapR-FS group, click User Disk Usage.
The User Disk Usage panel displays all users and groups that have been assigned as an
Accounting Entity (e.g. loganalysis_dev).
2. Click on your project Accounting Entity. The Group Properties dialog box appears.
3. Following the example below, enter the quota settings for your project Accounting
Entity in the Usage Tracking section of the Group Properties dialog box.
For development projects:

Turn on User/Group Advisory Quota. Enter 9T
Turn on User/Group Hard Quota. Enter 10T
For production projects:

Turn on User/Group Advisory Quota. Enter 19T
Turn on User/Group Hard Quota. Enter 20T

48
Command Line
It is also possible to set the Accounting Entity quotas at the command line. For example:
maprcli entity modify -quota 10T -advisoryquota 9T \
-name loganalysis_dev -type 1
Conclusion
Before you begin adding data to your cluster or submitting jobs make a decision about topology
(node/data placement) and implement this decision on your cluster
Create volumes early and often. It is much easier to manage cluster data at a volume level than
managing all of the data on the cluster as one enormous data set. Imagine trying to manage
petabytes of data!
Creating separate volumes provides flexibility of resource management by separating ownership
from accounting
Do not use the / or /data/default-rack topology for data placement

49
Lesson 5: Data Ingestion, Access &

Availability
Labs Overview
Lesson 5 labs cover the following topics:
Accessing the cluster using NFS
Snapshots
Mirrors
Multiple Clusters and Disaster Recovery
5.1 Get Data into an NFS Cluster

Topics and tasks in this first lab will help you to
understand the significance of NFS in MapR
learn how to get data into a cluster using NFS
view and manipulate data directly on your cluster using standard Linux file commands via
NFS
Before you begin the lab steps, the cluster filesystem must be mounted on the data instance.
Create Input Directory for Data

Copy Data from Data Instance to Input Directory on Cluster
1. SSH to the data instance (NFS node for exercise and contained in your hosts file )
mkdir /mapr/<Team3Cluster>
2. Mount you cluster to the NFS client node
# mount t nfs <TeamCluster3>:/mapr
/mapr/<TeamCluster3>
3. Copy the data from the /etc directory on the data instance to the input directory on
your project volume that you created in the previous step
cp v /etc/*.conf
/mapr/<my.cluster.com>/home/loganalysis_dev/input
4. Verify that the data is now on in the input directory on your cluster volume
ls /mapr/<my.cluster.com>/home/loganalysis_dev/input
You should see a collection of files that end in .conf
5. Verify that the data you moved from the data instance is now on the cluster in your
project volume
ls /mapr/<my.cluster.com>/home/loganalysis_dev/input
Run a MapReduce Job on Data

1. Run a MapReduce job on the data
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2- \
dev-examples.jar wordcount </home/loganalysis_dev/input> \
</home/loganalysis_dev/output1>
2. View the output of the MapReduce job
ls /mapr/<my.cluster.com>/home/loganalysis_dev/output1
Modify Data and Run MapReduce again

1. From the/home/<loganalysis_dev>/input
Use `sed` to add some files to your input data directory
for i in `ls *`; do cp $i `echo $i | sed

"s/.conf/AA.conf/g"`; done
2. Re-run the same MapReduce job on the data sending the output to a new directory
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2dev- \examples.jar wordcount /home/<loganalysis>/dev/input
\ /home/<loganalysis>/dev/output2
Compare Results from Both MapReduce Jobs

1. Compare the output from the MapReduce jobs
diff \ /mapr/my.cluster.com/home/<loganalysis_dev>/output1/
\ part-r-00000 \
/mapr/my.cluster.com/home/<loganalysis_dev>/output2/ \
part-r-00000 \
You should see the change you made in the previous step

51
Note: The diff and vi you used above are standard Linux commands. Because the cluster
filesystem is mounted via NFS, any standard Linux programs that operate on text files (sed,
awk, grep, etc.) can be used with data on your cluster. This would not be possible without
NFS. You would need to copy the file out of the cluster first before performing your task and
then copy the resultant file back into the cluster.
Conclusion
In this lab you experienced copying data from an external data source to the cluster storage via
NFS. You were able to do so with standard Linux file commands that are familiar to system
administrators. This process would have been much more technically challenging and taken a
significantly longer time to perform without NFS.
Lab 5.2: Snapshots

Explore how snapshots work by creating snapshots at various points in time of a volume
containing changing data to see that each snapshot shows data from a fixed point in time. Also
see that snapshot creation is almost instantaneous and that the snapshot can preserve data that
has since been deleted or changed. Learn to apply a schedule so that snapshots are
automatically created at fixed intervals. Schedules also allow snapshots to expire at a time you
designate. This lad has 4 exercises:
Create a snapshot in two ways
Show how snapshots capture frozen views of past state
Show snapshots preserve deleted data
Create a snapshot schedule from the MCS
Create snapshot in two ways

This exercise will create a snapshot of a volume at a particular point in time, using two different
methods for making a snapshot.
Preparation: Before starting this exercise, you should have created a volume for your
experiments and mounted it. If you havent already created such a volume, do so now using the
MCS. Make sure that your volume is different from the volumes other students are using for
this exercise to minimize confusion about who is doing what.

52
Put some sample data into your volume
1. Use ssh to log in to a node in your cluster. Use your own user id here.
$ ssh mapr@classnode-cluster
2. Change directory to your personal volume.
$ cd /mapr/<my.cluster.com>/snapshot_lab_mnt_user01
3. Create a data file called STATIC in your personal user-volume containing whatever
data you choose.
$ cat /etc/hosts > STATIC
Create a volume snapshot of your volume using MCS
Use the MCS to create a snapshot, as shown here:
Select New Snapshot from the pull down menu under Modify Volume on top bar, provide a
name for your snapshot, and click OK to create a snapshot of the selected volume, in this case,
snapshot_lab_vol_user01. This will create a snapshot of the volume you have selected.

53
Verify the snapshot was made by clicking Snapshots from navigation pane at left side of
window to see snapshot name, mount-path and reference volume.
Create and view contents of a new snapshot
Use CLI to manually create a new snapshot and to see its contents for comparison to the source
volume.
1. Connect to your node via ssh and use CLI command to create a snapshot. Make sure
that the name you give to your snapshot does not have a dash in it.
$ maprcli volume snapshot create \
-volume snapshot_lab_vol_user01\
-snapshotname snapshot2_user01
2. Change directory to your volume mount point, list the snapshots, and then list the
contents of the .snapshot directory
$ ls -al(notice you dont see your snapshot)
STATIC
$ ls .snapshot
snapshot2_user01
SNP_of_lab_vol_user01_2013-07-16.12-31-44
$ ls .snapshot/snapshot2_user01/
STATIC
Notice that there is a directory with the name of your snapshot.
The contents of the .snapshot/ snapshot2_user01 directory will be identical to
contents of the volume at the time you took the snapshot.
Show snapshots are time-specific

You will run a program to keep updating the data with new files in the same volume so that you
have something to watch. Then you will manually create snapshots of the volume at different
points in time. When you look at the contents of these snapshots, you should see that each
snapshot preserves the state of the volume at a moment in time. This state includes which files
exist and what data is in these files.
54

Create new data files in your volume by running a shell script
.Run the following commands:

$ while true; do
touch file-$(date +%T)
date >> log; sleep 13
done &
This creates a new file every 13 seconds as this script runs in the background. The file name of
each file will contain the time the file is created. The last command will also log the time each
file is created. This log file will look something like this:
Thu Dec 13 17:15:44 PST 2012
Thu Dec 13 17:15:57 PST 2012
Thu Dec 13 17:16:10 PST 2012
Thu Dec 13 17:16:23 PST 2012
The files created will look something like this:
$ ls
file-17:15:44
file-17:16:23
log
file-17:15:57
file-17:16:10
STATIC
$
Create a new snapshot, wait about 30 seconds, then create another snapshot
Note the last time notation that was displayed in the original ssh window when you created
each snapshot by putting a line into the log file.
$ maprcli volume snapshot create -volume
snapshot_lab_vol_user01
-snapshotname snapshot3_user01; echo snapped $(date)
>> log
55

Explore the snapshot directory from CLI
1. Change directory into the mount point of the volume you created the snapshots for
earlier
2. List all files and directories there using "ls -a". Note that you won't see the .snapshot
directory because it is hidden. You can see the contents of the .snapshot directory if
you explicitly give its name, but you wont see it otherwise.
Even though you don't see the .snapshot directory using ls in the volume mount point, it is still
there and you can look inside. Do this:
$ ls alh .snapshot
total 2.5K
drwxr-xr-x. 5 root root 3 Jul 16 12:58 .
drwxr-xr-x. 2 root root 2 Jul 16 12:57 ..
drwxr-xr-x. 2 root root 1 Jul 16 12:24 snapshot2_user01
drwxr-xr-x. 2 root root 2 Jul 16 12:57 snapshot3_user01
drwxr-xr-x. 2 root root 1 Jul 16 12:24
SNP_of_lab_vol_user01_ ---2013-07-16.12-31-44
You should see the snapshots that you created earlier.
Note: You can also see a list of snapshots in the MCS along with details like when they were
created and when they will expire. You will not, however, be able to see the contents of the
snapshots from the MCS.
1. List the contents of each snapshot. You should see that more files appear in each
subsequent snapshot, like this:
$ ls .snapshot/*
.snapshot/snapshot1:
STATIC
file-08:39:16
file-08:39:55
file-08:40:34
file-08:41:13
file-08:39:29
file-08:40:08
file-08:40:47
log

56
file-08:39:42
file-08:40:21
file-08:41:00
STATIC
file-08:39:16
file-08:40:34
file-08:41:52
file-08:43:11
file-08:39:29
file-08:40:47
file-08:42:05
log
file-08:39:42
file-08:41:00
file-08:42:19
STATIC
file-08:39:55
file-08:41:13
file-08:42:32
file-08:40:08
file-08:41:26
file-08:42:45
file-08:40:21
file-08:41:39
file-08:42:58
You can also look at the contents of the log files in each snapshot. In the second and third
snapshots, you should see everything in the log file up to the moment the snapshot was taken.
That means that you will see the log line for the second snapshot in the third snapshotted
version of the log.
$ cat .snapshot/snapshot3_user01/log
Thu Dec 13 08:39:16 UTC 2012
Thu Dec 13 08:39:29 UTC 2012
...
Thu Dec 13 08:41:13 UTC 2012
snapped at Thu Dec 13 08:41:24 UTC 2012
Thu Dec 13 08:41:26 UTC 2012
Thu Dec 13 08:41:39 UTC 2012
...
Thu Dec 13 08:43:11 UTC 2012
$

57
The parent directory has continued to fill up with files due to the script that has been running all
this time. Note that each snapshot has all of the files that were created before the snapshot
was created, but it has nothing else. The snapshots preserve a view of the content as it was
when the snapshot was created.
Show snapshots preserve deleted data

Now you will remove all data files except log and STATIC from your volume and check the
contents of your snapshots to see that they have preserved the deleted data as of specific points
in time.
Stop the program
$ kill %1
Remove all files except log and static
$ rm file-* ; echo removed files $(date) >> log
1. List the contents of your volume and compare the volume contents to the contents of
the last snapshot of the volume:
$ ls
log
STATIC
$ ls .snapshot/snapshot3
file-08:39:16
file-08:40:34
file-08:41:52
file-08:43:11
file-08:39:29
file-08:40:47
file-08:42:05
log
file-08:39:42
file-08:41:00
file-08:42:19
STATIC
file-08:39:55
file-08:41:13
file-08:42:32
file-08:40:08
file-08:41:26
file-08:42:45
file-08:40:21
file-08:41:39
file-08:42:58
$
Note that files that you deleted are still present in each snapshot made before the deletion.
Remember you can review the exact sequence of events that happened by looking at your log
file. Comparing the final version of the log with each snapshotted version is very instructive.
58
Schedule snapshots from MSC

You will need a schedule for this next part of the lab. Schedules are independent of volumes,
snapshots or mirrors. A schedule simply expresses a policy in terms of frequency and retention
times.
Create a custom schedule
Using the MCS to create a schedule:

1. Click Schedules under MapR-FS in the navigation pane
2. Click the New Schedule button
3. Give the schedule a name (every_5 _minutes) and a rule (say every 5 minutes and
retain(expire) after 45 minutes)
4. Click the save schedule button
Note: the schedule is not currently applied to any volumes
Apply the schedule
Now you should use the MCS to apply the custom schedule a snapshot schedule for one of your
volumes:
1. Click Volumes under MapR-FS in the Navigation pane
2. Click the name of one of your volumes
3. Scroll down to the Snapshot Scheduling section

59
4. Select the custom schedule from the previous step. Click the OK button at the bottom
of the dialog.
Note that new snapshots are being created and that their content is a frozen view of the volume
as of the particular moment in time when each snapshot was created.
These new volumes will have names based on the time that they are created rather than
sequentially numbered names like the snapshots that you created before.
5. Verify that the new snapshots are being credited according to the schedule you applied.
6. Using the MCS, list snapshots and notice that the ones created by schedule have an
expiration date, while the ones created manually do not.

60
Lab 5.3: Mirrors and schedules

Gain experience with making mirrors manually via the MCS and CLI. Also learn to apply a
schedule to update data from the source volume. This lab has three parts:
Create a mirror from the MapR Control System (MCS)
Apply a schedule to the mirror
Create a mirror from CLI and initiate a mirror sync
Create a mirror from the MCS

Create a local mirror based on a source volume
1. Connect to the cluster MCS

2. Choose one of the ways to set up a mirror volume, for instance, choose volumes from
the left bar to display volumes of choice for the source volume. If possible pick a volume
containing data for the source volume so will be able to verify that it is copied to the
new mirror volume.
2. Make the selection New Volume from top menu and fill in the template to make a
local mirror (Mounting is optional )

61
Now you have created the mirror volume, but no data has been copied to it.
3. Verify your new mirror volume exists by selecting Mirror Volumes on left bar menu to
display names of all mirrors.
Copy data to your new mirror volume
1. Use the MCS to start mirroring by selecting this option from the Modify Volume
button drop down menu.
2. Verify that data are copied to your mirror volume by watching the display of mirror
volumes. If there is a lot of data, you will see an indication that the copying is in
progress:

62
Apply a schedule to the mirror

1. Use the MCS to apply a schedule to update your mirror volume.
Create a mirror from CLI and initiate a mirror sync

Use CLI to create a new local mirror volume of a different source volume .
1. CLI to manually create a mirror volume
CLI example:
Determine schedule ids available
#maprcli schedule list

# maprcli volume create name <lab_mirror_2_user01> source \ < source_vol_mirror@clusterName> -type 1 schedule
<schedule_ID>
2. Use CLI to initiate a mirror sync
# maprcli volume mirror start name <mirror_volume_name>
OR
# maprcli volume mirror push name <source_volume_name>

63
Conclusion
The mirror volumes you created are located on the local cluster. They would be appropriate for
load balancing, for making a read-only version of data available, for isolating a copy of data from
ongoing activities or for deployment.
Remember these lessons learned:
When you make a mirror volume, you must reference the source volume by name.
The new mirror volume you create does not contain data until you start the mirroring
process or apply a schedule.
Updates do not happen automatically unless you apply a schedule.
Mirrors do not expire unless you schedule expiration.
Lab 5.4: Disaster Recovery

One key aspect of Disaster Recovery (DR) is preservation of business-critical data in the event of a
disaster. Data preservation typically involves maintaining a consistent copy of the data in more than
one location. MapR provides remote mirroring as a means of ensuring that a consistent copy of your
most important business data is available elsewhere in the event of a disaster. The goal of this lab is
to understand how to set up two clusters so that data can be mirrored between them. You will also
learn how to configure, initiate and verify volume mirroring from one cluster to another.
The most important parts of configuring remote mirroring are the following:
Configure the clusters to be aware of each other
Create a remote mirror and initiate mirroring on the destination cluste
You will break into teams and, as a class, you will configure both clusters to be aware of each
other so data can be mirrored between them. Each team will configure one or two nodes on
each cluster so it is aware of the other cluster. Then one person will restart the Webserver on
each cluster. Then each team will create a mirror volume on the destination cluster that refers
back to a volume with data on the source cluster and initiate mirroring.
Before beginning lab steps you should verify that they have some data in a volume on the
source cluster. This data could be left over from a previous lab exercise (e.g. the NFS/Accessing
lab) or you can copy new data into a volume for the purpose of this lab. The instructor will
provide you with some test data to use if necessary.

64
Set Up
1. Verify all nodes in the source cluster have <Team name> for the source cluster(line 1)
and configure all nodes to be aware of the <destination Team cluster> (line2)
2. SSH to the node you are configuring on the source cluster
3. Verify in /opt/mapr/conf/mapr-clusters.conf that the <Teamname> is there
Team1
4. Add a second line in /opt/mapr/conf/mapr-clusters.conf for the remote
cluster in the format:
<clusterTeam2> <CLDB1>:7222 <CLDB2>:7222
cluster2 is the name of the destination cluster
<CLDB1> and <CLDB2> are the CLDB nodes in the destination cluster
5. Restart the Warden on your all your node(s)

service mapr-warden restart
Note: there is a small bug that requires you to add more than one remote cluster to the maprclusters.conf to be visible in the GUI (3.0.2)
Configure all nodes in the destination cluster

Configure all nodes in the destination cluster with a unique name for the destination cluster
and configure all nodes to be aware of the source cluster
1. SSH to the nodes you are configuring on the destination cluster
2. Edit /opt/mapr/conf/mapr-clusters.conf and verify your <Teamname>
3. Add a second line in /opt/mapr/conf/mapr-clusters.conf for the source
cluster in the format:
<Teamname2> <CLDB_A>:7222 <CLDB_B>:7222
Teamname2 is the name of the source cluster
<CLDB_A> and <CLDB_B> are the CLDB nodes in the source cluster
4. Restart the Warden on your nodes

service mapr-warden restart
Note: the remainder of the steps should be completed by each team
65
Verify that each cluster has a unique name

Verify that each cluster has a unique name and is aware of the other cluster
1. Log on to the MCS of the source cluster
2. Verify that the cluster name is cluster1
3. Click the + symbol next to the cluster1

4. Verify that cluster2 is listed under Available Clusters
5. Log on to the MCS of the destination cluster

6. Verify that the cluster name is cluster2
7. Click the + symbol next to the cluster2
8. Verify that cluster1 is listed under Available Clusters

66
Create a remote mirror volume on the destination cluster

You should be logged into the MCS on the destination cluster
1. Select Volumes in the Navigation pane
2. Click the New Volume button
o
o
o
o
o
o
o
Volume Type: Remote Mirror Volume

Enter a unique name for your mirror volume
Enter the name of the source volume
Source Cluster Name: cluster1
Enter a unique mount path for the mirror volume
The parent directory must already exist
Ensure that the Mounted checkbox is checked
Topology: /data
3. Click the OK button
You should see confirmation at the top of the MCS indicating that the mirror volume was
created

67
Initiate mirroring to the destination cluster

1. If not already selected, click Volumes in the Navigation pane
2. Select the mirror volume you created in Step 4
3. Click Volume Actions

4. Select Start Mirroring

68
Verify data from source cluster was copied to destination cluster

1. SSH to any node on the destination cluster
2. List the contents of the destination mirror volume
hadoop fs ls /<mirror_volume_mount_point>
or, if the cluster filesystem is mounted via NFS
ls /mapr/Team2/<mirror_volume_mount_point>
You should see the exact same contents in the mirror volume as you do in the original source
volume
Conclusion
In this lab you learned how to copy data from one cluster to another using remote mirroring. As you
learned earlier in this course, MapR volumes allow you a greater degree of control over how to
manage data in the cluster. Mirroring the volumes that contain your business-critical data to a
remote cluster can significantly reduce the amount of key data you would lose and the time it would
take to resume productivity in the event of a disaster.
Lab 5.5: Using the HBase shell

The objective of this lab is to get you started with HBase shell and perform operations to create
a table, put data into the table, retrieve data from the table and delete data from the table.
Start HBase shell

1. Get a help listing which demonstrates some basic commands.
a. Get help specifically on the "put" command.
2. Create a table called 'Blog' with the following schema: blog title, blog topic, author first
name, author last name. The blog title and topic must be grouped together as they will
be saved together and retrieved together. Author first and last name must also be
grouped together.
3. List the new table you created in its directory, to confirm it was created.
4. Insert the following data to the 'Blog' table.

69
Where Title and Topic are in column family info and First and Last are in column family author
ID
Title
Topic
First
Last
MapR M7 is Now Available on Amazon

EMR
cloud
Diana
Truman
Enterprise Grade Solutions for HBase
highavail
Roopesh
Nair
A Comparison of NoSQL Database

Platforms
nosql
Jonathan
Morgan
5. Count the number of rows. Make sure every row is printed to the screen as it is
counted.
6. Retrieve the entire record with ID '2'.
7. Retrieve only the title and topic for record with ID '3'.
8. Change the last name of the author with title "A Comparison of NoSQL Database
Platforms".
Display the record to verify the change.
Display both the new and old value. Can you explain why both values are there?
9. Display all the records.

10. Display the title and last name of all the records.
11. Display the title and topic of the first two records.
12. Delete the record with title "Enterprise Grade Solutions for HBase".
Verify that the record was deleted by scanning all records, or
Try to select just that record.
13. Drop the table 'Blog'.

70
Create a Table using MapR Control System (MCS)

1. Connect to MCS from a bowser using the notes from the instructor. Login with your
account.
2. Create a table called 'Blogtest' with the following schema: blog title, blog topic, author
first name, author last name. The blog title and topic must be grouped together as they
will be saved together and retrieved together. Author first and last name must also be
grouped together.
3. List the new table you created in its directory, to confirm it was created.
4. Insert some data and test. Also change the number of versions of cells you can keep
and test.
MapR Tables - Solutions

1. Use cases fit for MapR tables:
A data store comprised of petabytes of semi-structured data.
A data store that will be access by large numbers of client requests, for example
thousands of reads per second.
2. Use cases not fit for MapR tables:
Access normalized relational data with SQL
Full text search
3. Columns may be created when data is inserted, they don't have to be defined up front.
MapR can scale up to very large numbers of columns per column family. However, table
name and column family have to be defined before data is inserted.
4. In addition to using the list command in the HBase shell, you can use standard Linux
ls to list all tables (and files) stored in a particular directory.

71
HBase Shell Solution

You can find the commands below in the file Lab1_hbase_shell_commands.txt.
1. Start a hbase shell in your command window
user02@ip-10-196-89-226:~$ hbase shell
2. Hbase help command
hbase> help
hbase> help "put"
3. Create a table /user/user01/Blog with column families info and author
hbase> create '/user/user01/Blog', {NAME=>'info'},
{NAME=>'author'}
Since it was required that title and topic be grouped they will be stored as columns that
belong to the 'info' column family while 'first' and 'last' will belong to the 'author'
column family.
4. List the table:
hbase> list '/user/user01/'
5. Execute the following put statements to insert the records into Blog table:
hbase> put '/user/user01/Blog','1','info:title', 'MapR M7
is Now Available on Amazon EMR'
hbase> put '/user/user01/Blog','1','info:topic','cloud'
hbase> put '/user/user01/Blog','1','author:first','Diana'
hbase> put '/user/user01/Blog','1','author:last','Truman'
hbase> put '/user/user01/Blog','2','info:title',
'Enterprise Grade Solutions for HBase'
hbase> put '/user/user01/Blog','2','info:topic','highavail'
hbase> put '/user/user01/Blog','2','author:first','Roopesh'
hbase> put '/user/user01/Blog','2','author:last','Nair'
hbase> put '/user/user01/Blog','3','info:title', 'A
Comparison of NoSQL Database Platforms'
hbase> put '/user/user01/Blog','3','info:topic','nosql'
72
hbase> put
'/user/user01/Blog','3','author:first','Jonathan'
hbase> put '/user/user01/Blog','3','author:last','Morgan'
6. Count the number of rows of data inserted
hbase> count '/user/user01/Blog',INTERVAL=>1
7. Retrieve the entire record with ID 2

hbase> get '/user/user01/Blog','2'
8. Retrieve only the title and topic for record with ID '3'.
hbase> get
'/user/user01/Blog','3',{COLUMNS=>['info:title','info:topic
']}
9. The record with title "A Comparison of NoSQL Database Platforms" has ID 3. To update
its value execute a put operation with that ID.
hbase> put '/user/user01/Blog', '3','author:last','Smith'
To verify the put worked, select the record:
hbase> get '/user/user01/Blog','3',
{COLUMNS=>'author:last'}
To display both version specify the number of versions in a get operation:
hbase> get '/user/user01/Blog','3',
{COLUMNS=>'author:last', VERSIONS=>3}
The reason we see the old value is cells have up to three versions by default in MapR
tables.
10. Display all the records.
hbase> scan '/user/user01/Blog'

73
11. Display the title and last name of all the records.
hbase> scan
'/user/user01/Blog',{COLUMNS=>['info:title','author:last']}
12. Display the title and topic of the first two records.
hbase> scan '/user/user01/Blog',
{COLUMNS=>['info:title','info:topic'],LIMIT=>2}
13. The record with title "Enterprise Grade Solutions for HBase" has record ID '2'; delete all
columns for record with ID '2':
hbase> delete '/user/user01/Blog','2','info:title'
hbase> delete '/user/user01/Blog','2','info:topic'
hbase> delete '/user/user01/Blog','2','author:first'
hbase> delete '/user/user01/Blog','2','author:last'
14. To delete a table in HBase shell, the table must first be disabled, and then you can drop
it.
hbase> disable '/user/user01/Blog'
hbase> drop '/user/user01/Blog'
Troubleshooting
NameError: undefined local variable or method `interval' for #<Object:0x6ec12f3>

Happens for hbase> count '/user/user02/Blog', interval=>1
Use uppercase INTERVAL example: hbase> count '/user/user02/Blog', INTERVAL=>1

74
HBase shell commands (optional)

The objective of this optional lab lab is to run scripts from your hbase shell. These commands
can be run individually in a hbase shell. Or they can be pasted into a script and run. Example:
hbase> source "hbase_script.txt"
1. Open a vi session and insert the following into your script.
2. Adjust all references to home directory to the appropriate directory
3. Name your script
4. Run your script
Additional commands to experiment with
# NOTE: You can copy-paste multiple lines at a time
#
into HBase shell. Or, you can source a script.
#
Example: hbase> source "hbase_script.txt"
# Background information on HBase Shell at:

# http://wiki.apache.org/hadoop/Hbase/Shell
##########################################################
# Solution to Lab 1
# NOTE: Change the table paths to your own user directory
#
so your actions don't conflict with other students.
#
Example: create '/home/user12/atable', {NAME=>'cf1'}
##########################################################
help
help "put"
create '/home/user01/Blog', {NAME=>'info'}, {NAME=>'author'}
list '/home/user01/'
put '/home/user01/Blog','1','info:title', 'MapR M7 is Now Available

on Amazon EMR'
put '/home/user01/Blog','1','info:topic','cloud'
put '/home/user01/Blog','1','author:first','Diana'
put '/home/user01/Blog','1','author:last','Truman'
put '/home/user01/Blog','2','info:title', 'Enterprise Grade
Solutions for HBase'
put '/home/user01/Blog','2','info:topic','highavail'
put '/home/user01/Blog','2','author:first','Roopesh'
put '/home/user01/Blog','2','author:last','Nair'
put '/home/user01/Blog','3','info:title', 'A Comparison of NoSQL
Database Platforms'
put '/home/user01/Blog','3','info:topic','nosql'
put '/home/user01/Blog','3','author:first','Jonathan'
75
put '/home/user01/Blog','3','author:last','Morgan'
count '/home/user01/Blog',INTERVAL=>1
get '/home/user01/Blog','2'
get '/home/user01/Blog','3',{COLUMNS=>['info:title','info:topic']}
put '/home/user01/Blog', '3','author:last','Smith'
get '/home/user01/Blog','3', {COLUMNS=>'author:last'}
get '/home/user01/Blog','3', {COLUMNS=>'author:last', VERSIONS=>3}
scan '/home/user01/Blog'
scan '/home/user01/Blog', {COLUMNS=>['info:title','author:last']}
scan '/home/user01/Blog',
{COLUMNS=>['info:title','info:topic'],LIMIT=>2}
delete
delete
delete
delete
'/home/user01/Blog','2','info:title'
'/home/user01/Blog','2','info:topic'
'/home/user01/Blog','2','author:first'
'/home/user01/Blog','2','author:last'
#disable '/home/user01/Blog'
#drop '/home/user01/Blog'
##########################################################
# Additional commands to experiment with
# NOTE: You can copy-paste multiple lines at a time
#
into HBase shell. Or, you can source a script.
#
Example: hbase> source "hbase_script.txt"
##########################################################
# add content column-family to table
alter '/home/user01/Blog', {NAME=>'content'}
# insert row 1
put '/home/user01/Blog', 'Diana-001', 'info:title', 'MapR M7 is Now
Available on Amazon EMR'
put '/home/user01/Blog', 'Diana-001', 'info:author', 'Diana'
put '/home/user01/Blog', 'Diana-001', 'info:date', '2013.05.06'
put '/home/user01/Blog', 'Diana-001', 'content:post', 'Lorem ipsum
dolor sit amet, consectetur adipisicing elit'
# insert row 2
put '/home/user01/Blog', 'Diana-002', 'info:title', 'Implementing
Timeouts with FutureTask'
put '/home/user01/Blog', 'Diana-002', 'info:author', 'Diana'
put '/home/user01/Blog', 'Diana-002', 'info:date', '2011.02.14'
put '/home/user01/Blog', 'Diana-002', 'content:post', 'Sed ut
perspiciatis unde omnis iste natus error sit'
# insert row 3
76
put '/home/user01/Blog', 'Roopesh-003', 'info:title', 'Enterprise

Grade Solutions for HBase'
put '/home/user01/Blog', 'Roopesh-003', 'info:author', 'Roopesh'
put '/home/user01/Blog', 'Roopesh-003', 'info:date', '2012.10.20'
put '/home/user01/Blog', 'Roopesh-003', 'content:post', 'At vero eos
et accusamus et iusto odio dignissimos ducimus'
# insert row 4
put '/home/user01/Blog', 'Jonathan-004', 'info:title', 'A Comparison
of NoSQL Database Platforms'
put '/home/user01/Blog', 'Jonathan-004', 'info:author', 'Jonathan'
put '/home/user01/Blog', 'Jonathan-004', 'info:date', '2013.01.08'
put '/home/user01/Blog', 'Jonathan-004', 'content:post', 'Duis aute
irure dolor in reprehenderit in voluptate velit'
# insert row 5
put '/home/user01/Blog', 'Sylvia-005', 'info:title', 'NetBeans IDE
7.3.1 Introduces Java EE 7 Support'
put '/home/user01/Blog', 'Sylvia-005', 'info:author', 'Sylvia'
put '/home/user01/Blog', 'Sylvia-005', 'info:date', '2012.07.20'
put '/home/user01/Blog', 'Sylvia-005', 'content:post', 'Excepteur
sint occaecat cupidatat non proident, sunt in culpa'
# count the data you inserted above, INTERVAL specifies how often
counts are displayed
count '/home/user01/Blog', {INTERVAL=>2}
count '/home/user01/Blog', {INTERVAL=>1}
# this get won't return anything as the rowkey doesn't exist
get '/home/user01/Blog', 'unknownRowKey'
# retrieve ALL columns for the provided rowkey
get '/home/user01/Blog', 'Jonathan-004'
# retrieve specific columns for the provided rowkey
get '/home/user01/Blog', 'Jonathan-004',
{COLUMN=>['info:author','content:post']}
# retrieve data for specific columns and time-stamp
get '/home/user01/Blog', 'Jonathan-004',
{COLUMN=>['info:author','content:post'], TIMESTAMP=>1326061625690}
# exercise different scan options
scan '/home/user01/Blog'
scan '/home/user01/Blog', {STOPROW=>'Sylvia'}
scan '/home/user01/Blog', {COLUMNS=>'info:title',
STARTROW=>'Sylvia', STOPROW=>'Jonathan'}
# update the record few times and then retrieve back multiple
version
# only 3 versions are kept by default
77
get '/home/user01/Blog',
VERSIONS=>3}
VERSIONS=>2}
VERSIONS=>1}
# selects 1 by default
'Jonathan-004', {COLUMN=>'info:date',
'Jonathan-004', {COLUMN=>'info:date'}
# delete a record, delete all versions of the cell

get '/home/user01/Blog', 'Roopesh-003', 'info:date'
delete '/home/user01/Blog', 'Roopesh-003', 'info:date'
get '/home/user01/Blog', 'Roopesh-003', 'info:date'
# delete the versions before the provided timestamp
get '/home/user01/Blog', 'Jonathan-004',{COLUMN=>'info:date',
VERSIONS=>3}
delete '/home/user01/Blog', 'Jonathan-004', 'info:date',
1326254739791
get '/home/user01/Blog', 'Jonathan-004',{COLUMN=>'info:date',
VERSIONS=>3}
# drop the table
disable '/home/user01/Blog'
drop '/home/user01/Blog'
Using importtsv and copytable

The objective of this lab is to get you started with HBase shell and perform operations to create
a table, import flat tab separated data into the table, retrieve data from the table and delete
data from the table.
View Existing Table Using MCS
1. Log onto the MCS.

2. Select MapR-FS =>MapR Tables
3. Click on /user/mapr/<sampletable> under Recently opened tables
If /user/mapr/<sampletable> is not displayed under Recently opened tables, enter
/user/mapr/<sampletable> in the Go to table field and click the Go button
4. Look at the information available in the Regions tab
Each row represents one region of data
The columns (Start Key, End Key, Physical Size, Logical size, etc.) represent
meaningful data about the table regions
78
Most of this information is read-only
Node hostnames are hyperlinks to the node details page for a given node
5. Click the Column Families tab to view column families
Each row represents a unique column family
Click the checkbox next to a column family to select it
6. You can then edit or delete a column family
Click the New Column Family button to see the options available for a new column
family (name, Max Versions, Min Versions, etc.)
Click Cancel we do not want to create a new column family at this time
Create a Table using MCS and import data using importtsv
1. Create a table called '/<data2>/<userX_voter_data_table>' with the following schema:
Column family = cf1
Column family = cf2
Column family = cf3
2. Observe the new table you have created in the UI and at the cli(your choice)
UI, highlight MapR tables>Go To Table /<data2>/<userX_voter_data_table>
Cli, cd /data2/ , Type in ls
3. Locate the sample data file ( instructor can point to the right directory)
4. Cd to sample data directory.
5. Run the hadoop command with the importtsv flag
[root@CentOS001 data2]# su mapr c hadoop jar
/opt/mapr/hbase/hbase-0.94.9/hbase-0.94.9-mapr-1308.jar
importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,\
cf2:age,cf2:party,cf3:contribution_amount,\
cf3:voter_number /mapr/<Cluster_name>/<data2>/
<userX_voter_data_table> \
/mapr/<Cluster_name>/data2/voter10M

79
Note : Highlighted area is syntax for 3.1 permissions

Notice the first column is defined as HBASE_ROW_KEY, this will take the first field of data
(namely the numerical index field) and make it the row:key.
Important: also notice that command above identifies each column in the data file as well as the
column family it belongs in. The column family used in the example below is cf1, cf2 and cf3. If
the table you are importing into has a different column family name, then you will need to
modify the command below to match the correct column family name.
6. While the import job is processing, look at the MCS to view changes to the table and
puts being processed on the node:
Click Nodes under Cluster
Click the Overview dropdown and change the value to Performance
If necessary, scroll to the right so you can see the Gets, Puts and Scans columns.
You should see a large number of puts across several nodes while your import is
processing
Click MapR Tables under MapR-FS
Click the name of the table you used for the import under Recently opened tables
Select the Regions tab
You should see that your table automatically split into a number of regions during
the import
7. In an hbase shell examine the data that has been imported

[root@CentOS001 data2]# hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to
leave the HBase Shell
Version 0.94.9-mapr-1307-SNAPSHOT,
rcab8177f900b149296e367d79a28150c771958e6, Tue Aug
13:52:17 PDT 2013
Not all HBase shell commands are applicable to MapR tables. Consult MapR documentation
for the list of supported commands.
hbase(main):001:0 >scan '/<data2>/ <userX_voter_data_table>',
LIMIT => 5

80

Sample output
ROW
COLUMN+CELL
1
column=cf1:name, timestamp=1377028204508, value=david davidson
1
column=cf2:age, timestamp=1377028204508, value=49
1
column=cf2:party, timestamp=1377028204508, value=socialist
1
column=cf3:contribution_amount, timestamp=1377028204508, value=369.78
1
column=cf3:voter_number, timestamp=1377028204508, value=5108
10
column=cf1:name, timestamp=1377028204508, value=oscar xylophone
10
10
column=cf2:party, timestamp=1377028204508, value=green
10
10
100
column=cf1:name, timestamp=1377028204508, value=oscar carson
100
100
100
100
1000
column=cf1:name, timestamp=1377028204508, value=yuri brown
1000
1000
1000
1000
10000 column=cf1:name, timestamp=1377028204508, value=ulysses zipper
10000 column=cf2:age, timestamp=1377028204508, value=33
10000 column=cf2:party, timestamp=1377028204508, value=libertarian
10000 column=cf3:contribution_amount, timestamp=1377028204508, value=866.72
10000 column=cf3:voter_number, timestamp=1377028204508, value=10729
5 row(s) in 0.0960 seconds
8. Exit the hbase shell

hbase(main):004:0> exit
[root@CentOS001 data2]#
9. In MCS ,Create another table table called '/data2/<userX_voter_data_table>2' with the
following schema:
Column family = cf1

Column family = cf2
Column family = cf3
In this table we will import the same data and create a new HBASE_ROW_KEY from field
position 6.
hadoop jar /opt/mapr/hbase/hbase-0.94.9/hbase-0.94.9-mapr1308.jar importtsv Dimporttsv.columns=cf1:number,cf1:name,cf2:age,cf2:party,cf
3:contribution_amount,HBASE_ROW_KEY
/mapr/<Cluster_name>/data2/ <userX_voter_data_table>2
/mapr/<Cluster_name>/data2/voter10M
81
10. check the job progress in the MCS as you did before in the previous step
11. enter an hbase shell
perform a scan operation
hbase(main):001:0 >scan '/data2/table1', LIMIT => 5
observe the output of the scan operation
Notice the data is in a new position. This may or may not improve future scanning
operations. You now have a technique to import tab separated data and to control the
definition/position of a ROW_KEY
12. Create Presplit a table (optional)
hbase(main):002:0> create '/data1/presplit', {NAME => 'colFam', VERSIONS => 2,

COMPRESSION => 'SNAPPY'}, {SPLITS => ['250000000','500000000','750000000']}
observe your newly created presplit table in the MCS or command line.
rerun steps 9 and 10 substituting your newly created presplit table for the
destination table.
Create a Table using MCS and copy data using copytable
1. In MCS ,Create another table called '/data2/<userX_voter_data_table>3' with the

following schema:
Column family = cf1
Column family = cf2
Column family = cf3
2. At the system shell type in the following:
[mapr@CentOS001 data2]#hbase
org.apache.hadoop.hbase.mapreduce.CopyTable
-new.name=/mapr/<Cluster_name>/data2/<userX_voter_data_table
>3 /mapr/<Cluster_name>/data2/<userX_voter_data_table>2
This should generate a new job in your MCS

82
Create a Table Using CLI
At the system prompt, type:

[mapr@CentOS001 data2]#maprcli table create -path
/user/mapr/testable
Note: Pick a unique name for your table when you create it.
Use the CLI to create a column family for that table you created in the previous step
An example of the syntax is below using the table created in the previous step and a column
family named cf1
[mapr@CentOS001 data2]#maprcli table cf create -path
/user/mapr/testtable -cfname cf

83
Lesson 6: Monitor the Cluster

Managing your cluster through its lifetime involves understanding the key types of events that
will take place over time as well as knowing the relative frequency of these events. For
example, when a disk fails, it is important to understand the process for taking the node offline,
replacing the disk, and placing the node back online.
It is also important to understand what tools you can use to manage the services running on
your cluster overall as well as on individual nodes. The purpose of these labs is to familiarize you
with several key aspects of the lifecycle of a MapR cluster and some of the tools you will use to
manage services and nodes on your cluster.
Labs Overview
Lab 6.1: Setting Up email addresses to receive notifications on cluster health
Lab 6.2: Set up SMTP to configure the cluster to use your SMTP server to send mail
Lab 6.3: Metrics, Monitoring & Troubleshooting in MCS. Explore the various metrics
available through the MCS, then monitor and assess a set of MapReduce jobs
Lab 6.4: Managing Services: Practical Exercises. In this lab you will get practical
experience with managing services and nodes from MCS and CLI.
Lab 6.5: OPTIONAL - Decommissioning vs. Maintenance. In this lab you will experience
what happens when one or more nodes are moved out of the /data topology to
/offline before being decommissioned and contrast that behavior with what
happens when you temporarily shut down Warden on a node so you can maintain it.
Lab 6.1: Set up Email Addresses

This simple set of lab exercises introduces you to some common tools you can use to add email
addresses and setup SMTP to receive notifications on cluster health. Then, you will learn how
to use MapR metrics to monitor and assess MapReduce jobs.
1. To set up email addresses for cluster users, access the MCS and select System
Settings>Email as shown.
In the Configure Email Addresses dialog, you can specify whether MapR gets user email
addresses from an LDAP directory, or uses a company domain:
1. Use Company Domain - specify a domain to append after each username to
complete each user's email address.
2. Use LDAP - obtain each user's email address from an LDAP server.

85
Lab 6.2: Set up SMTP

Use the following procedure to configure the cluster to use your SMTP server to send mail:
1. In the MapR Control System, expand the System Settings Views group and click SMTP to
display the Configure Sending Email dialog.
2. Enter the information about how MapR will send mail:
Provider: assists in filling out the fields if you use Gmail.

SMTP Server: the SMTP server to use for sending mail.
This server requires an encrypted connection (SSL): specifies an SSL connection to
SMTP.
SMTP Port: the SMTP port to use for sending mail.
Full Name: the name MapR should use when sending email. Example: MapR Cluster
Email Address: the email address MapR should use when sending email.
Username: the username MapR should use when logging on to the SMTP server.
SMTP Password: the password MapR should use when logging on to the SMTP server.
86
3. Click Test SMTP Connection. If there is a problem, check the fields to make sure the
SMTP information is correct.
4. Once the SMTP connection is successful, click Save to save the settings.
Along with setting up email addresses, you need to set up your SMTP server so you can send
emails. Start by accessing the MapR Control System (MCS) from your browser, then follow these
steps:
Lab 6.3: Metrics, Monitoring & Troubleshooting in MCS

The purpose of this lab is to gain experience with MapR metrics. You will explore the various
metrics available through the MCS, then monitor and assess a set of MapReduce jobs that will
be run during the lab.
Explore metrics options via the MCS
Use MCS metrics to monitor real-time MapReduce jobs in progress
Troubleshooting
Explore MapR metrics via the MCS

1. Log on to MCS using the ID provided by the instructor
2. Select Jobs under Cluster from the left column menu
You should see a histogram labeled Job Count by Job Duration. This histogram shows how
long previous jobs took to run. Each bar in the histogram represents job duration in a
particular range and the height of the bar indicates how many jobs had a similar duration.
Below the histogram is a table of all the jobs in the histogram.

87
3. Show just a few jobs by setting a filter

Hover the mouse over one of the smaller bars in the histogram and wait until a small menu
appears that offers you the ability to Filter or Zoom the histogram. Click on Filter.
Note: you can also just click on a bar in the graph as the default action is filter.
What should happen is that the table of jobs should be narrowed down to just the ones
that had a duration in the range that you selected. You can see a filter setting above the
histogram that expresses this limitation. The bar that you used to filter the table will be
highlighted in yellow.
4. Remove the filter by clicking on the minus signs at the right of each filter expression.
5. Change the filter expression to be anything you like.
6. Try filtering on user name or job name.

88
7. Examine a particular set of jobs in more detail by clicking on Zoom instead of Filter.
The filter expression will be set as before but the histogram is limited to just those jobs
that match the filter. The horizontal axis of the histogram will be expanded
appropriately.
8. Explore a single job
The name of each job in the jobs table is highlighted in blue to indicate that it is an active link. If
you click on the name of a job, a new tab opens with information about that job. Whereas the
job metrics page has information about jobs, this new page has information about the tasks for
a single job. You can explore the tasks of a job just as you were able to previously explore all
jobs.
With tasks, you can control whether map tasks, reduce tasks or setup tasks are shown.
NOTE: One common thing to look for in a job is to see if there are map or reduce tasks that take
significantly longer to complete than others. To find such anomalous map tasks, display only
map tasks by checking the appropriate boxes above the histogram. For the job shown below,
there are 3 map tasks that took considerably longer than most tasks.
Isolate those tasks by adding a filter that only shows tasks with a duration > 3 seconds.
Common causes of slow tasks include a malfunctioning node (all or most of the tasks would be
on one node) or tasks that are reading data from other nodes rather than from a local replica
(the Local column in the table shows this for map tasks).

89
Use MCS metrics to monitor MapReduce jobs in progress

This lab shows you how the check jobs running on your cluster.
1.
Use the MCS to monitor the progress of these jobs to answer the following questions:
a. How many jobs are complete in less than 1 minute? How many jobs completed
between 1 and 2 minutes?
b. Did any jobs fail?
c. Click the name of a job to display information about the jobs tasks. What do you
observe?
d. Which jobs had the longest duration? Shortest duration? Had more map tasks?
e. Find how to drill down to task; task attempts.
2. Compare different kinds of tasks.

a. Pick a job that has completed successfully.
b. Click on the job to see all the tasks.
c. Sort by task duration.
d. Do you see any patterns?
e. Try restricting your view to just map tasks. Then try restricting your view to just
reduce tasks. Do you note any differences?
f.
Try showing both map and reduce tasks. Sort the task list by task duration and scroll
down the list. Do you see a pattern about the arrangement of map and reduce
tasks? Do you see a pattern between map tasks with local data or non-local data?
Troubleshoot Jobs
You can look for failed jobs by creating a filter. Go back to the Jobs page by clicking on the left
Navigation panel. Reset any filters by making sure that the filters check box is unchecked as
shown below:

90
Now check the filter check box again and add a filter to find jobs with Job Failed Task Count
greater than zero, as shown below:
If you click on a job that had failed tasks, you will see tasks with a variety of coded squares next
to them:

91
The green tasks completed successfully. The bright red tasks failed. The dark red tasks were
killed by the job tracker when the job could not be completed due to a persistent error.
You can find out more by clicking on the failed task. This will show you all of the attempts to run
this task. This should look something like this:
Clicking further on one of these task attempts takes you to a page that describes all that is
known about this task attempt. This includes all of the counters generated by this task as well
as a link in the upper right of the page that allows you to get as stack trace from the failed
process:
Note: you may need to modify the URL by replacing the internal hostname with the external
hostname.

92
In this case, here is the stack trace for this task attempt. It looks very similar to the stack traces
for the other attempts.
In this case, this task died because your trainer removed permissions on the output directory. In
real life, the problems are not as simple as that. Often, the next task for you will be to look at
the source code of the program that failed.

93
Lab 6.4: Manage Services

Use MCS to see what services are running on your cluster
Connect to the MCS for your cluster.

Review cluster-wide view of services in the services pane
Figure 1: Services panel right side of MCS Dashboard
Note: Warden and ZooKeeper services are not displayed on the MCS.
Learn where active management services are running

Determine on which cluster nodes the active management services are running (CLDB,
JobTracker, etc.).
1. From the Services panel (displayed in Figure 1: Services panel right side of MCS
Dashboard) click on the name of any service to display the node(s) on which that service
is running. For example, click JobTracker to see information on the node running the
active JobTracker.

94
Figure 2: Nodes view showing services configured on and running on one node
Hint:
You can also click on the numbers in the various columns to view information about the
associated nodes. For example, if you wish to view information on the Standby JobTracker from
Figure 1, click on the number 1 in the Stby column for JobTracker.
Manage Node Services for a single node

1. From the Nodes view (displayed in Figure 2) click on the hostname of your team node to
display the node details view. This screen shows detailed information about the node
you selected.

95
Figure 3: Node details with Manage Node Services highlighted
Hint: You can also click on the icon for your team node in the heatmap to display the node
details.
2. The node details view contains a section called Manage Node services. Select a service
to display the available options at the bottom of the Manage Node Services pane.
Figure 4: Manage Node Services pane

96
Stop TaskTracker on a single node

Select the TaskTracker service on your node and click the Stop Service button. After stopping
this service you will see a change on this Manage Services pane for that node. You can also see
this information on the Services pane of the Dashboard.
Figure 5: Manage Node Services pane with TaskTracker stopped
Return to the Dashboard

Navigate back to the cluster dashboard so you can see the service you stopped in the previous
step in the Services pane. Notice that there is one fewer service under the Stop column for
TaskTracker.
Figure 6: Cluster-wide view of services with one TaskTracker stopped

97
Restart TaskTracker on your team node

1. Click the number 1 in the Stop column for TaskTracker.
2. Click the checkbox for the node
3. Click the Manage Services button
Figure 7: Manage services from Nodes view
4. On the TaskTracker dropdown select Start

5. Click the Change Node button
Figure 8; Manage Node Services dialog

98
Figure 9: Manage Node Services dialog
Hint: If you had selected more than one node in the previous step then any action you take
in the Manage Node Services dialog would affect all of those nodes simultaneously. The
hostname for each node would be displayed in the Nodes affected by service changes
section.
Notice that all TaskTrackers are once again running on the cluster
Figure 10: Services panel with all TaskTrackers running
Hint: Alternatively, you could have started the TaskTracker from the Manage Node Services
pane in the Node details view for your node.

99
Figure 11: Start TaskTracker from Manage Node Services pane in node details view
Compare services view of MCS versus jps

1. Log onto a cluster node via SSH
2. Switch to the root or mapr user
3. Run jps to determine what Java services are running on the node by entering
jps
4. Access the node details view (see Step 3/Figure 3) and compare the running services you
see in this view with the services displayed in the output to jps
o
Are there any services displayed by jps that are not displayed in the MCS?
Are there any services displayed in the MCS that are not displayed by jps?
Can you explain the differences between these two methods for viewing
services
Compare services view of MCS versus MapR CLI

1. Use the MapR CLI to view the services running on your node
2. maprcli service list -node
3. Access the node details view and compare the running services you see in this view
with the services displayed in the MapR CLI output

100
Observe JobTracker failover

1. Stop the JobTracker service on the active JobTracke, then monitor the Services pane of
the Dashboard on the MCS and make note of the following:
o
What happens when you stop the JobTracker service on the active JobTracker?
How long does it take for the standby JobTracker to become the active
JobTracker?
2. Start the JobTracker services on the node where it was stopped.
How long does it take for the restarted JobTracker to appear in the Services pane of
the Dashboard?
Where does this service now appear in the Services pane of the Dashboard?
Lab 6.5: Decommissioning vs. Maintenance

In this optional activity you can observe the cluster behavior under two different scenarios:
Decommissioning
A node is placed into the /offline topology (a topology that has no volumes associated with it) so
that it can be removed from the cluster. The intention in this scenario is that the node will no
longer be used in the MapR cluster.
Maintenance
The Warden is shut down on a node so that it can be taken offline temporarily, possibly for
maintenance. The intention is that the Warden will be started once again in a relatively short
time (less than 1 hour) and that the node will return to the MapR cluster.
Both of these scenarios are likely to occur at some point in the lifecycle of a MapR cluster. The
frequency depends upon the size of the cluster and other factors such as disk failure rate, etc.
While these two scenarios may seem similar, the data container replication behavior is quite
different, as you will observe.

101
Identify and log onto the Master CLDB node

1. Log onto any cluster node via SSH
2. Switch to the root or mapr user
3. Determine which node is the master CLDB
o
maprcli node cldbmaster
4. Log onto the CLDB master node and switch to the root or mapr user
Navigate to the MapR logs directory and monitor the CLDB log file
1. Navigate to /opt/mapr/logs
Notice that there are many log files in this location. This is where all MapR services
write their logs. You will also notice that some of the log files roll over periodically. For
example, you will see 10 files in the format warden.log.<date> in addition to the
current warden.log file. Each day a new log file is created, the previous days log file
is renamed with the date, and the oldest log file is deleted. This helps to make the log
files manageable and keeps them from filling up too much disk space.
2. Begin to monitor the CLDB log file
# tail f cldb.log
3. Keep this window open so you can observe changes as they are recorded in the CLDB
log file
Monitor the MCS

1. Log on to the MCS for your cluster and keep an eye out for changes such as alarms
Step through the decommissioning scenario by moving a node into the /offline
topology. Continue to monitor the CLDB log file and pay special attention to replication
events.
How long does it take for container replication to begin?

Do you see any alarms in the MCS?
Step through the servicing scenario by stopping Warden on a node that has containers data.
How long does it take for container replication to begin?

Do you see any alarms in the MCS?
How are the entries in the CLDB log file different from the decommissioning scenario?

102

Administration of Hadoop Summer 2014 Lab Guide v3.1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Administration of Hadoop Summer 2014 Lab Guide v3.1

Uploaded by

Copyright:

Available Formats

Administration of Hadoop

Cluster Admin on Hadoop

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Lesson 1: Pre-install ........................................................................................... 26

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Lab 1.1: Pre-install validation ................................................................................................ 26

Lesson 2: Install MapR software ......................................................................... 31

Lesson 3: Post-install .......................................................................................... 39

Lesson 4: Configure Cluster Storage Resources ................................................ 41

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Practice Creating and Removing Volumes ........................................................................ 50

Lesson 5: Data Ingestion, Access & Availability ................................................. 58

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Lab 5.3: Mirrors and schedules ................................................................................................. 70

Lesson 6: Cluster Monitoring .............................................................................. 94

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Lab 6.2: Set up SMTP ................................................................................................................. 95

Lesson 7: Managing Services on Nodes .......................................................... 104

PROPRIETARY AND CONFIDENTIAL INFORMATION

Configure Virtual Private Cloud (VPC) Networking

Cluster Admin on Hadoop

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Create AWS Virtual Machine Instances for Hadoop Installation

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Create an AWS VM Instance for NFS Access

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Log in to AWS Nodes

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Managing your Nodes

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Terminating Your Instances and EBS Storage

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Get Started 2: Setup passwordless ssh access between

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Get Started 3: Log into the class cluster

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Get Started 4: Explore the MapR Control System

PROPRIETARY AND CONFIDENTIAL INFORMATION

Cluster Admin on Hadoop

Use the username and password provided

PROPRIETARY AND CONFIDENTIAL INFORMATION