MapR Upgrade Doc for 3.0.2 to 4.0.2

MapR Upgrade Documentation
Version 3.0.2 to 4.0.2

David Schexnayder
MapR Data Engineer
Revision 1.2
3/05/15
Introduction................................................................................................................................................... 4
Scope............................................................................................................................................................. 4
Software Support Matrices............................................................................................................................. 4
1. MapR JDK Support Matrix...................................................................................................................... 4
2. MapR Ecosystem Support for 3.0.X and 4.0.X.......................................................................................4
3. MapR Ecosystem Upgrade Version Information....................................................................................6
4. Additional Ecosystem and MapR Partner Components..........................................................................7
Upgrade Outline............................................................................................................................................ 7
1. Upgrade to Redhat Linux 6.6................................................................................................................ 7
2. Upgrade to Java 1.7.x........................................................................................................................... 7
3. Upgrade the M7 Cluster to 4.0.2........................................................................................................... 8
4. Validate non-hadoop Components in 3.x Clients...................................................................................8
5. Resolve 3.x Client Issues...................................................................................................................... 8
6. Upgrade M5 cluster to 4.0.2................................................................................................................. 8
7. Configure MR1/MR2 Ratio to 80/20....................................................................................................... 8
8. Upgrade Special non-hadoop Clients to 4.x Based on Certification......................................................8
9. Test Amex Application Code................................................................................................................. 8
Upgrade Components and Configuration....................................................................................................... 8
1. Hadoop Mapreduce Configuration........................................................................................................ 8
2. Fair Scheduler Configuration (YARN)..................................................................................................... 9
3. Mapreduce v1 API Changes for 4.0.X.................................................................................................. 10
4. Submitting Jobs to MR1 or MR2.......................................................................................................... 13
5. Edge Node Configuration.................................................................................................................... 13
6. Services Configuration and Ports Used...............................................................................................13
7. MapR-DB (M7) Features and HBase API..............................................................................................15
Enabling Table Authorization with Access Control Expressions..............................................................16
Bulk Loading and MapR Tables.............................................................................................................. 18
HBase 0.98 API...................................................................................................................................... 20
8. Apache Hive....................................................................................................................................... 27
9. Apache Pig.......................................................................................................................................... 28
10. Apache Flume................................................................................................................................... 29
11. Apache Sqoop................................................................................................................................... 29
12. Apache Mahout................................................................................................................................. 30
13. Apache Oozie.................................................................................................................................... 31
14. Apache HBase................................................................................................................................... 31
15. Apache Cascading............................................................................................................................ 32
16. Apache Whir..................................................................................................................................... 32
17. Apache Zookeeper............................................................................................................................ 32
18. Apache Storm................................................................................................................................... 32
Amex Lab Upgrade Plan............................................................................................................................... 32
(Week 1 : 3/2 3/6) - Build New Sandbox Cluster.....................................................................................32
Build StackIQ template for cluster......................................................................................................... 32
MapR 3.0.2 to 4.0.2 Upgrade Document

Deploy Cluster....................................................................................................................................... 33
Bring the Cluster Online and Apply License........................................................................................... 33
(Week 2 : 3/9 3/13) Bring the Cluster Online, Operationalize, and Transfer Data...................................33
Operationalize the Cluster..................................................................................................................... 33
Verify cluster operation......................................................................................................................... 33
Run Terrys Benchmark suite................................................................................................................. 33
Stage Actual Data and Tables................................................................................................................ 33
(Week 3 : 3/16 3/20) Perform Upgrade and Develop Configuration........................................................34
Halt Jobs................................................................................................................................................ 34
Stop Cluster Services............................................................................................................................. 34
Stop Ecosystem Component Services.................................................................................................... 34
Stop MapR core services....................................................................................................................... 34
Backup the /opt/mapr/roles Directory.................................................................................................... 35
Upgrade Packages and Configuration Files............................................................................................35
Restore roles files.................................................................................................................................. 35
Install YARN Components (ResourceManager, NodeManager, HistoryServer)........................................35
Verify that packages installed successfully on all nodes........................................................................35
Update the warden configuration files which are located in /opt/mapr/conf/ and /opt/mapr/conf/conf.d.
.............................................................................................................................................................. 35
Upgrade existing ecosystem components.............................................................................................35
Edit the sethadoopenv.sh on edge nodes (as needed).......................................................................35
Start the cluster and enable new features............................................................................................. 36
The same tests run in week 2 should be run again to validate functionality in both MR1 and MR2.......37
MapR Upgrade Document
Introduction
This document will provide a step-by-step procedure to update the base MapR packages from 3.0.2 to
4.0.2 and ecosystem components. This document is intended for validation on the AXP sandbox cluster
and to be modified for use in additional environments. With MapR v4.X, the cluster has the ability of
running hadoop1 (classic) and hadoop2 (yarn) workloads simultaneously. The goal is to provide a cluster
with 80% capacity initially dedicated for hadoop1 (classic) and 20% for hadoop2 (yarn). Over time, the
capacity can be incrementally shifted to provide a higher percentage for hadoop2 (yarn). American
Express currently has 3 pilot clusters running MapR v4.0.2. Many of the configuration details have been
developed as part of this pilot program.
Scope
The scope of this document will cover the MapR base packages and ecosystem packages. We also cover
the updates needed for configuration files and configuration for hadoop1 / hadoop2 on the edge nodes.
Software Support Matrices

Since MapR v4.X has the ability of running hadoop1 (classic) and hadoop2 (yarn) workloads
simultaneously, MapR now releases ecosystem packages with both hadoop1 and hadoop2 capabilities.
The goal of this upgrade is to maintain compatibility with existing workloads. Existing ecosystem
component will need to be uninstalled and replaced with the updated MapR v4.X versions.
1. MapR JDK Support Matrix
Below are the supported versions of Java per MapR release
JDK/MapR Version
MapR 2.x
MapR 3.0.x
MapR 3.1.x
MapR 4.0.x
JDK 6
Yes
Yes
Yes
No
JDK 7
No
Yes
Yes
Yes
JDK 8
No
No
No
Yes
2. MapR Ecosystem Support for 3.0.X and 4.0.X

Below is the current MapR ecosystem support matrix for MapR v3.0.X and v4.0.X
Ecosystem Component/MapR Version
Apache Hive
Apache Spark
MapR 3.0.x
MapR 4.0.2
(MapReduce v1 mode)
MapR 4.0.2
(YARN Mode)
No
No
No
10
Yes
No
No
11
Yes
No
No
12
Yes
Yes
Yes
13
Yes
Yes
Yes
0.9.1
Yes
No
No
0.9.2
Yes
No
No
1.0.2
Yes
Yes
Yes

1.1.0
No
Yes
Yes
1.1.1
Yes
No
No
1.2.3
Yes
Yes
No
1.4.1
No
Yes
No
10
Yes
Yes
No
11
Yes
No
No
12
Yes
Yes
Yes
13
No
Yes
Yes
1.3.1
Yes
No
No
1.4.0
Yes
No
N/A
1.5
No
Yes
N/A
1.4.3
Yes
No
No
1.4.4
Yes
Yes
Yes
1.4.5
Yes
Yes
Yes
Apache Sqoop2
1.99.0
Yes
Yes
Yes
Apache Mahout
0.7
Yes
No
No
0.8
Yes
No
No
0.9
No
Yes
Yes
3.3.2
Yes
No
No
4.0.0
Yes
No
No
4.0.1
No
Yes
Yes
2.5 Beta Only
Yes
No
No
3.5
Yes
No
No
3.6
No
Yes
Yes
0.92.2
No
No
No
0.94.17
Yes
No
No
0.94.21
Yes
No
No
0.98.4
No
Yes
Yes
0.98.7
No
Yes
Yes
Impala
Apache Pig
Apache Flume
Apache Sqoop
Apache Oozie
Hue
Apache HBase

Apache Drill
0.5
No
Yes
N/A
0.6
No
Yes
N/A
0.6R2
No
Yes
N/A
0.7
No
Yes
N/A
1.4.1
Yes
No
No
1.5
No
Yes
Yes
2.1.6
Yes
No
No
2.5
No
Yes
Yes
0.8.1
No
No
No
Yes
Yes
N/A
0.4
N/A
N/A
Yes
MapReduce
1.0.3
Yes
Yes
N/A
MapReduce
2.5.1
N/A
N/A
Yes
Storm
0.9.3
N/A
Yes
N/A
Sentry
1.4.0
No
Yes
Yes
Asynchbase
Cascading
Whirr
HTTPFS
Apache Tez
(Developer Preview)
3. MapR Ecosystem Upgrade Version Information

Below is the upgrade ecosystem matrix with the current Amex version and upgrade version. Packages
highlighted in Green will maintain current project release. Packages highlighted in Yellow do not have
supported versions in both 3.0.X and 4.0.X.
Note: All ecosystem components will require uninstall and install following the core upgrade.
Ecosystem Component/MapR Version
Amex Current Version
Proposed Upgrade Version
Apache Hive
12
0.12.23716
0.12.201502021326
Apache Pig
12
0.12.23716
0.12.27259
Apache Flume
1.4.0 / 1.5.0
1.4.0.23547
1.5.0.201501191849
Apache Sqoop
1.4.4
1.4.4.22554
1.4.4.201411051136
0.7 / 0.9
0.7.22084
0.9.201409041745
Apache Oozie
3.3.2 / 4.0.1
3.3.2.23554
4.0.1.201501231601
Apache HBase
0.94.13 / 0.98.7
0.94.13.23554
0.98.7.201501291259
2.1 / 2.5
2.1.20130606
2.5
0.8.1
0.8.1.18380
NA
Apache Mahout
Apache Cascading
Apache Whir

Apache Zookeeper
3.3.6 / 3.4.5
3.3.6
3.4.5
0.9.3
0.9.3 (on MapR 3.1.1)
0.9.3 (no YARN integration)
Apache Storm
4. Additional Ecosystem and MapR Partner Components

Below are the additional ecosystem components supported but not part of the MapR distribution.
Packages highlighted in Green will maintain current project release. Packages highlighted in Yellow do
not have supported versions in both 3.0.X and 4.0.X.
Note: Individual components may require reconfiguration following the upgrade.
Ecosystem Component / Version
Amex Current Version
Proposed Upgrade Version
1.6.0_33
1.7.0_67 or later
MySQL Server
5.6.1
5.6.1 (unchanged)
LWS-Solr
2.6.3
2.6.3 (unchanged)
Elastic Search
1.2.1
1.2.1 (unchanged)
8.01.00-rel141029
Should work with ODBC
1.4.4-3.el6
Not certified, but should work
Talend
5.4.1
5.6.1
Platfora
4.0.3
4.1.X
Datameer
4.5.6
Under certification
Revolution R
7.3.0
Under certification
4.4.2.10 / 4.4.2.11
Should Work with 4.0.X
2.6.6
2.6.6 (unchanged)
6.3
Up to 6.6
Oracle Java
Kognitio
Memcached
Dataguise
Python
Redhat Linux
Upgrade Outline
The high level plan describes the order in which clusters and cluster infrastructure should be upgraded. Specific details
and configuration found later in the document.
1 Upgrade to Redhat Linux 6.6
Linux should be upgraded prior to the upgrade of any MapR components. Any issues related to the
Linux upgrade should be resolved prior to the MapR upgrade. The upgrade consists of updating the
Linux packages in StackIQ and running yum upgrade. This can be performed in a rolling fashion on a
subset of nodes and requires a reboot once complete. The node can rejoin the cluster and run
workloads as usual.
Note: Any vendor specific drivers (Cisco UCS, IBM 3650) should also be updated as part of the process.
5. Upgrade to Java 1.7.x
Once all cluster nodes are running Redhat 6.5, Oracle Java 1.7.0_67 (or later) can be installed. The
package is released in RPM so it can be installed via StackIQ. Once installed, the JAVA_HOME variable
should be set via /opt/mapr/conf/env.sh or /etc/alternatives. The upgrade does not require a reboot,

but does require a restart of MapR services (zookeeper and warden). Existing code should be tested for
compatibility.
6. Upgrade the M7 Cluster to 4.0.2
The M7 clusters per environment (Gold, Platinum) should be upgraded prior to the analytical (M5)
clusters. Since the HBase API will be upgraded from 0.94.13 to 0.98.7, existing tables should be
backed up on the M7 clusters prior to the upgrade. The backup can be performed either by using the
standard HBase mapreduce jobs (CopyTable or Export) or by mirroring volumes containing M7 tables.
The backup should be stored on another cluster. Ecosystem components not in use should be
removed.
7. Validate non-hadoop Components in 3.x Clients
All client systems and services accessing the M7 cluster and tables should be validated. The client
code from 3.X clients should remain compatible with the 4.0.2 cluster and tables, but issues may arise
and client code may require a recompile. If required, upgrade client code and recompile applications
as necessary.
8. Resolve 3.x Client Issues
All issues related to accessing the M7 cluster and tables should be resolved prior to proceeding to
additional clusters. This may include upgrading MapR client code, upgrading ecosystem components
(such as HBase API), and recompiling client code. Upgrades or code changes should only be necessary
to resolve client access issues or specific performance issues. Once all upgrades are complete, an
additional step can be taken to upgrade all clients to a uniform code base.
9. Upgrade M5 cluster to 4.0.2
Detailed procedure for upgrading the cluster and components are found later in the document. Roll
forward any lessons learned from upgrading the M7 cluster and attached clients.
10.Configure MR1/MR2 Ratio to 80/20
Initially, the mix between MR1 (classic) hadoop and MR2 (yarn) should be set to 80/20 (80% capacity
by CPU/Mem/Disk dedicated to running MR1). The ratio is set per node in warden.conf and should be
managed by StackIQ. Changes to the warden.conf will require a restart of Nodemanager and
taskTracker services.
11.Upgrade Special non-hadoop Clients to 4.x Based on Certification
The remaining Additional Ecosystem and MapR Partner Components should be upgraded as
necessary. Some may simply require configuration path updates. The components should be
upgraded and validated.
12.Test Amex Application Code
Test existing mapreduce code on both MR1 and MR2 for compatibility. A test job per component should
be performed along with actual client code. Any failures should be resolved prior to completion of the
upgrade.
Upgrade Components and Configuration

1 Hadoop Mapreduce Configuration
The cluster should be configured to run MR1 (classic) and MR2 (yarn) simultaneously with 80% capacity
dedicated to MR1. The configuration is set on a per node basis. All nodes should be have the
hadoop_version initially set to classic or MR1:
Config file
/opt/mapr/conf/hadoop_version:

classic_version=0.20.2
yarn_version=2.5.1
default_mode=classic
To configure the CPU / Memory / Disk for 80% MR1, set the following in warden.conf
Config file
/opt/mapr/conf/warden.conf:
mr1.memory.percent=80
mr1.cpu.percent=80
mr1.disk.percent=80
Note: The memory will be allocated as a remainder after all other heap space is used (fileserver,
nodemanager, tt, hbase-regionserver. The percentage is based on the remaining memory on the
server. For M7, 4x CPUs are reserved for fileserver, on M5 2x CPUs are reserved for fileserver.
Jobs can be submitted to either MR1 or MR2. When the hadoop_version environment variable is set
to classic, the command hadoop will point to /opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop. We
also have hadoop1 and hadoop2 commands available which point to MR1 and MR2 respectively.
13.Fair Scheduler Configuration (YARN)
The fair scheduler for YARN is configured differently than for the JobTracker. The JobTracker fair
scheduler configuration should remain the same. To enable the fair scheduler for YARN
Config file
/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/yarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSched
uler</value>
</property>
<property>
<name>yarn.scheduler.allocation.file</name>
<value>fairscheduler.xml</value>
</property>
<property>
<name>yarn.scheduler.fair.preemption</name>
<value>true</value>
</property>
<property>
<name>yarn.scheduler.fair.allowundeclaredpools</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.fair.userasdefaultqueue</name>
<value>false</value>
</property>
Notes:
The queues are configured in /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/fair-scheduler.xml
Preemption is enabled
We do not allow undeclared pools (all queues must be defined)
We do not allow user as default queue (users must specify a queue and the users name will not be
used)

The queues are defined in the /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/fair-scheduler.xml file.
Minimum queue resources are defined. The queues are hierarchical with root being necessary. Sub
queues can be created for projects under business units if desired.
Config file
/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/fair-scheduler.xml
<allocations>
<queuePlacementPolicy>
<rulename="specified"create="false"/>
<rulename="reject"/>
</queuePlacementPolicy>
<queuename="root">
<schedulingPolicy>fair</schedulingPolicy>
<minResources>1000mb,0vcores,1disks</minResources>
<maxRunningApps>1</maxRunningApps>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
<weight>1.0</weight>
<aclSubmitApps>root</aclSubmitApps>
<aclAdministerApps>root</aclAdministerApps>
<queuename="default">
<aclSubmitApps>root</aclSubmitApps>
</queue>
<queuename="myqueue">
<aclSubmitApps>maprmygroup</aclSubmitApps>
</queue>
Notes:
The root queue is necessary
Users in the group mygroup can submit to myqueue
The default queue will only allow root to submit, but additional configuration is required to allow root
to submit mapreduce jobs
14.Mapreduce v1 API Changes for 4.0.X
Existing compiled MapReduce V1 applications may need to be recompiled before they can be run as
MapReduce V1 applications in MapR Version 4.0.x. The small number of API changes that have been
made, including removal of classes and methods and conversion of classes to interfaces, are
documented here. If your application does not use any of the changes listed in this document, you do
not need to recompile the application.
When an application has been compiled against MapReduce V1 or MapReduce V2 (YARN) in MapR 4.0.x,
the application can be run in either mode.
The following list of changes is grouped by package name:
org.apache.hadoop.mapred.jobcontrol
Job
extends ControlledJob
getMapredJobID: return type changed from String to JobID
JobControl
extends org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl
String addJob(ControlledJob) instead of String addJob(Job), where Job extends ControlledJob
org.apache.hadoop.mapred
JobContext
Changed from class to interface

JobInProgress
All methods removed. Counter class remains for backward compatibility.

JobEndNotifier
Removed methods:
void registerNotification(JobConf, JobStatus)
void startNotifier()
void stopNotifier()
Operation
QueueACL qACLNeeded: Change in type from

org.apache.hadoop.mapred.QueueManager.QueueACL to org.apache.hadoop.mapred.QueueACL
TaskAttemptContext
Progress method removed

Counters
Extends AbstractCounters, and some methods have been type parameterized, using generic. This
change breaks binary compatibility.
TaskLog
Removed methods: captureDebugOut, captureOutAndError, getJobDir, getTaskLogFile,

getUserLogDir
TaskStatus
Removed methods: boolean getIncludeCounters(), TaskLogFS getTaskLogFs(), void
setIncludeCounters(boolean), void setTaskLogFs(TaskLogFS)
TaskUmbilicalProtocol
Signature change for several methods where the JvmContext argument has been removed.
Utils
getHttpScheme removed
ClusterStatus
getMaxPrefetchMapTasks() removed
JobClient
New exception thrown
10

Task
Change in exceptions: java.lang.ClassNotFoundException was removed.
java.lang.InterruptedException was added.
Change of visibility from protected to public.
org.apache.hadoop.mapreduce
Counter
Removed:
boolean equals(Object)
int hashCode()
void readFields(DataInput): read the binary representation of the counter
void write(DataOutput)
CounterGroup:
Removed:
boolean equals(Object)
Counter findCounter(String, String): internal to find a counter in a group
Counter findCounter(String)
String getDisplayName(): get the display name of the group
String getName(): get the internal name of the group
int hashCode()
void incrAllCounters(CounterGroup)
Iterator<Counter> iterator()
void readFields(DataInput)
int size(): return the number of counters in this group
void write(DataOutput)
Counters
Extends AbstractCounters, and some methods have been type parameterized, using generic. This
breaks binary compatibility.
Job
Removed: TaskCompletionEventList getTaskCompletionEventList(int), void

setUserClassesTakesPrecedence(boolean)
JobContext:
Removed: boolean userClassesTakesPrecedence()
JobSubmissionFiles
getStagingDir: Change in signature from (JobClient, Configuration) to (Cluster, Configuration)
MapContext
Mapper.Context
Changed from non-abstract to abstract
ReduceContext
11
ReduceContext.ValueIterator

Reducer.Context
Changed from non-abstract to abstract

TaskAttemptContext
Removed: progress
TaskInputOutputContext

org.apache.hadoop.mapreduce.lib.output
FileOutputCommitter
abortTask: Change in exceptions thrown from no exceptions to java.io.IOException.
org.apache.hadoop.mapreduce.security.token
DelegationTokenRenewal
Removed class
org.apache.hadoop.mapreduce.util
ProcessTree
Change in signature for some methods (emoved arg - signal; now just takes
pid): killProcess, killProcessGroup, etc.
15.Submitting Jobs to MR1 or MR2
If no recompiling of application code is necessary, the job can be submitted to either MR1 or MR2
Note Dmapred.job.queue.name=queue is deprecated in MR2 (YARN), but will function
MR1 job:
$>hadoop1jar<path_to_jar_file>Dmapred.job.queue.name=<queue>
MR2 job:
$>hadoop2jar<path_to_jar_file>Dmapreduce.job.queuename=<queue>
Notes: Even though MR2 (YARN) fair scheduler has hierarchical queues, the single queue name can be
used (ie myqueue instead of root.myqueue)
16.Edge Node Configuration
The edge nodes can be configured for either MR1 (classic) or MR2 (YARN) as the default. The edge
node services (hiveserver2, oozie, etc) will need to start with one or the other. Choices need to be
made on how best to run services. For example, one edge node can be configured for MR1 and
another for MR2. A more complicated approach may be to run multiple instances of the same service
on different ports.
For example Hiveserver2
Run MR1 on the standard port 10000, run MR2 on a separate port 10001. This should be tested for
each service (hiveserver2, oozie, etc) prior to rolling out.
12

17.Services Configuration and Ports Used
The existing Admin / CLDB nodes should also be configured to run ResourceManager (same as
JobTracker). There is also a ResourceManager JobHistory server, which is currently non-HA to run on a
single node. In the event of a failure of the JobHistory server, re-run configure.sh and change the HS
to the new node. The files needed exist on the MapR-FS and the new history server will read the
information.
Existing data nodes will need the NodeManager service installed
Configure.sh should be run on all nodes following the upgrade to set the hadoop version and identify
the HistoryServer address:
$>sudoclusha/opt/mapr/server/configure.shRN<cluster_name>C
<cldb_server_list>Z<zookeeper_server_listHS<history_server>
Additional ports will need to be open in order to interoperate with the ResourceManager,
NodeManagers, the Application Master, and the History Server. The ResourceManager port will only be
active on a single server at a time (8088). The Application Master is assigned per application and will
run on any NodeManager on the cluster.
Here is the complete port listing for MapR v4.0.2 with default port numbers:
Service
Port
CLDB
7222
CLDB JMX monitor port
7220
CLDB web port
7221
DNS
53
HBase Master
60000
HBase Master (for GUI)
60010
HBase RegionServer
60020
HBase Thrift Server
9090
HistoryServer RPC
10020
HistoryServer Web UI and REST APIs (HTTP)
19888
Hive Metastore
9083
Hiveserver2
10000
Httpfs
14000
Hue Beeswax
8002
Hue Webserver
8888
Impala Catalog Daemon
25020
Impala Daemon
21000
13

Service
Port
Impala Daemon
21050
Impala Daemon
25000
Impala StateStoreDaemon
25010
JobTracker
9001
JobTracker web
50030
LDAP
389
LDAPS
636
Metrics RPC activity
1111
MFS server
5660
MySQL
3306
NFS
2049
NFS monitor (for HA)
9997
NFS management
9998
NFS VIP service
9997 and 9998
NodeManager
8041
NodeManager Localizer RPC
8040
NodeManager Web UI and REST APIs (HTTP)
8042
NTP
123
Oozie
11000
Port mapper
111
ResourceManager Admin RPC
8033
ResourceManager Client RPC
8032
ResourceManager Resource Tracker RPC (for NodeManagers)
8031
ResourceManager Scheduler RPC (for ApplicationMasters)
8030
ResourceManager Web UI (HTTP)
8088
Secure HistoryServer Web UI and REST APIs (HTTPS)
19890
Secure NodeManager Web UI and REST APIs (HTTP)
8044
Secure ResourceManager Web UI (HTTPS)
8090
14

Service
Port
Shuffle HTTP
13562
SMTP
25
Sqoop2 Server
12000
SSH
22
TaskTracker web
50060
Web UI HTTPS
8443
Web UI HTTP (off by default)
8080
ZooKeeper
5181
ZooKeeper follower-to-leader communication
2888
ZooKeeper leader election
3888
18.MapR-DB (M7) Features and HBase API

MapR-DB (M7) v4.0.2 offers 2 new key features, which provide additional security and faster loading.
MapR v4.0.2 offers table security using Boolean expressions at the column level and M7 bulk loading of
data. In addition, many enhancements to the HBase API for 0.98 have been included (see below).
Enabling Table Authorization with Access Control Expressions

The MapR distribution for Hadoop enables native storage for MapR Tables. You can set permissions
for access to these tables through the MapR Control System (MCS) or with the maprcli table
commands.
Permissions for MapR tables, column families, and columns are defined by Access Control
Expressions (ACEs). You can set permissions for tables when you create or edit tables. You can set
default permissions for column families when you create or edit tables, and you can override these
defaults when you create column families.
Syntax of Access Control Expressions
An ACE is defined by a combination of user, group, or role definitions. You can combine these
definitions using the following syntax:
Operator
Description
Username or user ID, as they appear in /etc/passwd, of a specific user.

Usage: u:<username or user ID>
Group name or group ID, as they appear in /etc/group, of a specific

group. Usage: g:<group name or group ID>
Name of a specific role. Usage: r:<role name>.
Public. Specifies that this operation is available to the public without

restriction. Cannot be combined with any other operator.
15

Operator
Description
Negation operator. Usage: !<operator>.
&
AND operation.
OR operation
()
Delimiters for subexpressions.
""
The empty string indicates that no user has the specified permission.
An example definition is u:1001 | r:engineering, which restricts access to the user with ID 1001 or to
any user with the role engineering.
In this next example, members of the group admin are given access, and so are members of the
group qa:
g:admin | g:qa
For another example, suppose that you have this list of groups to which you want to give read
permissions on a table:
The admin group as a whole, but not the admins for a particular cluster (which is named cl3).
Members of the qa group who are responsible for testing the two applications
(named app2 and app3) that access this table.
The business analysts (group ba) in department 7A (group dept_7a)
All of the data scientists (group ds) in the company.
To grant the read permission, you construct this boolean expression:
u:cfkane | (g:admin & g:!cl3) | (g:qa & (g:app2 | g:app3)) | (g:ba & g:dept_7a) | g:ds
This expression is made up of five subexpressions which are separated by OR operators.
The first subexpression u:cfkane grants the read permission to the username cfkane.
The subexpression (g:admin & g:!cl3) grants the read permission to the admins for all
clusters except cluster cl3. The operator g is the group operator, the value admin is the
name of the group of all admins. The & operator limits the number of administrators who
have read permission because only those administrators who meet the additional condition
will have it.
The condition g:!cl3 is a limiting condition. The operator ! is the NOT operator. Combined with the
group operator, this operator means that this group is excluded and does not receive the read
permission.
Icon
Be careful when using the NOT operator. You might exclude fewer people than you intended. For
example, suppose that you do not want anyone in the group group_a to have access. You therefore
define this ACE:
g:!group_a
You might think that the data in your table is now protected because members of group_a do not
have access to it. However, you have not restricted access for anyone else except the members
of group_a. The rest of the world can access the table.
You should not define ACEs through exclusion by using the NOT operator. You should define them by
inclusion and use the NOT operator to limit further the access of the groups or roles that you have
included.
16

In the subexpression (g:admin & g:!cl3), the NOT operator limits the number of members within
the admin group who have access. The admin group is included, and all users who are also part of
the cl3 group are excluded.
The subexpression (g:qa & (g:app2 | g:app3)) demonstrates that you can use a
subexpression within a subexpression. The larger subexpression means that only members
of group qa who are also members of group app2 or app3 have read access to the table. The
smaller subexpression limits the number of people in the qa group have have this
permission.
The next two subexpressions -- (g:ba & g:dept_7a) and g:ds -- grant the read permission to
the members of group ba who are also in the group dept_7a. It also grants permission to the
members of the group ds.
Defining ACEs with the MCS by using the Expression Builder

1. To define an ACE for an existing table, click Edit Table Permissions from the table's pane in
the MCS to display the Permissions pane.
2. Click the arrow at the right side of any field to display the Expression Builder for that field.
17

3. Use the + button to add a condition to the expression. Note that you cannot mix AND and OR
without using subexpressions.
You can also type expressions directly into the field. The MCS validates expressions when focus
leaves the field. The field is colored yellow for a warning and red for an error. Hover the cursor on
the field to display the error or warning message.
Defining ACEs by using maprcli commands
You can set ACEs with the following commands:
table create Creates a new MapR table.
table edit Edits a MapR table.
table cf create Creates a column family for a MapR table.
table cf edit Edits a column-family definition.
table cf colperm set Set Access Control Expressions (ACEs) for a specified column.
Bulk Loading and MapR Tables

The most common way of loading data to a MapR table is with a put operation. At large scales, bulk
loads offer a performance advantage over put operations.
Bulk loading can be performed as a full bulk load or as an incremental bulk load. Full bulk loads
offer the best performance advantage for empty tables. Incremental bulk loads can add data to
existing tables concurrently with other table operations, with better performance
than put operations.
Bulk Load Process Flow
Once your source data is in the MapR-FS layer, bulk loading uses a MapReduce job to perform the
following steps:
1. Transform the source data into the native file format used by MapR tables.
2. Notify the database of the location of the resulting files.
A full bulk load operation can only be performed to an empty table and skips the write-ahead log
(WAL) typical of Apache HBase and MapR table operations, resulting in increased performance.
Incremental bulk load operations do use the WAL.
Creating a MapR Table with Full Bulk Load Support
When you create a new MapR table with the maprcli table create command, specify the value of
the -bulkload parameter as true.
18

When you create a new MapR table from the hbase shell, specify BULKLOAD as true, as in the
following example:
create'/a0','f1',BULKLOAD=>'true'
When you create a new MapR table from the MapR Control System (MCS), check the Bulk Load box
under Table Properties.
Performing Bulk Load Operations

Notes: You can only perform a full bulk load to empty tables that have the bulk load attribute set.
You can only set this attribute during table creation. The alter operation will not set this attribute
to true on an existing table.
Warning: Your table is unavailable for normal client operations, including put, get,
and scan operations, while a full bulk load operation is in progress. To keep your table available for
client operations, use an incremental bulk load.
Attempting a full bulk load to a table that does not have the bulk load attribute set will result in an
incremental bulk load being performed instead.
19

You can use incremental bulk loads to ingest large amounts of data to an existing table. Tables
remain available for standard client operations such as put, get, and scan while the bulk load is in
process. A table can perform multiple incremental bulk load operations simultaneously.
Bulk Loading Tools
Bulk loading is supported for the following tools, which can be used for both full or incremental bulk
load operations:
The CopyTable tool uses a MapReduce job to copy a MapR table.
hbase com.mapr.fs.hbase.mapreduce.CopyTable -src /table1 -dst /table2
The CopytableTest tool copies a MapR table without using MapReduce.
hbase com.mapr.fs.CopyTableTest -src /table1 -dst /table2
The ImportTsv tool imports a tab-separated values file into a MapR table.
importtsv -Dimporttsv.columns=HBASE_ROW_KEY,CF-1:custkey,CF-1:orderstatus,CF1:totalprice,CF-1:orderdate,CF-1:orderpriority -Dimporttsv.separator='|'
-Dimporttsv.bulk.output=/dummy /table1 /orders
The ImportFiles tool imports HFile or Result files into a MapR table.
hbase com.mapr.fs.hbase.mapreduce.ImportFiles -Dmapred.reduce.tasks=2 -inputDir
/test/tabler.kv -table /table2 -format Result
Custom MapReduce jobs can use bulk loads with the configureIncrementalLoad() method from
the HFileOutputFormat class.
HTable table = new HTable(jobConf, tableName);
HFileOutputFormat.configureIncrementalLoad(mrJob, table);
After completing a full bulk load operation, take the table out of bulk load mode to restore normal
client operations. You can do this from the command line or the HBase shell with the following
commands:
# maprcli table edit -path /user/juser/mytable -bulkload false (command line)
hbase shell> alter '/user/juser/mytable', 'f2', BULKLOAD => 'false' (hbase shell)
Paths for HBase 0.98 Tools
Note the path name changes for the following tools in HBase 0.98:
Tool
HBase 0.98 Path
CopyTableTest
com.mapr.fs.hbase.tools.CopyTableTest
CopyTable
com.mapr.fs.hbase.tools.mapreduce.CopyTable
ImportFiles
com.mapr.fs.hbase.tools.mapreduce.ImportFiles
If you are running on an HBase 0.98 client but the exported files were generated with HBase 0.94,
include -Dhbase.import.version=0.94 in the ImportFiles job.
HBase 0.98 API

The API for accessing MapR tables works the same way as the Apache HBase API. Code written for
Apache HBase can be easily ported to use MapR tables.
MapR tables do not support low-level HBase API calls that are used to manipulate the state of an
Apache HBase cluster. HBase API calls that are not supported by MapR tables report successful
completion to allow legacy code written for Apache HBase to continue executing, but do not
perform any actual operations.
For details on the behavior of each function, refer to the Apache HBase API documentation.
20

HBaseAdmin API
Available for
MapR
Tables?
void addColumn(String tableName,

HColumnDescriptor column)
Yes
void close()
Yes
void createTable()(HTableDescriptor
desc, byte[][] splitKeys)
Yes
This call is synchronous.
void createTableAsync() (HTableDescrip

tor desc, byte[][] splitKeys)
Yes
For MapR tables, this call is

identical to createTable.
void deleteColumn (byte[] family,

byte[] qualifier, long timestamp)
Yes
void deleteTable(String tableName)
Yes
HTableDescriptor[] deleteTables(Patter
n pattern)
Yes
Configuration getConfiguration()
Yes
HTableDescriptor getTableDescriptor (b
yte[] tableName)
Yes
HTableDescriptor[] getTableDescriptors
(List<String> tableNames)
Yes
boolean isTableAvailable(String
tableName)
Yes
boolean isTableDisabled(String
tableName)
Yes
boolean isTableEnabled(String
tableName)
Yes
HTableDescriptor[] listTables()
Yes
void modifyColumn(String tableName,

HColumnDescriptor descriptor)
Yes
void modifyTable (byte[] tableName,

HTableDescriptor htd)
No
boolean tableExists(String tableName)
Yes
Pair<Integer,
Integer> getAlterStatus (byte[]
tableName)
Yes
CompactionState getCompactionState(
String tableNameOrRegionName)
Yes
Returns CompactionState.NO
NE.
void split(byte[]
tableNameOrRegionName)
Yes
The tableNameOrRegionNam
e parameter has a different
Comments
21

HBaseAdmin API
Available for
MapR
Tables?
Comments
format when used with MapR

tables than with Apache
HBase tables. With MapR
Tables, specify both the table
path and the FID as a commaseparated list.
void abort(String why, Throwable e)
No
void assign (byte[] regionName)
No
boolean balancer()
No
boolean balanceSwitch(boolean b)
No
void closeRegion(ServerName sn,

HRegionInfo hri)
No
void closeRegion(String regionname,

String serverName)
No
boolean closeRegionWithEncodedRegio
nName(String encodedRegionName,
String serverName)
No
void flush(String
No
ClusterStatus getClusterStatus()
No
HConnection getConnection()
No
HMasterInterface getMaster()
No
String[] getMasterCoprocessors()
No
boolean isAborted()
No
boolean isMasterRunning()
No
void majorCompact(String
No
void move (byte[]

encodedRegionName, byte[]
destServerName)
No
byte[][] rollHLogWriter(String
serverName)
No
boolean setBalancerRunning(boolean
on, boolean synchronous)
No
void shutdown()
No
22

HBaseAdmin API
Available for
MapR
Tables?
Comments
void stopMaster()
No
void stopRegionServer(String
hostnamePort)
No
void unassign (byte[] regionName,

boolean force)
No
HTable API
Available for
MapR Tables?
Comments
void clearRegionCache()
No
Operation is silently ignored.
void close()
Yes
<T extends CoprocessorProtocol, R>

Map<byte[], R>
coprocessorExec(Class<T> protocol,
byte[] startKey, byte[] endKey, Call<T,
R> callable)
No
Returns null.
<T extends CoprocessorProtocol> T

coprocessorProxy(Class<T> protocol,
byte[] row)
No
Returns null.
Map<HRegionInfo, HServerAddress>
deserializeRegionInfo(DataInput in)
Yes
void flushCommits()
Yes
Configuration getConfiguration()
Yes
HConnection getConnection()
No
Returns null
int getOperationTimeout()
No
Returns null
ExecutorService [getPool()
No
Returns null
int getScannerCaching()
No
Returns 0
ArrayList<Put> getWriteBuffer()
No
Returns null
long getWriteBufferSize()
No
Returns 0
boolean isAutoFlush()
Yes
void
prewarmRegionCache(Map<HRegionInf
o, HServerAddress> regionMap)
No
void serializeRegionInfo(DataOutput
out)
Yes
Configuration and State

Management
23

HBaseAdmin API
Available for
MapR
Tables?
Comments
void setAutoFlush(boolean autoFlush,

boolean clearBufferOnFail)
Same as
setAutoFlush(b
oolean
autoFlush)
void setAutoFlush(boolean autoFlush)
Yes
void setFlushOnRead(boolean val)
Yes
boolean shouldFlushOnRead()
Yes
void setOperationTimeout(int
operationTimeout)
No
void setScannerCaching(int
scannerCaching)
No
void setWriteBufferSize(long
writeBufferSize)
No
Atomic operations
Result append(Append append)
Yes
boolean checkAndDelete(byte[] row,

byte[] family, byte[] qualifier, byte[]
value, Delete delete)
Yes
boolean checkAndPut(byte[] row,

byte[] family, byte[] qualifier, byte[]
value, Put put)
Yes
Result increment(Increment increment)
Yes
long incrementColumnValue(byte[]
row, byte[] family, byte[] qualifier, long
amount, boolean writeToWAL)
Yes
long incrementColumnValue(byte[]
row, byte[] family, byte[] qualifier, long
amount)
Yes
void mutateRow(RowMutations rm)
Yes
DML operations
void batch(List actions, Object[]
results)
Yes
Object[] batch(List<? extends Row>

actions)
Yes
void delete(Delete delete)
Yes
void delete(List<Delete> deletes)
Yes
24

HBaseAdmin API
Available for
MapR
Tables?
boolean exists(Get get)
Yes
Result get(Get get)
Yes
Result[] get(List<Get> gets)
Yes
Result getRowOrBefore(byte[] row,

byte[] family)
No
ResultScanner getScanner(...)
Yes
void put(Put put)
Yes
void put(List<Put> puts)
Yes
Comments
Table Schema Information

HRegionLocation getRegionLocation(by
te[] row, boolean reload)
Yes
Map<HRegionInfo, HServerAddress>
getRegionsInfo()
Yes
List<HRegionLocation> getRegionsInR
ange(byte[] startKey, byte[] endKey)
Yes
byte[][] getEndKeys()
Yes
byte[][] getStartKeys()
Yes
Pair<byte[][], byte[][]>
getStartEndKeys()
Yes
HTableDescriptor getTableDescriptor()
Yes
byte[] getTableName()
Yes
Returns table path
Row Locks
RowLock lockRow(byte[] row)
No
void unlockRow(RowLock rl)
No
HTablePool API
Available for
MapR Tables?
close()
Yes
closeTablePool(byte[] tableName)
Yes
closeTablePool(String tableName)
Yes
protected
HTableInterface createHTable(String
Yes
Comments
25

HBaseAdmin API
Available for
MapR
Tables?
Comments
tableName)
int getCurrentPoolSize(String
tableName)
Yes
HTableInterface getTable(byte[]
tableName)
Yes
HTableInterface getTable(String
tableName)
Yes
void putTable(HTableInterface table)
Yes
MapR Tables and Filters

MapR tables support the following built-in filters. These filters work identically to their Apache
HBase versions.
Filter
Description
ColumnCountGetFilter
Returns the first N columns of a row.
ColumnPaginationFilter
ColumnPrefixFilter
ColumnRangeFilter
CompareFilter
FilterList
FirstKeyOnlyFilter
FirstKeyValueMatchingQualifiersFilter
Looks for the given columns in KeyValue.
FuzzyRowFilter
InclusiveStopFilter
KeyOnlyFilter
MultipleColumnPrefixFilter
PageFilter
PrefixFilter
RandomRowFilter
RegexStringComparator
SingleColumnValueFilter
26

Filter
Description
SkipFilter
TimestampsFilter
WhileMatchFilter
HBase Shell Commands

The following table lists support information for HBase shell commands for managing
MapR tables.
Command
Available
for MapR
Tables?
alter
Yes
alter_async
Yes
create
Yes
describe
Yes
disable
Yes
drop
Yes
enable
Yes
exists
Yes
is_disabled
Yes
is_enabled
Yes
list
Yes
disable_all
Yes
drop_all
No
enable_all
Yes
show_filters
Yes
count
Yes
get
Yes
put
Yes
scan
Yes
Comments
Obsolete. Use the rm <table

names> command from the MapR
filesystem or hadoop fs -rm <table
names>instead.
27

Command
Available
for MapR
Tables?
delete
Yes
deleteall
Yes
incr
Yes
truncate
Yes
get_counter
Yes
assign
No
balance_swi
tch
No
balancer
No
close_region
No
major_comp
act
No
move
No
unassign
No
zk_dump
No
status
No
version
Yes
whoami
Yes
Comments
19.Apache Hive
We plan to keep the current major version of Hive in place for the upgrade. Since hiveserver2 is
already configured with impersonation, no specific configuration changes should be necessary. The
hive metastore should be backed up prior to the upgrade. The newer version of hive works with both
MR1 and MR2 and hive will follow the environment variable in /opt/mapr/conf/hadoop_version. (by
default, hiveserver2 will start MR1 or MR2 based on hadoop_version).
For the upgrade, the existing version of hive-0.12 should be uninstalled and the new version installed.
This will ensure the proper hadoop jar files are in place.
Changes to hive-site.xml should include the updated jar files. In addition, the HBase jar files are now
split into 3 files, which will need to be updated:
hbase-client-0.98.7-mapr-1501.jar
hbase-common-0.98.7-mapr-1501.jar
hbase-protocol-0.98.7-mapr-1501.jar
The following Hive patches are included in the latest version:
28

Commit
Date (YYYY-MMDD)
Comment
4441453
2015-1-19
MAPR-16784: Secure Hive metastore no longer fails to

start on a MapR 4.x cluster that runs MapReduce v1.
a3e0516
2015-1-14
MAPR-15645: Pig Hcatalog query no longer fails on

Hive due to Hadoop 2.x changes to JobContext API.
58b1655
2014-12-30
MAPR-12880/HIVE-6082: Hive no longer continually

retries to delete a lock if the ZooKeeper responds that
the node does not exist.
c414fc8
2014-11-21
MAPR-14270/HIVE-7279: format_number now

supports the decimal datatype.
491ec41
2014-11-20
MAPR-15833: The hive metatool updateLocation

command no longer fails.
415b1b5
10-15-2014
MAPR-15696: Changes were made to support Sentry

for Hive 12.
c5854a7
11-03-2014
MAPR-15816: HCatStorer failed to store data in a Hive

table on a non-secure 4.0.1 unsecure running MRv1.
5a2322a
11-03-2014
MAPR-15467: HiveServer2 failed to execute a join

query when Kerberos authentication was enabled.
c19ade2
2014-09-04
MAPR-15135: Created shims to handle hive queries in

a mixed-mode cluster.
3cef19f
2014-08-06
MAPR-14731: Hue 3.5 works with Hive 0.12 on MapR.
f301d59
2014-09-02
MAPR-14901: Changed the default HiveServer2 port in

warden.hs2.conf to 10000.
20.Apache Pig
We plan to keep the current major version of Pig in place for the upgrade. As is the same in Hive, Pig
will follow the default MR1 / MR2 configuration set in /opt/mapr/conf/hadoop_version. This can be
overridden via script or environment variable. No further pig configuration changes will be necessary.
The following Pig patches have been included in the latest version:
Commit
Date (YYYYMM-DD)
Comment
39efebb
2014-08-21
PIG-4139, MapR-14982: Pig jobs that run on a YARN

cluster configured with MRv1 no longer produce the
following exception:
NoSuchFieldException: jobsInProgress (Hadoop 0.23
API) caught and logged at level: DEBUG
d017b26
2014-08-19
MapR-14963: You can run PigMix queries and

MapReduce jobs on YARN.
c6fbb42
2014-06-27
PIG-3813, MapR-13391: Pig jobs do not go into an

infinite loop when the FILTER operator is immediately
29

Commit
Date (YYYYMM-DD)
Comment
followed by the RANK operator.
21.Apache Flume
We will be required to upgrade apache flume from 1.4.0 to 1.5.0 as part of the MapR v4.0.2 upgrade.
Flume 1.5.0 should be completely backwards compatible to 1.4.0 and should not require additional
configuration.
Below is a list of new features available for Flume 1.5.0:
New in-memory channel that can spill to disk

A new dataset sink that use Kite API to write data to HDFS and HBase
Support for Elastic Search HTTP API in Elastic Search Sink
Much faster replay in the File Channel.
Flume will require configuration for use with HBase-0.98.

1. Go to the directory where the Flume scripts are stored:
$>cd/opt/mapr/flume/flume<version>/bin/
2. Execute the update script:

$>bashflumejars.sh
3. During the script execution, you should see the following log messages:
POST_YARN=1,HBASE_VERSION=0.98.7:installingflume*hbase.98h2jars
The latest Flume package contains the following patches:

Comm
it
Date (YYYYMM-DD)
Comment
7096e
0f
2014-12-11
Added tmp extensions to prevent the JVM from using jar files
from other versions of HBase.
e544e
94
2014-12-04
Added a script to configure Flume jars after an upgrade to

HBase 0.98.
334a2
bc
2014-12-04
mapr-flume-1.5.0 was built with support for HBase 0.94 and

0.98.
22.Apache Sqoop
Apache Sqoop will remain at the 1.4.4 base version. No configuration file updates should be necessary.
The existing version of Sqoop will need to be uninstalled and the new version installed.
The latest Sqoop 1.4.4 package contains the following patches:
Commi
t
Date
(YYYY-
Comment
30

MM-DD)
d0b934
5
2014-1014
MapR-8915: Sqoop import operations failed with an error in

opening zip file exception.
303ad6
7
2014-0827
(Bug 14922) Changes Sqoop script to pick up hadoop version

numbers so Sqoop 1.4.4 can work with MapR 4.0.1.
(Bug 12836) By default, Sqoop does not create job history files.
545068
f
2014-0827
(Bug 14838) The bc command is removed from install and

uninstall scripts, since it is not part of a standard Linux installation.
7e4d22
5
2014-0614
MapR 14355. Corrects a packaging error that resulted in a

spurious error message mv: target `/opt/mapr/sqoop/sqoop1.4.4/sqoop-test-1.4.4-mapr.jar' is not a directory.
f54e03
7
20-Nov2013
Fix for MapR 12083. Sqoop can find the Kerberos tgt file.
682565
5
02-Dec2013
Sqoop-1246 HBaseImportJob should add job authtoken only if

HBase is secured.
23.Apache Mahout
Apache Mahout requires an upgrade from 0.7 to 0.9. No real configuration changes will be necessary
aside from updating the paths, but as we are jumping 2 full versions, some existing jobs may be
deprecated.
Below is a list of changes from 0.8 to 0.9:
New and improved Mahout website based on Apache CMS - MAHOUT-1245

Early implementation of a Multi Layer Perceptron (MLP) classifier - MAHOUT-1265
Scala DSL Bindings for Mahout Math Linear Algebra. See this blogpost and MAHOUT-1297
Recommenders as Search. See [https://github.com/pferrel/solr-recommender] and MAHOUT-1288
Support for easy functional Matrix views and derivatives - MAHOUT-1300
JSON output format for ClusterDumper - MAHOUT-1343
Enabled randomised testing for all Mahout modules using Carrot RandomizedRunner - MAHOUT1345
Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering - See
this pdf and MAHOUT-1361
Upgrade to Lucene 4.6.1 - MAHOUT-1364
The following algorithms that were marked deprecated in 0.8 have been removed in 0.9:
Switched LDA implementation from Gibbs Sampling to Collapsed Variational Bayes
Meanshift - removed due to lack of actual usage and support
MinHash - removed due to lack of actual usage and support
Winnow - removed due to lack of actual usage and support
Perceptron - removed due to lack of actual usage and support
Slope One - removed due to lack of actual usage
Distributed Pseudo recommender - removed due to lack of actual usage
TreeClusteringRecommender - removed due to lack of actual usage
Below is a list of changes from 0.7 to 0.8:
Numerous performance improvements to Vector and Matrix implementations, API's and their
iterators
31

Numerous performance improvements to the recommender implementations
MAHOUT-1088: Support for biased item-based recommender
MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases
MAHOUT-1106: Support for SVD++
MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well
as an upgrade of the supported Lucene version to Lucene 4.3.1.
MAHOUT-1154 and friends: New streaming k-means implementation that offers on-line (and fast)
clustering
MAHOUT-833: Make conversion to SequenceFiles Map-Reduce, 'seqdirectory' can now be run as a
MapReduce job.
MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash
(indexes or values).
MAHOUT-884: Matrix Concat utility, presently only concatenates two matrices.
MAHOUT-1187: Upgraded to CommonsLang3
MAHOUT-916: Speedup the Mahout build by making tests run in parallel.
Existing code should be run on one of the existing MapR v4.0.2 clusters to test for compatibility.
24.Apache Oozie
Apache Oozie will require an upgrade from 3.3.2 to 4.0.1 and some configuration changes to work with
MapR v4.0.1. The 4.0.1 version does support both MR1 and MR2. As a result, oozie must be started
with either MR1 or MR2.
If you are running Oozie jobs on YARN, you must provide the address of the node running the active
ResourceManager and the port used for ResourceManager client RPCs (port 8032). Edit
the job.properties file and insert the following statement:
JobTracker=<ResourceManager_address>:8032
You also need to copy the yarn-site.xml file for the active ResourceManager to the following location:
/opt/mapr/oozie/oozie<version>/conf/hadoopconf
Restart Oozie after making this change.

Installation Notes
Before installing Oozie 4.0.1-1501, remove the old share libraries from: maprfs://oozie/share
If you are running MapR Version 3.x, copy the new Oozie share libraries manually:
hadoopfsput/opt/mapr/oozie/oozie${oozie_version}/share1/oozie/share
Note: With MapR v4.0.2, the share libs are copied automatically when you first start Oozie 4.0.1.
The latest Oozie package contains the following patches:
Comm
it
Date (YYYYMM-DD)
Comment
2973bc
3
2014-11-20
MAPR-16032: Oozie 4.0.1 did not pick up variables from the

mapred-site.xml file.
911f78
a
2014-11-04
MAPR-15524: Oozie server installation on a Version 3.1.1

mapr-client failed.
e6230c
2014-10-30
MAPR-12997: Oozie ShareLib was updated to include Hive

0.13 and Pig 0.12.
32
25.Apache HBase
Apache HBase will require an upgrade from 0.94.12 to 0.98.7 for MapR v4.0.2. This will require and
upgrade to the hfile format on disk. It is highly recommended to back up important tables to another
cluster prior to the upgrade.
Here is a summary of the steps needed to upgrade HBase:
1. Install the HBase 0.98.7 binaries.
2. Use the 0.98.7 binary to check for incompatible files on the running 0.94.x cluster. (The purpose of
this step is to convert the incompatible HFile format.)
$>/opt/mapr/hbase/hbase0.98.7/bin/hbaseupgradecheck
If incompatible files are found, you must purge them by running a compaction.
3. Shut down the HBase 0.94.13 services on the MapR cluster.
4. Execute the upgrade on the cluster:
$ /opt/mapr/hbase/hbase-0.98.7/bin/hbase upgrade -execute
5. Start the upgraded HBase services.
The apache upgrade documentation can be found here:
http://hbase.apache.org/book.html#upgrade0.96
Note: Upgrading from 0.94.X to 0.98.X is the same as upgrading from 0.94.X to 0.96.X.
Here is a list of the features found in the HBase 0.98.7:
This is the initial release of HBase 0.98.7 for the MapR Distribution for Hadoop. It includes the following
new features:
1. C APIs for HBase (libhbase). This library is not supported by MapR-DB.
2. HTable.checkAndMutate(). This API is not supported by MapR-DB.
3. Impersonation for HBase REST gateway with MapR-DB tables is supported on a MapR 4.0.2 cluster.
26.Apache Cascading
Apache Cascading will require an upgrade from 2.1 to 2.5 for MapR v4.0.2. We support MR1 and MR2
with Cascading 2.5. Users are advised to test existing code with the newer version of cascading.
Placeholder: additional information needed
27.Apache Whir
Apache Whir is merely a set of libraries for running cloud services. There are currently no Amex
projects using this software. It is recommended to remove whir prior to the upgrade.
28.Apache Zookeeper
With MapR v3.1.1 and later, MapR uses Zookeeper 3.4.5 instead of 3.3.6. The 3.4.5 version requires no
additional configuration and should be compatible with additional open source projects such as Solr
and Storm. Prior to the upgrade, it is recommended to back up /opt/mapr/zkdata on each node
configured to run zookeeper.
33

29.Apache Storm
Apache Storm is running on a single cluster at Amex, which is running MapR v3.1.1. The same version
of Storm is supported on both 3.1.1 and 4.0.2. No additional configuration will be necessary until the
Storm cluster itself is ready for upgrade.
Note: although a storm on yarn project exists, the code was forked from an earlier version of Storm
and should be considered alpha.
Amex Lab Upgrade Plan

This procedure details the high level plan to test the upgrade on a 13-node cluster (Legolas) in the AXP
IPC1 data center. The cluster will be configured with the Silver cluster as a template and will use data
from Silver. Prior to this step it is assumed the cluster is running JDK 1.7 and Redhat Linux 6.6. The
details below cover the first 3 weeks of testing beginning March 2nd.
(Week 1 : 3/2 3/6) - Build New Sandbox Cluster
Activities for the first week to include building the template for the new cluster and performing the
build.
Build StackIQ template for cluster

Determine cluster layout and develop template for zk, admin, and data nodes
Deploy Cluster
Deploy the cluster via StackIQ. Resolve any hardware or build issues.
Bring the Cluster Online and Apply License

For testing purposes, the cluster should initially use M5 temporary license key (fully functional for
30 days)
(Week 2 : 3/9 3/13) Bring the Cluster Online, Operationalize, and Transfer Data
Activities in week 2 should focus on bringing the cluster online using Silver as a template. The cluster
should be tested thoroughly to ensure compatibility with all ecosystem components, end-user
environment, and real data to simulate the Silver cluster.
Operationalize the Cluster

The cluster post-install scripts should be run in order to configure all necessary components, user
environment, and data volumes.
Tasks to include:
Create standard AXP volumes
Create user home directories and MCS access
Configure MySQL databases for Hive, MapR Metrics
Setup NFS mounts for Edge and CLDB nodes
Configure Node and Volume topologies
Configure Hive, HBase and other ecosystem components
Configure JobTracker and Fair Scheduler queues and ACLs
Configure Clustershell (clush) groups for zk, cldb, edge, data, mysql, etc
Verify cluster operation

To ensure the cluster operation, run representative jobs on the cluster:
TeraSort (100 million * 100-byte records = 10 billion)
TestDFSIO (10 output files, 1GB each)
Hive, Pig, and HBase validation
34

Run Terrys Benchmark suite
Terry has a standard benchmark suite, which has been run on every cluster deployed in the AXP
environment. The benchmark suite will stage and generate data, and run representative
MapReduce code. There are 3 MapReduce and 1 Hive jobs run to stage the data, which will take 4-6
hours and the actual tests. All staging and tests should be submitted by an unprivileged user
through the proper queue. The tests will take approximately 16 hours to run on a cluster of this
size:
In-Memory Map Only Join
Java Raw Comparator
Hive Join
Avro Join
Stage Actual Data and Tables

Real data and actual tables should be copied and validated on the cluster. Data should be copied to
the cluster in order to validate equivalent functionality following the upgrade. Existing code from
application teams in Silver is encouraged.
Hive Tables
HBase Tables
Data for Existing MapReduce code
(Week 3 : 3/16 3/20) Perform Upgrade and Develop Configuration
This week should be dedicated to performing the upgrade manually and developing the configuration
files and templates needed to perform subsequent upgrades. All YARN configurations should be
developed using Bronze and Gimli as a template. Updates to existing configurations should be noted
and captured in StackIQ.
To perform the upgrade, use the following steps as a guideline:
Halt Jobs
As defined by your upgrade plan, halt activity on the cluster in the following sequence before you
begin upgrading packages:
1. Notify stakeholders.
2. Stop accepting new jobs.
3. Terminate any running jobs.
The following commands can be used to list and terminate MapReduce jobs:
#hadoopjoblist
#hadoopjobkill<jobid>
#hadoopjobkilltask<taskid>
4. You might also need specific commands to terminate custom applications.

At this point the cluster is ready for maintenance but still operational. The goal is to perform the
upgrade and get back to normal operation as safely and quickly as possible.
Stop Cluster Services

The following sequence will stop cluster services gracefully. When you are done, the cluster will be
offline. The maprcli commands used in this section can be executed on any node in the cluster.
Disconnect NFS Mounts
Unmount the MapR NFS share from all clients connected to it, including other nodes in the cluster.
This allows all processes accessing the cluster via NFS to disconnect gracefully. The following
example unmounts the NFS shares from the edge and cldb nodes.
35

%>sudoclushgcldbumountl/mapr/axp
%>sudoclushgedgeumountl/idn/axp
Stop Ecosystem Component Services

Stop ecosystem component services on each node in the cluster.
1. Run maprcli node list command to display the services on each node in the cluster:
#maprclinodelistcolumnshostname,csvc
2. Stop ecosystem component services.

For example, you can use the following command to stop Oozie and Hive on the edge nodes:
#maprclinodeservicesmulti'[{"name":"oozie","action":"stop"},
{"name":"hs2","action":"stop"}]'nodes<hostnames>
Stop MapR core services

1. Stop warden on the CLDB nodes first:
$>sudoclushgcldbservicemaprwardenstop
2. Stop warden on the remaining data nodes

$>sudoclushgadmin,dataservicemaprwardenstop
3. Stop zookeeper
$>sudoclushgzkservicemaprzookeeperstop
Backup the /opt/mapr/roles Directory

$>sudoclushamkdir/tmp/roles&&cp/opt/mapr/roles/*/tmp/roles
Upgrade Packages and Configuration Files

Upgrade the following MapR core component packages on all nodes where they exist:
mapr-cldb
mapr-core
mapr-fileserver
mapr-jobtracker
mapr-metrics
mapr-nfs
mapr-tasktracker
mapr-webserver
mapr-zookeeper
mapr-zk-internal
$>sudoclushayumupdateymaprcldbmaprcoremaprfileservermapr
jobtrackermaprmetricsmaprnfsmaprtasktrackermaprwebservermapr
zookeepermaprzkinternal
Restore roles files

$>sudoclusharmf/opt/mapr/roles/*&&cp/tmp/roles/*/opt/mapr/roles/
36

Install YARN Components (ResourceManager, NodeManager, HistoryServer)
$>sudoclushgcldbyuminstallymaprresourcemanager
$>sudoclushgdatayuminstallymaprnodemanager
$>sudossh<historyserver>yuminstallymaprhistoryserver
Verify that packages installed successfully on all nodes.

Confirm that there were no errors during installation, and check
that /opt/mapr/MapRBuildVersion contains the expected value.
Example:
$>sudoclushaBcat/opt/mapr/MapRBuildVersion
Update the warden configuration files which are located in /opt/mapr/conf/

and /opt/mapr/conf/conf.d.
Manually merge new configuration settings from the files in /opt/mapr/conf/conf.d.new/ to

/opt/mapr/conf/conf.d/
Manually merge new configuration settings from /opt/mapr/warden.conf.new into
/opt/mapr/warden.conf.
Upgrade existing ecosystem components

Notes: Hive does not require a metastore upgrade, but the metastore should be backed up prior to
upgrade.
HBase table should be backed up to another cluster prior to the upgrade, as the hfile format
on disk is changed
Edit the sethadoopenv.sh on edge nodes (as needed)

File after edits:
exportHADOOP_HOME=/opt/mapr/hadoop/hadoop0.20.2
exportSQOOP_HOME=/opt/mapr/sqoop/sqoop1.4.4
exportMAHOUT_HOME=/opt/mapr/mahout/mahout0.9
exportHBASE_HOME=/opt/mapr/hbase/hbase0.98.7
exportHIVE_HOME=/opt/mapr/hive/hive0.12
exportPIG_HOME=/opt/mapr/pig/pig0.12
exportPIG_CLASSPATH=$HADOOP_HOME/conf
exportPATH=$PATH:$HADOOP_HOME/bin:$MAHOUT_HOME/bin:$SQOOP_HOME/bin:
$HBASE_HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin
exportCLASSPATH=$HADOOP_HOME/conf
exportPIG_OPTS="Dhbase.zookeeper.property.clientPort=5181
Dhbase.zookeeper.quorum=lpdbd0000.phx.aexp.com,lpdbd0010.phx.aexp.com,lpd
bd0016.phx.aexp.com"
LOGNAME=`whoami`
if[$LOGNAME==root]
then
break
else
if[[f~/.my_queue&&`cat~/.my_queue|grep[az]|wcl`gt
0]]&&[[$(echo"`date+%s``statLformat%Y~/.my_queue`"|bc)
lt86400]];
then
exportMY_QUEUE=`cat~/.my_queue`;
echoe"\nUsingExistingQueueInfo";
else
`$HADOOP_HOME/bin/hadoopqueueshowacls2>/dev/null|
grepv"default"|grepsubmitjob|awk'{print$1}'|head1>
~/.my_queue`;
37

exportMY_QUEUE=`cat~/.my_queue`;
echoe"\nCreatingQueueInfo";
fi
if["`echo${MY_QUEUE:null}`"=="null"];then
echoe"\n!Error:UnabletosetMY_QUEUE;Pleasecheck
ifyouareamemberofanyqueueotherthan\"default\"";
else
echoe"\nDefinedMY_QUEUE=$MY_QUEUE\n";
fi
Start the cluster and enable new features

Start zookeeper on the zookeeper nodes
$>sudoclushgzkservicemaprzookeeperstart
Start warden on all nodes

$>sudoclushaservicemaprwardenstart
After the cluster comes up, enable the cldb v4 features from the command line
(as root or sudo on a single node)
$>sudomaprcliconfigsavevalues{cldb.v4.features.enabled:1}
Set the MapR version

$>cat/opt/mapr/MapRBuildVersion
4.0.2.29870.GA
$>sudomaprcliconfigsavevalues{mapr.targetversion:"4.0.2.29870.GA"}
Promotable Mirror Volumes (new features as of version 4.0.2)

Issue the following command to enable support for promotable mirror volumes:
#maprcliconfigsavevalues{mfs.feature.rwmirror.support:1}
Note: This feature is automatically enabled with a fresh install.

Reduce On-Disk Container Size (new features as of version 4.0.2)
Issue the following command to reduce the space required on-disk for each container:
#maprcliconfigsavevalues{cldb.reduce.container.size:1}
over.
The reduction of the on-disk container size will take effect after the CLDB service restarts or fails
The same tests run in week 2 should be run again to validate functionality in
both MR1 and MR2
38

MapR Upgrade Doc for 3.0.2 to 4.0.2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MapR Upgrade Doc for 3.0.2 to 4.0.2

Uploaded by

Copyright:

Available Formats

MapR Upgrade Documentation

Version 3.0.2 to 4.0.2

MapR 3.0.2 to 4.0.2 Upgrade Document

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

Software Support Matrices

2. MapR Ecosystem Support for 3.0.X and 4.0.X

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

2.5 Beta Only

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

3. MapR Ecosystem Upgrade Version Information

Amex Current Version

Proposed Upgrade Version

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

0.9.3 (on MapR 3.1.1)

0.9.3 (no YARN integration)

4. Additional Ecosystem and MapR Partner Components

Amex Current Version

Proposed Upgrade Version

Should work with ODBC

Not certified, but should work

Should Work with 4.0.X

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

Upgrade Components and Configuration

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

Changed from class to interface

All methods removed. Counter class remains for backward compatibility.

QueueACL qACLNeeded: Change in type from

Changed from class to interface

Progress method removed

Removed methods: captureDebugOut, captureOutAndError, getJobDir, getTaskLogFile,

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

Removed: TaskCompletionEventList getTaskCompletionEventList(int), void

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

Changed from class to interface

Changed from class to interface

Changed from non-abstract to abstract

Changed from class to interface

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

CLDB JMX monitor port

CLDB web port

HBase Master (for GUI)

HBase Thrift Server

HistoryServer Web UI and REST APIs (HTTP)

Impala Catalog Daemon

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

Metrics RPC activity

NFS monitor (for HA)

NFS VIP service