You are on page 1of 38

MapR Upgrade Documentation

Version 3.0.2 to 4.0.2


David Schexnayder
MapR Data Engineer
Revision 1.2
3/05/15
Introduction................................................................................................................................................... 4
Scope............................................................................................................................................................. 4
Software Support Matrices............................................................................................................................. 4
1. MapR JDK Support Matrix...................................................................................................................... 4
2. MapR Ecosystem Support for 3.0.X and 4.0.X.......................................................................................4
3. MapR Ecosystem Upgrade Version Information....................................................................................6
4. Additional Ecosystem and MapR Partner Components..........................................................................7
Upgrade Outline............................................................................................................................................ 7
1. Upgrade to Redhat Linux 6.6................................................................................................................ 7
2. Upgrade to Java 1.7.x........................................................................................................................... 7
3. Upgrade the M7 Cluster to 4.0.2........................................................................................................... 8
4. Validate non-hadoop Components in 3.x Clients...................................................................................8
5. Resolve 3.x Client Issues...................................................................................................................... 8
6. Upgrade M5 cluster to 4.0.2................................................................................................................. 8
7. Configure MR1/MR2 Ratio to 80/20....................................................................................................... 8
8. Upgrade Special non-hadoop Clients to 4.x Based on Certification......................................................8
9. Test Amex Application Code................................................................................................................. 8
Upgrade Components and Configuration....................................................................................................... 8
1. Hadoop Mapreduce Configuration........................................................................................................ 8
2. Fair Scheduler Configuration (YARN)..................................................................................................... 9
3. Mapreduce v1 API Changes for 4.0.X.................................................................................................. 10
4. Submitting Jobs to MR1 or MR2.......................................................................................................... 13
5. Edge Node Configuration.................................................................................................................... 13
6. Services Configuration and Ports Used...............................................................................................13
7. MapR-DB (M7) Features and HBase API..............................................................................................15
Enabling Table Authorization with Access Control Expressions..............................................................16
Bulk Loading and MapR Tables.............................................................................................................. 18
HBase 0.98 API...................................................................................................................................... 20
8. Apache Hive....................................................................................................................................... 27
9. Apache Pig.......................................................................................................................................... 28
10. Apache Flume................................................................................................................................... 29
11. Apache Sqoop................................................................................................................................... 29
12. Apache Mahout................................................................................................................................. 30
13. Apache Oozie.................................................................................................................................... 31
14. Apache HBase................................................................................................................................... 31
15. Apache Cascading............................................................................................................................ 32
16. Apache Whir..................................................................................................................................... 32
17. Apache Zookeeper............................................................................................................................ 32
18. Apache Storm................................................................................................................................... 32
Amex Lab Upgrade Plan............................................................................................................................... 32
(Week 1 : 3/2 3/6) - Build New Sandbox Cluster.....................................................................................32
Build StackIQ template for cluster......................................................................................................... 32

MapR 3.0.2 to 4.0.2 Upgrade Document


Deploy Cluster....................................................................................................................................... 33
Bring the Cluster Online and Apply License........................................................................................... 33
(Week 2 : 3/9 3/13) Bring the Cluster Online, Operationalize, and Transfer Data...................................33
Operationalize the Cluster..................................................................................................................... 33
Verify cluster operation......................................................................................................................... 33
Run Terrys Benchmark suite................................................................................................................. 33
Stage Actual Data and Tables................................................................................................................ 33
(Week 3 : 3/16 3/20) Perform Upgrade and Develop Configuration........................................................34
Halt Jobs................................................................................................................................................ 34
Stop Cluster Services............................................................................................................................. 34
Stop Ecosystem Component Services.................................................................................................... 34
Stop MapR core services....................................................................................................................... 34
Backup the /opt/mapr/roles Directory.................................................................................................... 35
Upgrade Packages and Configuration Files............................................................................................35
Restore roles files.................................................................................................................................. 35
Install YARN Components (ResourceManager, NodeManager, HistoryServer)........................................35
Verify that packages installed successfully on all nodes........................................................................35
Update the warden configuration files which are located in /opt/mapr/conf/ and /opt/mapr/conf/conf.d.
.............................................................................................................................................................. 35
Upgrade existing ecosystem components.............................................................................................35
Edit the sethadoopenv.sh on edge nodes (as needed).......................................................................35
Start the cluster and enable new features............................................................................................. 36
The same tests run in week 2 should be run again to validate functionality in both MR1 and MR2.......37

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

Introduction
This document will provide a step-by-step procedure to update the base MapR packages from 3.0.2 to
4.0.2 and ecosystem components. This document is intended for validation on the AXP sandbox cluster
and to be modified for use in additional environments. With MapR v4.X, the cluster has the ability of
running hadoop1 (classic) and hadoop2 (yarn) workloads simultaneously. The goal is to provide a cluster
with 80% capacity initially dedicated for hadoop1 (classic) and 20% for hadoop2 (yarn). Over time, the
capacity can be incrementally shifted to provide a higher percentage for hadoop2 (yarn). American
Express currently has 3 pilot clusters running MapR v4.0.2. Many of the configuration details have been
developed as part of this pilot program.

Scope
The scope of this document will cover the MapR base packages and ecosystem packages. We also cover
the updates needed for configuration files and configuration for hadoop1 / hadoop2 on the edge nodes.

Software Support Matrices


Since MapR v4.X has the ability of running hadoop1 (classic) and hadoop2 (yarn) workloads
simultaneously, MapR now releases ecosystem packages with both hadoop1 and hadoop2 capabilities.
The goal of this upgrade is to maintain compatibility with existing workloads. Existing ecosystem
component will need to be uninstalled and replaced with the updated MapR v4.X versions.
1. MapR JDK Support Matrix
Below are the supported versions of Java per MapR release
JDK/MapR Version

MapR 2.x

MapR 3.0.x

MapR 3.1.x

MapR 4.0.x

JDK 6

Yes

Yes

Yes

No

JDK 7

No

Yes

Yes

Yes

JDK 8

No

No

No

Yes

2. MapR Ecosystem Support for 3.0.X and 4.0.X


Below is the current MapR ecosystem support matrix for MapR v3.0.X and v4.0.X
Ecosystem Component/MapR Version
Apache Hive

Apache Spark

MapR Upgrade Document

MapR 3.0.x

MapR 4.0.2
(MapReduce v1 mode)

MapR 4.0.2
(YARN Mode)

No

No

No

10

Yes

No

No

11

Yes

No

No

12

Yes

Yes

Yes

13

Yes

Yes

Yes

0.9.1

Yes

No

No

0.9.2

Yes

No

No

1.0.2

Yes

Yes

Yes

MapR 3.0.2 to 4.0.2 Upgrade Document


1.1.0

No

Yes

Yes

1.1.1

Yes

No

No

1.2.3

Yes

Yes

No

1.4.1

No

Yes

No

10

Yes

Yes

No

11

Yes

No

No

12

Yes

Yes

Yes

13

No

Yes

Yes

1.3.1

Yes

No

No

1.4.0

Yes

No

N/A

1.5

No

Yes

N/A

1.4.3

Yes

No

No

1.4.4

Yes

Yes

Yes

1.4.5

Yes

Yes

Yes

Apache Sqoop2

1.99.0

Yes

Yes

Yes

Apache Mahout

0.7

Yes

No

No

0.8

Yes

No

No

0.9

No

Yes

Yes

3.3.2

Yes

No

No

4.0.0

Yes

No

No

4.0.1

No

Yes

Yes

2.5 Beta Only

Yes

No

No

3.5

Yes

No

No

3.6

No

Yes

Yes

0.92.2

No

No

No

0.94.17

Yes

No

No

0.94.21

Yes

No

No

0.98.4

No

Yes

Yes

0.98.7

No

Yes

Yes

Impala

Apache Pig

Apache Flume

Apache Sqoop

Apache Oozie

Hue

Apache HBase

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document


Apache Drill

0.5

No

Yes

N/A

0.6

No

Yes

N/A

0.6R2

No

Yes

N/A

0.7

No

Yes

N/A

1.4.1

Yes

No

No

1.5

No

Yes

Yes

2.1.6

Yes

No

No

2.5

No

Yes

Yes

0.8.1

No

No

No

Yes

Yes

N/A

0.4

N/A

N/A

Yes

MapReduce

1.0.3

Yes

Yes

N/A

MapReduce

2.5.1

N/A

N/A

Yes

Storm

0.9.3

N/A

Yes

N/A

Sentry

1.4.0

No

Yes

Yes

Asynchbase

Cascading

Whirr
HTTPFS
Apache Tez
(Developer Preview)

3. MapR Ecosystem Upgrade Version Information


Below is the upgrade ecosystem matrix with the current Amex version and upgrade version. Packages
highlighted in Green will maintain current project release. Packages highlighted in Yellow do not have
supported versions in both 3.0.X and 4.0.X.
Note: All ecosystem components will require uninstall and install following the core upgrade.
Ecosystem Component/MapR Version

Amex Current Version

Proposed Upgrade Version

Apache Hive

12

0.12.23716

0.12.201502021326

Apache Pig

12

0.12.23716

0.12.27259

Apache Flume

1.4.0 / 1.5.0

1.4.0.23547

1.5.0.201501191849

Apache Sqoop

1.4.4

1.4.4.22554

1.4.4.201411051136

0.7 / 0.9

0.7.22084

0.9.201409041745

Apache Oozie

3.3.2 / 4.0.1

3.3.2.23554

4.0.1.201501231601

Apache HBase

0.94.13 / 0.98.7

0.94.13.23554

0.98.7.201501291259

2.1 / 2.5

2.1.20130606

2.5

0.8.1

0.8.1.18380

NA

Apache Mahout

Apache Cascading
Apache Whir

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document


Apache Zookeeper

3.3.6 / 3.4.5

3.3.6

3.4.5

0.9.3

0.9.3 (on MapR 3.1.1)

0.9.3 (no YARN integration)

Apache Storm

4. Additional Ecosystem and MapR Partner Components


Below are the additional ecosystem components supported but not part of the MapR distribution.
Packages highlighted in Green will maintain current project release. Packages highlighted in Yellow do
not have supported versions in both 3.0.X and 4.0.X.
Note: Individual components may require reconfiguration following the upgrade.
Ecosystem Component / Version

Amex Current Version

Proposed Upgrade Version

1.6.0_33

1.7.0_67 or later

MySQL Server

5.6.1

5.6.1 (unchanged)

LWS-Solr

2.6.3

2.6.3 (unchanged)

Elastic Search

1.2.1

1.2.1 (unchanged)

8.01.00-rel141029

Should work with ODBC

1.4.4-3.el6

Not certified, but should work

Talend

5.4.1

5.6.1

Platfora

4.0.3

4.1.X

Datameer

4.5.6

Under certification

Revolution R

7.3.0

Under certification

4.4.2.10 / 4.4.2.11

Should Work with 4.0.X

2.6.6

2.6.6 (unchanged)

6.3

Up to 6.6

Oracle Java

Kognitio
Memcached

Dataguise
Python
Redhat Linux

Upgrade Outline
The high level plan describes the order in which clusters and cluster infrastructure should be upgraded. Specific details
and configuration found later in the document.
1 Upgrade to Redhat Linux 6.6
Linux should be upgraded prior to the upgrade of any MapR components. Any issues related to the
Linux upgrade should be resolved prior to the MapR upgrade. The upgrade consists of updating the
Linux packages in StackIQ and running yum upgrade. This can be performed in a rolling fashion on a
subset of nodes and requires a reboot once complete. The node can rejoin the cluster and run
workloads as usual.
Note: Any vendor specific drivers (Cisco UCS, IBM 3650) should also be updated as part of the process.
5. Upgrade to Java 1.7.x
Once all cluster nodes are running Redhat 6.5, Oracle Java 1.7.0_67 (or later) can be installed. The
package is released in RPM so it can be installed via StackIQ. Once installed, the JAVA_HOME variable
should be set via /opt/mapr/conf/env.sh or /etc/alternatives. The upgrade does not require a reboot,

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document


but does require a restart of MapR services (zookeeper and warden). Existing code should be tested for
compatibility.
6. Upgrade the M7 Cluster to 4.0.2
The M7 clusters per environment (Gold, Platinum) should be upgraded prior to the analytical (M5)
clusters. Since the HBase API will be upgraded from 0.94.13 to 0.98.7, existing tables should be
backed up on the M7 clusters prior to the upgrade. The backup can be performed either by using the
standard HBase mapreduce jobs (CopyTable or Export) or by mirroring volumes containing M7 tables.
The backup should be stored on another cluster. Ecosystem components not in use should be
removed.
7. Validate non-hadoop Components in 3.x Clients
All client systems and services accessing the M7 cluster and tables should be validated. The client
code from 3.X clients should remain compatible with the 4.0.2 cluster and tables, but issues may arise
and client code may require a recompile. If required, upgrade client code and recompile applications
as necessary.
8. Resolve 3.x Client Issues
All issues related to accessing the M7 cluster and tables should be resolved prior to proceeding to
additional clusters. This may include upgrading MapR client code, upgrading ecosystem components
(such as HBase API), and recompiling client code. Upgrades or code changes should only be necessary
to resolve client access issues or specific performance issues. Once all upgrades are complete, an
additional step can be taken to upgrade all clients to a uniform code base.
9. Upgrade M5 cluster to 4.0.2
Detailed procedure for upgrading the cluster and components are found later in the document. Roll
forward any lessons learned from upgrading the M7 cluster and attached clients.
10.Configure MR1/MR2 Ratio to 80/20
Initially, the mix between MR1 (classic) hadoop and MR2 (yarn) should be set to 80/20 (80% capacity
by CPU/Mem/Disk dedicated to running MR1). The ratio is set per node in warden.conf and should be
managed by StackIQ. Changes to the warden.conf will require a restart of Nodemanager and
taskTracker services.
11.Upgrade Special non-hadoop Clients to 4.x Based on Certification
The remaining Additional Ecosystem and MapR Partner Components should be upgraded as
necessary. Some may simply require configuration path updates. The components should be
upgraded and validated.
12.Test Amex Application Code
Test existing mapreduce code on both MR1 and MR2 for compatibility. A test job per component should
be performed along with actual client code. Any failures should be resolved prior to completion of the
upgrade.

Upgrade Components and Configuration


1 Hadoop Mapreduce Configuration
The cluster should be configured to run MR1 (classic) and MR2 (yarn) simultaneously with 80% capacity
dedicated to MR1. The configuration is set on a per node basis. All nodes should be have the
hadoop_version initially set to classic or MR1:
Config file
/opt/mapr/conf/hadoop_version:

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document


classic_version=0.20.2
yarn_version=2.5.1
default_mode=classic

To configure the CPU / Memory / Disk for 80% MR1, set the following in warden.conf
Config file
/opt/mapr/conf/warden.conf:
mr1.memory.percent=80
mr1.cpu.percent=80
mr1.disk.percent=80

Note: The memory will be allocated as a remainder after all other heap space is used (fileserver,
nodemanager, tt, hbase-regionserver. The percentage is based on the remaining memory on the
server. For M7, 4x CPUs are reserved for fileserver, on M5 2x CPUs are reserved for fileserver.
Jobs can be submitted to either MR1 or MR2. When the hadoop_version environment variable is set
to classic, the command hadoop will point to /opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop. We
also have hadoop1 and hadoop2 commands available which point to MR1 and MR2 respectively.
13.Fair Scheduler Configuration (YARN)
The fair scheduler for YARN is configured differently than for the JobTracker. The JobTracker fair
scheduler configuration should remain the same. To enable the fair scheduler for YARN
Config file
/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/yarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSched
uler</value>
</property>
<property>
<name>yarn.scheduler.allocation.file</name>
<value>fairscheduler.xml</value>
</property>
<property>
<name>yarn.scheduler.fair.preemption</name>
<value>true</value>
</property>
<property>
<name>yarn.scheduler.fair.allowundeclaredpools</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.fair.userasdefaultqueue</name>
<value>false</value>
</property>

Notes:
The queues are configured in /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/fair-scheduler.xml
Preemption is enabled
We do not allow undeclared pools (all queues must be defined)
We do not allow user as default queue (users must specify a queue and the users name will not be
used)

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document


The queues are defined in the /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/fair-scheduler.xml file.
Minimum queue resources are defined. The queues are hierarchical with root being necessary. Sub
queues can be created for projects under business units if desired.
Config file
/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/fair-scheduler.xml
<allocations>
<queuePlacementPolicy>
<rulename="specified"create="false"/>
<rulename="reject"/>
</queuePlacementPolicy>
<queuename="root">
<schedulingPolicy>fair</schedulingPolicy>
<minResources>1000mb,0vcores,1disks</minResources>
<maxRunningApps>1</maxRunningApps>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
<weight>1.0</weight>
<aclSubmitApps>root</aclSubmitApps>
<aclAdministerApps>root</aclAdministerApps>
<queuename="default">
<schedulingPolicy>fair</schedulingPolicy>
<minResources>1000mb,0vcores,1disks</minResources>
<maxRunningApps>5</maxRunningApps>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
<weight>1.0</weight>
<aclSubmitApps>root</aclSubmitApps>
<aclAdministerApps>root</aclAdministerApps>
</queue>
<queuename="myqueue">
<schedulingPolicy>fair</schedulingPolicy>
<minResources>1000mb,0vcores,1disks</minResources>
<maxRunningApps>10</maxRunningApps>
<minSharePreemptionTimeout>60</minSharePreemptionTimeout>
<weight>1.0</weight>
<aclSubmitApps>maprmygroup</aclSubmitApps>
<aclAdministerApps>root</aclAdministerApps>
</queue>

Notes:
The root queue is necessary
Users in the group mygroup can submit to myqueue
The default queue will only allow root to submit, but additional configuration is required to allow root
to submit mapreduce jobs
14.Mapreduce v1 API Changes for 4.0.X
Existing compiled MapReduce V1 applications may need to be recompiled before they can be run as
MapReduce V1 applications in MapR Version 4.0.x. The small number of API changes that have been
made, including removal of classes and methods and conversion of classes to interfaces, are
documented here. If your application does not use any of the changes listed in this document, you do
not need to recompile the application.
When an application has been compiled against MapReduce V1 or MapReduce V2 (YARN) in MapR 4.0.x,
the application can be run in either mode.
The following list of changes is grouped by package name:
org.apache.hadoop.mapred.jobcontrol
Job

MapR Upgrade Document

MapR 3.0.2 to 4.0.2 Upgrade Document

extends ControlledJob
getMapredJobID: return type changed from String to JobID

JobControl
extends org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl
String addJob(ControlledJob) instead of String addJob(Job), where Job extends ControlledJob
org.apache.hadoop.mapred
JobContext

Changed from class to interface


JobInProgress

All methods removed. Counter class remains for backward compatibility.


JobEndNotifier
Removed methods:
void registerNotification(JobConf, JobStatus)
void startNotifier()
void stopNotifier()
Operation

QueueACL qACLNeeded: Change in type from


org.apache.hadoop.mapred.QueueManager.QueueACL to org.apache.hadoop.mapred.QueueACL
TaskAttemptContext

Changed from class to interface

Progress method removed


Counters

Extends AbstractCounters, and some methods have been type parameterized, using generic. This
change breaks binary compatibility.
TaskLog

Removed methods: captureDebugOut, captureOutAndError, getJobDir, getTaskLogFile,


getUserLogDir
TaskStatus
Removed methods: boolean getIncludeCounters(), TaskLogFS getTaskLogFs(), void
setIncludeCounters(boolean), void setTaskLogFs(TaskLogFS)
TaskUmbilicalProtocol
Signature change for several methods where the JvmContext argument has been removed.
Utils
getHttpScheme removed
ClusterStatus
getMaxPrefetchMapTasks() removed
JobClient
New exception thrown

MapR Upgrade Document

10

MapR 3.0.2 to 4.0.2 Upgrade Document


Task
Change in exceptions: java.lang.ClassNotFoundException was removed.
java.lang.InterruptedException was added.
Change of visibility from protected to public.
org.apache.hadoop.mapreduce
Counter
Changed from class to interface
Removed:
boolean equals(Object)
int hashCode()
void readFields(DataInput): read the binary representation of the counter
void write(DataOutput)
CounterGroup:
Changed from class to interface
Removed:
boolean equals(Object)
Counter findCounter(String, String): internal to find a counter in a group
Counter findCounter(String)
String getDisplayName(): get the display name of the group
String getName(): get the internal name of the group
int hashCode()
void incrAllCounters(CounterGroup)
Iterator<Counter> iterator()
void readFields(DataInput)
int size(): return the number of counters in this group
void write(DataOutput)
Counters

Extends AbstractCounters, and some methods have been type parameterized, using generic. This
breaks binary compatibility.
Job

Removed: TaskCompletionEventList getTaskCompletionEventList(int), void


setUserClassesTakesPrecedence(boolean)
JobContext:
Changed from class to interface
Removed: boolean userClassesTakesPrecedence()
JobSubmissionFiles
getStagingDir: Change in signature from (JobClient, Configuration) to (Cluster, Configuration)
MapContext
Changed from class to interface
Mapper.Context
Changed from non-abstract to abstract
ReduceContext

MapR Upgrade Document

11

MapR 3.0.2 to 4.0.2 Upgrade Document

Changed from class to interface

ReduceContext.ValueIterator

Changed from class to interface


Reducer.Context

Changed from non-abstract to abstract


TaskAttemptContext
Changed from class to interface
Removed: progress
TaskInputOutputContext

Changed from class to interface


org.apache.hadoop.mapreduce.lib.output
FileOutputCommitter
abortTask: Change in exceptions thrown from no exceptions to java.io.IOException.
org.apache.hadoop.mapreduce.security.token
DelegationTokenRenewal
Removed class
org.apache.hadoop.mapreduce.util
ProcessTree

Change in signature for some methods (emoved arg - signal; now just takes
pid): killProcess, killProcessGroup, etc.
15.Submitting Jobs to MR1 or MR2
If no recompiling of application code is necessary, the job can be submitted to either MR1 or MR2
Note Dmapred.job.queue.name=queue is deprecated in MR2 (YARN), but will function
MR1 job:
$>hadoop1jar<path_to_jar_file>Dmapred.job.queue.name=<queue>

MR2 job:
$>hadoop2jar<path_to_jar_file>Dmapreduce.job.queuename=<queue>

Notes: Even though MR2 (YARN) fair scheduler has hierarchical queues, the single queue name can be
used (ie myqueue instead of root.myqueue)
16.Edge Node Configuration
The edge nodes can be configured for either MR1 (classic) or MR2 (YARN) as the default. The edge
node services (hiveserver2, oozie, etc) will need to start with one or the other. Choices need to be
made on how best to run services. For example, one edge node can be configured for MR1 and
another for MR2. A more complicated approach may be to run multiple instances of the same service
on different ports.
For example Hiveserver2
Run MR1 on the standard port 10000, run MR2 on a separate port 10001. This should be tested for
each service (hiveserver2, oozie, etc) prior to rolling out.

MapR Upgrade Document

12

MapR 3.0.2 to 4.0.2 Upgrade Document


17.Services Configuration and Ports Used
The existing Admin / CLDB nodes should also be configured to run ResourceManager (same as
JobTracker). There is also a ResourceManager JobHistory server, which is currently non-HA to run on a
single node. In the event of a failure of the JobHistory server, re-run configure.sh and change the HS
to the new node. The files needed exist on the MapR-FS and the new history server will read the
information.
Existing data nodes will need the NodeManager service installed
Configure.sh should be run on all nodes following the upgrade to set the hadoop version and identify
the HistoryServer address:
$>sudoclusha/opt/mapr/server/configure.shRN<cluster_name>C
<cldb_server_list>Z<zookeeper_server_listHS<history_server>

Additional ports will need to be open in order to interoperate with the ResourceManager,
NodeManagers, the Application Master, and the History Server. The ResourceManager port will only be
active on a single server at a time (8088). The Application Master is assigned per application and will
run on any NodeManager on the cluster.
Here is the complete port listing for MapR v4.0.2 with default port numbers:
Service

Port

CLDB

7222

CLDB JMX monitor port

7220

CLDB web port

7221

DNS

53

HBase Master

60000

HBase Master (for GUI)

60010

HBase RegionServer

60020

HBase Thrift Server

9090

HistoryServer RPC

10020

HistoryServer Web UI and REST APIs (HTTP)

19888

Hive Metastore

9083

Hiveserver2

10000

Httpfs

14000

Hue Beeswax

8002

Hue Webserver

8888

Impala Catalog Daemon

25020

Impala Daemon

21000

MapR Upgrade Document

13

MapR 3.0.2 to 4.0.2 Upgrade Document


Service

Port

Impala Daemon

21050

Impala Daemon

25000

Impala StateStoreDaemon

25010

JobTracker

9001

JobTracker web

50030

LDAP

389

LDAPS

636

Metrics RPC activity

1111

MFS server

5660

MySQL

3306

NFS

2049

NFS monitor (for HA)

9997

NFS management

9998

NFS VIP service

9997 and 9998

NodeManager

8041

NodeManager Localizer RPC

8040

NodeManager Web UI and REST APIs (HTTP)

8042

NTP

123

Oozie

11000

Port mapper

111

ResourceManager Admin RPC

8033

ResourceManager Client RPC

8032

ResourceManager Resource Tracker RPC (for NodeManagers)

8031

ResourceManager Scheduler RPC (for ApplicationMasters)

8030

ResourceManager Web UI (HTTP)

8088

Secure HistoryServer Web UI and REST APIs (HTTPS)

19890

Secure NodeManager Web UI and REST APIs (HTTP)

8044

Secure ResourceManager Web UI (HTTPS)

8090

MapR Upgrade Document

14

MapR 3.0.2 to 4.0.2 Upgrade Document


Service

Port

Shuffle HTTP

13562

SMTP

25

Sqoop2 Server

12000

SSH

22

TaskTracker web

50060

Web UI HTTPS

8443

Web UI HTTP (off by default)

8080

ZooKeeper

5181

ZooKeeper follower-to-leader communication

2888

ZooKeeper leader election

3888

18.MapR-DB (M7) Features and HBase API


MapR-DB (M7) v4.0.2 offers 2 new key features, which provide additional security and faster loading.
MapR v4.0.2 offers table security using Boolean expressions at the column level and M7 bulk loading of
data. In addition, many enhancements to the HBase API for 0.98 have been included (see below).

Enabling Table Authorization with Access Control Expressions


The MapR distribution for Hadoop enables native storage for MapR Tables. You can set permissions
for access to these tables through the MapR Control System (MCS) or with the maprcli table
commands.
Permissions for MapR tables, column families, and columns are defined by Access Control
Expressions (ACEs). You can set permissions for tables when you create or edit tables. You can set
default permissions for column families when you create or edit tables, and you can override these
defaults when you create column families.
Syntax of Access Control Expressions
An ACE is defined by a combination of user, group, or role definitions. You can combine these
definitions using the following syntax:
Operator

Description

Username or user ID, as they appear in /etc/passwd, of a specific user.


Usage: u:<username or user ID>

Group name or group ID, as they appear in /etc/group, of a specific


group. Usage: g:<group name or group ID>

Name of a specific role. Usage: r:<role name>.

Public. Specifies that this operation is available to the public without


restriction. Cannot be combined with any other operator.

MapR Upgrade Document

15

MapR 3.0.2 to 4.0.2 Upgrade Document


Operator

Description

Negation operator. Usage: !<operator>.

&

AND operation.

OR operation

()

Delimiters for subexpressions.

""

The empty string indicates that no user has the specified permission.

An example definition is u:1001 | r:engineering, which restricts access to the user with ID 1001 or to
any user with the role engineering.
In this next example, members of the group admin are given access, and so are members of the
group qa:
g:admin | g:qa
For another example, suppose that you have this list of groups to which you want to give read
permissions on a table:
The admin group as a whole, but not the admins for a particular cluster (which is named cl3).
Members of the qa group who are responsible for testing the two applications
(named app2 and app3) that access this table.
The business analysts (group ba) in department 7A (group dept_7a)
All of the data scientists (group ds) in the company.
To grant the read permission, you construct this boolean expression:
u:cfkane | (g:admin & g:!cl3) | (g:qa & (g:app2 | g:app3)) | (g:ba & g:dept_7a) | g:ds
This expression is made up of five subexpressions which are separated by OR operators.
The first subexpression u:cfkane grants the read permission to the username cfkane.
The subexpression (g:admin & g:!cl3) grants the read permission to the admins for all
clusters except cluster cl3. The operator g is the group operator, the value admin is the
name of the group of all admins. The & operator limits the number of administrators who
have read permission because only those administrators who meet the additional condition
will have it.
The condition g:!cl3 is a limiting condition. The operator ! is the NOT operator. Combined with the
group operator, this operator means that this group is excluded and does not receive the read
permission.
Icon
Be careful when using the NOT operator. You might exclude fewer people than you intended. For
example, suppose that you do not want anyone in the group group_a to have access. You therefore
define this ACE:
g:!group_a
You might think that the data in your table is now protected because members of group_a do not
have access to it. However, you have not restricted access for anyone else except the members
of group_a. The rest of the world can access the table.
You should not define ACEs through exclusion by using the NOT operator. You should define them by
inclusion and use the NOT operator to limit further the access of the groups or roles that you have
included.

MapR Upgrade Document

16

MapR 3.0.2 to 4.0.2 Upgrade Document


In the subexpression (g:admin & g:!cl3), the NOT operator limits the number of members within
the admin group who have access. The admin group is included, and all users who are also part of
the cl3 group are excluded.

The subexpression (g:qa & (g:app2 | g:app3)) demonstrates that you can use a
subexpression within a subexpression. The larger subexpression means that only members
of group qa who are also members of group app2 or app3 have read access to the table. The
smaller subexpression limits the number of people in the qa group have have this
permission.
The next two subexpressions -- (g:ba & g:dept_7a) and g:ds -- grant the read permission to
the members of group ba who are also in the group dept_7a. It also grants permission to the
members of the group ds.

Defining ACEs with the MCS by using the Expression Builder


1. To define an ACE for an existing table, click Edit Table Permissions from the table's pane in
the MCS to display the Permissions pane.

2. Click the arrow at the right side of any field to display the Expression Builder for that field.

MapR Upgrade Document

17

MapR 3.0.2 to 4.0.2 Upgrade Document


3. Use the + button to add a condition to the expression. Note that you cannot mix AND and OR
without using subexpressions.

You can also type expressions directly into the field. The MCS validates expressions when focus
leaves the field. The field is colored yellow for a warning and red for an error. Hover the cursor on
the field to display the error or warning message.
Defining ACEs by using maprcli commands
You can set ACEs with the following commands:
table create Creates a new MapR table.
table edit Edits a MapR table.
table cf create Creates a column family for a MapR table.
table cf edit Edits a column-family definition.
table cf colperm set Set Access Control Expressions (ACEs) for a specified column.

Bulk Loading and MapR Tables


The most common way of loading data to a MapR table is with a put operation. At large scales, bulk
loads offer a performance advantage over put operations.
Bulk loading can be performed as a full bulk load or as an incremental bulk load. Full bulk loads
offer the best performance advantage for empty tables. Incremental bulk loads can add data to
existing tables concurrently with other table operations, with better performance
than put operations.
Bulk Load Process Flow
Once your source data is in the MapR-FS layer, bulk loading uses a MapReduce job to perform the
following steps:
1. Transform the source data into the native file format used by MapR tables.
2. Notify the database of the location of the resulting files.
A full bulk load operation can only be performed to an empty table and skips the write-ahead log
(WAL) typical of Apache HBase and MapR table operations, resulting in increased performance.
Incremental bulk load operations do use the WAL.
Creating a MapR Table with Full Bulk Load Support
When you create a new MapR table with the maprcli table create command, specify the value of
the -bulkload parameter as true.

MapR Upgrade Document

18

MapR 3.0.2 to 4.0.2 Upgrade Document


When you create a new MapR table from the hbase shell, specify BULKLOAD as true, as in the
following example:
create'/a0','f1',BULKLOAD=>'true'

When you create a new MapR table from the MapR Control System (MCS), check the Bulk Load box
under Table Properties.

Performing Bulk Load Operations


Notes: You can only perform a full bulk load to empty tables that have the bulk load attribute set.
You can only set this attribute during table creation. The alter operation will not set this attribute
to true on an existing table.
Warning: Your table is unavailable for normal client operations, including put, get,
and scan operations, while a full bulk load operation is in progress. To keep your table available for
client operations, use an incremental bulk load.
Attempting a full bulk load to a table that does not have the bulk load attribute set will result in an
incremental bulk load being performed instead.

MapR Upgrade Document

19

MapR 3.0.2 to 4.0.2 Upgrade Document


You can use incremental bulk loads to ingest large amounts of data to an existing table. Tables
remain available for standard client operations such as put, get, and scan while the bulk load is in
process. A table can perform multiple incremental bulk load operations simultaneously.
Bulk Loading Tools
Bulk loading is supported for the following tools, which can be used for both full or incremental bulk
load operations:
The CopyTable tool uses a MapReduce job to copy a MapR table.
hbase com.mapr.fs.hbase.mapreduce.CopyTable -src /table1 -dst /table2
The CopytableTest tool copies a MapR table without using MapReduce.
hbase com.mapr.fs.CopyTableTest -src /table1 -dst /table2
The ImportTsv tool imports a tab-separated values file into a MapR table.
importtsv -Dimporttsv.columns=HBASE_ROW_KEY,CF-1:custkey,CF-1:orderstatus,CF1:totalprice,CF-1:orderdate,CF-1:orderpriority -Dimporttsv.separator='|'
-Dimporttsv.bulk.output=/dummy /table1 /orders
The ImportFiles tool imports HFile or Result files into a MapR table.
hbase com.mapr.fs.hbase.mapreduce.ImportFiles -Dmapred.reduce.tasks=2 -inputDir
/test/tabler.kv -table /table2 -format Result
Custom MapReduce jobs can use bulk loads with the configureIncrementalLoad() method from
the HFileOutputFormat class.
HTable table = new HTable(jobConf, tableName);
HFileOutputFormat.configureIncrementalLoad(mrJob, table);
After completing a full bulk load operation, take the table out of bulk load mode to restore normal
client operations. You can do this from the command line or the HBase shell with the following
commands:
# maprcli table edit -path /user/juser/mytable -bulkload false (command line)
hbase shell> alter '/user/juser/mytable', 'f2', BULKLOAD => 'false' (hbase shell)
Paths for HBase 0.98 Tools
Note the path name changes for the following tools in HBase 0.98:
Tool

HBase 0.98 Path

CopyTableTest

com.mapr.fs.hbase.tools.CopyTableTest

CopyTable

com.mapr.fs.hbase.tools.mapreduce.CopyTable

ImportFiles

com.mapr.fs.hbase.tools.mapreduce.ImportFiles

If you are running on an HBase 0.98 client but the exported files were generated with HBase 0.94,
include -Dhbase.import.version=0.94 in the ImportFiles job.

HBase 0.98 API


The API for accessing MapR tables works the same way as the Apache HBase API. Code written for
Apache HBase can be easily ported to use MapR tables.
MapR tables do not support low-level HBase API calls that are used to manipulate the state of an
Apache HBase cluster. HBase API calls that are not supported by MapR tables report successful
completion to allow legacy code written for Apache HBase to continue executing, but do not
perform any actual operations.
For details on the behavior of each function, refer to the Apache HBase API documentation.

MapR Upgrade Document

20

MapR 3.0.2 to 4.0.2 Upgrade Document


HBaseAdmin API

Available for
MapR
Tables?

void addColumn(String tableName,


HColumnDescriptor column)

Yes

void close()

Yes

void createTable()(HTableDescriptor
desc, byte[][] splitKeys)

Yes

This call is synchronous.

void createTableAsync() (HTableDescrip


tor desc, byte[][] splitKeys)

Yes

For MapR tables, this call is


identical to createTable.

void deleteColumn (byte[] family,


byte[] qualifier, long timestamp)

Yes

void deleteTable(String tableName)

Yes

HTableDescriptor[] deleteTables(Patter
n pattern)

Yes

Configuration getConfiguration()

Yes

HTableDescriptor getTableDescriptor (b
yte[] tableName)

Yes

HTableDescriptor[] getTableDescriptors
(List<String> tableNames)

Yes

boolean isTableAvailable(String
tableName)

Yes

boolean isTableDisabled(String
tableName)

Yes

boolean isTableEnabled(String
tableName)

Yes

HTableDescriptor[] listTables()

Yes

void modifyColumn(String tableName,


HColumnDescriptor descriptor)

Yes

void modifyTable (byte[] tableName,


HTableDescriptor htd)

No

boolean tableExists(String tableName)

Yes

Pair<Integer,
Integer> getAlterStatus (byte[]
tableName)

Yes

CompactionState getCompactionState(
String tableNameOrRegionName)

Yes

Returns CompactionState.NO
NE.

void split(byte[]
tableNameOrRegionName)

Yes

The tableNameOrRegionNam
e parameter has a different

MapR Upgrade Document

Comments

21

MapR 3.0.2 to 4.0.2 Upgrade Document


HBaseAdmin API

Available for
MapR
Tables?

Comments

format when used with MapR


tables than with Apache
HBase tables. With MapR
Tables, specify both the table
path and the FID as a commaseparated list.
void abort(String why, Throwable e)

No

void assign (byte[] regionName)

No

boolean balancer()

No

boolean balanceSwitch(boolean b)

No

void closeRegion(ServerName sn,


HRegionInfo hri)

No

void closeRegion(String regionname,


String serverName)

No

boolean closeRegionWithEncodedRegio
nName(String encodedRegionName,
String serverName)

No

void flush(String
tableNameOrRegionName)

No

ClusterStatus getClusterStatus()

No

HConnection getConnection()

No

HMasterInterface getMaster()

No

String[] getMasterCoprocessors()

No

boolean isAborted()

No

boolean isMasterRunning()

No

void majorCompact(String
tableNameOrRegionName)

No

void move (byte[]


encodedRegionName, byte[]
destServerName)

No

byte[][] rollHLogWriter(String
serverName)

No

boolean setBalancerRunning(boolean
on, boolean synchronous)

No

void shutdown()

No

MapR Upgrade Document

22

MapR 3.0.2 to 4.0.2 Upgrade Document


HBaseAdmin API

Available for
MapR
Tables?

Comments

void stopMaster()

No

void stopRegionServer(String
hostnamePort)

No

void unassign (byte[] regionName,


boolean force)

No

HTable API

Available for
MapR Tables?

Comments

void clearRegionCache()

No

Operation is silently ignored.

void close()

Yes

<T extends CoprocessorProtocol, R>


Map<byte[], R>
coprocessorExec(Class<T> protocol,
byte[] startKey, byte[] endKey, Call<T,
R> callable)

No

Returns null.

<T extends CoprocessorProtocol> T


coprocessorProxy(Class<T> protocol,
byte[] row)

No

Returns null.

Map<HRegionInfo, HServerAddress>
deserializeRegionInfo(DataInput in)

Yes

void flushCommits()

Yes

Configuration getConfiguration()

Yes

HConnection getConnection()

No

Returns null

int getOperationTimeout()

No

Returns null

ExecutorService [getPool()

No

Returns null

int getScannerCaching()

No

Returns 0

ArrayList<Put> getWriteBuffer()

No

Returns null

long getWriteBufferSize()

No

Returns 0

boolean isAutoFlush()

Yes

void
prewarmRegionCache(Map<HRegionInf
o, HServerAddress> regionMap)

No

void serializeRegionInfo(DataOutput
out)

Yes

Configuration and State


Management

MapR Upgrade Document

Operation is silently ignored.

23

MapR 3.0.2 to 4.0.2 Upgrade Document


HBaseAdmin API

Available for
MapR
Tables?

Comments

void setAutoFlush(boolean autoFlush,


boolean clearBufferOnFail)

Same as
setAutoFlush(b
oolean
autoFlush)

void setAutoFlush(boolean autoFlush)

Yes

void setFlushOnRead(boolean val)

Yes

boolean shouldFlushOnRead()

Yes

void setOperationTimeout(int
operationTimeout)

No

Operation is silently ignored.

void setScannerCaching(int
scannerCaching)

No

Operation is silently ignored.

void setWriteBufferSize(long
writeBufferSize)

No

Operation is silently ignored.

Atomic operations
Result append(Append append)

Yes

boolean checkAndDelete(byte[] row,


byte[] family, byte[] qualifier, byte[]
value, Delete delete)

Yes

boolean checkAndPut(byte[] row,


byte[] family, byte[] qualifier, byte[]
value, Put put)

Yes

Result increment(Increment increment)

Yes

long incrementColumnValue(byte[]
row, byte[] family, byte[] qualifier, long
amount, boolean writeToWAL)

Yes

long incrementColumnValue(byte[]
row, byte[] family, byte[] qualifier, long
amount)

Yes

void mutateRow(RowMutations rm)

Yes

DML operations
void batch(List actions, Object[]
results)

Yes

Object[] batch(List<? extends Row>


actions)

Yes

void delete(Delete delete)

Yes

void delete(List<Delete> deletes)

Yes

MapR Upgrade Document

24

MapR 3.0.2 to 4.0.2 Upgrade Document


HBaseAdmin API

Available for
MapR
Tables?

boolean exists(Get get)

Yes

Result get(Get get)

Yes

Result[] get(List<Get> gets)

Yes

Result getRowOrBefore(byte[] row,


byte[] family)

No

ResultScanner getScanner(...)

Yes

void put(Put put)

Yes

void put(List<Put> puts)

Yes

Comments

Table Schema Information


HRegionLocation getRegionLocation(by
te[] row, boolean reload)

Yes

Map<HRegionInfo, HServerAddress>
getRegionsInfo()

Yes

List<HRegionLocation> getRegionsInR
ange(byte[] startKey, byte[] endKey)

Yes

byte[][] getEndKeys()

Yes

byte[][] getStartKeys()

Yes

Pair<byte[][], byte[][]>
getStartEndKeys()

Yes

HTableDescriptor getTableDescriptor()

Yes

byte[] getTableName()

Yes

Returns table path

Row Locks
RowLock lockRow(byte[] row)

No

void unlockRow(RowLock rl)

No

HTablePool API

Available for
MapR Tables?

close()

Yes

closeTablePool(byte[] tableName)

Yes

closeTablePool(String tableName)

Yes

protected
HTableInterface createHTable(String

Yes

MapR Upgrade Document

Comments

25

MapR 3.0.2 to 4.0.2 Upgrade Document


HBaseAdmin API

Available for
MapR
Tables?

Comments

tableName)
int getCurrentPoolSize(String
tableName)

Yes

HTableInterface getTable(byte[]
tableName)

Yes

HTableInterface getTable(String
tableName)

Yes

void putTable(HTableInterface table)

Yes

MapR Tables and Filters


MapR tables support the following built-in filters. These filters work identically to their Apache
HBase versions.
Filter

Description

ColumnCountGetFilter

Returns the first N columns of a row.

ColumnPaginationFilter
ColumnPrefixFilter
ColumnRangeFilter
CompareFilter
FilterList
FirstKeyOnlyFilter
FirstKeyValueMatchingQualifiersFilter

Looks for the given columns in KeyValue.

FuzzyRowFilter
InclusiveStopFilter
KeyOnlyFilter
MultipleColumnPrefixFilter
PageFilter
PrefixFilter
RandomRowFilter
RegexStringComparator
SingleColumnValueFilter

MapR Upgrade Document

26

MapR 3.0.2 to 4.0.2 Upgrade Document


Filter

Description

SkipFilter
TimestampsFilter
WhileMatchFilter

HBase Shell Commands


The following table lists support information for HBase shell commands for managing
MapR tables.
Command

Available
for MapR
Tables?

alter

Yes

alter_async

Yes

create

Yes

describe

Yes

disable

Yes

drop

Yes

enable

Yes

exists

Yes

is_disabled

Yes

is_enabled

Yes

list

Yes

disable_all

Yes

drop_all

No

enable_all

Yes

show_filters

Yes

count

Yes

get

Yes

put

Yes

scan

Yes

MapR Upgrade Document

Comments

Obsolete. Use the rm <table


names> command from the MapR
filesystem or hadoop fs -rm <table
names>instead.

27

MapR 3.0.2 to 4.0.2 Upgrade Document


Command

Available
for MapR
Tables?

delete

Yes

deleteall

Yes

incr

Yes

truncate

Yes

get_counter

Yes

assign

No

balance_swi
tch

No

balancer

No

close_region

No

major_comp
act

No

move

No

unassign

No

zk_dump

No

status

No

version

Yes

whoami

Yes

Comments

19.Apache Hive
We plan to keep the current major version of Hive in place for the upgrade. Since hiveserver2 is
already configured with impersonation, no specific configuration changes should be necessary. The
hive metastore should be backed up prior to the upgrade. The newer version of hive works with both
MR1 and MR2 and hive will follow the environment variable in /opt/mapr/conf/hadoop_version. (by
default, hiveserver2 will start MR1 or MR2 based on hadoop_version).
For the upgrade, the existing version of hive-0.12 should be uninstalled and the new version installed.
This will ensure the proper hadoop jar files are in place.
Changes to hive-site.xml should include the updated jar files. In addition, the HBase jar files are now
split into 3 files, which will need to be updated:
hbase-client-0.98.7-mapr-1501.jar
hbase-common-0.98.7-mapr-1501.jar
hbase-protocol-0.98.7-mapr-1501.jar
The following Hive patches are included in the latest version:

MapR Upgrade Document

28

MapR 3.0.2 to 4.0.2 Upgrade Document


Commit

Date (YYYY-MMDD)

Comment

4441453

2015-1-19

MAPR-16784: Secure Hive metastore no longer fails to


start on a MapR 4.x cluster that runs MapReduce v1.

a3e0516

2015-1-14

MAPR-15645: Pig Hcatalog query no longer fails on


Hive due to Hadoop 2.x changes to JobContext API.

58b1655

2014-12-30

MAPR-12880/HIVE-6082: Hive no longer continually


retries to delete a lock if the ZooKeeper responds that
the node does not exist.

c414fc8

2014-11-21

MAPR-14270/HIVE-7279: format_number now


supports the decimal datatype.

491ec41

2014-11-20

MAPR-15833: The hive metatool updateLocation


command no longer fails.

415b1b5

10-15-2014

MAPR-15696: Changes were made to support Sentry


for Hive 12.

c5854a7

11-03-2014

MAPR-15816: HCatStorer failed to store data in a Hive


table on a non-secure 4.0.1 unsecure running MRv1.

5a2322a

11-03-2014

MAPR-15467: HiveServer2 failed to execute a join


query when Kerberos authentication was enabled.

c19ade2

2014-09-04

MAPR-15135: Created shims to handle hive queries in


a mixed-mode cluster.

3cef19f

2014-08-06

MAPR-14731: Hue 3.5 works with Hive 0.12 on MapR.

f301d59

2014-09-02

MAPR-14901: Changed the default HiveServer2 port in


warden.hs2.conf to 10000.

20.Apache Pig
We plan to keep the current major version of Pig in place for the upgrade. As is the same in Hive, Pig
will follow the default MR1 / MR2 configuration set in /opt/mapr/conf/hadoop_version. This can be
overridden via script or environment variable. No further pig configuration changes will be necessary.
The following Pig patches have been included in the latest version:
Commit

Date (YYYYMM-DD)

Comment

39efebb

2014-08-21

PIG-4139, MapR-14982: Pig jobs that run on a YARN


cluster configured with MRv1 no longer produce the
following exception:
NoSuchFieldException: jobsInProgress (Hadoop 0.23
API) caught and logged at level: DEBUG

d017b26

2014-08-19

MapR-14963: You can run PigMix queries and


MapReduce jobs on YARN.

c6fbb42

2014-06-27

PIG-3813, MapR-13391: Pig jobs do not go into an


infinite loop when the FILTER operator is immediately

MapR Upgrade Document

29

MapR 3.0.2 to 4.0.2 Upgrade Document


Commit

Date (YYYYMM-DD)

Comment
followed by the RANK operator.

21.Apache Flume
We will be required to upgrade apache flume from 1.4.0 to 1.5.0 as part of the MapR v4.0.2 upgrade.
Flume 1.5.0 should be completely backwards compatible to 1.4.0 and should not require additional
configuration.
Below is a list of new features available for Flume 1.5.0:

New in-memory channel that can spill to disk


A new dataset sink that use Kite API to write data to HDFS and HBase
Support for Elastic Search HTTP API in Elastic Search Sink
Much faster replay in the File Channel.

Flume will require configuration for use with HBase-0.98.


1. Go to the directory where the Flume scripts are stored:
$>cd/opt/mapr/flume/flume<version>/bin/

2. Execute the update script:


$>bashflumejars.sh

3. During the script execution, you should see the following log messages:
POST_YARN=1,HBASE_VERSION=0.98.7:installingflume*hbase.98h2jars

The latest Flume package contains the following patches:


Comm
it

Date (YYYYMM-DD)

Comment

7096e
0f

2014-12-11

Added tmp extensions to prevent the JVM from using jar files
from other versions of HBase.

e544e
94

2014-12-04

Added a script to configure Flume jars after an upgrade to


HBase 0.98.

334a2
bc

2014-12-04

mapr-flume-1.5.0 was built with support for HBase 0.94 and


0.98.

22.Apache Sqoop
Apache Sqoop will remain at the 1.4.4 base version. No configuration file updates should be necessary.
The existing version of Sqoop will need to be uninstalled and the new version installed.
The latest Sqoop 1.4.4 package contains the following patches:
Commi
t

Date
(YYYY-

MapR Upgrade Document

Comment

30

MapR 3.0.2 to 4.0.2 Upgrade Document


MM-DD)
d0b934
5

2014-1014

MapR-8915: Sqoop import operations failed with an error in


opening zip file exception.

303ad6
7

2014-0827

(Bug 14922) Changes Sqoop script to pick up hadoop version


numbers so Sqoop 1.4.4 can work with MapR 4.0.1.
(Bug 12836) By default, Sqoop does not create job history files.

545068
f

2014-0827

(Bug 14838) The bc command is removed from install and


uninstall scripts, since it is not part of a standard Linux installation.

7e4d22
5

2014-0614

MapR 14355. Corrects a packaging error that resulted in a


spurious error message mv: target `/opt/mapr/sqoop/sqoop1.4.4/sqoop-test-1.4.4-mapr.jar' is not a directory.

f54e03
7

20-Nov2013

Fix for MapR 12083. Sqoop can find the Kerberos tgt file.

682565
5

02-Dec2013

Sqoop-1246 HBaseImportJob should add job authtoken only if


HBase is secured.

23.Apache Mahout
Apache Mahout requires an upgrade from 0.7 to 0.9. No real configuration changes will be necessary
aside from updating the paths, but as we are jumping 2 full versions, some existing jobs may be
deprecated.
Below is a list of changes from 0.8 to 0.9:

New and improved Mahout website based on Apache CMS - MAHOUT-1245


Early implementation of a Multi Layer Perceptron (MLP) classifier - MAHOUT-1265
Scala DSL Bindings for Mahout Math Linear Algebra. See this blogpost and MAHOUT-1297
Recommenders as Search. See [https://github.com/pferrel/solr-recommender] and MAHOUT-1288
Support for easy functional Matrix views and derivatives - MAHOUT-1300
JSON output format for ClusterDumper - MAHOUT-1343
Enabled randomised testing for all Mahout modules using Carrot RandomizedRunner - MAHOUT1345
Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering - See
this pdf and MAHOUT-1361
Upgrade to Lucene 4.6.1 - MAHOUT-1364
The following algorithms that were marked deprecated in 0.8 have been removed in 0.9:
Switched LDA implementation from Gibbs Sampling to Collapsed Variational Bayes
Meanshift - removed due to lack of actual usage and support
MinHash - removed due to lack of actual usage and support
Winnow - removed due to lack of actual usage and support
Perceptron - removed due to lack of actual usage and support
Slope One - removed due to lack of actual usage
Distributed Pseudo recommender - removed due to lack of actual usage
TreeClusteringRecommender - removed due to lack of actual usage
Below is a list of changes from 0.7 to 0.8:
Numerous performance improvements to Vector and Matrix implementations, API's and their
iterators

MapR Upgrade Document

31

MapR 3.0.2 to 4.0.2 Upgrade Document


Numerous performance improvements to the recommender implementations
MAHOUT-1088: Support for biased item-based recommender
MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases
MAHOUT-1106: Support for SVD++
MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well
as an upgrade of the supported Lucene version to Lucene 4.3.1.
MAHOUT-1154 and friends: New streaming k-means implementation that offers on-line (and fast)
clustering
MAHOUT-833: Make conversion to SequenceFiles Map-Reduce, 'seqdirectory' can now be run as a
MapReduce job.
MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash
(indexes or values).
MAHOUT-884: Matrix Concat utility, presently only concatenates two matrices.
MAHOUT-1187: Upgraded to CommonsLang3
MAHOUT-916: Speedup the Mahout build by making tests run in parallel.
Existing code should be run on one of the existing MapR v4.0.2 clusters to test for compatibility.
24.Apache Oozie
Apache Oozie will require an upgrade from 3.3.2 to 4.0.1 and some configuration changes to work with
MapR v4.0.1. The 4.0.1 version does support both MR1 and MR2. As a result, oozie must be started
with either MR1 or MR2.
If you are running Oozie jobs on YARN, you must provide the address of the node running the active
ResourceManager and the port used for ResourceManager client RPCs (port 8032). Edit
the job.properties file and insert the following statement:
JobTracker=<ResourceManager_address>:8032

You also need to copy the yarn-site.xml file for the active ResourceManager to the following location:
/opt/mapr/oozie/oozie<version>/conf/hadoopconf

Restart Oozie after making this change.


Installation Notes
Before installing Oozie 4.0.1-1501, remove the old share libraries from: maprfs://oozie/share
If you are running MapR Version 3.x, copy the new Oozie share libraries manually:
hadoopfsput/opt/mapr/oozie/oozie${oozie_version}/share1/oozie/share

Note: With MapR v4.0.2, the share libs are copied automatically when you first start Oozie 4.0.1.
The latest Oozie package contains the following patches:
Comm
it

Date (YYYYMM-DD)

Comment

2973bc
3

2014-11-20

MAPR-16032: Oozie 4.0.1 did not pick up variables from the


mapred-site.xml file.

911f78
a

2014-11-04

MAPR-15524: Oozie server installation on a Version 3.1.1


mapr-client failed.

e6230c

2014-10-30

MAPR-12997: Oozie ShareLib was updated to include Hive


0.13 and Pig 0.12.

MapR Upgrade Document

32

MapR 3.0.2 to 4.0.2 Upgrade Document

25.Apache HBase
Apache HBase will require an upgrade from 0.94.12 to 0.98.7 for MapR v4.0.2. This will require and
upgrade to the hfile format on disk. It is highly recommended to back up important tables to another
cluster prior to the upgrade.
Here is a summary of the steps needed to upgrade HBase:
1. Install the HBase 0.98.7 binaries.
2. Use the 0.98.7 binary to check for incompatible files on the running 0.94.x cluster. (The purpose of
this step is to convert the incompatible HFile format.)
$>/opt/mapr/hbase/hbase0.98.7/bin/hbaseupgradecheck

If incompatible files are found, you must purge them by running a compaction.
3. Shut down the HBase 0.94.13 services on the MapR cluster.
4. Execute the upgrade on the cluster:
$ /opt/mapr/hbase/hbase-0.98.7/bin/hbase upgrade -execute
5. Start the upgraded HBase services.
The apache upgrade documentation can be found here:
http://hbase.apache.org/book.html#upgrade0.96
Note: Upgrading from 0.94.X to 0.98.X is the same as upgrading from 0.94.X to 0.96.X.
Here is a list of the features found in the HBase 0.98.7:
This is the initial release of HBase 0.98.7 for the MapR Distribution for Hadoop. It includes the following
new features:
1. C APIs for HBase (libhbase). This library is not supported by MapR-DB.
2. HTable.checkAndMutate(). This API is not supported by MapR-DB.
3. Impersonation for HBase REST gateway with MapR-DB tables is supported on a MapR 4.0.2 cluster.
26.Apache Cascading
Apache Cascading will require an upgrade from 2.1 to 2.5 for MapR v4.0.2. We support MR1 and MR2
with Cascading 2.5. Users are advised to test existing code with the newer version of cascading.
Placeholder: additional information needed
27.Apache Whir
Apache Whir is merely a set of libraries for running cloud services. There are currently no Amex
projects using this software. It is recommended to remove whir prior to the upgrade.
28.Apache Zookeeper
With MapR v3.1.1 and later, MapR uses Zookeeper 3.4.5 instead of 3.3.6. The 3.4.5 version requires no
additional configuration and should be compatible with additional open source projects such as Solr
and Storm. Prior to the upgrade, it is recommended to back up /opt/mapr/zkdata on each node
configured to run zookeeper.

MapR Upgrade Document

33

MapR 3.0.2 to 4.0.2 Upgrade Document


29.Apache Storm
Apache Storm is running on a single cluster at Amex, which is running MapR v3.1.1. The same version
of Storm is supported on both 3.1.1 and 4.0.2. No additional configuration will be necessary until the
Storm cluster itself is ready for upgrade.
Note: although a storm on yarn project exists, the code was forked from an earlier version of Storm
and should be considered alpha.

Amex Lab Upgrade Plan


This procedure details the high level plan to test the upgrade on a 13-node cluster (Legolas) in the AXP
IPC1 data center. The cluster will be configured with the Silver cluster as a template and will use data
from Silver. Prior to this step it is assumed the cluster is running JDK 1.7 and Redhat Linux 6.6. The
details below cover the first 3 weeks of testing beginning March 2nd.
(Week 1 : 3/2 3/6) - Build New Sandbox Cluster
Activities for the first week to include building the template for the new cluster and performing the
build.

Build StackIQ template for cluster


Determine cluster layout and develop template for zk, admin, and data nodes

Deploy Cluster
Deploy the cluster via StackIQ. Resolve any hardware or build issues.

Bring the Cluster Online and Apply License


For testing purposes, the cluster should initially use M5 temporary license key (fully functional for
30 days)
(Week 2 : 3/9 3/13) Bring the Cluster Online, Operationalize, and Transfer Data
Activities in week 2 should focus on bringing the cluster online using Silver as a template. The cluster
should be tested thoroughly to ensure compatibility with all ecosystem components, end-user
environment, and real data to simulate the Silver cluster.

Operationalize the Cluster


The cluster post-install scripts should be run in order to configure all necessary components, user
environment, and data volumes.
Tasks to include:
Create standard AXP volumes
Create user home directories and MCS access
Configure MySQL databases for Hive, MapR Metrics
Setup NFS mounts for Edge and CLDB nodes
Configure Node and Volume topologies
Configure Hive, HBase and other ecosystem components
Configure JobTracker and Fair Scheduler queues and ACLs
Configure Clustershell (clush) groups for zk, cldb, edge, data, mysql, etc

Verify cluster operation


To ensure the cluster operation, run representative jobs on the cluster:
TeraSort (100 million * 100-byte records = 10 billion)
TestDFSIO (10 output files, 1GB each)
Hive, Pig, and HBase validation

MapR Upgrade Document

34

MapR 3.0.2 to 4.0.2 Upgrade Document


Run Terrys Benchmark suite
Terry has a standard benchmark suite, which has been run on every cluster deployed in the AXP
environment. The benchmark suite will stage and generate data, and run representative
MapReduce code. There are 3 MapReduce and 1 Hive jobs run to stage the data, which will take 4-6
hours and the actual tests. All staging and tests should be submitted by an unprivileged user
through the proper queue. The tests will take approximately 16 hours to run on a cluster of this
size:
In-Memory Map Only Join
Java Raw Comparator
Hive Join
Avro Join

Stage Actual Data and Tables


Real data and actual tables should be copied and validated on the cluster. Data should be copied to
the cluster in order to validate equivalent functionality following the upgrade. Existing code from
application teams in Silver is encouraged.
Hive Tables
HBase Tables
Data for Existing MapReduce code
(Week 3 : 3/16 3/20) Perform Upgrade and Develop Configuration
This week should be dedicated to performing the upgrade manually and developing the configuration
files and templates needed to perform subsequent upgrades. All YARN configurations should be
developed using Bronze and Gimli as a template. Updates to existing configurations should be noted
and captured in StackIQ.
To perform the upgrade, use the following steps as a guideline:

Halt Jobs
As defined by your upgrade plan, halt activity on the cluster in the following sequence before you
begin upgrading packages:
1. Notify stakeholders.
2. Stop accepting new jobs.
3. Terminate any running jobs.
The following commands can be used to list and terminate MapReduce jobs:
#hadoopjoblist
#hadoopjobkill<jobid>
#hadoopjobkilltask<taskid>

4. You might also need specific commands to terminate custom applications.


At this point the cluster is ready for maintenance but still operational. The goal is to perform the
upgrade and get back to normal operation as safely and quickly as possible.

Stop Cluster Services


The following sequence will stop cluster services gracefully. When you are done, the cluster will be
offline. The maprcli commands used in this section can be executed on any node in the cluster.
Disconnect NFS Mounts
Unmount the MapR NFS share from all clients connected to it, including other nodes in the cluster.
This allows all processes accessing the cluster via NFS to disconnect gracefully. The following
example unmounts the NFS shares from the edge and cldb nodes.

MapR Upgrade Document

35

MapR 3.0.2 to 4.0.2 Upgrade Document


%>sudoclushgcldbumountl/mapr/axp
%>sudoclushgedgeumountl/idn/axp

Stop Ecosystem Component Services


Stop ecosystem component services on each node in the cluster.
1. Run maprcli node list command to display the services on each node in the cluster:
#maprclinodelistcolumnshostname,csvc

2. Stop ecosystem component services.


For example, you can use the following command to stop Oozie and Hive on the edge nodes:
#maprclinodeservicesmulti'[{"name":"oozie","action":"stop"},
{"name":"hs2","action":"stop"}]'nodes<hostnames>

Stop MapR core services


1. Stop warden on the CLDB nodes first:
$>sudoclushgcldbservicemaprwardenstop

2. Stop warden on the remaining data nodes


$>sudoclushgadmin,dataservicemaprwardenstop

3. Stop zookeeper
$>sudoclushgzkservicemaprzookeeperstop

Backup the /opt/mapr/roles Directory


$>sudoclushamkdir/tmp/roles&&cp/opt/mapr/roles/*/tmp/roles

Upgrade Packages and Configuration Files


Upgrade the following MapR core component packages on all nodes where they exist:

mapr-cldb
mapr-core
mapr-fileserver
mapr-jobtracker
mapr-metrics
mapr-nfs
mapr-tasktracker
mapr-webserver
mapr-zookeeper
mapr-zk-internal
$>sudoclushayumupdateymaprcldbmaprcoremaprfileservermapr
jobtrackermaprmetricsmaprnfsmaprtasktrackermaprwebservermapr
zookeepermaprzkinternal

Restore roles files


$>sudoclusharmf/opt/mapr/roles/*&&cp/tmp/roles/*/opt/mapr/roles/

MapR Upgrade Document

36

MapR 3.0.2 to 4.0.2 Upgrade Document


Install YARN Components (ResourceManager, NodeManager, HistoryServer)
$>sudoclushgcldbyuminstallymaprresourcemanager
$>sudoclushgdatayuminstallymaprnodemanager
$>sudossh<historyserver>yuminstallymaprhistoryserver

Verify that packages installed successfully on all nodes.


Confirm that there were no errors during installation, and check
that /opt/mapr/MapRBuildVersion contains the expected value.
Example:
$>sudoclushaBcat/opt/mapr/MapRBuildVersion

Update the warden configuration files which are located in /opt/mapr/conf/


and /opt/mapr/conf/conf.d.

Manually merge new configuration settings from the files in /opt/mapr/conf/conf.d.new/ to


/opt/mapr/conf/conf.d/
Manually merge new configuration settings from /opt/mapr/warden.conf.new into
/opt/mapr/warden.conf.

Upgrade existing ecosystem components


Notes: Hive does not require a metastore upgrade, but the metastore should be backed up prior to
upgrade.
HBase table should be backed up to another cluster prior to the upgrade, as the hfile format
on disk is changed

Edit the sethadoopenv.sh on edge nodes (as needed)


File after edits:
exportHADOOP_HOME=/opt/mapr/hadoop/hadoop0.20.2
exportSQOOP_HOME=/opt/mapr/sqoop/sqoop1.4.4
exportMAHOUT_HOME=/opt/mapr/mahout/mahout0.9
exportHBASE_HOME=/opt/mapr/hbase/hbase0.98.7
exportHIVE_HOME=/opt/mapr/hive/hive0.12
exportPIG_HOME=/opt/mapr/pig/pig0.12
exportPIG_CLASSPATH=$HADOOP_HOME/conf
exportPATH=$PATH:$HADOOP_HOME/bin:$MAHOUT_HOME/bin:$SQOOP_HOME/bin:
$HBASE_HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin
exportCLASSPATH=$HADOOP_HOME/conf
exportPIG_OPTS="Dhbase.zookeeper.property.clientPort=5181
Dhbase.zookeeper.quorum=lpdbd0000.phx.aexp.com,lpdbd0010.phx.aexp.com,lpd
bd0016.phx.aexp.com"
LOGNAME=`whoami`
if[$LOGNAME==root]
then
break
else
if[[f~/.my_queue&&`cat~/.my_queue|grep[az]|wcl`gt
0]]&&[[$(echo"`date+%s``statLformat%Y~/.my_queue`"|bc)
lt86400]];
then
exportMY_QUEUE=`cat~/.my_queue`;
echoe"\nUsingExistingQueueInfo";
else
`$HADOOP_HOME/bin/hadoopqueueshowacls2>/dev/null|
grepv"default"|grepsubmitjob|awk'{print$1}'|head1>
~/.my_queue`;

MapR Upgrade Document

37

MapR 3.0.2 to 4.0.2 Upgrade Document


exportMY_QUEUE=`cat~/.my_queue`;
echoe"\nCreatingQueueInfo";
fi
if["`echo${MY_QUEUE:null}`"=="null"];then
echoe"\n!Error:UnabletosetMY_QUEUE;Pleasecheck
ifyouareamemberofanyqueueotherthan\"default\"";
else
echoe"\nDefinedMY_QUEUE=$MY_QUEUE\n";
fi

Start the cluster and enable new features


Start zookeeper on the zookeeper nodes
$>sudoclushgzkservicemaprzookeeperstart

Start warden on all nodes


$>sudoclushaservicemaprwardenstart

After the cluster comes up, enable the cldb v4 features from the command line
(as root or sudo on a single node)
$>sudomaprcliconfigsavevalues{cldb.v4.features.enabled:1}

Set the MapR version


(as root or sudo on a single node)
$>cat/opt/mapr/MapRBuildVersion
4.0.2.29870.GA
$>sudomaprcliconfigsavevalues{mapr.targetversion:"4.0.2.29870.GA"}

Promotable Mirror Volumes (new features as of version 4.0.2)


Issue the following command to enable support for promotable mirror volumes:
(as root or sudo on a single node)
#maprcliconfigsavevalues{mfs.feature.rwmirror.support:1}

Note: This feature is automatically enabled with a fresh install.


Reduce On-Disk Container Size (new features as of version 4.0.2)
Issue the following command to reduce the space required on-disk for each container:
(as root or sudo on a single node)
#maprcliconfigsavevalues{cldb.reduce.container.size:1}

over.

The reduction of the on-disk container size will take effect after the CLDB service restarts or fails

The same tests run in week 2 should be run again to validate functionality in
both MR1 and MR2

MapR Upgrade Document

38

You might also like