You are on page 1of 49

IBM Security Guardium V10.

1
Deployment Guide for Hadoop
Systems
July 27, 2016

Revisions
12/16/2015 you DO need kerberos configuration if either HBase or Hive is used. Previously it said you
dont need it for Hive, but this was incorrect.
01/12/2016- added Solr port (8983) and IE type (HTTP)
06/08/2016 - updated with updated sizing rules of thumb. Updated with gpfs information for bigInsights.
Removed requirement to customize HBase report as this has been fixed in 10.1
6/27/2016 added reference to Deployment guide for Hortwonworks for Ranger integration

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 1


Hadoop Deployment Guide
Objectives of this Guide...................................................................................................... 4
Whats new in V10 for Hadoop .......................................................................................... 4
Whats new in V10.1 for Hadoop ....................................................................................... 5
Planning .............................................................................................................................. 5
Sizing (capacity planning) .............................................................................................. 5
What version and distribution of Hadoop? ..................................................................... 6
What components of Hadoop are you running? ............................................................. 6
What are the business requirements for auditing? Considerations for policies and
reporting .......................................................................................................................... 7
Considerations for policy rules ....................................................................................... 7
Considerations for reporting ........................................................................................... 8
What is an object in Hadoop? ..................................................................................... 8
What is a verb for HDFS? ........................................................................................... 8
What are verbs for HBase? ......................................................................................... 9
Big SQL, Impala, and Hive verbs and objects ............................................................ 9
Restrictions on Monitoring and other operational considerations .................................. 9
Are you using Kerberos? .............................................................................................. 10
Deploying S-TAPS (and GIM clients) .............................................................................. 10
Configuring inspection engines .................................................................................... 11
Configuring inspection engines using Guardium API .................................................. 13
Default Hadoop policy ...................................................................................................... 15
Simple production policy .................................................................................................. 16
Rule: Privileged user activity: Log full details ......................................................... 17
Rule: Privileged user access to sensitive data: log policy violation ......................... 18
Built-in reports .................................................................................................................. 19
Hadoop Permissions...................................................................................................... 19
Privileged users accessing sensitive data ...................................................................... 20
Access denied exception report .................................................................................... 22
Users on the Hadoop cluster ......................................................................................... 22
Policies that support redaction and blocking (Advanced) ................................................ 23
Prerequisites for blocking ............................................................................................. 24
Blocking rules ............................................................................................................... 24
Prerequisites for redaction ............................................................................................ 27
Redaction rules.............................................................................................................. 27
Deployment Recommendations ........................................................................................ 30
Resources .......................................................................................................................... 31
Appendixes ....................................................................................................................... 32
Appendix A. Kerberos setup instructions when HBASE or Hive is used ................... 32
Step 1: Creating a keytab for use with Guardium .................................................. 32
Step 2. Configure Guardium ..................................................................................... 35
Step 3: Basic operational testing of the configuration .............................................. 37
Alternate approach to creating Kerberos keytabs (for Cloudera) ............................. 37
Appendix B. Using computed attributes to pull out db user from SOLR, Impala Hue, or
Hive Hue/Beeswax........................................................................................................ 42

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 2


Impala: computed attribute to get user name from Hue ........................................... 45
Hive: computed attribute to get user name from Hue/Beeswax ............................... 46
Appendix C: Considerations for IBM InfoSphere BigInsights and Big SQL .............. 47
Hadoop on GPFS (IBM Spectrum Scale) ................................................................. 47
Big SQL .................................................................................................................... 47
Appendix D. Supported Hadoop components (Hadoop 2) ........................................... 48
Notices ...................................................................................................................... 49

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 3


Objectives of this Guide
This guide is to help customers, IBM Business Partners, and IBM technical staff plan for
and validate Guardium for Hadoop in a test or sandbox environment. Over time, we hope
this guide can be augmented with more information about best practices and
performance, but will not and cannot replace the IBM Redbook, Deployment Guide for
InfoSphere Guardium, which is currently is the best source of information for overall
planning, performance tuning, and troubleshooting information.

Assumptions:
A client has already set up the environment on their own or worked with technical
sales, lab services or a knowledgeable Business Partner to do initial sizing and to
set up the environment, including network connectivity for the Guardium
appliances.

The person doing the implementation has Guardium knowledge and has involved
the relevant people who understand the clients Hadoop architecture.

The client has a clear set of use cases to test based on an understanding of known
capabilities in the product for Hadoop (as described in this document, for
example). If a requirement is not addressed by the product, the client can use the
Request for Enhancement process to ensure that IBM product management is
aware of the request. https://www.ibm.com/developerworks/rfe/

This space is changing frequently and Guardium is evolving constantly to keep up


with the changes. Be sure to work with IBM to ensure that you have the latest
information or to check if there is a later version of this guide

Whats new in V10 for Hadoop


The major new enhancement in V10 is support for blocking and redaction. There were
many changes as well to improve overall parsing and collection of relevant information
and reporting as well. This section will continue to be updated as further improvements
roll out through the maintenance stream.

Blocking and redaction for Hive and Impala (this was already supported for Big
SQL). For more information, see Policies that support redaction and blocking
(Advanced) on page 23.
New inspection engines for Hive, Hue, Impala and WebHDFS. For more
information, see Table 1 on page 12.
Removed restriction on Hue Metastore Previously only MySQL was supported.
Now PostgreSQL and Oracle are also supported.
Now capture failed logins from Hue for MySQL, Oracle and PostgresSQL
datastores.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 4


Support for ODBC traffic (delivered originally in V9p500).
Improved built in reports more focused on security, such as a permissions report,
privileged user accessing sensitive data, and so forth. For more information, see
Built-in reports, on page 19.
Dropped support for CDH3 and for BigInsights 1.x.
BigInsights- Added support for 4.0, 4.1, and dropped GuardiumProxy support.
If you are NOT using HBase or Hive you no longer need to specifically configure
Guardium for activity that uses Kerberos authentication. (This is also true in V9.)

Whats new in V10.1 for Hadoop


The major enhancement in 10.1 is focused on integration with Apache Ranger for
Hortonworks distributions. If you are using SSL encryption with your Hortonworks
Hadoop cluster, the S-TAP as described in this guide will not work. Instead, you will
need to reference another guide: Guardium Activity Monitoring (and blocking) for
Hortonworks Hadoop using Apache Ranger Integration.

Planning
Sizing (capacity planning)
This section refers to capacity planning, not sizing for pricing. Pricing for Hadoop is per
node.

The current rule of thumb is based on deployments that are not high volume in terms of
what is being audited
10 management/server nodes per collector,
20+ data nodes per collector, assuming STAPs are needed for the data nodes
(They are not needed for all components)
Possibly even more nodes per collector if if physical appliances are used

Your sizing may vary, of course.

The other option is to size by the PVUs of the nodes. This may result in oversizing if you
are not auditing significant amounts of traffic.

The capacity sizing guidelines for Version 10 is 4000 PVU per collector.

http://www-01.ibm.com/support/docview.wss?uid=swg27046184

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 5


What version and distribution of Hadoop?
Although Guardium supports multiple versions and distributions of Hadoop, it is
important to record the exact distribution being used in your environment. This is because
the different levels of Apache Hadoop itself impact Guardium processing and thus could
require additional patches, but also because different distributions support or include
different add-on components, either open source or proprietary. Its important to ensure
that you know exactly what is and isnt covered by Guardium from a monitoring
perspective as to ensure that you have all the correct Guardium prereq/patches installed
for your version.

This is a space that is changing frequently, so double check with IBM if you are unsure or
to get the latest information.

As of the writing of this guide:

Hadoop 1.x is used with the following distributions


Cloudera 4.x
Hortonworks 1.x
IBM BigInsights 2.1
Pivotal (Greenplum) HD 1.2

Hadoop 2.x is used with the following distributions


Cloudera 5.x
Hortonworks 2.x
BigInsights 2.1.2, 3.0. 4.x
Pivotal 1.5

Record your Hadoop distribution and release here:

_______________________________________________________________________

What components of Hadoop are you running?


The basics in Hadoop include the file system (HDFS), where the data is stored, and
MapReduce (or MapReduce 2, YARN), which is the framework for accessing and
analyzing data. From a monitoring perspective, if you capture activity on these two
components, you are covering basic auditing requirements because at the end of the day
everything (except management console traffic) goes through HDFS.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 6


Figure 1. Hadoop architecture

However, HDFS activity is not the most auditor-friendly it is somewhat like monitoring
file accesses in a relational database. You may want to consider also monitoring activity
from other components that your organization is probably using, such as Hive, Big SQL,
or Impala that are more akin to what one might expect from database access.

Example report outputs from some of these components are included in this guide.

You can record which components you are running in Table 1 as well as whether you
require monitoring above and beyond HDFS monitoring.

What are the business requirements for auditing?


Considerations for policies and reporting

For monitoring purposes, you must think about the user, the data object being monitored
and what actions/commands are being done. In Guardium terminology, these are,
respectively, the DB User, the Object, and the Verb (the command). Those of you
familiar with Guardium will remember that these entities can be used in policy rules to
trigger particular actions, such as a real time alert.

So, as with any other auditing exercise, a key step in setting up your security policies is to
inventory your assets and map your inventory of assets to users and servers.

Considerations for policy rules

Guardium policy rule actions allow you to not just to alert or log policy violations, but
also enable you to filter certain traffic for performance.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 7


For Hadoop traffic, you cannot use session-level filtering actions, such as Ignore S-TAP
session. This is because Hadoop does not do session-management in the same way as
relational databases where you log into the database, which establishes a session, and
then run a bunch of SQL traffic within that session and then log back out again. With
Hadoop, each command is its own session and can spawn many more sessions as work
gets distributed throughout the cluster1.

In most cases, Guardium cannot catch failed logins for command line components.
Guardium can see failed logins from Hue and through IBM BigSQL.

You will get permission exceptions on the file system level, so you report on those using
the exceptions domain.

Considerations for reporting

This section includes lists of objects and commands (verbs) to Hadoop. For the
commands, you can cut and paste these into a group in Guardium if you like, using the
Group Builder tool. You will also need to create groups of users and objects based on
your own environment.

What is an object in Hadoop?


An object is:
HDFS files/directories
MapReduce job name (YARN only). Prior to MapReduce 2, the MapReduce job
names was not logged as a separate object but you could obtain it by using the
built in MapReduce report, which used computed attributes to pull the job name
out of the full message.
IBM Big SQL, Impala, Hive, HBase table and view names

What is a verb for HDFS?


Read verbs for HDFS:
getFileInfo
getBlockLocations
getFileLocation
getListing

Write verbs for HDFS:


addBlock
complete

1
Note that BigSQL traffic in BigInsights does have session information, even if the underlying
HDFS does not.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 8


create
delete
mkdirs
rename

What are verbs for HBase?


Read verbs:
list
scan

Write verbs:
createTable
disableTable
deleteTable
multi (this is an insert/update) (With the Ranger integration deployment option,
this is put)
drop

Big SQL, Impala, and Hive verbs and objects


The Big SQL, Impala, and Hive query languages are like SQL and thus normal parsing
and logging rules apply as with most other relational databases in Guardium. Many of
those commands are already included in Guardium command groups, such as ALTER
commands, CREATE commands, administrative commands. The extent of SQL syntax
support varies greatly among these with Big SQL having the most extensive support.

Restrictions on Monitoring and other operational considerations


SSL encryption is not supported. The one exception to this is for Hortonworks
deployments that use Ranger. Guardium can leverage Ranger auditing to capture
traffic after decryption. This integration is covered in another deployment guide.
UID chaining is not supported.
Blocking and redaction is only supported for Big SQL, Hive, and Impala
Configuration Audit system and sensitive data discovery are not supported at this
time.
Guardium currently does not support administration command auditing (stop and
start services etc).
Guardium load balancing and failover options are not supported when Kerberos is
used. (F5 or other load balancer in which a virtual IP is used may be an option.)

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 9


Are you using Kerberos?
Guardium supports the use of Kerberos secure clusters with some restrictions (such as
load balancing not being supported). In order to decrypt Kerberos user IDs, Guardium
requires that keytab files be generated and placed in a specific location. Instructions are
included in Appendix A. Kerberos setup instructions on page 32.

If you are NOT using HBase or Hive, you do not need to configure Guardium for a
Kerberos configuration.

Deploying S-TAPS (and GIM clients)


Only S-TAP and GIM clients are needed since Guardium does not yet support CAS and
database discovery for the Hadoop platforms.

As with any S-TAP deployment, be sure to download the correct S-TAP for your
operating system and kernel level.

Figure 2 provides a high level overview of where S-TAPs should be installed depending
on what you want to monitor. Note that the graphic does not necessarily reflect physical
servers.

Figure 2. S-TAPs in Hadoop

Edge nodes: An S-TAP is recommended for edge nodes as well, particularly if you are
using them as a landing zone for data.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 10


Configuring inspection engines
After S-TAPs are deployed, the appropriate inspection engines must be defined from the
Guardium appliance. Inspection engines specify what traffic is to be monitored from a
particular S-TAP host. For example, the figure below shows that on this particular S-
TAP host, Guardium should monitor traffic from 8032 and 60000. Inspection engines are
also where you define the protocol, such as Hadoop or HTTP.
Figure 3. S-TAP detailed architecture

Use the table below to record the ports and inspection engine protocols required for each
node. Combine this information into a spreadsheet with the server IP (S-TAP host IP) and
you will have everything you need to create grdapi commands if you prefer to use that
instead of configuring each of these using the Guardium UI.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 11


Table 1. Indicate which services require monitoring and their associated ports

Required Hadoop
Service Default Ports Your ports IE Protocol
(Y/N) Node
Namenode 8020 Hadoop
HDFS Name Node
Namenode HTTP port (for WEBHDFS
50070
WebHDFS)
Namenode Resource Manager 8032 Hadoop
(YARN only)
Only for Job Tracker 8021 Hadoop
MapReduce Job
mapreduce 9290
Tracker
1 50030
HBase HBase Master 60000 Hadoop
Master
HBase 60020 Hadoop
HBase Region
Region
Hive Server 2 Thrift protocol HIVE
10000
messages
Hive Thrift protocol message HADOOP
Metastore used to get Impala
and Hive DB user from 9083
Hue (requires
computed attribute)
Impala IMPALA
Impala 21000
daemons
Impala HIVE Because impala
Impala from Hue 21050 from hue uses hiveserver2.
Management 51000 DB2
node 32051
BigSQL Server
(changed in
4.1)
Compute 51000 DB2
node 32051
BigSQL Server
(changed in
4.1)
Hue node Hue UI (Oracle HUE
1521
backend)
Hue Node Hue UI (MySQL HUE
3306
backend)
Hue Node Hue UI (PGSQL HUE
5432
backend)
Solr Search HTTP
Solr search 8983
node

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 12


Notes:
Hive CLI This is deprecated in Hadoop distributions and is not supported by
Guardium.
Impala - Must set up Inspection Engines on all nodes that run Impala daemons.
HBase Need S-TAPs on all data nodes as well as the Master.
Big SQL If you are using Kerberos or GPFS, you must configure the S-TAP
with the DB2_Exit, which is a safe, efficient way to capture Big SQL/DB2
encrypted traffic and/or GPFS. This means, however, that blocking and redaction
are not supported. See the developerWorks article here for more details on
configuration and support for Big SQL:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1411hadoop-
biginsights-guardium/index.html

Configuring inspection engines using Guardium API


These examples use port ranges to reduce the number of inspection engines required to be
configured but it is in general a best practice to limit the number of ports that Guardium
is listening on. Create your own grdapi scripts to match your own configuration.
(This can also be done through the Guardium user interface under Manage > Activity
Monitoring >S-TAP Control.)

/* Master or NameNode or
/*YARN
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=8032 portMax=8050 portMin=8032 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*HDFS
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=8020 portMax=8020 portMin=8020 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*Hive metastore to capture impala and hive db user through Hue


grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=9083 portMax=9083 portMin=9083 connectToIp=127.0.0.0
stapHost=10.19.232.21

/* WEBHDFS
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=WEBHDFS
ktapDbPort=50070 portMax=50070 portMin=50070 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*HBASE Master
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=60000 portMax=60000 portMin=60000 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*Impala daemon
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=IMPALA
ktapDbPort=21000 portMax=21000 portMin=21000 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*Impala from hue

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 13


grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HIVE
ktapDbPort=21050 portMax=21050 portMin=21050 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*Hive
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HIVE
ktapDbPort=10000 portMax=10000 portMin=10000 connectToIp=127.0.0.0
stapHost=10.19.232.21

/* BigSQL prior to 4.1


grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=DB2
dbInstallDir=/home/bigsql procName=/home/bigsql/sqllib/adm/db2sysc
ktapDbPort=51000 portMax=51000 portMin=51000 connectToIp=127.0.0.0
stapHost=10.19.232.21

/* BigSQL 4.1 or later


grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=DB2
dbInstallDir=/home/bigsql procName=/home/bigsql/sqllib/adm/db2sysc
ktapDbPort=32051 portMax=32051 portMin=32051 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*Hue Oracle backend


grdapi create_stap_inspection_engine stapHost=10.19.232.21 protocol=HUE
portMin=1521 portMax=1521 ktapDbPort=1521 connectToIp=127.0.0.0
client=0.0.0.0/0.0.0.0 dbInstallDir=/home/oracle11
procName=/home/oracle11/product/11.1.0/db_1/bin/oracle

/*Hue MySQL backend


grdapi create_stap_inspection_engine stapHost=10.19.232.21
protocol=10.19.232.21 portMin=3306 portMax=3306 ktapDbPort=3306
connectToIp=127.0.0.0 client=0.0.0.0/0.0.0.0 procName=MySQL

/*Hue Postgres backend


grdapi create_stap_inspection_engine stapHost=10.19.232.21 protocol=HUE
portMin=5432 portMax=5432 ktapDbPort=5432
connectToIp=127.0.0.client=0.0.0.0/0.0.0.0 procName=PGSQL

/* Solr search
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HTTP
ktapDbPort=8983 portMax=8983 portMin=8983 stapHost=10.19.232.21

/* data nodes
/* HBASE Region
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=60020 portMax=60020 portMin=60020 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*Impala daemon
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=IMPALA
ktapDbPort=21000 portMax=21000 portMin=21000 connectToIp=127.0.0.0
stapHost=10.19.232.21

/*Big sql server prior to 4.1


grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=DB2
dbInstallDir=/home/bigsql procName=/home/bigsql/sqllib/adm/db2sysc
ktapDbPort=51000 portMax=51000 portMin=51000 connectToIp=127.0.0.0
stapHost=10.19.232.21

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 14


/* BigSQL 4.1 and later
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=DB2
dbInstallDir=/home/bigsql procName=/home/bigsql/sqllib/adm/db2sysc
ktapDbPort=32051 portMax=32051 portMin=32051 connectToIp=127.0.0.0
stapHost=10.19.232.21

Default Hadoop policy


Use the built-in Hadoop policy first (shown in Error! Reference source not found.) to
ensure traffic is being captured. Its recommended that you first try this in a low traffic
test environment, and you may even want to add one more access rule to restrict traffic to
just one server type, such as Hive, to reduce the amount of noise you see.

After you are comfortable that traffic is flowing to the collector, you can clone the default
policy and create one that aligns with your security and compliance requirements, as
described in Simple production policy on page 16.

Figure 4. Default Hadoop policy

There is a lot of noise with Hadoop internal communications, and the more background
noise you can filter out, the better. Rule 1 will filter out (not log) activity in which the
object is one of the items in the Hadoop Skip Objects group. You can edit this group to
add objects that you observe in your traffic.

The second rule filters out noisy commands that reflect internal communications.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 15


The third rule is mostly used in a test environment where there may be non-Hadoop
servers associated with this appliance it filters out traffic from any unrelated servers
based on the server IPs you specify.

Tip: You must put something in the Not Hadoop Servers group, even if its a dummy IP
or you will not collect traffic. If you dont have any such servers, make sure you remove
this group altogether and uncheck the Not checkbox.

This rule also specifies LOG FULL DETAILS action for all nonfiltered traffic, which
may be handy for a small environment, but it is probably not what you want to do in
production. This may overload the collector because each command is logged in full.
Thus, you will likely modify or delete it after doing initial validation in a test
environment.

Recommendation: If you are not familiar with the way Guardium policies impact data
collection and reporting, familiarize yourself with that before moving ahead with policy
definitions. Some recommended resources on the Guardium community on
developerWorks (bit.ly/guardwiki) include:
4-part video series on policies
Tech Talk: Reporting 101
The Deployment Guide for InfoSphere Guardium also includes a good introduction.
(http://www.redbooks.ibm.com/abstracts/sg248129.html?Open)

Simple production policy


Figure 5 shows a simple production policy. It uses the default logging of constructs only
for most traffic and logs full details only for privileged user activity. With the default
construct logging, each unique message construct is logged and the number of times that
unique message construct is executed is aggregated once per hour. In general, log full
details is required only when exact timestamps are critical.

Figure 5. Simple Hadoop Production Policy

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 16


We also deleted the server IP filter rule that was present in the default policy.

Rule: Privileged user activity: Log full details


This rule is an example of a business-oriented rule that might be needed if you need
detailed, exact time recording of all privileged user activity, as is required by some
compliance regulations.

Figure 6. Log full details of privileged user activity

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 17


Tip: Your Log Full Details rule must be before any rule that does not log full details (that
is, that uses normal logging of constructs only).

Rule: Privileged user access to sensitive data: log policy violation

In this case, a policy violation of medium severity will be logged whenever someone in
the privileged user group accesses any object (HDFS file, HBase Table, etc.) with the
string customer in its name. (Most likely you will be creating a group of sensitive
objects.)

Figure 7. Log policy violation of access to sensitive objects

The violation will appear in the Policy Violations / Incident Management report.
(Comply>Reports>Incident Management).

Figure 8. Log policy violation for access to sensitive data

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 18


Built-in reports
This section includes a selection of prebuilt reports. To see the complete list of built-in
reports for Hadoop, simply go to My Dashboards > Create a new dashboard, then
click on Add Report. Start typing in Hadoop and you will see the full list. A partial list
is shown below.
Figure 9. Partial list of Hadoop reports in Guardium

Some of these reports are component-based reporting, which is probably of most use
when validating your configuration and that you are catching traffic from the component.

Well go into a little more detail on the following reports, which are more focused on
security and compliance.

Hadoop -Permissions
Privileged Users Accessing Sensitive Objects
Exception report
Hadoop logged in users

Hadoop Permissions
This report shows when permissions are changed on any Hadoop file system object.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 19


Figure 10. Hadoop permissions
report

This report uses a built in group called Hadoop Permissions below. You could also
choose to include Hive, BigSQL or Impala grant/revoke statement sin this report by
adding those commands as well, or catch those using another report (for example, the
built in Execution of Grant Commands report).

Figure 11. Group of Hadoop permission commands

Privileged users accessing sensitive data


This report relies on two groups: privileged users (Figure 12) and sensitive objects
(Figure 13). For users, include the complete user name. For Kerberos systems, include
the user name and the domain (or use wild card if appropriate).

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 20


Figure 12. Hadoop privileged users

For objects, you can use full file directory paths (for HDFS), wild cards, or a combination
of both. Note that if you are also specifically monitoring HBase, BigSQL, or Hive and
they also use customer in their names, those will also match.

Figure 13. Group of sensitive objects

And here is the report.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 21


Figure 14. Privileged users accessing sensitive data

Access denied exception report

File permission exceptions are indicated by error code 101, which is used in the query
conditions section of the Exception Report query builder shown in Error! Reference
source not found..

Figure 15. Hadoop Permission Exception Report

Users on the Hadoop cluster


This report can help you understand which users(IDs) are accessing the Hadoop cluster,
The Session Start attribute shows the latest date and time that the particular DB User with

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 22


the corresponding attributes of client IP, Server IP and Server Type were active on the
system.
Figure 16. Hadoop Users

Policies that support redaction and blocking (Advanced)


Guardium V10 supports redaction (using extrusion rules) and blocking (S-GATE
Terminate) for Hive and Impala. (Blocking for BigSQL was supported in V9.x when the
S-TAP is used.)

Here is a policy that includes both blocking and redaction rules. Well examine these
rules and their prerequisites in more detail in this section.

Figure 17. Policy with blocking and redaction rules added

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 23


Prerequisites for blocking
As with blocking on other databases, the S-TAP must be configured with
firewall_installed=1.
BigSQL- on all nodes where BigSQL is running. Important: If you are using the
communications exit to facilitate BigSQL auditing, then blocking is not
supported.
Impala on all nodes where Impala is running
Hive On the node where HiveServer2 is running
For more information about other firewall parameters, see http://www-
01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc.stap/stap
/r_stapparmsw_firewall.html

For more information about the blocking actions (S-GATE Terminate) see http://www-
01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/r
ule_actions.html

Blocking rules
Although there are additional nuances that are covered in other sources, blocking requires
a minimum of two actions:
1. S-GATE Attach specify the conditions under which S-TAP must start watching
the session traffic for possible blocking (which requires checking all actions
against the policy on the collector.
2. S-GATE Terminate when this condition is met, terminate the connection.

Important: Blocking has performance implications because S-TAP must hold the
command and check with the policy on the appliance to see if this command should be
allowed through. Thus, it is important that you limit which conditions you use for attach
to those that are not performance sensitive, such as privileged user access.

Also, because of the way Hive and Impala traffic is processed in Hadoop, you must do
the following in the blocking policy rule:

1. Specify the DBTYPE in all S-GATE ATTACH and S-GATE TERMINATE)


policy rules; that is, either Impala or Hive.
2. Ensure that ATTACH happens on a combination of user and object and/or
command.

The rules shown below in Figure 18 and Figure 19 can be translated as follows:
1. Whenever there is a connection from svoruga to any Hive table that includes
customer in the name, start watching this for possible blocking.
2. If svoruga issues a SELECT command (or any command in that group) on a
customer table, block the connection.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 24


Figure 18. Attach rule specifies DB Type, User and Object

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 25


Figure 19. Block this user when they issue a SELECT command on customer objects

Figure 20 shows a select from customer in hive using beeline and how it was blocked.
The policy violation report shows that the rule was triggered.

Figure 20. Hive select was blocked

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 26


Prerequisites for redaction
As a reminder, you must enable inspection of returned data in the master inspection
engine configuration. Go to Manage > Activity Monitoring > Inspection Engines and
select the Inspect Returned Data checkbox as shown here:

Figure 21. Required configuration to enable redaction

Redaction rules

Figure 22 below is one of the extrusion rules in our policy to inspect the returned data for
a pattern that matches social security numbers and then redact data in the pattern. Figure
23 has the same rules except for credit cards.

For information about the special pattern tests, such as those for credit cards and social
security numbers, see the Knowledge Center here:
http://www-
01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/r_patterns
.html

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 27


Figure 22. Redact social security numbers (Hive, Impala or BigSQL)

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 28


Figure 23. Redact credit card data (Hive, Impala or BigSQL)

Figure 24 shows the redaction on Hive data whether the query was issued in the UI (Hue)
or in the beeline command line.

Figure 24. Redacted data for Hive

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 29


Deployment Recommendations
To avoid flooding the collector and to make problem diagnosis simpler, consider tactics
to reduce the amount and types of traffic that has to be processed by the Guardium
collector.

To limit data that must flow across the network to the appliance, restrict the
number of inspection engines you configure.

To limit the amount of data that is logged on the collector, put in conditions on
the policy.

One strategy might be to just configure for Hive command line queries and try that before
adding additional inspection engines and opening up the policy to more types of traffic
such as HDFS, which will generate a much higher volume of traffic.

For each new inspection engine that is configured, you must restart S-TAP.

Monitor appliance as more services generate more traffic. The Guardium deployment
redbook includes details on how to monitor the appliance and make sure the traffic is not
excessive for the collector.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 30


Resources
IBM Redbook: Information Governance Principles and Practices for a Big Data
Landscape, http://www.redbooks.ibm.com/abstracts/sg248165.html?Open

IBM Redbook: Deployment Guide for InfoSphere Guardium,


http://www.redbooks.ibm.com/abstracts/sg248129.html?Open

Guardium Activity Monitoring (and blocking) for Hortonworks Hadoop using Apache
Ranger Integration. http://www.ibm.com/support/docview.wss?uid=swg21987893

IBM developerWorks article: Protect sensitive Hadoop data using InfoSphere


BigInsights Big SQL and InfoSphere Guardium,
http://www.ibm.com/developerworks/data/library/techarticle/dm-1411hadoop-
biginsights-guardium/index.html

IBM Security Guardium product documentation: http://www-


01.ibm.com/support/knowledgecenter/SSMPHH/SSMPHH_welcome.html

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 31


Appendixes
Appendix A. Kerberos setup instructions when HBASE or Hive
is used
This appendix provides the procedure to configure Guardium so that it can properly
decrypt user names when Kerberos is used at the authentication mechanism for Hadoop
and HBase or Hive is also used.

The configuration requires that each node in the cluster (that is running Guardium S-
TAP) has a keytab that includes the Hadoop services that are running on the node.
Guardium will use those keys to decrypt the user name for services running on the node.

Important: These instructions use Cloudera as the Hadoop distribution. The same basic
instructions can be used for other distributions, but the process to obtain the Kerberos
principals will vary.

A keytab file must be created/updated each time a principals encryption key has
been changed or when a service is added or deleted on a node.

Overview of the procedure:


Export keys from Kerberos and create a keytab on each node. This uses the option
-norandkey to export the principals. If you cannot use this option, see heading
below entitled Alternate approach to creating Kerberos keytabs (for Cloudera) for
a different set of instructions to do this step.
Configure Guardium to look for that keytab.
Sanity check the configuration.

Step 1: Creating a keytab for use with Guardium


Important: If your Kerberos does not support use of norandkey, use the alternate
instructions below for this step.

1. Identify the principals needed from the Cloudera Manager interface.


From the Cloudera Administration menu, select Kerberos. This will provide a list of
principals that are used by each node. Make a copy of this list for your reference.

The principals are defined in the following manner:


<service>/<nodename>@<kerberos domain>

Example: hdfs/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 32


2. On the Kerberos server as root or kinit with a user that has kadmin privileges, use the
kadmin.local command to access the Kerberos server. Verify that the principals
identified in step one are available.

Example:
kadmin.local: listprincs
HTTP/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-02.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-03.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-04.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-05.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-02.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-03.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-04.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-05.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
... etc.

3. Use the xst command in kadmin.local to export every principal for a particular node
into the same keytab file:
xst -k /tmp/krb5.keytab-nodename -norandkey <service>/<nodename>@<kerberos
domain>

IMPORTANT: -norandkey is an important flag that prevents the invalidation of


previous keytabs that included the principal being exported. For example, if you are using
Cloudera, be sure to use norandkey to avoid invalidating Cloudera keytabs.
NOTE: Each node may have a different service depending on what that node is running.

Example: Exporting the services for all nodes to a single keytab file:
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey HTTP/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hdfs/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hbase/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hive/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 33


kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hue/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey zookeeper/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM

This will create a file named krb5.keytab-node01 in the /tmp directory on the server.

Create a keytab file for each node in your cluster.

4. Copy the keytabs to each respective node to /etc/krb5.keytab.


NOTE: The name of the keytab should be the same on each respective node
scp /tmp/krb5.keytab-node01 root@node01:/etc/krb5.keytab
scp /tmp/krb5.keytab-node02 root@node02:/etc/krb5.keytab
scp /tmp/krb5.keytab-node03 root@node03:/etc/krb5.keytab

5. Verify your keytab principals using the klist command from the node's command line
using the following command:
klist -k <keytab>

Example:
[root@rh6-cl-01:]$ klist -k /etc/krb5.keytab
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
2 hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
2 hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
2 hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
.... etc

6. Verify that you can authenticate with the keytab:

kinit -k -t /etc/krb5.keytab <service>/<nodename>@<kerberos domain>

Use the klist command to verify the authentication. In the example below, klist shows
that no credentials are in use. A kinit is then done using the keytab file. A klist is issued
to show that the credentials are in use.

Example:
[root@rh6-cl-01 ~]# klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_0)
[root@rh6-cl-01 ~]# kinit -k -t /etc/krb5.keytab hdfs/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
[root@rh6-cl-01 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM

Valid starting Expires Service principal


06/11/14 11:31:25 06/12/14 11:31:25
krbtgt/GUARD.SWG.USMA.IBM.COM@GUARD.SWG.USMA.IBM.COM
renew until 06/12/14 11:31:25
[root@rh6-cl-01 ~]#

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 34


7. Restart the UTAP to re-read the keytab using start and stop:
stop utap
start utap

Example:
[root@rh6-cl-01 ~]# stop utap
utap stop/waiting
[root@rh6-cl-01 ~]# start utap
utap start/running, process 366
[root@rh6-cl-01 ~]#

Step 2. Configure Guardium


Many enterprise deployments will use the Guardium Installation Manager to push out
server-side updates, such as S-TAPs. This step has two subsections: one if GIM is
installed on the server node and one if it is not.

Use these steps if GIM is not installed on the server:


1. Stop STAP temporarily:
$ stop utap

2. Create a new directory: /usr/local/guardium/kerberos


To that new directory, copy the following files from directory
/usr/local/guardium/lib64:
guard_stap_runner
guardkerbplugin.conf
libguardkerbplugin.so
utap.conf

Important: Make sure all the files are all readable by root, and that
guard_stap_runner and libguardkerbplugin.so are executable by root.

3. Make sure the kerberos libraries (libkrb5.so, libk5crypto.so, etc) are in


one of /lib64, /usr/lib64, /lib, /usr/lib. If they are not there, then edit
guard_stap_runner so that the LD_LIBRARY_PATH includes their location.

4. Make sure that the kerberos configuration file is at /etc/krb5.conf. If not, then
edit the file /usr/local/guardium/kerberos/guardkerbplugin.conf
appropriately.

5. Make sure that the kerberos keytab file is at /etc/krb5.keytab. If not, then edit the
file /usr/local/guardium/kerberos/guardkerbplugin.conf appropriately.

6. Make sure that the guard_stap in the guard_stap_runner points to the executable
file and not a directory.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 35


7. Make sure the kerberos configuration file (/etc/krb5.conf) includes the following
line in the [libdefaults] section:
clockskew=600

8. Configure the /usr/local/guardium/guard_stap/guard_tap.ini file as


needed, such as adding inspection engines and making sure the SQLGuard
section(s) point to the appropriate Guardium appliance(s).

Edit the kerberos_plugin_dir line to be this:


kerberos_plugin_dir=/usr/local/guardium/kerberos

9. Replace the file /etc/init/utap.conf with the one in this directory:


$ mv /etc/init/utap.conf /etc/init/utap.conf.O
$ cp utap.conf /etc/init

Make sure the new file has the same ownership/permissions as the old one.

10. Restart STAP:


$ start utap

Use these steps if GIM is installed on the server:


These steps assume the following:
The server has an S-TAP of Version 9 or later
Guardium Installation Manager (GIM) is installed.
All the relevant nodes are configured with the same directory structures

Create a new directory: /usr/local/guardium/kerberos


To that new directory, copy the following files from directory
/usr/local/guardium/lib64:
libguardkerbplugin.so
guardkerbplugin.conf

Alternatively, create a file with the following lines:


#comment
KRB5RCACHETYPE=none
KRB5_KTNAME=/etc/krb5.keytab
KRB5_CONFIG=/etc/krb5.conf

Important: Make sure the files are readable by root, and libguardkerbplugin.so is
executable by root.

Make sure the kerberos libraries (libkrb5.so, libk5crypto.so, etc) are in

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 36


one of /lib64, /usr/lib64, /lib, /usr/lib.

Make sure that the kerberos configuration file is at /etc/krb5.conf and that the
kerberos keytab file is at /etc/krb5.keytab. If not, then edit the file
/usr/local/guardium/kerberos/guardkerbplugin.conf appropriately.

Create a tar file of the kerberos/ directory.

Copy the tar file to the server node and extract the tar with the C option to create
the destination directory.

Edit the /usr/local/guardium/guard_stap/guard_tap.ini to add the


Kerberos directory setting:
kerberos_plugin_dir=/usr/local/guardium/kerberos

Stop S-TAP
$ ps ef | grep stap.
$ kill <stap_pid>

The GIM (actually, the GIM supervisor process) restarts the S-TAP after it is
killed.

Step 3: Basic operational testing of the configuration


1. When the STAP started above, it printed a PID for the STAP process. Make sure
the STAP continues to run in that PID, and does not continually restart, by
running this command several times:
$ ps -ef| grep guard_stap | grep -v grep

The STAP puts any error messages in the file /tmp/guard_stap.stderr.txt.


Make sure that file is not growing in size. If it is, or if the STAP is constantly
restarting, check the file contents for error messages.

2. Make sure the kerberos plugin is loaded: Using the PID from above, do the
following (this example assumes the pid is 12345):
$ lsof -p 12345

Make sure that libguardkerbplugin.so is one of the open files listed.

Alternate approach to creating Kerberos keytabs (for Cloudera)


These instructions are to be used in a Cloudera deployment when the norandkey option
described in Step 1 above cannot be used. After you have the keytab, continue with above
instructions for configuring Guardium S-TAP.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 37


Overview
In a Cloudera deployment where Kerberos is used for authentication, the following
guidelines can be used to create a single keytab file for use with Guardium S-TAP for
monitoring.

The instructions below utilize the ktutil that is found in the krb5-workstation for Linux
environments. Similar tools for Microsoft Windows Active Directory environments can
also be used.

In general, each node in the Cloudera deployment will contain a keytab for each service.
These services will vary depending on your cluster and the services installed/running on
each host. For each node, each service keytab will be read into ktutil. Once all the
required keytabs are read, then a single keytab is written. This resultant keytab can then
be used by Guardium S-TAP for monitoring.

Identifying keytabs
On each node, identify the service that is running. The services files are typically found
in /var/run/cloudera-scm-agent/process/.
Search for the latest process number for the service. In the example below, use the ls -l
command to list the possible folders for the hdfs processes. In the example, youd find
that the newest folders contain the correct keytabs.

[root@cloudera-cl1-01 process]# ls -ld *hdfs*

drwxr-x--x 3 hdfs hdfs 460 Oct 31 13:08 2697-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Oct 31 13:08 2704-hdfs-HTTPFS

drwxr-x--x 3 hdfs hdfs 460 Oct 31 16:00 2895-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Oct 31 16:00 2902-hdfs-HTTPFS

drwxr-x--x 3 hdfs hdfs 460 Oct 31 16:05 2947-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Oct 31 16:05 2954-hdfs-HTTPFS

drwxr-x--x 3 hdfs hdfs 460 Nov 5 11:24 3077-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Nov 5 11:24 3084-hdfs-HTTPFS

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 38


drwxr-x--x 3 hdfs hdfs 460 Nov 5 13:40 3161-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Nov 5 13:40 3168-hdfs-HTTPFS

drwxr-x--x 3 hdfs hdfs 460 Dec 3 11:49 3712-hdfs-NAMENODE

drwxr-x--x 4 httpfs httpfs 240 Dec 3 11:49 3719-hdfs-HTTPFS

Inside these folders, you will find the service keytab. For example, for HDFS, the
corresponding keytab is named hdfs.keytab.
Identify all keytabs that each node uses and note their locations.
You can test each keytab to verify that it is a valid keytab using kinit:
kinit -t -k <keytab> <service>/<principle>@<domain>
Use klist to verify that you have obtained a ticket. For example, the following example
tests an HDFS keytab:
[root@cloudera-cl1-01 ~]# kdestroy
[root@cloudera-cl1-01 ~]# klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_0)
[root@cloudera-cl1-01 ~]# kinit -k -t /var/run/cloudera-scm-agent/process/3712-hdfs-
NAMENODE/hdfs.keytab hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
[root@cloudera-cl1-01 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA

Valid starting Expires Service principal


12/04/14 16:47:49 12/05/14 16:47:49 krbtgt/CLOUDERA@CLOUDERA
renew until 12/09/14 16:47:49

Once all the keytabs on all nodes are identified, the keytabs can be merged.

Merging the keytabs


Once the keytabs are identified, they will need to be merged together for use with
Guardium S-TAP.

Use ktutil to read in the keytabs and write keytabs. On each node, merge the identified
keytabs together.

The following example uses ktutil to read in the keytab and list the principal for HDFS.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 39


# ktutil
ktutil: read_kt /var/run/cloudera-scm-agent/process/3712-hdfs-NAMENODE/hdfs.keytab
ktutil: list
slot KVNO Principal
---- ---- ---------------------------------------------------------------------
1 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
2 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
3 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
4 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
5 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
6 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
7 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
8 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
9 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
10 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
11 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
12 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA

The following example uses ktutil to read in the keytab and list the principal for hive.

# ktutil
ktutil: read_kt /var/run/cloudera-scm-agent/process/3737-hive-HIVEMETASTORE/hive.keytab
ktutil: list
slot KVNO Principal
---- ---- ---------------------------------------------------------------------
1 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
2 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
3 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
4 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
5 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
6 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA

The following example uses ktutil to read in the HDFS and Hive keytabs, then writes out
a single krb5.keytab. Once the krb5.keytab is written, it is then read back into ktutil and
the principals listed.

# ktutil
ktutil: read_kt /var/run/cloudera-scm-agent/process/3712-hdfs-NAMENODE/hdfs.keytab
ktutil: read_kt /var/run/cloudera-scm-agent/process/3737-hive-HIVEMETASTORE/hive.keytab
ktutil: write_kt /tmp/krb5.keytab
ktutil: read_kt /tmp/krb5.keytab
ktutil: list
slot KVNO Principal
---- ---- ---------------------------------------------------------------------
1 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
2 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
3 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
4 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
5 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
6 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
7 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 40


8 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
9 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
10 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
11 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
12 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
13 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
14 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
15 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
16 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
17 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
18 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA

You can then use kinit as shown in previous examples to test the keytab and the different
principals.

Once this single keytab is created, krb5.keytab in the example above, it can be
copied/moved to its final location. Then the Guardium S-TAP can be configured to use
this keytab. Use the steps described in Step 2. Configure Guardium on page 35 to do this.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 41


Appendix B. Using computed attributes to pull out db user from
SOLR, Impala Hue, or Hive Hue/Beeswax

Here are the steps to create a computed attribute for SOLR. Use the same basic procedure
for Impala Hue and Hive Hue/Beeswax. The SQL expressions for those are included
below.

Navigate to Reports Guardium Configuration Items Query Entities


&Attributes:

Right click on any row in the report and select Invoke create_computed-attribute
as shown below.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 42


In the UI, enter information as shown below. (The attributeLabel can be anything you
want.). The SQL used in the expression is shown here. You can copy and paste this into
the expression field.

if( (LOCATE('user.name=hue&doAs=',FULL_SQL)>0) ,
substring_index(substring(GDM_CONSTRUCT_TEXT.FULL_SQL,(instr(GDM_CONSTR
UCT_TEXT.FULL_SQL,'doAs=')+0)),'&',1),' ')

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 43


Click Invoke now. The attribute will now be available in the FULL SQL entity of the
access domain to be used in reports as shown here.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 44


Figure 25. Computed attribute for Solr users now appears in Query Builder

Here is an example of what SOLR message traffic looks like before and after the
computed attribute is applied.
GET

/solr/yelp_demo/select?user.name=hue&doAs=svoruga&q=%2A%3A%2A&wt=json&r
ows=10&start=0&facet=true&facet.mincount=0&facet.limit=10&facet.field={
%21ex%3Dstars}stars&f.stars.facet.limit=16&f.stars.facet.mincount=0&fac
et.field={%21ex%3Dbusiness_id}business_id&f.business_id.facet.limit=21&
f.business_id.facet.mincount=0&facet.field={%21ex%3Dfull_address}full_a
ddress&f.full_address.facet.limit=11&f.full_address.facet.mincount=0&fa
cet.range={%21ex%3Duseful}useful&f.useful.facet.range.start=0&f.useful.
facet.range.end=0&f.useful.facet.range.gap=1&f.useful.facet.mincount=0&
fq={%21tag%3Duseful}useful%3A[0+TO+0}&fq={%21tag%3Dfull_address}{%21fie
ld+f%3Dfull_address}az&fl=date%2Cid&hl=true&hl.fl=%2A&hl.snippets=3
HTTP/1.1
Host: cloudera-cl1-06.guard.swg.usma.ibm.com:8983
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-
431.11.2.el6.x86_64

Result after computed attribute is applied


doAs=svoruga

Impala: computed attribute to get user name from Hue

This SQL assumes you are gathering information based on a log full details policy rule:
IF (instr(GDM_CONSTRUCT_TEXT.FULL_SQL,'__THRIFT
message={method=''query'',struct:1=')>0 ,
substring_index(substring(GDM_CONSTRUCT_TEXT.FULL_SQL,(instr(GDM_CONSTR
UCT_TEXT.FULL_SQL,'string:4')+10)),'''',1), '')

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 45


This SQL is if you are not using log full details:

IF (instr(GDM_CONSTRUCT.ORIGINAL_SQL,'__THRIFT
message={method=''query'',struct:1=')>0 ,
substring_index(substring(GDM_CONSTRUCT.ORIGINAL_SQL,(instr(GDM_CONSTRU
CT.ORIGINAL_SQL,'string:4')+10)),'''',1), '')

Hive: computed attribute to get user name from Hue/Beeswax


This SQL assumes you are gathering information based on a log full details policy rule:

IF (instr(GDM_CONSTRUCT_TEXT.FULL_SQL,'__THRIFT
message={method=''get_table'',struct:0=')>0 ,
substring_index(substring(GDM_CONSTRUCT_TEXT.FULL_SQL,(instr(GDM_CONSTR
UCT_TEXT.FULL_SQL,'string:3')+10)),'''',1), '')

This SQL is if you are not using log full details:


IF (instr(GDM_CONSTRUCT_TEXT.FULL_SQL,'__THRIFT
message={method=''get_table'',struct:0=')>0 ,
substring_index(substring(GDM_CONSTRUCT_TEXT.ORIGINAL_SQL,(instr(GDM_CO
NSTRUCT_TEXT.FULL_SQL,'string:3')+10)),'''',1), '')

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 46


Appendix C: Considerations for IBM InfoSphere BigInsights and
Big SQL

For most Hadoop activity, the recommendations in this guide apply as for all other
Hadoop distributions, with the following exceptions.

Hadoop on GPFS (IBM Spectrum Scale)


As of Version 10.1 of Guardium, you can use the GPFS deployment of BigInsights by
configuring the HDFS Transparency Connector. You can find out more about the
connector here:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General
%20Parallel%20File%20System%20%28GPFS%29/page/2nd%20generation%20HDFS
%20Transparency%20Protocol

Big SQL
An S-TAP must be installed on all nodes in which a Big SQL engine is installed. The
support for Big SQL is quite comprehensive and is similar to what Guardium already
supports for DB2. For more details on configuration and capabilities, see the
developerWorks article on this topic here:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1411hadoop-
biginsights-guardium/index.html

If Kerberos and/or GPFS is used, then you must configure a special communications exit
on each Big SQL node. Guardium provides a dynamically loaded shared library that
interacts with Big SQL. Big SQL will invoke functions within that library at run time
when it performs SQL and utility requests. Directions for this are included in the
developerWorks article.

Restrictions: Only monitoring and auditing are supported using the exit methodology.
Redaction and blocking are advanced features that are only supported using S-TAP.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 47


Appendix D. Supported Hadoop components (Hadoop 2)
The table summarizes the degree to which particular Hadoop services traffic is parsed
and logged. For example, a complete level of logging means you get user name, object
names, etc, for that specific message type. Other components can only be capture at the
lower level of MapReduce/YARN and/or HDFS traffic.

Component Level of parsing/logging Computed attributes required?


Hue Traffic through Hue is captured N/A
except for Impala (bug opened)
Failed logins for Hue are not
currently captured. (Bug)
HDFS Complete No
MapReduce Complete. Yes (included in prebuilt report)
YARN Complete No
HBase Complete No
Hive Complete. Yes, if you use Hue, to return the
user name for THRIFT messages
into the DB User field. See Hive:
computed attribute to get user name
on page 46.

WebHDFS Complete No
Solr Complete Yes to return DB User. Note, you
will need a log full details policy
rule for Solr traffic to get the
computed attribute.
Impala Complete . Yes, if you use Hue, to return the
user name for THRIFT messages
into the DB User field. See Impala:
computed attribute to get user name
on page 45.
SPARK Not supported for in-memory You will catch HDFS and
usage. (open requirement) MapReduce as data is brought into
memory, but interactions in
memory will not be captured by
Guardium.
Shark Not supported n/a
Sqoop Returned as HDFS and YARN No
Pig Returned as HDFS and YARN No
traffic
Zookeeper Returned as HDFS and YARN No
traffic
Avro Returned as HDFS and YARN No
Flume Returned as HDFS and YARN No

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 48


Cascading Returned as HDFS and YARN No
Slider Returned as HDFS and YARN No
Storm Returned as HDFS and YARN No
Knox Guardium catches the WebHDFS N/A
and other resulting traffic.
Ambari Guardium catches resulting N/A
traffic issued from Ambari.
Slider Returned as HDFS and YARN No
Tez Returned as HDFS and YARN No
Falcon Returned as HDFS and YARN No
NFS Not supported N/A
Java/Scala Returned as HDFS and YARN No

Notices
Copyright IBM Corp. 2016. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA
ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, Guardium, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM
or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information
(www.ibm.com/legal/copytrade.shtml)
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.

IBM Security Guardium V10.1 - Deployment Guide for Hadoop Systems 49

You might also like