Professional Documents
Culture Documents
1
Deployment Guide for Hadoop
Systems
July 27, 2016
Revisions
12/16/2015 you DO need kerberos configuration if either HBase or Hive is used. Previously it said you
dont need it for Hive, but this was incorrect.
01/12/2016- added Solr port (8983) and IE type (HTTP)
06/08/2016 - updated with updated sizing rules of thumb. Updated with gpfs information for bigInsights.
Removed requirement to customize HBase report as this has been fixed in 10.1
6/27/2016 added reference to Deployment guide for Hortwonworks for Ranger integration
Assumptions:
A client has already set up the environment on their own or worked with technical
sales, lab services or a knowledgeable Business Partner to do initial sizing and to
set up the environment, including network connectivity for the Guardium
appliances.
The person doing the implementation has Guardium knowledge and has involved
the relevant people who understand the clients Hadoop architecture.
The client has a clear set of use cases to test based on an understanding of known
capabilities in the product for Hadoop (as described in this document, for
example). If a requirement is not addressed by the product, the client can use the
Request for Enhancement process to ensure that IBM product management is
aware of the request. https://www.ibm.com/developerworks/rfe/
Blocking and redaction for Hive and Impala (this was already supported for Big
SQL). For more information, see Policies that support redaction and blocking
(Advanced) on page 23.
New inspection engines for Hive, Hue, Impala and WebHDFS. For more
information, see Table 1 on page 12.
Removed restriction on Hue Metastore Previously only MySQL was supported.
Now PostgreSQL and Oracle are also supported.
Now capture failed logins from Hue for MySQL, Oracle and PostgresSQL
datastores.
Planning
Sizing (capacity planning)
This section refers to capacity planning, not sizing for pricing. Pricing for Hadoop is per
node.
The current rule of thumb is based on deployments that are not high volume in terms of
what is being audited
10 management/server nodes per collector,
20+ data nodes per collector, assuming STAPs are needed for the data nodes
(They are not needed for all components)
Possibly even more nodes per collector if if physical appliances are used
The other option is to size by the PVUs of the nodes. This may result in oversizing if you
are not auditing significant amounts of traffic.
The capacity sizing guidelines for Version 10 is 4000 PVU per collector.
http://www-01.ibm.com/support/docview.wss?uid=swg27046184
This is a space that is changing frequently, so double check with IBM if you are unsure or
to get the latest information.
_______________________________________________________________________
However, HDFS activity is not the most auditor-friendly it is somewhat like monitoring
file accesses in a relational database. You may want to consider also monitoring activity
from other components that your organization is probably using, such as Hive, Big SQL,
or Impala that are more akin to what one might expect from database access.
Example report outputs from some of these components are included in this guide.
You can record which components you are running in Table 1 as well as whether you
require monitoring above and beyond HDFS monitoring.
For monitoring purposes, you must think about the user, the data object being monitored
and what actions/commands are being done. In Guardium terminology, these are,
respectively, the DB User, the Object, and the Verb (the command). Those of you
familiar with Guardium will remember that these entities can be used in policy rules to
trigger particular actions, such as a real time alert.
So, as with any other auditing exercise, a key step in setting up your security policies is to
inventory your assets and map your inventory of assets to users and servers.
Guardium policy rule actions allow you to not just to alert or log policy violations, but
also enable you to filter certain traffic for performance.
In most cases, Guardium cannot catch failed logins for command line components.
Guardium can see failed logins from Hue and through IBM BigSQL.
You will get permission exceptions on the file system level, so you report on those using
the exceptions domain.
This section includes lists of objects and commands (verbs) to Hadoop. For the
commands, you can cut and paste these into a group in Guardium if you like, using the
Group Builder tool. You will also need to create groups of users and objects based on
your own environment.
1
Note that BigSQL traffic in BigInsights does have session information, even if the underlying
HDFS does not.
Write verbs:
createTable
disableTable
deleteTable
multi (this is an insert/update) (With the Ranger integration deployment option,
this is put)
drop
If you are NOT using HBase or Hive, you do not need to configure Guardium for a
Kerberos configuration.
As with any S-TAP deployment, be sure to download the correct S-TAP for your
operating system and kernel level.
Figure 2 provides a high level overview of where S-TAPs should be installed depending
on what you want to monitor. Note that the graphic does not necessarily reflect physical
servers.
Edge nodes: An S-TAP is recommended for edge nodes as well, particularly if you are
using them as a landing zone for data.
Use the table below to record the ports and inspection engine protocols required for each
node. Combine this information into a spreadsheet with the server IP (S-TAP host IP) and
you will have everything you need to create grdapi commands if you prefer to use that
instead of configuring each of these using the Guardium UI.
Required Hadoop
Service Default Ports Your ports IE Protocol
(Y/N) Node
Namenode 8020 Hadoop
HDFS Name Node
Namenode HTTP port (for WEBHDFS
50070
WebHDFS)
Namenode Resource Manager 8032 Hadoop
(YARN only)
Only for Job Tracker 8021 Hadoop
MapReduce Job
mapreduce 9290
Tracker
1 50030
HBase HBase Master 60000 Hadoop
Master
HBase 60020 Hadoop
HBase Region
Region
Hive Server 2 Thrift protocol HIVE
10000
messages
Hive Thrift protocol message HADOOP
Metastore used to get Impala
and Hive DB user from 9083
Hue (requires
computed attribute)
Impala IMPALA
Impala 21000
daemons
Impala HIVE Because impala
Impala from Hue 21050 from hue uses hiveserver2.
Management 51000 DB2
node 32051
BigSQL Server
(changed in
4.1)
Compute 51000 DB2
node 32051
BigSQL Server
(changed in
4.1)
Hue node Hue UI (Oracle HUE
1521
backend)
Hue Node Hue UI (MySQL HUE
3306
backend)
Hue Node Hue UI (PGSQL HUE
5432
backend)
Solr Search HTTP
Solr search 8983
node
/* Master or NameNode or
/*YARN
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=8032 portMax=8050 portMin=8032 connectToIp=127.0.0.0
stapHost=10.19.232.21
/*HDFS
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=8020 portMax=8020 portMin=8020 connectToIp=127.0.0.0
stapHost=10.19.232.21
/* WEBHDFS
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=WEBHDFS
ktapDbPort=50070 portMax=50070 portMin=50070 connectToIp=127.0.0.0
stapHost=10.19.232.21
/*HBASE Master
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=60000 portMax=60000 portMin=60000 connectToIp=127.0.0.0
stapHost=10.19.232.21
/*Impala daemon
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=IMPALA
ktapDbPort=21000 portMax=21000 portMin=21000 connectToIp=127.0.0.0
stapHost=10.19.232.21
/*Hive
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HIVE
ktapDbPort=10000 portMax=10000 portMin=10000 connectToIp=127.0.0.0
stapHost=10.19.232.21
/* Solr search
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HTTP
ktapDbPort=8983 portMax=8983 portMin=8983 stapHost=10.19.232.21
/* data nodes
/* HBASE Region
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=HADOOP
ktapDbPort=60020 portMax=60020 portMin=60020 connectToIp=127.0.0.0
stapHost=10.19.232.21
/*Impala daemon
grdapi create_stap_inspection_engine client=0.0.0.0/0.0.0.0 protocol=IMPALA
ktapDbPort=21000 portMax=21000 portMin=21000 connectToIp=127.0.0.0
stapHost=10.19.232.21
After you are comfortable that traffic is flowing to the collector, you can clone the default
policy and create one that aligns with your security and compliance requirements, as
described in Simple production policy on page 16.
There is a lot of noise with Hadoop internal communications, and the more background
noise you can filter out, the better. Rule 1 will filter out (not log) activity in which the
object is one of the items in the Hadoop Skip Objects group. You can edit this group to
add objects that you observe in your traffic.
The second rule filters out noisy commands that reflect internal communications.
Tip: You must put something in the Not Hadoop Servers group, even if its a dummy IP
or you will not collect traffic. If you dont have any such servers, make sure you remove
this group altogether and uncheck the Not checkbox.
This rule also specifies LOG FULL DETAILS action for all nonfiltered traffic, which
may be handy for a small environment, but it is probably not what you want to do in
production. This may overload the collector because each command is logged in full.
Thus, you will likely modify or delete it after doing initial validation in a test
environment.
Recommendation: If you are not familiar with the way Guardium policies impact data
collection and reporting, familiarize yourself with that before moving ahead with policy
definitions. Some recommended resources on the Guardium community on
developerWorks (bit.ly/guardwiki) include:
4-part video series on policies
Tech Talk: Reporting 101
The Deployment Guide for InfoSphere Guardium also includes a good introduction.
(http://www.redbooks.ibm.com/abstracts/sg248129.html?Open)
In this case, a policy violation of medium severity will be logged whenever someone in
the privileged user group accesses any object (HDFS file, HBase Table, etc.) with the
string customer in its name. (Most likely you will be creating a group of sensitive
objects.)
The violation will appear in the Policy Violations / Incident Management report.
(Comply>Reports>Incident Management).
Some of these reports are component-based reporting, which is probably of most use
when validating your configuration and that you are catching traffic from the component.
Well go into a little more detail on the following reports, which are more focused on
security and compliance.
Hadoop -Permissions
Privileged Users Accessing Sensitive Objects
Exception report
Hadoop logged in users
Hadoop Permissions
This report shows when permissions are changed on any Hadoop file system object.
This report uses a built in group called Hadoop Permissions below. You could also
choose to include Hive, BigSQL or Impala grant/revoke statement sin this report by
adding those commands as well, or catch those using another report (for example, the
built in Execution of Grant Commands report).
For objects, you can use full file directory paths (for HDFS), wild cards, or a combination
of both. Note that if you are also specifically monitoring HBase, BigSQL, or Hive and
they also use customer in their names, those will also match.
File permission exceptions are indicated by error code 101, which is used in the query
conditions section of the Exception Report query builder shown in Error! Reference
source not found..
Here is a policy that includes both blocking and redaction rules. Well examine these
rules and their prerequisites in more detail in this section.
For more information about the blocking actions (S-GATE Terminate) see http://www-
01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/r
ule_actions.html
Blocking rules
Although there are additional nuances that are covered in other sources, blocking requires
a minimum of two actions:
1. S-GATE Attach specify the conditions under which S-TAP must start watching
the session traffic for possible blocking (which requires checking all actions
against the policy on the collector.
2. S-GATE Terminate when this condition is met, terminate the connection.
Important: Blocking has performance implications because S-TAP must hold the
command and check with the policy on the appliance to see if this command should be
allowed through. Thus, it is important that you limit which conditions you use for attach
to those that are not performance sensitive, such as privileged user access.
Also, because of the way Hive and Impala traffic is processed in Hadoop, you must do
the following in the blocking policy rule:
The rules shown below in Figure 18 and Figure 19 can be translated as follows:
1. Whenever there is a connection from svoruga to any Hive table that includes
customer in the name, start watching this for possible blocking.
2. If svoruga issues a SELECT command (or any command in that group) on a
customer table, block the connection.
Figure 20 shows a select from customer in hive using beeline and how it was blocked.
The policy violation report shows that the rule was triggered.
Redaction rules
Figure 22 below is one of the extrusion rules in our policy to inspect the returned data for
a pattern that matches social security numbers and then redact data in the pattern. Figure
23 has the same rules except for credit cards.
For information about the special pattern tests, such as those for credit cards and social
security numbers, see the Knowledge Center here:
http://www-
01.ibm.com/support/knowledgecenter/SSMPHH_10.0.0/com.ibm.guardium.doc/protect/r_patterns
.html
Figure 24 shows the redaction on Hive data whether the query was issued in the UI (Hue)
or in the beeline command line.
To limit data that must flow across the network to the appliance, restrict the
number of inspection engines you configure.
To limit the amount of data that is logged on the collector, put in conditions on
the policy.
One strategy might be to just configure for Hive command line queries and try that before
adding additional inspection engines and opening up the policy to more types of traffic
such as HDFS, which will generate a much higher volume of traffic.
For each new inspection engine that is configured, you must restart S-TAP.
Monitor appliance as more services generate more traffic. The Guardium deployment
redbook includes details on how to monitor the appliance and make sure the traffic is not
excessive for the collector.
Guardium Activity Monitoring (and blocking) for Hortonworks Hadoop using Apache
Ranger Integration. http://www.ibm.com/support/docview.wss?uid=swg21987893
The configuration requires that each node in the cluster (that is running Guardium S-
TAP) has a keytab that includes the Hadoop services that are running on the node.
Guardium will use those keys to decrypt the user name for services running on the node.
Important: These instructions use Cloudera as the Hadoop distribution. The same basic
instructions can be used for other distributions, but the process to obtain the Kerberos
principals will vary.
A keytab file must be created/updated each time a principals encryption key has
been changed or when a service is added or deleted on a node.
Example: hdfs/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
Example:
kadmin.local: listprincs
HTTP/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-02.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-03.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-04.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
HTTP/rh6-cl-05.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-02.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-03.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-04.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
hbase/rh6-cl-05.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
... etc.
3. Use the xst command in kadmin.local to export every principal for a particular node
into the same keytab file:
xst -k /tmp/krb5.keytab-nodename -norandkey <service>/<nodename>@<kerberos
domain>
Example: Exporting the services for all nodes to a single keytab file:
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey HTTP/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hdfs/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hbase/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
kadmin.local: xst -k /tmp/krb5.keytab-node01 -norandkey hive/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
This will create a file named krb5.keytab-node01 in the /tmp directory on the server.
5. Verify your keytab principals using the klist command from the node's command line
using the following command:
klist -k <keytab>
Example:
[root@rh6-cl-01:]$ klist -k /etc/krb5.keytab
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
2 hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
2 hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
2 hbase/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
.... etc
Use the klist command to verify the authentication. In the example below, klist shows
that no credentials are in use. A kinit is then done using the keytab file. A klist is issued
to show that the credentials are in use.
Example:
[root@rh6-cl-01 ~]# klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_0)
[root@rh6-cl-01 ~]# kinit -k -t /etc/krb5.keytab hdfs/rh6-cl-
01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
[root@rh6-cl-01 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/rh6-cl-01.guard.swg.usma.ibm.com@GUARD.SWG.USMA.IBM.COM
Example:
[root@rh6-cl-01 ~]# stop utap
utap stop/waiting
[root@rh6-cl-01 ~]# start utap
utap start/running, process 366
[root@rh6-cl-01 ~]#
Important: Make sure all the files are all readable by root, and that
guard_stap_runner and libguardkerbplugin.so are executable by root.
4. Make sure that the kerberos configuration file is at /etc/krb5.conf. If not, then
edit the file /usr/local/guardium/kerberos/guardkerbplugin.conf
appropriately.
5. Make sure that the kerberos keytab file is at /etc/krb5.keytab. If not, then edit the
file /usr/local/guardium/kerberos/guardkerbplugin.conf appropriately.
6. Make sure that the guard_stap in the guard_stap_runner points to the executable
file and not a directory.
Make sure the new file has the same ownership/permissions as the old one.
Important: Make sure the files are readable by root, and libguardkerbplugin.so is
executable by root.
Make sure that the kerberos configuration file is at /etc/krb5.conf and that the
kerberos keytab file is at /etc/krb5.keytab. If not, then edit the file
/usr/local/guardium/kerberos/guardkerbplugin.conf appropriately.
Copy the tar file to the server node and extract the tar with the C option to create
the destination directory.
Stop S-TAP
$ ps ef | grep stap.
$ kill <stap_pid>
The GIM (actually, the GIM supervisor process) restarts the S-TAP after it is
killed.
2. Make sure the kerberos plugin is loaded: Using the PID from above, do the
following (this example assumes the pid is 12345):
$ lsof -p 12345
The instructions below utilize the ktutil that is found in the krb5-workstation for Linux
environments. Similar tools for Microsoft Windows Active Directory environments can
also be used.
In general, each node in the Cloudera deployment will contain a keytab for each service.
These services will vary depending on your cluster and the services installed/running on
each host. For each node, each service keytab will be read into ktutil. Once all the
required keytabs are read, then a single keytab is written. This resultant keytab can then
be used by Guardium S-TAP for monitoring.
Identifying keytabs
On each node, identify the service that is running. The services files are typically found
in /var/run/cloudera-scm-agent/process/.
Search for the latest process number for the service. In the example below, use the ls -l
command to list the possible folders for the hdfs processes. In the example, youd find
that the newest folders contain the correct keytabs.
Inside these folders, you will find the service keytab. For example, for HDFS, the
corresponding keytab is named hdfs.keytab.
Identify all keytabs that each node uses and note their locations.
You can test each keytab to verify that it is a valid keytab using kinit:
kinit -t -k <keytab> <service>/<principle>@<domain>
Use klist to verify that you have obtained a ticket. For example, the following example
tests an HDFS keytab:
[root@cloudera-cl1-01 ~]# kdestroy
[root@cloudera-cl1-01 ~]# klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_0)
[root@cloudera-cl1-01 ~]# kinit -k -t /var/run/cloudera-scm-agent/process/3712-hdfs-
NAMENODE/hdfs.keytab hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
[root@cloudera-cl1-01 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
Once all the keytabs on all nodes are identified, the keytabs can be merged.
Use ktutil to read in the keytabs and write keytabs. On each node, merge the identified
keytabs together.
The following example uses ktutil to read in the keytab and list the principal for HDFS.
The following example uses ktutil to read in the keytab and list the principal for hive.
# ktutil
ktutil: read_kt /var/run/cloudera-scm-agent/process/3737-hive-HIVEMETASTORE/hive.keytab
ktutil: list
slot KVNO Principal
---- ---- ---------------------------------------------------------------------
1 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
2 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
3 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
4 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
5 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
6 9 hive/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
The following example uses ktutil to read in the HDFS and Hive keytabs, then writes out
a single krb5.keytab. Once the krb5.keytab is written, it is then read back into ktutil and
the principals listed.
# ktutil
ktutil: read_kt /var/run/cloudera-scm-agent/process/3712-hdfs-NAMENODE/hdfs.keytab
ktutil: read_kt /var/run/cloudera-scm-agent/process/3737-hive-HIVEMETASTORE/hive.keytab
ktutil: write_kt /tmp/krb5.keytab
ktutil: read_kt /tmp/krb5.keytab
ktutil: list
slot KVNO Principal
---- ---- ---------------------------------------------------------------------
1 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
2 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
3 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
4 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
5 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
6 11 hdfs/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
7 10 HTTP/cloudera-cl1-01.guard.swg.usma.ibm.com@CLOUDERA
You can then use kinit as shown in previous examples to test the keytab and the different
principals.
Once this single keytab is created, krb5.keytab in the example above, it can be
copied/moved to its final location. Then the Guardium S-TAP can be configured to use
this keytab. Use the steps described in Step 2. Configure Guardium on page 35 to do this.
Here are the steps to create a computed attribute for SOLR. Use the same basic procedure
for Impala Hue and Hive Hue/Beeswax. The SQL expressions for those are included
below.
Right click on any row in the report and select Invoke create_computed-attribute
as shown below.
if( (LOCATE('user.name=hue&doAs=',FULL_SQL)>0) ,
substring_index(substring(GDM_CONSTRUCT_TEXT.FULL_SQL,(instr(GDM_CONSTR
UCT_TEXT.FULL_SQL,'doAs=')+0)),'&',1),' ')
Here is an example of what SOLR message traffic looks like before and after the
computed attribute is applied.
GET
/solr/yelp_demo/select?user.name=hue&doAs=svoruga&q=%2A%3A%2A&wt=json&r
ows=10&start=0&facet=true&facet.mincount=0&facet.limit=10&facet.field={
%21ex%3Dstars}stars&f.stars.facet.limit=16&f.stars.facet.mincount=0&fac
et.field={%21ex%3Dbusiness_id}business_id&f.business_id.facet.limit=21&
f.business_id.facet.mincount=0&facet.field={%21ex%3Dfull_address}full_a
ddress&f.full_address.facet.limit=11&f.full_address.facet.mincount=0&fa
cet.range={%21ex%3Duseful}useful&f.useful.facet.range.start=0&f.useful.
facet.range.end=0&f.useful.facet.range.gap=1&f.useful.facet.mincount=0&
fq={%21tag%3Duseful}useful%3A[0+TO+0}&fq={%21tag%3Dfull_address}{%21fie
ld+f%3Dfull_address}az&fl=date%2Cid&hl=true&hl.fl=%2A&hl.snippets=3
HTTP/1.1
Host: cloudera-cl1-06.guard.swg.usma.ibm.com:8983
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-
431.11.2.el6.x86_64
This SQL assumes you are gathering information based on a log full details policy rule:
IF (instr(GDM_CONSTRUCT_TEXT.FULL_SQL,'__THRIFT
message={method=''query'',struct:1=')>0 ,
substring_index(substring(GDM_CONSTRUCT_TEXT.FULL_SQL,(instr(GDM_CONSTR
UCT_TEXT.FULL_SQL,'string:4')+10)),'''',1), '')
IF (instr(GDM_CONSTRUCT.ORIGINAL_SQL,'__THRIFT
message={method=''query'',struct:1=')>0 ,
substring_index(substring(GDM_CONSTRUCT.ORIGINAL_SQL,(instr(GDM_CONSTRU
CT.ORIGINAL_SQL,'string:4')+10)),'''',1), '')
IF (instr(GDM_CONSTRUCT_TEXT.FULL_SQL,'__THRIFT
message={method=''get_table'',struct:0=')>0 ,
substring_index(substring(GDM_CONSTRUCT_TEXT.FULL_SQL,(instr(GDM_CONSTR
UCT_TEXT.FULL_SQL,'string:3')+10)),'''',1), '')
For most Hadoop activity, the recommendations in this guide apply as for all other
Hadoop distributions, with the following exceptions.
Big SQL
An S-TAP must be installed on all nodes in which a Big SQL engine is installed. The
support for Big SQL is quite comprehensive and is similar to what Guardium already
supports for DB2. For more details on configuration and capabilities, see the
developerWorks article on this topic here:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1411hadoop-
biginsights-guardium/index.html
If Kerberos and/or GPFS is used, then you must configure a special communications exit
on each Big SQL node. Guardium provides a dynamically loaded shared library that
interacts with Big SQL. Big SQL will invoke functions within that library at run time
when it performs SQL and utility requests. Directions for this are included in the
developerWorks article.
Restrictions: Only monitoring and auditing are supported using the exit methodology.
Redaction and blocking are advanced features that are only supported using S-TAP.
WebHDFS Complete No
Solr Complete Yes to return DB User. Note, you
will need a log full details policy
rule for Solr traffic to get the
computed attribute.
Impala Complete . Yes, if you use Hue, to return the
user name for THRIFT messages
into the DB User field. See Impala:
computed attribute to get user name
on page 45.
SPARK Not supported for in-memory You will catch HDFS and
usage. (open requirement) MapReduce as data is brought into
memory, but interactions in
memory will not be captured by
Guardium.
Shark Not supported n/a
Sqoop Returned as HDFS and YARN No
Pig Returned as HDFS and YARN No
traffic
Zookeeper Returned as HDFS and YARN No
traffic
Avro Returned as HDFS and YARN No
Flume Returned as HDFS and YARN No
Notices
Copyright IBM Corp. 2016. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA
ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, Guardium, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM
or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information
(www.ibm.com/legal/copytrade.shtml)
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.