2.3 - Best Practices - Native Hadoop Tool Sqoop

Best Practices
Native Hadoop tool Sqoop
Version: 2.00
Date: XXX
Copyright 2013 by Teradata. All Rights Reserved.
History / Issues / Approvals
Revision History
Date
March 3 2013
Version
1.0
Description
Initial Draft
Author
Sarang Patil
March 12 2013
1.5
Removed other tools and added Sqoop of Teradata
Sarang Patil
April 2, 2013
2.0
Added interface to Teradata ASTER
Sarang Patil
April 5, 2013
2.2
For Internal DI CoE Review
Sarang Patil
April 9, 2013
2.3
Table 1: Revision History
Review Comments included
Sarang Patil
Issues Documentation
The following Issues were defined during Design/Preparation.
Raised By
Issue
Date
Needed
Sada Shiro
/ Deepak
Mangani l
Chris Ward
We need to include some

examples
Needs to understand the internal

of Sqoop connector
Table 2: Issues Documentation
Resolution/Answer
Date
Completed
Resolved By
Section added.
Section added; No UDA

available at this point
Approvals
The undersigned acknowledge they have reviewed the high-level architecture design and agree with its contents.
Name
Role
Steve Fox
Sr. Architect DI CoE
Email Approval
Approved
Version
Version 2.2
RE List of potential
BP reviewers.msg
Alex Tuabman
Data Integration Consultant
Version 2.2
Approve Best
Practice - Native Hadoop Tool Sqoop.msg
Table 3: Document Signoff
Teradata Confidential and Proprietary
Page i of xxiv
Table of Contents
Table of Contents
Executive summary ................................................................................................................ 5
Apache Sqoop - Overview ...................................................................................................... 6
Ease of Use ..............................................................................................................................................7
Ease of Extension ....................................................................................................................................7
Security ...................................................................................................................................................8
Apache Sqoop help tool....................................................................................................... 9

Best practices for Sqoop Installation ..................................................................................... 10
Server installation .....................................................................................................................................10
Installing Dependencies ............................................................................................................................10
Configuring Server .....................................................................................................................................11
Server Life Cycle ........................................................................................................................................11
Client installation ......................................................................................................................................12
Debugging information .............................................................................................................................12
Best practices for importing data to Hadoop ......................................................................... 13

Importing data to HDFS .............................................................................................................................13
Importing Data into Hive ...........................................................................................................................14
Importing Data Importing Data into HBase ...............................................................................................15
Best practices to exporting data from Hadoop ...................................................................... 17

Best practices NoSQL database ............................................................................................. 18
Best practices operational .................................................................................................... 19
Operational Dos ....................................................................................................................................19
Operational Donts ................................................................................................................................19
Sqoop Examples ................................................................................................................... 20

hdfs to Teradata/Teradata Aster .................................................................. Error! Bookmark not defined.
Moving entire table to hdfs...................................................................................................................21
Moving entire table to hive ...................................................................... Error! Bookmark not defined.
Moving entire table to Hbase ...............................................................................................................21
Teradata/Teradata Aster to hdfs ...............................................................................................................21
Moving entire table from hdfs to Teradata ..........................................................................................21
Moving entire table from hdfs to Teradata ASTER ...............................................................................21
Sqoop informational links .................................................................................................... 22

Page ii of xxiv
Table of Contents
Summary ............................................................................................................................. 23
Page iii of xxiv
Table of Figures
Table of Figures
FIGURE 1SQOOP ARCHITECTURE ...............................................................................................................................7
FIGURE 2SQOOP IMPORT JOB ..................................................................................................................................14
FIGURE 3SQOOP EXPORT JOB ..................................................................................................................................16
Page iv of xxiv
Best Practices Native Hadoop Tool Sqoop
Executive summary
Currently there are three ways proven standard methods to interface Hadoop and Teradata, as well as
Teradata Aster.
1. Using Flat file interfaces
a. Available for Teradata
b. Available for Teradata Aster
2. Using SQL-H Interface.
a. Available for Teradata will be in Q3 of 2013
3. Using Apache tool Sqoop
a. Available for Teradata
In the BIG DATA environment it is not recommended to write and read BIG data multiple times flat file
interface will be used only when none of the other options are available to be used. The Best Practices
for these interfaces are documented in Best Practices for Teradata tools and utilities.
Detail Best Practices for SQL-H are documented in separate document Best Practices for Aster Data
Integration. Currently (Q2-2013) the SQL-H interface is available for Teradata Aster Platform. SQL-H
interface for Teradata will be available in Q-3 of 2013.
The scope of the document is detail best practices for native Hadoop tool Sqoop. Current version of
Sqoop is Sqoop2.
Page 5 of 24
Apache Sqoop - Overview

Using Hadoop for analytics and data processing requires loading data into clusters and processing it in
conjunction with other data that often resides in production databases across the enterprise. Loading
bulk data into Hadoop from production systems or accessing it from map reduce applications running on
large clusters can be a challenging task. Users must consider details like ensuring consistency of data,
the consumption of production system resources, data preparation for provisioning downstream
pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data
residing on external systems from within the map reduce applications complicates applications and
exposes the production system to the risk of excessive load originating from cluster nodes.
This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software
Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
Sqoop allows easy import and export of data from structured data stores such as relational databases,
enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external
system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to
schedule and automate import and export tasks. Sqoop uses a connector based architecture which
supports plugins that provide connectivity to new external systems.
What happens underneath the covers when you run Sqoop is very straightforward. The dataset being
transferred is sliced up into different partitions and a map-only job is launched with individual mappers
responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe
manner since Sqoop uses the database metadata to infer the data types.
In the rest of this post we will walk through an example that shows the various ways you can use Sqoop.
The goal of this post is to give an overview of Sqoop operation without going into much detail or
advanced functionality
Page 6 of 24
Figure 1Sqoop Architecture
Ease of Use
Whereas Sqoop requires client-side installation and configuration, Sqoop 2 will be installed and
configured server-side. This means that connectors will be configured in one place, managed by the
Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database
connectivity will only be needed on the server. Sqoop 2 will be a web-based service: front-ended by a
Command Line Interface (CLI) and browser and back-ended by a metadata repository. Moreover, Sqoop
2's service level integration with Hive and HBase will be on the server-side. Oozie will manage Sqoop
tasks through the REST API. This decouples Sqoop internals from Oozie, i.e. if you install a new Sqoop
connector then you won't need to install it in Oozie also.
Ease of Extension
In Sqoop 2, connectors will no longer be restricted to the JDBC model, but can rather define their own
vocabulary, e.g. Couchbase no longer needs to specify a table name, only to overload it as a backfill or
dump operation.
Page 7 of 24
Common functionality will be abstracted out of connectors, holding them responsible only for data
transport. The reduce phase will implement common functionality, ensuring that connectors benefit
from future development of functionality.
Sqoop 2's interactive web-based UI will walk users through import/export setup, eliminating redundant
steps and omitting incorrect options. Connectors will be added in one place, with the connectors
exposing necessary options to the Sqoop framework. Thus, users will only need to provide information
relevant to their use-case.
With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more
predictable. In the same way, the user will not need to be aware of the functionality of all connectors.
As a result, connectors no longer need to provide downstream functionality, transformations, and
integration with other systems. Hence, the connector developer no longer has the burden of
understanding all the features that Sqoop supports.
Security
Currently, Sqoop operates as the user that runs the 'sqoop' command. The security principal used by a
Sqoop job is determined by what credentials the users have when they launch Sqoop. Going forward,
Sqoop 2 will operate as a server based application with support for securing access to external systems
via role-based access to Connection objects. For additional security, Sqoop 2 will no longer allow code
generation, require direct access to Hive and HBase, nor open up access to all clients to execute jobs.
Sqoop 2 will introduce Connections as First-Class Objects. Connections, which will encompass
credentials, will be created once and then used many times for various import/export jobs. Connections
will be created by the Admin and used by the Operator, thus preventing credential abuse by the end
user. Furthermore, Connections can be restricted based on operation (import/export). By limiting the
total number of physical Connections opens at one time and with an option to disable Connections,
resources can be managed
Page 8 of 24
Apache Sqoop help tool

Sqoop ships with a help tool. To display a list of all available tools,
type the following command:
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen
Generate code to interact with database records
create-hive-table
Import a table definition into Hive
eval
Evaluate a SQL statement and display the results
export
Export an HDFS directory to a database table
help
List available commands
import
Import a table from a database to HDFS
import-all-tables
Import tables from a database to HDFS
list-databases
List available databases on a server
list-tables
List available tables in a database
version
Display version information
See 'sqoop help COMMAND' for information on a specific command.
Page 9 of 24
Best practices for Sqoop Installation

Sqoop ships as one binary package however its compound from two separate parts - client and server.
You need to install server on single node in your cluster. This node will then serve as an entry point for
all connecting Sqoop clients. Server acts as a MapReduce client and therefore Hadoop must be installed
and configured on machine hosting Sqoop server. Clients can be installed on any arbitrary number of
machines. Client is not acting as a MapReduce client and thus you do not need to install Hadoop on
nodes that will act only as a Sqoop client.
Server installation
Copy Sqoop artifact on machine where you want to run Sqoop server. This machine must have installed
and configured Hadoop. You dont need to run any Hadoop related services there, however the machine
must be able to act as an Hadoop client. You should be able to list a HDFS for example:
hadoop dfs -ls
Sqoop server supports multiple Hadoop versions. However as Hadoop major versions are not
compatible with each other, Sqoop have multiple binary artifacts - one for each supported major version
of Hadoop. You need to make sure that youre using appropriated binary artifact for your specific
Hadoop version. To install Sqoop server decompress appropriate distribution artifact in location at your
convenience and change your working directory to this folder.
# Decompress Sqoop distribution tarball

tar -xvf sqoop-<version>-bin-hadoop<hadoop-version>.tar.gz
# Move decompressed content to any location
mv sqoop-<version>-bin-hadoop<hadoop version>.tar.gz /usr/lib/sqoop
# Change working directory
cd /usr/lib/sqoop
Installing Dependencies
You need to install Hadoop libraries into Sqoop server war file. Sqoop provides convenience script
addtowar.sh to do so. If you have installed Hadoop in usual location in /usr/lib and executable
hadoop is in your path, you can use automatic Hadoop installation procedure:
Page 10 of 24
./bin/addtowar.sh -hadoop-auto
In case that you have Hadoop installed in different location, you will need to manually specify Hadoop
version and path to Hadoop libraries. You can use parameter -hadoop-version for specifying
Hadoop major version, were currently support versions 1.x and 2.x. Path to Hadoop libraries can be
specified using -hadoop-path parameter. In case that your Hadoop libraries are in multiple different
folders, you can specify all of them separated by :.
Example of manual installation:
./bin/addtowar.sh -hadoop-version 2.0 -hadoop-path /usr/lib/hadoopcommon:/usr/lib/hadoop-hdfs:/usr/lib/hadoop-yarn
Lastly you might need to install JDBC drivers that are not bundled with Sqoop because of incompatible
licenses. You can add any arbitrary java jar file to Sqoop server using script addtowar.sh with -jars
parameter. Similarly as in case of hadoop path you can enter multiple jars separated with :.
Example of installing MySQL JDBC driver to Sqoop server:
./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-*-bin.jar
Configuring Server
Before starting server you should revise configuration to match your specific environment. Server
configuration files are stored in server/config directory of distributed artifact along side with other
configuration files of Tomcat.
File sqoop_bootstrap.properties specifies which configuration provider should be used for
loading configuration for rest of Sqoop server. Default value
PropertiesConfigurationProvider should be sufficient.
Second configuration file sqoop.properties contains remaining configuration properties that can
affect Sqoop server. File is very well documented, so check if all configuration properties fits your
environment. Default or very little tweaking should be sufficient most common cases.
Server Life Cycle

After installation and configuration you can start Sqoop server with following command:
Page 11 of 24
./bin/sqoop.sh server start
Similarly you can stop server using following command:
./bin/sqoop.sh server stop
Client installation
Client do not need extra installation and configuration steps. Just copy Sqoop distribution artifact on
target machine and unzip it in desired location. You can start client with following command:
bin/sqoop.sh client
Debugging information
The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution
directory.
The logs of the Sqoop2 server and the Derby repository are located as sqoop.log and derbyrepo.log (by
default unless changed by the above configuration), respectively, under the (LOGS) directory in the
Sqoop2 distribution directory.
Page 12 of 24
Best practices for importing data to Hadoop

The following section describes the option to export data from RDBMS to Hadoop hdfs as well as higher
level constructs like Hive and HBase.
Importing data to HDFS

The following command is used to import all data from a table called ORDERS from a Teradata database:
--$ sqoop import --connect jdbc:teradata://12.13.24.54/localhost/
--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>
import: This is the sub-command that instructs Sqoop to initiate an import.
--connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.
--table <table name>: This parameter specifies the table which will be imported.
The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the
database to gather the necessary metadata for the data being imported.
The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the
actual data transfer using the metadata captured in the previous step.
The imported data is saved in a directory on HDFS based on the table being imported. As is the case with
most aspects of Sqoop operation, the user can specify any alternative directory where the files should
be populated.
By default these files contain comma delimited fields, with new lines separating different records. You
can easily override the format in which data is copied over by explicitly specifying the field separator and
record terminator characters.
Sqoop also supports different data formats for importing data. For example, you can easily import data
in Avro data format by simply specifying the option --as-avrodatafile with the import command.
There are many other options that Sqoop provides which can be used to further tune the import
operation to suit your specific requirements.
Page 13 of 24
Figure 2Sqoop Import Job
Importing Data into Hive

In most cases, importing data into Hive is the same as running the import task and then using Hive to
create and load a certain table or partition. Doing this manually requires that you know the correct type
mapping between the data and other details like the serialization format and delimiters.
Sqoop takes care of populating the Hive meta-store with the appropriate metadata for the table and
also invokes the necessary commands to load the table or partition as the case may be. All of this is
done by simply specifying the option --hive-import with the import command.
$ sqoop import --connect jdbc:teradata://12.13.24.54/

-- hive-import
Page 14 of 24
When you run a Hive import, Sqoop converts the data from the native datatypes within the external
datastore into the corresponding types within Hive.
Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new
line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the
data correctly populated for consumption in Hive.
Once the import is complete, you can see and operate on the table just like any other table in Hive.
Importing Data Importing Data into HBase

You can use Sqoop to populate data in a particular column family within the HBase table. Much like the
Hive import, this can be done by specifying the additional options that relate to the HBase table and
column family being populated. All data imported into HBase is converted to their string representation
and inserted as UTF-8 bytes..
$ sqoop import --connect jdbc:teradata://12.13.24.54/

-- hbase-create-table hbase-table MYTABLE column-family Teradata
In this command the various options specified are as follows:
--hbase-create-table: This option instructs Sqoop to create the HBase table.

--hbase-table: This option specifies the table name to use.
--column-family: This option specifies the column family name to use
Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Some connectors support staging tables that help isolate production tables from possible corruption in
case of job failures due to any reason. Staging tables are first populated by the map tasks and then
merged into the target table once all of the data has been delivered it.
Page 15 of 24
Figure 3Sqoop Export Job
Page 16 of 24
Best practices to exporting data from Hadoop

In some cases data processed by Hadoop pipelines may be needed in production systems to help run
additional critical business functions. Sqoop can be used to export such data into external data stores as
necessary.
Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the
ORDERS table in a database somewhere, you could populate it using the following command:
$ sqoop export --connect jdbc:Teradata://12.13.24.54/
--table ORDERS --username test --password **** \
--export -dir /user/stagedata/20130201/ORDERS
export: This is the sub-command that instructs Sqoop to initiate an export.

--connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.
--table <table name>: This parameter specifies the table which will be populated.
--export-dir <directory path>: This is the directory from which data will be exported.
Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Page 17 of 24
Best practices NoSQL database

Using specialized connectors, Sqoop can connect with external systems that have optimized import and
export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoops
extension framework and can be added to any existing Sqoop installation. Once a connector is installed,
Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the
connector.
By default Sqoop includes connectors for various popular databases such as Teradata, Teradata Aster,
MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and
PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch
tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be
used to connect to any database that is accessible via JDBC.
Apart from the built-in connectors, many companies have developed their own connectors that can be
plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to
NoSQL datastores
Sqoop2 can transfer large datasets between Hadoop and external datastores such as relational
databases. Beyond this, Sqoop offers many advance features such as different data formats,
compression, working with queries instead of tables etc.
Page 18 of 24
Best practices operational

Operational Dos
If you need to move big data, make it small first, and then move small data.
Prepare data model in advance to ensure that queries touch the least amount of data.
Always create an empty export table.
Do use --escaped-by option during import and --input-escaped-by during export.
Do use fields-terminated-by during import and input-fields-terminated-by during export.
Do specify the direct mode option (--direct), if you use the direct connector
Develop some kind of incremental import when sqoop-ing in large tables.
o If you do not, your Sqoop jobs will take longer and longer as the data grows from the
Compress data in HDFS.
o You will save space on HDFS as your replication factor makes multiple copies of your
data.
o You will benefit in processing as your Map/Reduce jobs have less data to feaster and
HADOOP becomes less I/O bound
Do use --escaped-by option during import and --input-escaped-by during export.
Do use fields-terminated-by during import and input-fields-terminated-by during export.
Operational Donts
Dont use the same table for both import and export
Dont specify the query, if you use the direct connector
Dont have too many partitions same file that will be stored in HDFS
o This translates into time consuming map tasks, use partitioning if possible
o 1000 Partitions will perform better than 10,000 partitions
Page 19 of 24
Technical implementation of Sqoop JDBC

Following section describes how the data is transferred using the JDBC connection; Including the
technical implementation of data pipes in and out Teradata as well as hdfs.
To be updated once we have this information
Page 20 of 24
Sqoop sample use case

Following section describes the use cases and examples on how to transfer the data. To be updated
once we have this information
Exporting data to hdfs

Exporting entire table to hdfs
Exporting table to hive using SQL statement
Exporting table to hive using SQL join statement
Exporting entire table to Hbase
Importing data from hdfs

Importing entire table from hdfs to Teradata
Importing entire table from hdfs to Teradata ASTER
Importing table to hive using SQL statement to Teradata
Importing table to hive using SQL join statement Teradata ASTER
Page 21 of 24
Sqoop informational links

Subject Area
Links to Sqoop Project
Sqoop2 Down load
Download
Sqoop2 Documentation
Documentation
API Documentation
Scoop2 API documentation
Sqoop trouble shooting Guide

Teradata Sqoop connector
Teradata Aster Sqoop connector
Frequently asked questions
Sqoop2 Project Status
Sqoop2 command line interface details
Issues related to Sqoop
Sqoop Troubleshooting Tips

Teradata Sqoop Connector
Teradata Aster Sqoop connector
FAQ
Sqoop2 Project Status
Command Line Client
Issue Tracker (JIRA)
Page 22 of 24
Summary
Sqoop 2 will enable users to use Sqoop effectively with a minimal understanding of its details by having
a web-application run Sqoop, which allows Sqoop to be installed once and used from anywhere.
In addition, having a REST API for operation and management will help Sqoop integrate better with
external systems such as Oozie.
Also, introducing a reduce phase allows connectors to be focused only on connectivity and ensures that
Sqoop functionality is uniformly available for all connectors. This facilitates ease of development of
connectors
Page 23 of 24

2.3 - Best Practices - Native Hadoop Tool Sqoop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.3 - Best Practices - Native Hadoop Tool Sqoop

Uploaded by

Copyright:

Available Formats

Best Practices

Native Hadoop tool Sqoop

Copyright 2013 by Teradata. All Rights Reserved.

History / Issues / Approvals

Removed other tools and added Sqoop of Teradata

Added interface to Teradata ASTER

For Internal DI CoE Review

Review Comments included

We need to include some

Needs to understand the internal

Section added; No UDA

Sr. Architect DI CoE

Data Integration Consultant

Table 3: Document Signoff

Teradata Confidential and Proprietary

Apache Sqoop help tool....................................................................................................... 9

Best practices for importing data to Hadoop ......................................................................... 13

Best practices to exporting data from Hadoop ...................................................................... 17

Sqoop Examples ................................................................................................................... 20

Sqoop informational links .................................................................................................... 22

Teradata Confidential and Proprietary

Page iii of xxiv

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

Apache Sqoop - Overview

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

Figure 1Sqoop Architecture

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

Apache Sqoop help tool

Generate code to interact with database records

Import a table definition into Hive

Evaluate a SQL statement and display the results

Export an HDFS directory to a database table

List available commands

Import a table from a database to HDFS

Import tables from a database to HDFS

List available databases on a server

List available tables in a database

Display version information

See 'sqoop help COMMAND' for information on a specific command.

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

Best practices for Sqoop Installation

hadoop dfs -ls

# Decompress Sqoop distribution tarball

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

./bin/addtowar.sh -hadoop-version 2.0 -hadoop-path /usr/lib/hadoopcommon:/usr/lib/hadoop-hdfs:/usr/lib/hadoop-yarn

./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-*-bin.jar

Server Life Cycle

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

./bin/sqoop.sh server start

Similarly you can stop server using following command:

./bin/sqoop.sh server stop

Teradata Confidential and Proprietary

Best Practices Native Hadoop Tool Sqoop

Best practices for importing data to Hadoop