Professional Documents
Culture Documents
Version: 2.00
Date: XXX
Revision History
Date
March 3 2013
Version
1.0
Description
Initial Draft
Author
Sarang Patil
March 12 2013
1.5
Sarang Patil
April 2, 2013
2.0
Sarang Patil
April 5, 2013
2.2
Sarang Patil
April 9, 2013
2.3
Table 1: Revision History
Sarang Patil
Issues Documentation
The following Issues were defined during Design/Preparation.
Raised By
Issue
Date
Needed
Sada Shiro
/ Deepak
Mangani l
Chris Ward
Resolution/Answer
Date
Completed
Resolved By
Section added.
Approvals
The undersigned acknowledge they have reviewed the high-level architecture design and agree with its contents.
Name
Role
Steve Fox
Email Approval
Approved
Version
Version 2.2
RE List of potential
BP reviewers.msg
Alex Tuabman
Version 2.2
Approve Best
Practice - Native Hadoop Tool Sqoop.msg
Page i of xxiv
Table of Contents
Table of Contents
Executive summary ................................................................................................................ 5
Apache Sqoop - Overview ...................................................................................................... 6
Ease of Use ..............................................................................................................................................7
Ease of Extension ....................................................................................................................................7
Security ...................................................................................................................................................8
Page ii of xxiv
Table of Contents
Summary ............................................................................................................................. 23
Table of Figures
Table of Figures
FIGURE 1SQOOP ARCHITECTURE ...............................................................................................................................7
FIGURE 2SQOOP IMPORT JOB ..................................................................................................................................14
FIGURE 3SQOOP EXPORT JOB ..................................................................................................................................16
Page iv of xxiv
Executive summary
Currently there are three ways proven standard methods to interface Hadoop and Teradata, as well as
Teradata Aster.
1. Using Flat file interfaces
a. Available for Teradata
b. Available for Teradata Aster
2. Using SQL-H Interface.
a. Available for Teradata will be in Q3 of 2013
b. Available for Teradata Aster
3. Using Apache tool Sqoop
a. Available for Teradata
b. Available for Teradata Aster
In the BIG DATA environment it is not recommended to write and read BIG data multiple times flat file
interface will be used only when none of the other options are available to be used. The Best Practices
for these interfaces are documented in Best Practices for Teradata tools and utilities.
Detail Best Practices for SQL-H are documented in separate document Best Practices for Aster Data
Integration. Currently (Q2-2013) the SQL-H interface is available for Teradata Aster Platform. SQL-H
interface for Teradata will be available in Q-3 of 2013.
The scope of the document is detail best practices for native Hadoop tool Sqoop. Current version of
Sqoop is Sqoop2.
Page 5 of 24
This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software
Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
Sqoop allows easy import and export of data from structured data stores such as relational databases,
enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external
system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to
schedule and automate import and export tasks. Sqoop uses a connector based architecture which
supports plugins that provide connectivity to new external systems.
What happens underneath the covers when you run Sqoop is very straightforward. The dataset being
transferred is sliced up into different partitions and a map-only job is launched with individual mappers
responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe
manner since Sqoop uses the database metadata to infer the data types.
In the rest of this post we will walk through an example that shows the various ways you can use Sqoop.
The goal of this post is to give an overview of Sqoop operation without going into much detail or
advanced functionality
Page 6 of 24
Ease of Use
Whereas Sqoop requires client-side installation and configuration, Sqoop 2 will be installed and
configured server-side. This means that connectors will be configured in one place, managed by the
Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database
connectivity will only be needed on the server. Sqoop 2 will be a web-based service: front-ended by a
Command Line Interface (CLI) and browser and back-ended by a metadata repository. Moreover, Sqoop
2's service level integration with Hive and HBase will be on the server-side. Oozie will manage Sqoop
tasks through the REST API. This decouples Sqoop internals from Oozie, i.e. if you install a new Sqoop
connector then you won't need to install it in Oozie also.
Ease of Extension
In Sqoop 2, connectors will no longer be restricted to the JDBC model, but can rather define their own
vocabulary, e.g. Couchbase no longer needs to specify a table name, only to overload it as a backfill or
dump operation.
Page 7 of 24
Common functionality will be abstracted out of connectors, holding them responsible only for data
transport. The reduce phase will implement common functionality, ensuring that connectors benefit
from future development of functionality.
Sqoop 2's interactive web-based UI will walk users through import/export setup, eliminating redundant
steps and omitting incorrect options. Connectors will be added in one place, with the connectors
exposing necessary options to the Sqoop framework. Thus, users will only need to provide information
relevant to their use-case.
With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more
predictable. In the same way, the user will not need to be aware of the functionality of all connectors.
As a result, connectors no longer need to provide downstream functionality, transformations, and
integration with other systems. Hence, the connector developer no longer has the burden of
understanding all the features that Sqoop supports.
Security
Currently, Sqoop operates as the user that runs the 'sqoop' command. The security principal used by a
Sqoop job is determined by what credentials the users have when they launch Sqoop. Going forward,
Sqoop 2 will operate as a server based application with support for securing access to external systems
via role-based access to Connection objects. For additional security, Sqoop 2 will no longer allow code
generation, require direct access to Hive and HBase, nor open up access to all clients to execute jobs.
Sqoop 2 will introduce Connections as First-Class Objects. Connections, which will encompass
credentials, will be created once and then used many times for various import/export jobs. Connections
will be created by the Admin and used by the Operator, thus preventing credential abuse by the end
user. Furthermore, Connections can be restricted based on operation (import/export). By limiting the
total number of physical Connections opens at one time and with an option to disable Connections,
resources can be managed
Page 8 of 24
Available commands:
codegen
create-hive-table
eval
export
help
import
import-all-tables
list-databases
list-tables
version
Page 9 of 24
Server installation
Copy Sqoop artifact on machine where you want to run Sqoop server. This machine must have installed
and configured Hadoop. You dont need to run any Hadoop related services there, however the machine
must be able to act as an Hadoop client. You should be able to list a HDFS for example:
Sqoop server supports multiple Hadoop versions. However as Hadoop major versions are not
compatible with each other, Sqoop have multiple binary artifacts - one for each supported major version
of Hadoop. You need to make sure that youre using appropriated binary artifact for your specific
Hadoop version. To install Sqoop server decompress appropriate distribution artifact in location at your
convenience and change your working directory to this folder.
Installing Dependencies
You need to install Hadoop libraries into Sqoop server war file. Sqoop provides convenience script
addtowar.sh to do so. If you have installed Hadoop in usual location in /usr/lib and executable
hadoop is in your path, you can use automatic Hadoop installation procedure:
Page 10 of 24
./bin/addtowar.sh -hadoop-auto
In case that you have Hadoop installed in different location, you will need to manually specify Hadoop
version and path to Hadoop libraries. You can use parameter -hadoop-version for specifying
Hadoop major version, were currently support versions 1.x and 2.x. Path to Hadoop libraries can be
specified using -hadoop-path parameter. In case that your Hadoop libraries are in multiple different
folders, you can specify all of them separated by :.
Example of manual installation:
Lastly you might need to install JDBC drivers that are not bundled with Sqoop because of incompatible
licenses. You can add any arbitrary java jar file to Sqoop server using script addtowar.sh with -jars
parameter. Similarly as in case of hadoop path you can enter multiple jars separated with :.
Example of installing MySQL JDBC driver to Sqoop server:
Configuring Server
Before starting server you should revise configuration to match your specific environment. Server
configuration files are stored in server/config directory of distributed artifact along side with other
configuration files of Tomcat.
File sqoop_bootstrap.properties specifies which configuration provider should be used for
loading configuration for rest of Sqoop server. Default value
PropertiesConfigurationProvider should be sufficient.
Second configuration file sqoop.properties contains remaining configuration properties that can
affect Sqoop server. File is very well documented, so check if all configuration properties fits your
environment. Default or very little tweaking should be sufficient most common cases.
Page 11 of 24
Client installation
Client do not need extra installation and configuration steps. Just copy Sqoop distribution artifact on
target machine and unzip it in desired location. You can start client with following command:
bin/sqoop.sh client
Debugging information
The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution
directory.
The logs of the Sqoop2 server and the Derby repository are located as sqoop.log and derbyrepo.log (by
default unless changed by the above configuration), respectively, under the (LOGS) directory in the
Sqoop2 distribution directory.
Page 12 of 24
--connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.
--table <table name>: This parameter specifies the table which will be imported.
The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the
database to gather the necessary metadata for the data being imported.
The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the
actual data transfer using the metadata captured in the previous step.
The imported data is saved in a directory on HDFS based on the table being imported. As is the case with
most aspects of Sqoop operation, the user can specify any alternative directory where the files should
be populated.
By default these files contain comma delimited fields, with new lines separating different records. You
can easily override the format in which data is copied over by explicitly specifying the field separator and
record terminator characters.
Sqoop also supports different data formats for importing data. For example, you can easily import data
in Avro data format by simply specifying the option --as-avrodatafile with the import command.
There are many other options that Sqoop provides which can be used to further tune the import
operation to suit your specific requirements.
Page 13 of 24
Page 14 of 24
When you run a Hive import, Sqoop converts the data from the native datatypes within the external
datastore into the corresponding types within Hive.
Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new
line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the
data correctly populated for consumption in Hive.
Once the import is complete, you can see and operate on the table just like any other table in Hive.
Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Some connectors support staging tables that help isolate production tables from possible corruption in
case of job failures due to any reason. Staging tables are first populated by the map tasks and then
merged into the target table once all of the data has been delivered it.
Page 15 of 24
Page 16 of 24
Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Page 17 of 24
Page 18 of 24
If you need to move big data, make it small first, and then move small data.
Prepare data model in advance to ensure that queries touch the least amount of data.
Always create an empty export table.
Do use --escaped-by option during import and --input-escaped-by during export.
Do use fields-terminated-by during import and input-fields-terminated-by during export.
Do specify the direct mode option (--direct), if you use the direct connector
Develop some kind of incremental import when sqoop-ing in large tables.
o If you do not, your Sqoop jobs will take longer and longer as the data grows from the
Compress data in HDFS.
o You will save space on HDFS as your replication factor makes multiple copies of your
data.
o You will benefit in processing as your Map/Reduce jobs have less data to feaster and
HADOOP becomes less I/O bound
Do use --escaped-by option during import and --input-escaped-by during export.
Do use fields-terminated-by during import and input-fields-terminated-by during export.
Operational Donts
Dont use the same table for both import and export
Dont specify the query, if you use the direct connector
Dont have too many partitions same file that will be stored in HDFS
o This translates into time consuming map tasks, use partitioning if possible
o 1000 Partitions will perform better than 10,000 partitions
Page 19 of 24
Page 20 of 24
Page 21 of 24
Download
Sqoop2 Documentation
Documentation
API Documentation
Page 22 of 24
Summary
Sqoop 2 will enable users to use Sqoop effectively with a minimal understanding of its details by having
a web-application run Sqoop, which allows Sqoop to be installed once and used from anywhere.
In addition, having a REST API for operation and management will help Sqoop integrate better with
external systems such as Oozie.
Also, introducing a reduce phase allows connectors to be focused only on connectivity and ensures that
Sqoop functionality is uniformly available for all connectors. This facilitates ease of development of
connectors
Page 23 of 24