You are on page 1of 22

Big Data and Hadoop

Module 9: Introduction to Apache Sqoop and Flume


Why Scoop?

While companies across industries are trying to move data from structured relational databases like MySQL, Teradata,
Netezza, and so on to Hadoop, there were concerns on the ease of transitioning of existing data to Hadoop Systems.

User consideration:

Data consistency

Production system resource consumption

Data preparation

Transition Efficiency

©2018 EduNextgen www.edunextgen.com 2


What is Scoop?

Sqoop, an Apache Hadoop Ecosystem project, is a command-line interface application for transferring data between
relational databases and Hadoop. It supports incremental loads of a single table, or a free-form SQL query. Exports can be
used to put data from Hadoop into a relational database.

©2018 EduNextgen www.edunextgen.com 3


Sqoop and Its Uses

Sqoop is an Apache Hadoop Ecosystem project whose responsibility is to import or export operations across relational
databases. Reasons for using Sqoop are as follows:

SQL servers are deployed worldwide

Allows for moving certain parts of data from traditional SQL DB to Hadoop

Transferring data using script is inefficient and time-consuming, hence Sqoop is used

Handles large data through ecosystem

Brings processed data from Hadoop to the applications

©2018 EduNextgen www.edunextgen.com 4


Sqoop and Its Uses (Cont’d)

Sqoop is required when a database is imported from a Relational Database (RDB) to Hadoop or vice versa.

While importing database from RDB to While exporting database from Hadoop to
Hadoop: RDB:

Users must consider consistency of data, consumption of Users must keep in mind that directly accessing data
production system resources, and preparation of data for residing on external systems, within a MapReduce
provisioning downstream pipeline. framework complicates applications. It exposes the
production system to excessive load originating from
cluster nodes.

©2018 EduNextgen www.edunextgen.com 5


Benefits of Sqoop

The following are the benefits of using Sqoop:

Transfers data from Hadoop to RDB


Exports data back to RDB
and vice versa

Used to import data from an RDB to


Transforms data in Hadoop with
the Hadoop Distributed File System
MapReduce without extra coding
(HDFS)

©2018 EduNextgen www.edunextgen.com 6


Sqoop Processing

The processing of Sqoop can be summarized as follows:

It runs in the Hadoop Cluster

It imports data from RDB or NoSQL DB to Hadoop

It has access to the Hadoop core which helps in using mappers to slice the incoming data into parts and place the data in
HDFS

It exports data back into the RDB, ensuring that the schema of the data in the database is maintained

©2018 EduNextgen www.edunextgen.com 7


Importing Data Using Sqoop

To import data present in MySQL database using Sqoop, use the following command:

©2018 EduNextgen www.edunextgen.com 8


Exporting Data from Hadoop Using Sqoop

Use the following command to export data from Hadoop using Sqoop:

©2018 EduNextgen www.edunextgen.com 9


Sqoop Connectors

The different types of Sqoop connectors are:

Generic JDBC connector Used to connect to any database that is accessible via JDBC

Default Sqoop connector Designed for specific databases

Specialized to use specific batch tools transferring data with high


Fast-path connector
throughput

©2018 EduNextgen www.edunextgen.com 10


Sample Sqoop Commands

$sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table
sqoop_demo –target-dir /user/sqoop_batch6 – m 1 –as-textfile

$sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table
sqoop_demo –target-dir /user/sqoop_batch6 – m 1 –as-textfile –where “id>2”

$sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table
sqoop_demo –target-dir /user/sqoop_batch6 –e “select * from sqoop_demo where id =13” – m 1 –as-textfile

©2018 EduNextgen www.edunextgen.com 11


More Sample Sqoop Commands

$sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –
table sqoop_demo –target-dir /user/sqoop_batch6 – m 1 –as-textfile –split-by id

$sqoop job –list

$sqoop export–connect jdbc:mysql://localhost/a2 –username root –password root –table report1 –export-dir
/user/hive/warehouse/mobile_vs_sentiments1/ --input-fields-terminated-by ‘\001’

©2018 EduNextgen www.edunextgen.com 12


Demo

Export Data Using Sqoop from Hadoop

©2018 EduNextgen www.edunextgen.com 13


Why Flume?

A business scenario in which Flume is beneficial:

A company has thousands of services running on different servers in a cluster that produce many
large logs; these services should be analyzed altogether.
Current scenario

Problem

Solution

©2018 EduNextgen www.edunextgen.com 14


Why Flume (Cont’d)

Consider the following case study:

The current issue involves determining how to send the logs to a setup that has Hadoop.
The channel or method used for the sending process must be reliable, scalable, extensive,
and manageable.
Current scenario

Problem

Solution

©2018 EduNextgen www.edunextgen.com 15


Why Flume (Cont’d)

Consider the following case study:

To solve this problem log aggregation tool called flume can be used. Apache Sqoop and Flume are
the tools that are used to gather data from different sources and load them into HDFS. Sqoop in
Hadoop is used to extract structured data from databases like Teradata, Oracle, and so on, whereas
Current scenario
Flume in Hadoop sources data that is stored in different sources, and deals with unstructured data.

Problem

Solution

©2018 EduNextgen www.edunextgen.com 16


Apache Flume – Introduction

Apache Flume is a distributed and reliable service for efficiently collecting, aggregating, and moving large amounts of
streaming data into the Hadoop Distributed File System (HDFS)

It has a simple and flexible architecture based on streaming data flows, which is robust and fault-tolerant

©2018 EduNextgen www.edunextgen.com 17


Flume Agent

An agent is an independent daemon process (JVM) in Flume

An agent is a collection of Sources, Sinks and Channels

It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent

Flume may have more than one agent. Following diagram represents a Flume Agent

©2018 EduNextgen www.edunextgen.com 18


Architecture – Components Of FLUME

©2018 EduNextgen www.edunextgen.com 19


Flume Configuration

After we have installed Flume, we need to configure it

For this we use the configuration file which is a Java property file having <key-value) pairs

We need to pass values to the keys in the file

In the Flume configuration file we need to:

Name the components of the current agent

Describe and Configure the source

Describe and Configure the sink

Describe and Configure the channel

Bind the source and the sink to the channel

In a typical Flume deployment, we may have multiple agents. We therefore use a unique name to differentiate and
configure each agent

©2018 EduNextgen www.edunextgen.com 20


Demo

Configure and Run Flume Agents

©2018 EduNextgen www.edunextgen.com 21

You might also like