You are on page 1of 13

A LITTLE BEE BOOK

How it Works
Apache Hadoop
A LITTLE BEE BOOK
This book belongs to:
How it Works
Apache Hadoop
Adapted from a blog post by Mike Ferguson

For more copies of this book, or to read others in the series, visit: littlebeelibrary.com
BACK NEXT
Apache Hadoop is a set of open source software
technology components that together form a
scalable system optimised for analysing data.

Data analysed on Hadoop is typically:


Structured (e.g. customer data, transaction data
and click stream data from websites)
Unstructured (e.g. text from news feeds,
documents or social media)
Very large in volume
Created and arriving at a very high speed rates.

In addition to the open-source Apache Hadoop, there


are a number of commercial distributions of Hadoop
available from various vendors, including IBM.

4 BACK NEXT
Lets look at Hadoop in a little more detail.

A Hadoop system is made up of a number of key


components. These are:
YARN
The Hadoop Distributed File System (HDFS)
Highly parallel analytics engines
Pig
Hive
Search.

We will look at each of these in turn.

6 BACK NEXT
Think of YARN (Yet Another Resource Negotiator)
as the operating system for Hadoop. It is the
cluster management software that controls how the
computing resources are allocated to the different
applications and execution engines across
the cluster.

The Hadoop Distributed File System (HDFS)


provides highly scalable storage. Many different
types of data in many different formats can be
loaded into and stored in HDFS. Data on HDFS is
spread across servers so that it can be accessed in
parallel, and it is triple replicated for high availability if
disks and/or servers fail.

8 BACK NEXT
Hadoop handles data at scale due to the design of
four highly parallel analytics engines: MapReduce,
Tez, Spark and Storm.

The engines execute application logic and analytics


in parallel across the cluster to process data stored
in HDFS. In general, Hadoop achieves parallelism by
taking the application logic to the data rather than
taking the data to the application.

In other words, copies of the application (or each


task within an application) are run on every server to
process local data physically stored on that server.
This avoids moving data elsewhere in the cluster to
be processed.

10 BACK NEXT
The MapReduce engine runs analytic applications
in batch. To do this, application developers and data
scientists need to write applications as two distinct
program components the map component and the
reduce component.

The MapReduce engine runs the map step on all


nodes in the cluster to produce a set of intermediate
output files. It then sorts these intermediate files
and then runs a reduce step to take the sorted
intermediate files and aggregate the data to get a
final result.

This process is scalable but relatively slow because


of the need to write lots of intermediate files to disk
and then read them again.

12 BACK NEXT
Tez is an alternative to the MapReduce engine.
Tez does not need to write and read intermediate
files to disk. For this reason it is generally faster
than MapReduce.

Spark accelerates application execution even


further by enabling data from HDFS to be read into
memory, meaning that Spark analytic applications
can analyse data at scale in parallel and in memory.
There is no I/O. Speed is the reason why Spark is
getting so much attention.

14 BACK NEXT
Storm is an execution engine for real-time streaming
analytic applications. This means data can be
analysed as soon as it is generated and BEFORE it
is stored anywhere. Sensor data and markets data
are examples.

Storm applications run in parallel in a cluster and


look for patterns in the data. When the condition
is detected or predicted, then some kind of action
is taken. This may be an alert or it could be an
automated action such as shutting down a piece of
equipment because it is predicted to fail.

Spark offers an alternative to Storm called Spark


Streaming again this runs in memory to accelerate
processing. Spark is emerging as the general
purpose execution engine of choice for all types of
analytic applications, and so interest in MapReduce
and Storm is fading.

16 BACK NEXT
If you dont want to write analytic applications in a
programming language like R, Python, Java or Scala,
you can always write in Pig scripts.

This is a higher-level language optimised for data


flow processing whereby the output of one step is
fed into the next step and so on. The difference here
is that Pig is a declarative language.

In simple terms this means that you state what you


want to happen in Pig, and then your Pig script is
compiled into MapReduce, Tez or Spark jobs to
run in parallel on Hadoop clusters processing data
in HDFS.

Pig is very popular for Extract, Transform and Load


(ETL) processing.

18 BACK NEXT
Hive is a free SQL interface to data stored in Hadoop
HDFS files. It allows you to connect self-service BI
tools (and applications that query data using SQL) to
Hadoop Hive and then use SQL to access data.

Note that Hadoop is NOT a relational database.


However, Hive enables you to create a schema
(e.g. a table structure) on top of a file to describe
the structure of the data within the file. You can then
query the table using SQL, and Hive will navigate the
file to get your data.

There are also many SQL-on-Hadoop alternatives


to Hive, e.g. IBM BigSQL and Spark SQL, which
also offer SQL access. The difference is that
technologies such as IBM BigSQL offer a much more
comprehensive SQL capability and performance
than Hive.

20 BACK NEXT
Finally there is Search. This allows you to build
search indexes by crawling the data in HDFS. Once
the indexes are built you can then explore the data
using a familiar search interface and search queries.

The queries access the indexes and let you discover


what is in your data. This is particularly useful for
unstructured data like text, but search can also work
on structured data and semi-structured data like
JSON or XML.

Many clients are planning to use Hadoop in the future


to manage all their data, and now you know why.

22 BACK NEXT
Copyright IBM Corporation 2017.All Rights Reserved.
IBM, the IBM logo andibm.comare trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both.
Other product, company or service names may be trademarks or service marks of others.
24

You might also like