Professional Documents
Culture Documents
HADOOP
A BRIEF INTRODUCTION TO THE HADOOP ECOSYSTEM
Matteo Golfarelli
A Definition
“The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability and
reliability, the library itself is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of which may be
prone to failures.
1
05/06/2015
SMP Architecture
Traditional RDBMS are based on Symmetric Multi Processing (SMP) architecture: several
processors share the same RAM, the same I/O bus and the same disk/disk array
• The limit of such architecture depends on the physical number of devices that can be
mounted and by the BUS bottleneck
BUS
MPP Architecture
In a Massively Parallel Processing (MPP) Architecture several processors, each equipped with
its own RAM and disks, collaborate to solve a single problem by splitting it in several independent
tasks.
tasks It is also called shared nothing architecture
• Examples of MPP appliances are Teradata, Nettezza that are mainly used for Data
Warehouse applications
Messaging
2
05/06/2015
Hadoop vs MPP
In MPP, as in MapReduce, processing of data is distributed across a bank of compute nodes,
these separate nodes process their data in parallel and the node-level output sets are
assembled together to produce a final result set, but:
• MPP systems are based on high-level HW and proprietary SW while Hadoop is based on
commodity HW and open source SW
• Hadoop is natively controlled through imperative code while MPP appliances are queried
though declarative query.
• In a great many cases, SQL is easier and more productive than is writing MapReduce jobs,
and database professionals with the SQL skill set are more plentiful and less costly than
Hadoop specialists.
Since they share the same computation mechanism Hadoop and MPP systems are relatives and
there is no reasons for the two approaches to converge. Some specific solutions, such as
Cloudera Impala, are examples of such process.
3
05/06/2015
HPC Architecture
The term High-Performance Computing refers to massively parallel system specifically
devoted to solve cpu-intensive tasks:
• HPC is often used as a synonym of Super Computer even if supercomputer is a more
general term referring to a system that is at the frontline of contemporary processing
capacity. Its architecture and features can change along time
• HPC main applications are scientific simulations, weather forecasts, 3d modeling, etc.
• In many cases the processors share the same set of disk drive
Scalability
Scalability is ability of a system, network, or process to handle a growing amount of work
in a capable manner or its ability to be enlarged to accommodate that growth.
Methods of adding more resources for a particular application fall into two broad
categories: horizontal and vertical scaling.
• To scale horizontally (or scale out) means to add more nodes to a system, such as
adding a new computer to a distributed software application.
• To scale vertically (or scale up) means to add resources to a single node in a system,
typically involving the addition of CPUs or memory to a single computer.
Vertical scalability is typically limited by the actual capability of the architecture to mount
more resources and by the physical limitation of such resources (e.g. processor speeds).
Nowadays, the availability of low cost commodity hardware and network connections
make scaling out often more convenient than scale up.
4
05/06/2015
Hadoop 1 vs Hadoop 2
The Hadoop 1.0 project is around since ten years. Its main features are:
• It was inspired by a similar Google project proposing Map-Reduce computation framework
and the proprietary Google File System
• It is mainly oriented to execute batch Map-Reduce
The first stable Hadoop 2.0 release is dated 2013. Its main features are:
• It solves some bottlenecks and single-point of failure of Hadoop 1
• it turns Hadoop to be a Data Operating system by adding YARN (Yet Another Resource
negotiator) a resource-management platform responsible for managing compute
resources in clusters and using them for scheduling of users' applications
• Its opens to computation mechanisms different from Map-Reduce
Main modules
Common: the common utilities that support the other Hadoop modules
Distributed File System: a distributed file system that provides high-throughput access to
application data
YARN: A framework for job scheduling and cluster resource management
Map Reduce: A YARN-based system for parallel processing of large data sets
Map Reduce
Batch
YARN
HDFS 10
5
05/06/2015
More modules
Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters
which includes support for HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig
and Sqoop.
ZooKeeper: A high-performance coordination service for distributed applications.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs
are Directed Acyclical Graphs (DAGs) of actions.
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured
datastores such as relational databases.
Ambari
ZooKeeper
Map Reduce
Sqoop
Oozie
Batch
YARN
HDFS 11
Since Pig code and hive queries are implemented as map-reduce jobs, they determine a batch
execution.
Ambari
Hive Pig
ZooKeeper
Map Reduce
Sqoop
Oozie
Batch
YARN
HDFS 12
6
05/06/2015
Ambari
Hive Pig
ZooKeeper
Sqoop
Oozie
YARN
HDFS 13
Spark
Spark was initially started at UC Berkeley AMPLab in 2009. It requires a cluster manager and a
distributed storage system. For cluster manager, Spark supports standalone (native Spark
cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a
wide variety, including HDFS, Cassandra, and Amazon S3.
14
7
05/06/2015
Ambari
Hive Pig Phoenix
ZooKeeper
Sqoop
Oozie
YARN
HDFS 15
Ambari
Hive Pig Phoenix
ZooKeeper
YARN
HDFS 16
8
05/06/2015
Main Distributions
Many companies provide products that include Apache Hadoop, a derivative work thereof,
commercial support, and/or tools and utilities related to Hadoop
• Cloudera distributes a platform of open-source projects called Cloudera's Distribution
including Apache Hadoop or CDH. In addition, Cloudera offers its enterprise customers a
family of product and services that complement the open-source Apache Hadoop platform.
These include comprehensive training sessions, architectural services and technical support
for Hadoop clusters in development or in production
• Hortonworks offers a 100% open source Hadoop platform. It works to Apache Hadoop more
robust and easier to install, manage, use, integrate and extend
• MapR sells a high performance map-reduce framework based on Apache Hadoop that
includes many of the standard eco-system components
• IBM InfoSphere BigInsight brings the power of Apache Hadoop to the enterprise
• Amazon Web Services: Amazon offers a version of Apache Hadoop on their EC2
infrastructure, sold as Amazon Elastic MapReduce
17
18
9
05/06/2015
19
HDFS Federation
Differently from HDFS 1, HDFS 2 implements a federation of name spaces.
• A name space (NS) is a hierarchy of files and directories (i.e. two name spaces can have the
same file name in the same directory). Metadata for each name space are stored on a Name
Node (NN).
• The NN maintains the name space tree and the mapping of data blocks to Data Nodes (DN).
• Name spaces use blocks grouped under a Block Pool. A DN can provide blocks to more than
one name space.
NN1 NN2 NN3
…
20
10
05/06/2015
HDFS Federation
Scalability: because the NN keeps all the name space and block locations in memory, the
size of the NN heap limits the number of files and also the number of blocks addressable.
This also limits the total cluster storage that can be supported by the NN.
Performance: NN is the single point for storage and management of meta-data, it can
become a bottleneck for supporting a huge number of files, especially a large number of
small files.
Availability: you can separate name spaces of different applications improving the overall
availability of the cluster.
Maintainability, Security & Flexibility: block pool abstraction allows other services to use
the block storage with perhaps a different name space structure. Each name space is
isolated and not aware of the others.
21
Cluster Topology
In order to carry out proper choices the cluster must be aware of the Cluster Topology that is
defined during the cluster setting phase. Block storage and process allocation (data locality) are
tasks that need such information
Nodes are organized in racks and racks are organized in data center
Hadoop models such concepts in a tree-like fashion and computes the distance between nodes
as their distance on the tree.
cluster
d1 d2
r1 r2 r1 r2
d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4
11
05/06/2015
B1
rack r1 rack r2
data center 23
d1
• Locally on a data node a data block request will be turned in several OS blocks requests
• The OS blocks (i.e. disk pages) composing a data block can be accessed independently
• If a file is smaller than the data block size it will be stored in a smaller file on the disk
24
12
05/06/2015
Data Locality
Hadoop exploit cluster topology and data block replication to apply the data locality principle
When computations involves large set of data its cheaper (i.e. faster) to move code to data
rather than data to code
The following cases respect the order the resource manager prefers:
1. Process and data on the same node
2. Process and data on the different node of the same rack
3. Process and data on different racks of the same data center
4. Process and data on different racks of the different data centers
25
The Name Node records changes to HDFS in a write-ahead log called the journal in its
local native file system
A secondary name node regularly connects with the primary one and builds snapshots of
the primary name node's directory information, which the system then saves to local or
remote directories. These check-pointed images can be used to restart a failed primary
name node without having to replay the entire journal of file-system actions.
26
13
05/06/2015
High-Availability
Prior to Hadoop 2, the NN was a single point of failure in an HDFS cluster.
In a typical HA cluster, two separate machines are configured as NNs. At any point in time,
exactly one of the NNs is in an Active state, and the other is in a Standby state.
27
Reading in HDFS
28
14
05/06/2015
Writing in HDFS
29
30
15
05/06/2015
YARN
YARN is the resource-management platform responsible for managing compute resources in
clusters and using them for scheduling of users' applications
While in Hadoop 1 the job scheduling/monitoring functionalities were both taken by the Job
tracker, in Hadoop 2 we such functionalities are redistributed between the following agents:
• A global Resource Manager (RM) that is the ultimate authority that arbitrates resources
among all the applications in the system. Resources are negoziated, assigned and used
based on the abstract notion of a resource Container which incorporates elements such as
memory, cpu, disk, network etc. In the first version, only memory is supported.
• A per-node slave, the Node Manager (NM) who is responsible for containers, monitoring
their resource usage (cpu, memory, disk, network) and reporting the same to the RM.
• A per-application Application Manager (AM) is tasked with negotiating resources from the
RM and working with the NMs to execute and monitor the tasks.
31
YARN
The RM has two main components: Scheduler and Applications Manager.
The Scheduler is responsible for allocating resources to the various running applications
subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in
the sense that:
• it performs no monitoring or tracking of status for the application.
• it offers no guarantees about restarting failed tasks either due to application failure or
hardware failures.
32
16
05/06/2015
YARN
RM
NM NM NM NM
Cont2,4 AM1 Cont1,2 AM2
Cont1,1
Cont2,3
NM NM NM NM
Cont2,1 Cont2,2
33
Application Master
The AM allows YARN to be more:
Scalable: many functionalities are distributed on several AMs thus the RM is no more a
bottleneck in the cluster. RM is a pure scheduler i.e. it doesn’t attempt to provide fault-
tolerance for resources.
Open: moving all application framework specific code into the AM generalizes the system so it
can now support multiple frameworks such as MapReduce, MPI and Graph Processing.
• Since AM is essentially user-code, RM cannot trust it (i.e. any AM is not a privileged service)
• The YARN system (RM and NM) has to protect itself from faulty or malicious AMs and
resources granted to them at all costs
Depending on the Containers it receives from the RM, the AM may update its execution plan to
accommodate the excess or lack of resources. Container allocation/de-allocation can take place
in a dynamic fashion as the application progresses
34
17
05/06/2015
YARN – Walkthrough
Application execution consists of the following steps:
• Application submission
• Bootstrapping the AM instance for the application
• Application execution managed by the AM instance
35
YARN – Walkthrough
4. During normal operation the AM negotiates appropriate resource containers via the
resource-request protocol.
5. On successful container allocations, the AM launches the container by providing the
container launch specification to the NM. The launch specification, typically, includes the
necessary information to allow the container to communicate with the AM itself.
6. The application code executing within the container then
provides necessary information (progress, status etc.)
to its AM.
7. During the application execution, the client that submitted
the program communicates directly with the AM to get
status, progress updates etc.
8. Once the application is complete, and all necessary work
has been finished, the AM deregisters with the RM and
shuts down, allowing its own container to be repurposed.
36
18
05/06/2015
37
AVRO
Avro is a remote procedure call and data serialization framework used as:
• a serialization format for persistent data
• a wire format for communication between Hadoop nodes, and from client programs to the
Hadoop services
• Used by Map-Reduce during shuffle phase
It is based on JSON for defining data types and protocols, and serializes data in a compact
binary format (or in plain text typically for debug purposes).
Avro is designed to be language-independent and there are several language bindings for it,
including Java, C, C++, Python, and Ruby.
38
19
05/06/2015
AVRO
Avro relies on schemas. When Avro data is read, the schema used when writing it is always
present. The main advantages with AVRO with respect to similar systems (e.g. Thrift) are:
• Untagged data: Since the schema is present when data is read, considerably less type
information need be encoded with data, resulting in smaller serialization size.
• Dynamic typing: does not require that code be generated. Data is always accompanied by a
schema that permits full processing of that data without code generation, static datatypes,
etc. This facilitates construction of generic data-processing systems and languages.
• Schema evolution: When a schema changes, both the old and new schema are always
present when processing data, so differences may be resolved symbolically, using field
names. You can use different schemas for serialization and deserialization, and Avro will
handle the missing/extra/modified fields
39
AVRO: a sample
{ {
"type": "record", "type": "record",
"name": "Person", "name": "Person",
"fields": [ "fields": [
{"name": "first", "type": "string"}, {"name": "first", "type": "string"},
{"name": "sport", "type": "string"}] {"name": "age", "type": "int", "default": 0}]
} }
40
20
05/06/2015
Parquet
An columnar storage format for efficient querying of nested structures in a flat format
• Derived from Google Dremel file format
• Parquet is open source
• Ensures interoperability through several frameworks (Spark, MapReduce, etc.) and query
engines (Hive, Impala, Pig, Presto, etc.)
41
Row-oriented
Column-oriented
42
21
05/06/2015
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c;
}
}
}
43
Parquet: unnesting
To store nested data structures in columnar format, we need to map the schema to a list of columns
in a way that we can write records to flat columns and read them back to their original nested data
structure. Parquet creates one column per primitive type field in the schema. If we represent the
schema as a tree, the primitive types are the leaves of this tree.
The structure of the record is captured for each value by two integers called repetition level and
definition level. Using definition and repetition levels, we can fully reconstruct the nested structures.
44
22
05/06/2015
45
message ExampleDefinitionLevel {
optional group a {
required group b {
optional string c;
}
}
}
Making definition levels small is important as the goal is to store the levels in as few bits as possible.
46
23
05/06/2015
Repetition level indicates at which level we have to create a new list for the current value. In other
words, the repetition level can be seen as a marker of when to start a new list and at which level. For
example consider the following representation of a list of lists of strings:
47
48
24
05/06/2015
49
50
25
05/06/2015
Parquet: configuration
Row group size: Larger row groups allow for larger column
chunks which makes it possible to do larger sequential IO.
• Large row groups (512MB - 1GB) are recommended since an entire
row group might need to be read, we want it to completely fit on
one HDFS block. An optimized read setup would be: 1GB row
groups, 1GB HDFS block size, 1 HDFS block per HDFS file.
51
38
26