You are on page 1of 15

Apache Spark

Fundamentals of big data hardware and software technologies


lvaro Mndez Civieta

Contents
1 Introduction

2 What is Spark

2.1 Parts of Spark . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1 Spark core . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.2 Spark SQL . . . . . . . . . . . . . . . . . . . . . . . .

2.1.3 Spark Streaming . . . . . . . . . . . . . . . . . . . . .

2.1.4 Spark MLlib Machine Learning Library . . . . . . . .

2.1.5 GraphX . . . . . . . . . . . . . . . . . . . . . . . . . .

3 How does Spark works

3.1 Data coherency and Fault tolerancy . . . . . . . . . . . . . . 11


4 Spark vs Hadoop: velocity

12

5 Conclusion

14

6 Bibliography

15

Introduction

When anyone talk about Big Data, there is a thing that has in mind. That
thing is Hadoop. Hadoop was rstly developed in 2003, and since that
moment, it has become a synonymous of Big Data.
It is an open-source project that let people work with thousands of
nodes to analyse petabytes of data reasonably fast. It is based on the well
known map-reduce paradigm (rstly developed by Google), and on the HDFS
(Hadoop distributed File System).
Almost every month, there is a new technology that appears on the Big
Data environment to enrich the widely used Hadoop. One can think that
Spark is just one of those technologies, but far from that, it is getting famous
and starts to substitute Hadoop as the Big Data star. Will Spark become
as popular in the future as Hadoop is nowadays? It is hard to say, but along
this technical review I will analyse the principles that are making famous
this new technology.
A bit of the history

Spark started as a project developed by the University of California in


2009, and was lately donated to the Apache Software Foundation in 2010,

where it still remains. It is still under develop, and is receiving commercial support from Cloudera, Hortonworks DataBricks etc. Spark is one of
the most active projects in the Apache Software Foundation, as we can see
having a look to the number of contributors:

The number of followers and the documentation of this new technology grows
every month, as it gets more and more used.

What is Spark

In this section the main characteristics of Spark will be described.


Spark is an open-source project. This means that the programming code
of Spark is available for anyone to see and modify with total freedom.
It is a cluster computing framework. This means that it works in a
collection of clusters, and that can execute tasks in parallel, taking advantage
of all the power provided by the cluster framework.
Spark needs two things to work: A cluster manager and a distributed
storage system.
Cluster manager: Spark supports dierent types of cluster managers

starting from its native Spark cluster (Standalone), Hadoop, Apache


Mesos etc.
Distributed storage: Spark can work with several distributed stor-

age systems. It usually works with HDFS (Hadoop distributed le


system), but can also interface with Cassandra, Amazon S3 etc.
Spark also supports a pseudo-distributed mode, which is distributed

locally. This means that can be used in a personal computer, where


distributed storage is not required and it is used the local le system
instead. In this situations, Spark is run with one executor per CPU
core.

As a cluster computing framework, Spark has an easy way to horizontal scale,


just by adding more clusters to the network as they are needed to reduce the
processing time.
Spark can run applications written in Scala, Java or Python (and in some
situations, even in R). This fact makes Spark a comfortable program to work
with, suitable for a lot of situations.
2.1

Parts of Spark

Spark is not just one program. The project consists on multiple components:
2.1.1 Spark core

Spark core is, as it name says, the core of the hole project. It provides
a distributed task dispatching and scheduling based on the structures of
RDD's. Those things will be largely analysed in following sections of the
report. Spark core can run applications in Scala, Java, Python and R.
6

2.1.2 Spark SQL

It is a component set on the top of Spark that provides Spark with a way to
work with structured and pseudo-structured data sets. Spark SQL can work
in Scala and Python.
2.1.3 Spark Streaming

With Spark streaming, Spark is able to perform streaming analytics. This


means that it can process large data streams in real time and deliver results in
almost real time. To do so, it ingests data in mini-batches and performs RDD
transformations on those mini-batches. Doing it this way, it let developers
use the same set of application code written for batch analytics in streaming
analytics, what is quite useful.
2.1.4 Spark MLlib Machine Learning Library

Spark Mlib is a machine learning framework set on the top of Spark core
that can be used to work with common machine learning algorithms and
statistical algorithms like hypothesis testing or logistic regression.
2.1.5 GraphX

GraphX is a distributed graph processing framework on the top of Spark


core that provide a way to work with graph computation.

How does Spark works

Along this section we will analyse how Spark dispatching process and scheduling process works.
Spark is built around the concept of Resilient Distributed Datasets (or
RDD). Those things, that are the main abstraction provided by Spark, are
immutable resilient and partitioned set of records that can be stored in the
clusters in-memory or in a persistent storage like HDFS (Hadoop distributed
le system), HBase, Cassandra etc. and can be converted into new RDD's
through some operations.
Spark supports 2 types of operations over the RDD's: transformations
and actions.
Transformation: A transformation is just an operation that creates

a new and modied RDD based on the originals. Examples of transformations are map, lter, join etc. The transformations can also be
clasied in 2 types:
 Narrow transformations: All the data needed to create the new

RDD is stored in the same old RDD and it is not needed to mix
dierent RDD's to create the new one. An example of this is the
transformation "lter"
 Wide transformation: It is necessary to mix dierent RDD's to

build the new one. An example of this is the transformation "join"


Action: An action consist on applying an operation on an RDD that

gives us a value as a result, depending on the action. Examples of


actions are reduce, collect, count etc.
Transformations in Spark are lazy because they don't calculate their results immediately. All the transformations used on the RDD's are recorded
but not applied until a result is requested, that means, until an action is ap8

plied. Once the action is applied, Spark do all the transformations recorded
and then do the action. This is a great thing, because Spark does not calculate and store on the persistent storage system a new RDD after each
transformation. In fact, a dataset created through some transformations
can be used in a action, and Spark returns only the result of the action,
rather than the entire set of data produced by the transformations.
The reason of this is that Spark stores intermediate results in memory
(cache memory) rather than writing them on the persistent storage system.
The cache memory has much faster access than the persistent storage system, so this enhances a lot the velocity of Spark, even more when you need to
work on the same dataset multiple times, for example, doing several queries
on the same data. When intermediate data does not t in memory, Spark
spills part to disk, so it is designed to work both in-memory and on-disk.
But the thing doesn't end here. The transformations over the RDD's are
not done in the order they were requested, but they are optimized by Spark
to reduce even more the processing time of the operations. Let's explain this
better with an example:
Suppose we have 2 RDD's and that we apply 2 transformations. Firstly
we apply a "join" transformation, and then we apply a "lter" transformation. Those transformations can be solved in two ways:

If Spark solves the transformations in the order they were written, it should
rst apply a join of the RDD's and then apply a lter to the whole new
RDD. This is slower than ltering each RDD separately and then joining
the resulting RDD's. For this reason, this is the way Spark applies the
transformations: It rst apply the lter, and then join the ltered RDD's.
Spark can optimize the transformations this way because it supports
DAG's ow (Directed acyclic graphs ow). Every task in Spark creates a
new DAG of work steps that can be executed in a cluster and that can have
as many steps as necessary. The intermediate steps are stored in the inmemory, not on the disk, Spark just stores on disk the nal result, what is
quite fast.
When a DAG of operations is built, Spark executes a mechanism that
generates a new modied DAG based on the old one. This new DAG provides
the same nal result than the older, but it minimizes the quantity of data that
is mixed from dierent RDD's, which is the slowest type of transformation.
This way, the operation time is optimized. The new modied DAG is built
10

this way:
1. Narrow transformations are applied.
2. Wide transformations are applied .
3. Actions are applied.
3.1

Data coherency and Fault tolerancy

The RDDs are immutable partitioned set of records. This means that when
we apply a transformation on an RDD, we are actually creating a new RDD,
meanwhile the original RDD remains the same. This way, data coherency is
preserved more easily.
And due to the immutability of the RDD's and to the fact that Spark
records all the transformations done to the RDD's, it can also ensure fault
tolerancy. If a node of the system fails, the data stored there can be rebuilt
doing again the transformations (that are stored) to the old RDD's.

11

Spark vs Hadoop: velocity

Thanks to the new way of dispatching tasks and scheduling, Spark is much
faster than other big data solutions as Hadoop. To see this, we can compare
the map-reduce paradigm and the Spark's way of work in terms of DAG's:
Spark allows programmers to develop complex, multi-step data pipelines

using DAG's where the intermediate steps are stored in-memory. It also
supports cache data sharing across dierent DAG's (remember, each
DAG is executed in a cluster), so that dierent jobs can work with the
same data.
The map-reduce paradigm of Hadoop can only build DAG's with 2

predened steps: map and reduce. If you want to do something complicated, you need to string together a series of Map-Reduce DAG's
and execute them in sequence. Each of those DAG's is high-latency,
and none can start until the previous has nished. Also, the intermediate data output between each step must be stored in the distributed
le system (HDFS) before the next step can begin. This tends to be
slow due to replication and the use of the disk storage instead of the
in-memory.
The velocity of both approaches (Hadoop's and Spark's) is compared in the
following table

12

As we can see, Spark's approach is much faster than Hadoop's Map-Reduce.


In the table we compare the Hadoop World Record with two executions of
Spark. If we have a look to the Spark 100 TB execution, we can see that it
has 10 times less nodes than Hadoop, almost 8 times less cores and applies
3 times more reducers, but it is still 3 times faster.

13

Conclusion

As we started saying, Spark is the new star in the Big Data environment,
and there are a lot of reasons why this is happening.
Spark gives us an unied framework to manage big data processing requirements. In addition to the usual map-reduce operations, Spark supports
SQL queries, streaming data, machine learning and graph data processing,
individually or mixed.
Spark lets developers write in Scala, Python, Java and in some situations,
in R, and includes a lot of predened operations that can be used by the
users.
But maybe the major dierence between Spark and other Big data solutions is the velocity on the map-reduce operations. As we have just seen,
Spark is much faster than the other solutions. It runs applications up to 100
times faster in memory and 10 times faster on disc than Hadoop, but this
increase on the velocity does not imply a loss in other factors. For example,
if we have a look to data coherency or fault tolerancy, we can see that both
are perfectly preserved in Spark.
All those factors converts Spark in a project to take into account in the
future.

14

Bibliography
Material from subjects "Fundamentals of Big Data software and hard-

ware technologies" and "High-performance computer systems and architectures"


http://www.infoq.com/articles/apache-spark-introduction
https://en.wikipedia.org/wiki/Apache_Spark
http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html
http://spark.apache.org/
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
http://insights.dice.com/2014/03/12/apache-spark-next-big-thing-big-data/
http://bigdata4success.com/blog/spark-nuevo-referente-en-big-data/
https://geekytheory.com/apache-spark-que-es-y-como-funciona/

15

You might also like