Professional Documents
Culture Documents
Contents
1 Introduction
2 What is Spark
2.1.5 GraphX . . . . . . . . . . . . . . . . . . . . . . . . . .
12
5 Conclusion
14
6 Bibliography
15
Introduction
When anyone talk about Big Data, there is a thing that has in mind. That
thing is Hadoop. Hadoop was rstly developed in 2003, and since that
moment, it has become a synonymous of Big Data.
It is an open-source project that let people work with thousands of
nodes to analyse petabytes of data reasonably fast. It is based on the well
known map-reduce paradigm (rstly developed by Google), and on the HDFS
(Hadoop distributed File System).
Almost every month, there is a new technology that appears on the Big
Data environment to enrich the widely used Hadoop. One can think that
Spark is just one of those technologies, but far from that, it is getting famous
and starts to substitute Hadoop as the Big Data star. Will Spark become
as popular in the future as Hadoop is nowadays? It is hard to say, but along
this technical review I will analyse the principles that are making famous
this new technology.
A bit of the history
where it still remains. It is still under develop, and is receiving commercial support from Cloudera, Hortonworks DataBricks etc. Spark is one of
the most active projects in the Apache Software Foundation, as we can see
having a look to the number of contributors:
The number of followers and the documentation of this new technology grows
every month, as it gets more and more used.
What is Spark
Parts of Spark
Spark is not just one program. The project consists on multiple components:
2.1.1 Spark core
Spark core is, as it name says, the core of the hole project. It provides
a distributed task dispatching and scheduling based on the structures of
RDD's. Those things will be largely analysed in following sections of the
report. Spark core can run applications in Scala, Java, Python and R.
6
It is a component set on the top of Spark that provides Spark with a way to
work with structured and pseudo-structured data sets. Spark SQL can work
in Scala and Python.
2.1.3 Spark Streaming
Spark Mlib is a machine learning framework set on the top of Spark core
that can be used to work with common machine learning algorithms and
statistical algorithms like hypothesis testing or logistic regression.
2.1.5 GraphX
Along this section we will analyse how Spark dispatching process and scheduling process works.
Spark is built around the concept of Resilient Distributed Datasets (or
RDD). Those things, that are the main abstraction provided by Spark, are
immutable resilient and partitioned set of records that can be stored in the
clusters in-memory or in a persistent storage like HDFS (Hadoop distributed
le system), HBase, Cassandra etc. and can be converted into new RDD's
through some operations.
Spark supports 2 types of operations over the RDD's: transformations
and actions.
Transformation: A transformation is just an operation that creates
a new and modied RDD based on the originals. Examples of transformations are map, lter, join etc. The transformations can also be
clasied in 2 types:
Narrow transformations: All the data needed to create the new
RDD is stored in the same old RDD and it is not needed to mix
dierent RDD's to create the new one. An example of this is the
transformation "lter"
Wide transformation: It is necessary to mix dierent RDD's to
plied. Once the action is applied, Spark do all the transformations recorded
and then do the action. This is a great thing, because Spark does not calculate and store on the persistent storage system a new RDD after each
transformation. In fact, a dataset created through some transformations
can be used in a action, and Spark returns only the result of the action,
rather than the entire set of data produced by the transformations.
The reason of this is that Spark stores intermediate results in memory
(cache memory) rather than writing them on the persistent storage system.
The cache memory has much faster access than the persistent storage system, so this enhances a lot the velocity of Spark, even more when you need to
work on the same dataset multiple times, for example, doing several queries
on the same data. When intermediate data does not t in memory, Spark
spills part to disk, so it is designed to work both in-memory and on-disk.
But the thing doesn't end here. The transformations over the RDD's are
not done in the order they were requested, but they are optimized by Spark
to reduce even more the processing time of the operations. Let's explain this
better with an example:
Suppose we have 2 RDD's and that we apply 2 transformations. Firstly
we apply a "join" transformation, and then we apply a "lter" transformation. Those transformations can be solved in two ways:
If Spark solves the transformations in the order they were written, it should
rst apply a join of the RDD's and then apply a lter to the whole new
RDD. This is slower than ltering each RDD separately and then joining
the resulting RDD's. For this reason, this is the way Spark applies the
transformations: It rst apply the lter, and then join the ltered RDD's.
Spark can optimize the transformations this way because it supports
DAG's ow (Directed acyclic graphs ow). Every task in Spark creates a
new DAG of work steps that can be executed in a cluster and that can have
as many steps as necessary. The intermediate steps are stored in the inmemory, not on the disk, Spark just stores on disk the nal result, what is
quite fast.
When a DAG of operations is built, Spark executes a mechanism that
generates a new modied DAG based on the old one. This new DAG provides
the same nal result than the older, but it minimizes the quantity of data that
is mixed from dierent RDD's, which is the slowest type of transformation.
This way, the operation time is optimized. The new modied DAG is built
10
this way:
1. Narrow transformations are applied.
2. Wide transformations are applied .
3. Actions are applied.
3.1
The RDDs are immutable partitioned set of records. This means that when
we apply a transformation on an RDD, we are actually creating a new RDD,
meanwhile the original RDD remains the same. This way, data coherency is
preserved more easily.
And due to the immutability of the RDD's and to the fact that Spark
records all the transformations done to the RDD's, it can also ensure fault
tolerancy. If a node of the system fails, the data stored there can be rebuilt
doing again the transformations (that are stored) to the old RDD's.
11
Thanks to the new way of dispatching tasks and scheduling, Spark is much
faster than other big data solutions as Hadoop. To see this, we can compare
the map-reduce paradigm and the Spark's way of work in terms of DAG's:
Spark allows programmers to develop complex, multi-step data pipelines
using DAG's where the intermediate steps are stored in-memory. It also
supports cache data sharing across dierent DAG's (remember, each
DAG is executed in a cluster), so that dierent jobs can work with the
same data.
The map-reduce paradigm of Hadoop can only build DAG's with 2
predened steps: map and reduce. If you want to do something complicated, you need to string together a series of Map-Reduce DAG's
and execute them in sequence. Each of those DAG's is high-latency,
and none can start until the previous has nished. Also, the intermediate data output between each step must be stored in the distributed
le system (HDFS) before the next step can begin. This tends to be
slow due to replication and the use of the disk storage instead of the
in-memory.
The velocity of both approaches (Hadoop's and Spark's) is compared in the
following table
12
13
Conclusion
As we started saying, Spark is the new star in the Big Data environment,
and there are a lot of reasons why this is happening.
Spark gives us an unied framework to manage big data processing requirements. In addition to the usual map-reduce operations, Spark supports
SQL queries, streaming data, machine learning and graph data processing,
individually or mixed.
Spark lets developers write in Scala, Python, Java and in some situations,
in R, and includes a lot of predened operations that can be used by the
users.
But maybe the major dierence between Spark and other Big data solutions is the velocity on the map-reduce operations. As we have just seen,
Spark is much faster than the other solutions. It runs applications up to 100
times faster in memory and 10 times faster on disc than Hadoop, but this
increase on the velocity does not imply a loss in other factors. For example,
if we have a look to data coherency or fault tolerancy, we can see that both
are perfectly preserved in Spark.
All those factors converts Spark in a project to take into account in the
future.
14
Bibliography
Material from subjects "Fundamentals of Big Data software and hard-
15