You are on page 1of 5

Introduction to Big Data

Not all the Big Data applications are based on


internet in fact you can use big data for
several purposes.
An example was Google Flu trends in which
the idea was: users use the system to query it
and through that we can understand problem
and what was happening (this is called
nowcasting). Doing it Google outbreak two
weeks ahead of CDC that example show
that we can use data to understand if there
are emergencies.
Once we have data the most important thing
is to understand what to do with them.

Who generates big data?


User Generated Content: through the use of the web and mobile for
example with applications like Facebook, YouTube, TripAdvisor, etc.
Health and scientific computing: these data are very huge.
Log file: these are potentially interesting
IoT: Sensors, smart meters, etc. Nowadays there are sensors everywhere
and this is also thanks to the increasing power of our mobile devices.
One thing we do not forget: all that data from sensors must be sent to data
centre and this operation is NOT inexpensive.

This
is an

example of big data at work. From the first image we have the
location of the users that uses their devices with the integrated sensors
which gave us lots of information. They are computed and then we
can merge data to suggest, for example, a new route if the indicating
one presents traffic. An app that does it is Wave GPS for example.

What is big data? There are many different definitions but generally when we talk about of big data
we refer to something big, large, complex and heterogeneous. In the original definition we also had
the 3Vs of big data that characterized it:
1. Volume: scale of data increases exponentially over time and continue to increase very
fast. For example, they increased 44 times from 2009 to 2020.
2. Variety: different forms of data there are several types and structure and also a single
application may generate many different formats that bring us to face some problems like
heterogeneous data and complex data integration.
3. Velocity: analysis of streaming data the generate rate is really high and we have to ensure
very fast data processing to ensure timeliness.
Typically, you do not have all these 3 Vs at the same time: if yes maybe you are in trouble.
Nowadays two others Vs has been added:
4. Veracity: uncertainty of data the data quality is an important feature
5. Value: exploit information provided by data this is the most important because translate
data into business advantage.
Remember that you can be a data scientist even if you are analysing a small dataset so it is not true
that data scientist is good only for Big Data and you do not have to perform the analysis alone.
There are team of data scientist because each one is specialized in a certain field.
Big Data value chain
Lets see now which are the steps to follow when you want to analyse Big Data:
Generation: it could be
o Passive recording: where we deal with structured data and we have many inputs like
shopping records, bank trading transactions, etc.
o Active generation: data are semi structured or unstructured and they are the ones
generated by the user using apps and mobile apps.
o Automatic production: they are data generated in an automatic way thanks to
location-aware for example using the sensor-based connected to Internet on our
devices.
Acquisition: depends on which data you are analysing. There are three phases
o Collection: pull-based like web crawler or push-based like video surveillance (the
difference is that in the first case you write something to send data while in the
second case it is something that already sends data to you).
o Transmission: we need to know the characteristics of the network because we have
to sender them to data center over high capacity links and we need do it in the best
way.
o Preprocessing: it is composed of a series of sub operations like integration (of
resources) or redundancy elimination.
Storage: it is necessary to know the
o Storage infrastructure: both the technology used to store data used for HDD and SSD
for example but also the networking one.
o Data management: in particular way the file system (HDFS) and the key-value stores
(Memcached).
o Programming models: map reduce, stream and graph processing.
Analysis: our objectives are the descriptive/predictive and prescriptive analytics and to
realize it we use some methods like data mining, statistical analysis, clustering, etc.
Big data challenges
The main difference is that now data are really important and to analyse them we need a new
architecture (but also programming paradigms and techniques). The traditional approach is not
enough because it is not able to support big data; for example if we think of the traditional
computation it is processor bound and to increase performance we need new and faster processor
but also more RAM. Which is the bottleneck?
This is the traditional approach in which you have
something that store data in your server so that you
can analyse them so data transfer from disk to
processors becomes a problem.
The solution used in a big data framework is very
simple: we split a dataset in many dataset, you
assign to each a chunk of the data and then
analyse these data by using the CPU that you
have in each server and each will emit a partial
result that will be aggregate (at the end) with the
others to obtain the final results.

Why we have to use Hadoop and MapReduce?


The basic motivation regards the data volumes that we already said increases every day and in a
really fast way. To make an example we want to analyse 10 billion web pages and the average size
of a webpage is 20KB so the size of the entire collection is 200TB. If the hard disk has a read
bandwidth of 100MB/s (notice that this value is really high) the time needed only to read all the
web pages without analysing them is more than 24 days. We understand that a single node
architecture cannot be adequate but you must have several servers that works in parallel (first
problem).
Another problem that occurs when you have to manage a big quantity of data are failures. In fact, if
we increase the numbers of servers it is more probable that we face a failure. They are part of
everyday life especially in data center and we have to deal them so we need a system that provides
solution to avoid failures. Generally, a single server stays up (if it is good) for 3 years and if we
have 10 servers I have 1 failure every 100 days that is not so bad but if we have 1000 servers we
may have 1 failure per day not good. How can be a failure? Permanent if for example the
motherboard is broken and the only solution is to change it or transient for example the
unavailability of a resource because too many users are using that resource in that moment.
Another problem, as we said before, is connected with the network bandwidth because it can
become the bottleneck of the system if big amounts of data need to be exchanged between
nodes/servers so to avoid that problem we have data locality in which you move code because
usually codes and programs are small and it is easier transfer data.
Lets see now a single-node architecture:
In the first image, we have that the architecture is good with small dataset since data can be
completely loaded in main memory. In the next one (which is the standard solution) I store on disk
and it is good with large dataset because data cannot be completely loaded in main memory but just
one chunk at a time so that you can process it and store some statistics to combine them all at the
end obtaining the final result.
While a cluster architecture:
If I
have
a
huge

dataset I can have problem with the


storage so we can use clusters of
servers (data center) where the
computation is distributed across
servers but also data are
stored/distributed on different
servers. Each server will contain part of the initial date. A standard architecture is composed of a
cluster of commodity Linux nodes (servers) in which we have 32 GB of main memory per node
which is a quantity that can be more or less according to the company that we are considering
interconnected using an internet connection, in most of the cases a Gigabit Ethernet connection.
The idea is to have many racks that contain 16-64 nodes. The nodes in each rack are connected by
switch and usually we have high connection. If I have to send data from one node to another in the
same rack I can do it efficiently, if you want to send data from one rack to the server of another rack
you have a good connection because you are in the same data center but you have the limitation
given by the maximum gigabits that you can send on that backbone.
It is important to know how data center are organised to write better applications Hadoop is
providing many of the services that are needed in order to have a good distribution of the execution
of your application and it already knows where are data and where it must execute your code.
If you want to scale the best solution is to use a cluster of servers; we need to understand how to use
this cluster of data to solve the problem. Two approaches are usually used to address scalability
issues:
Vertical scalability: used in the high-performance computing; it can be generalized in the
following way: if I have a problem that my server cannot resolve I will buy a new and better
one. More power/resources are added to a single node.
Horizontal scalability: this is the common and also the approach used in Hadoop: if I have a
problem I do not buy a better server but another server equal to the others and I will work in
parallel. More nodes are added to the system and costs are more linear since we pay the
server the same amount of money.
To sum up we can say that for data-intensive workloads a large number of commodity servers is
preferred over a small number of high-performing servers because at the same cost we can have a
system that processes data in a more efficient and fault-tolerant way; instead horizontal scalability is
preferred for big data applications but distributing computing is hard. The important thing is that
you do not have to care about how to distribute the apps among different servers, how to
synchronize all these instances of the apps because the system will do it for you but you need to
understand what the system is doing to avoid certain kind of problems.
When you use a big data framework like Hadoop or Spark the user has not to manage some tasks
like the scheduling of distributed applications that is a critical operation unless you are writing an
high-level application. Hadoop also manages the distribution of data storage thanks its distributed
file system that allow to have multiple copy of the same data in different servers so if you have a
problem with a server you have a copy elsewhere so you can analyse it (generally in Hadoop you
have 3 copies of each piece of data if you have 3 servers with failures and data are here in that
case you lose data but it is a very uncommon situation).
Which are the typical Big Data problems?
Usually you do not have complex problems, you want to perform simple analysis on your data but
you have a large amount of data to analyse. You have to iterate over a large number of
records/objects performing a full scan of your data and perform for each data a local statistic. This
local statistic in the combined with the other results to obtain the final result aggregating them. The
basic idea is to analyse each record without having the information about other records and it is the
idea at the base used by Hadoop to parallelize applications. The main challenges (to sum up) are:
Parallelization
Distributed storage of large data sets
Node Failure management
Network bottleneck
Data diversity and heterogeneity
But fortunately, the majority of these problems is managed by Hadoop.

You might also like