Professional Documents
Culture Documents
This
is an
example of big data at work. From the first image we have the
location of the users that uses their devices with the integrated sensors
which gave us lots of information. They are computed and then we
can merge data to suggest, for example, a new route if the indicating
one presents traffic. An app that does it is Wave GPS for example.
What is big data? There are many different definitions but generally when we talk about of big data
we refer to something big, large, complex and heterogeneous. In the original definition we also had
the 3Vs of big data that characterized it:
1. Volume: scale of data increases exponentially over time and continue to increase very
fast. For example, they increased 44 times from 2009 to 2020.
2. Variety: different forms of data there are several types and structure and also a single
application may generate many different formats that bring us to face some problems like
heterogeneous data and complex data integration.
3. Velocity: analysis of streaming data the generate rate is really high and we have to ensure
very fast data processing to ensure timeliness.
Typically, you do not have all these 3 Vs at the same time: if yes maybe you are in trouble.
Nowadays two others Vs has been added:
4. Veracity: uncertainty of data the data quality is an important feature
5. Value: exploit information provided by data this is the most important because translate
data into business advantage.
Remember that you can be a data scientist even if you are analysing a small dataset so it is not true
that data scientist is good only for Big Data and you do not have to perform the analysis alone.
There are team of data scientist because each one is specialized in a certain field.
Big Data value chain
Lets see now which are the steps to follow when you want to analyse Big Data:
Generation: it could be
o Passive recording: where we deal with structured data and we have many inputs like
shopping records, bank trading transactions, etc.
o Active generation: data are semi structured or unstructured and they are the ones
generated by the user using apps and mobile apps.
o Automatic production: they are data generated in an automatic way thanks to
location-aware for example using the sensor-based connected to Internet on our
devices.
Acquisition: depends on which data you are analysing. There are three phases
o Collection: pull-based like web crawler or push-based like video surveillance (the
difference is that in the first case you write something to send data while in the
second case it is something that already sends data to you).
o Transmission: we need to know the characteristics of the network because we have
to sender them to data center over high capacity links and we need do it in the best
way.
o Preprocessing: it is composed of a series of sub operations like integration (of
resources) or redundancy elimination.
Storage: it is necessary to know the
o Storage infrastructure: both the technology used to store data used for HDD and SSD
for example but also the networking one.
o Data management: in particular way the file system (HDFS) and the key-value stores
(Memcached).
o Programming models: map reduce, stream and graph processing.
Analysis: our objectives are the descriptive/predictive and prescriptive analytics and to
realize it we use some methods like data mining, statistical analysis, clustering, etc.
Big data challenges
The main difference is that now data are really important and to analyse them we need a new
architecture (but also programming paradigms and techniques). The traditional approach is not
enough because it is not able to support big data; for example if we think of the traditional
computation it is processor bound and to increase performance we need new and faster processor
but also more RAM. Which is the bottleneck?
This is the traditional approach in which you have
something that store data in your server so that you
can analyse them so data transfer from disk to
processors becomes a problem.
The solution used in a big data framework is very
simple: we split a dataset in many dataset, you
assign to each a chunk of the data and then
analyse these data by using the CPU that you
have in each server and each will emit a partial
result that will be aggregate (at the end) with the
others to obtain the final results.