You are on page 1of 26

Big Data Comments from a Beginner

Pedro Francisco Borges Pereira


Undergraduate Student Universidade Luterana do Brasil (Lutheran University of Brazil)
Major Information Systems
International Exchange Student at Kansas State University
Fall/2014 Semester

Foreword
I was very interested in Big Data is there some IT professional who is not, nowadays? and decided to
learn about. So Ive enrolled in, and Im attending, the CIS 798-Programming Techniques for Big Data
Analytics, at Kansas State University, with Professor William Hsu.
I started the course aware of some catchwords, like
Big data is NoSQL.
Big data made relational databases obsolete.
Professor Hsu started the course teaching us that Big data is about Volume, Velocity, Variety, and
proceeded with a Machine Lab implementing the prototypical Word Count Big Data programming
example, using Hadoop with a Java plug-in.
Well, that makes sense, but I think Professor Hsu underestimated my lack of knowledge. I was not just
ignorant regarding the programming techniques; I did not know when, in which use cases, to apply Big
Data programming techniques.
So I thought I should take some time to learn about the context, about in what use cases we could
and/or should use Big Data techniques and also, if traditional relational databases are not dead, when
we should stay with them.
As a result of my knowledge about Big Data (or, more properly, lack of) summing up almost just to the
catchword Big Data is NoSQL, when I speak, in these initial paragraphs, about Big Data
infrastructure, please understand that I mean anything different from relational databases.
I called myself a beginner. Indeed I am regarding Big Data. Im familiar (proficient, I would dare to
say) with relational databases. I think these notes can be useful to anyone aiming to get a general
overview of Big Data, but please take in account that Im writing as somebody used to relational
databases and new to Big Data. I did make a reasonable amount of reading about Big Data, but of
course I am still a beginner. I may have made mistakes; I very probably did. Your feedback will be very
welcomed at my personal e-mail: pedrofbpereira@yahoo.com.br . Thank you very much!

Contents
Foreword ................................................................................................................................................ 1
Initial Statements .................................................................................................................................... 4
For some Use Cases, Relational databases are better .............................................................................. 5
Big Data does things that Relational Databases cant ............................................................................... 5
Un-Structured Data ............................................................................................................................. 5
Complex Event Processing ................................................................................................................... 6
Sessionization ...................................................................................................................................... 6
Volume, Velocity, Variety ........................................................................................................................ 6
Volume ................................................................................................................................................ 6
Velocity ............................................................................................................................................... 7
Volume, Velocity Facebook ............................................................................................................... 7
Variety................................................................................................................................................. 7
NoSQL ................................................................................................................................................. 7
Scheme-agnostic, but organized ...................................................................................................... 8
Key-Value Stores .............................................................................................................................. 8
Column Family ................................................................................................................................. 8
Document Databases ....................................................................................................................... 8
Graph Databases ............................................................................................................................. 8
Big Table and HBase......................................................................................................................... 9
Infrastructure Characteristics ................................................................................................................ 10
Distributed, Parallel and Redundant High Performance and Availability .......................................... 10
Super-Computing / Cloud Computing ................................................................................................ 10
Hadoop Distributed File System - HDFS .............................................................................................. 11
MapReduce ....................................................................................................................................... 12
The MapReduce Pipeline ............................................................................................................... 12
Hadoop Simple Definition ............................................................................................................... 13
Hadoop BEOCAT instance ............................................................................................................... 13
AMAZON WEB SERVICES CLOUD COMPUTING ................................................................................ 14
The Canonical Wordcount Example ....................................................................................................... 16
Environment: Linux; Programming Language: Python ...................................................................... 16
Word Count Diagram ......................................................................................................................... 17
hadoop-streaming.jar ........................................................................................................................ 17
Mapper ............................................................................................................................................. 18
Reducer ............................................................................................................................................. 18
2

Our Mapper and Reducer can run just in Linux .................................................................................. 19


Running our Hadoop Job ................................................................................................................... 20
Using Parallel Processing ................................................................................................................... 20
BIBLIOGRAPHY ...................................................................................................................................... 21
Glossary ................................................................................................................................................ 23
Appendix a BigData Solution Architeture ............................................................................................ 24
Felipe Renz MBA Project at Unisinos College, Brazil ........................................................................... 24

Initial Statements
Ive read a dozen articles (see Bibliography, in these Notes), as well as Professor Hsu handouts, and
reached, in short, the following conclusions:
Despite the fact that (understandably) some NoSQL database vendors say so, not all applications
using relational databases worth being converted to Big Data techniques; relational databases
and data warehouses using non-normalized star schemas and cubes - are still, frequently, the
best option.
Despite the fact that some relational databases vendors (understandably) state that their
products can do everything that Big Data frameworks do, they cant. They can indeed deliver
some Big Data tools. But amazing new fields as for example Complex Event Processing - are
being pioneered thanks to Big Data techniques.
Volume, Velocity and Variety is a simple way to describe Big Data features to outsiders.
Reality is not always that simple. Actually most relational databases can scale to big data
volume (usually, at a higher cost; usually, not instantly) and can perform at high speed
(depending on the use case, faster than Big Data frameworks). Variety would be the key
feature. Id like to highlight that, as I see, this variety is more relevant in a sense of changes
in database schema (or even absence of schema) instead of variety of data sources or in data
content.
Economic differences and infrastructure differences:
o A strong point of Big Data frameworks is distributed processing and storage, which
leads to great easiness in scalability and failure tolerance. Relational databases can be
distributed and grow to sizes that would grant them the right to be called also big
data. But since Big Data frameworks are based in standard cheap computers, and
relational databases uses (mostly) expensive powerful servers, distributing
conventional relational databases is a lot more expensive. Big Data frameworks tend
to be cheaper.
o Big Data frameworks can scale instantly. Relational databases, usually, are not able to
answer instantly to explosive demand increases.
o Relational databases can be hosted in the cloud. Most Big Data frameworks are born
in the cloud. Open source frameworks as for example Apache Hadoop are less
expensive than cloud versions of relational databases. And big providers of
Infrastructure as a Service as for example Amazon allow using a safe, best-practicesmanaged, and highly (and instantly!) scalable platform, for an affordable cost.
Giants as Google, Yahoo, Amazon, the government, and also academic researchers, almost for
sure need to use Big Data frameworks. Smaller companies thanks to IaaS providers can also
use Big Data.
There is no free lunch! The increased capability of Big Data frameworks to deal with variable
schemas implies in not being ACID1 (Atomic, Consistent, Isolated, Durable). Despite some NoSQL
distributors claim to be ACID, I honestly cant understand how you can be distributed, unrelated, un-locked and, at the same time, ACID. They aim to be, instead, BASE2 (Basically
Available, Soft-state and Eventually consistent).
1

This performance/ACID exchange is clearly stated in Dynamo: Amazons Highly Available Key-value Store
article, at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf .
2
See Enterprise NoSQL for Dummies, MarkLogic Special Edition, by Charlie Brooks.

For some Use Cases, Relational databases are better


Which is the nature of your data? Tabular data, as these used in accounting, payroll, inventory, and so
on? Then there is no reason to think about anything but relational databases.
Regardless of the complexity level of your data, how deeply do you know and understand your
applications requisites, your users needs? If youre a Subject Matter Expert and you can preview main
query needs, then you or your Database Designer will be able to design a relational database scheme
to match the requisites; you dont need anything else.
Some article authors state that complexly structured data does not fit into the two-dimensional rowcolumn structure of relational databases. That, for example, multi-level nesting and hierarchies would
be more easily represented with JSON. I would dare to say that these authors are just not familiar
enough with relational databases; they are not taking in account that proper use of relationships
provides a lot more than a two-dimensional row-column structure and can easily store multi-level
nesting and hierarchies!
If it is a matter of cost, of avoiding paying for licenses, remember that despite NoSQL and Open
Source are usually seeing as associated, open source does not always means for free. Furthermore,
SQL Server has free versions; Oracles MySQL is for free; there are several other free relational database
implementations.
Your data are big? Well, StackOverflow has 4 million users and 560 million page views a month and
manages this with one SQL Server deployment running in four servers3. Even Facebook uses My SQL!
I think I dont need to say, every IT professional is familiar with this saying, but there goes it: if it is
working, dont fix it! If your application is working fine with a relational database, definitely there is no
reason why you should convert it to any other technology.

Big Data does things that Relational Databases cant


Un-Structured Data
The absence of rigid schema gives Big Data frameworks the capability of using the same source data in
several different ways, without ETL, without data re-organization. You can just use different plug-ins
in Java, or Python, or several other languages to make sense of your raw data for example, freeformat text. It reminds me of the old times of COBOL, when there were no databases, the data
structure was defined inside the programs and joins were done manually coding the logic. The
difference is that with Big Data a lot of times there is no data structure at all, you build a data structure
(for example, key-value pairs) on flight.
3

Big Data How to Pick your Platform Tech Digest Information Week September 2014 - http://dc.ubmus.com/i/378226#201-element

Complex Event Processing


Quoting literally the article Big Data: Pick your Platform, from Tech Digest, September 2014 (
http://dc.ubm-us.com/i/378226#201-element ):
Complex Event Processing - Next Big Data Frontier
Complex event processing is poised to open new avenues of data analysis, especially given the rise of
the Internet of Things. Think of it as multidimensional clickstream analysis: Identify which sessions in
certain clickstreams are like sessions in other clickstreams, then figure out what it all means. Four
interesting projects illustrate the wealth of possibilities.
Not-so-weird science: Remember Isaac Asimovs fictional sicience Psychohistory, which osits
that there may only be a set number of human-reaction patterns to events? Googles BigQuery
database now provides access to the GDELT Event Database, which tracks global broadcast,
print, and web new media and applies algorithms to see whats going on, and wheter something
similar has happened before. If so, could the past help us understand how to react to current
events?
Spotting disease: Researchers a t Boston Childrens Hospital helped develop HealthMap, an
outbreak monitoring system to detect and track the spreado of idsease. It aggregates big data
from sources including the World Health Organization, Google News, Baidu News, and the
European Centre for Disease Prevention and Control. Users can look at a map and find out what
health risks are in their neighborhoods, or in areas where they plan to travel.
The article lists more CEP uses. These two, I think, are interesting and impressive enough to highlight
the point.

Sessionization
Sessionization is the capability of analyze clickstream-related events grouping all events from a
session. With this capability you can query, for example, what happened before a particular type of
event for example, before an equipment fails. BigData frameworks as Hadoop do have sessionization
features for years; oracle 12c recently released this feature. Im not aware if the new Oracle
sessionization feature still follows the patter describe in Case Study: Sessionization (Social Network) 4

Volume, Velocity, Variety


As we said, its fair to say that the three Vs define Big Data. But, even if it is correct, we could also say
that things are not always so simple.

Volume
Ok, Big Data. How big? Its usually accepted that big data range starts at terabytes level.
And, indeed, it could be said that, in general, relational databases and data warehouses start to be
overwhelmed with 10 terabytes on and, since this, when you reach some point near 10 terabytes, its
better think about alternatives but, alternatives does not mean necessarily Big Data.

https://www.youtube.com/watch?v=hzZ3bM80pJg

Oracle Exadata, for example, can provide, in a single rack, 12 terabytes of system memory and 672
terabytes of disk, with more than 400 high-performance CPU cores5. With two of these racks, we are
into petabytes range!
Large volumes dont necessarily demand big data tools.

Velocity
Relational databases can be scaled to deal with a large number of transactions with good performance.
As we mentioned before, Stack Exchange, who maintains the programming Q&A site Stack Overflow,
which ranks 54th in world for traffic, uses SQL Server6.
Of course if you are dealing with machine-created data, from sensors for example, that would create a
massive amount of traffic. But very few companies who are using caching and indexing properly will
have traffic on a web application that could justify moving to a Big Data platform with the only reason of
needing more velocity.

Volume, Velocity Facebook


Actually, we have a very illustrative example both in terms of volume and velocity Facebook uses My
SQL7. Its true that they customized My SQL; they developed and are using several customized patches.
But its still My SQL, its a relational database and not even one of the most powerful.
In this particular case, we have even standard, cheap machines, being used as servers something we
usually link to Big Data distributed framework.

Variety
Variety is the key word. Relational databases, due its well-defined schemas, can grant data integrity
and consistency. Exactly due to its well-defined schemas, sometimes its not easy to make real life,
changing data, fit in the schema. ETL operations are needed; sometimes, the schema needs to be
changed which usually implies time-consuming operations and probably some downtime.
Nowadays business need to handle unstructured data, as for example text; flexibly structured JSON
data; geospatial data
Variety is indeed a stronger reason to a platform change. If your data are unstructured, or flexibly
structured, there is no reason to stay with a structured approach. You should think about changing from
SQL to NoSQL.

NoSQL
Its hard to define NoSQL; the difficulty starts with the name it says what it is not, instead of what it is!
Furthermore Hive (a query language for Hadoop) syntax is very similar to SQL; Cassandra database
provides CQL query language, which also has similarities with SQL.
I think thats why some are defining NoSQL as Not ONLY SQL, instead of, literally, no SQL.

Oracle Exadata Database Machine Extreme Performance for the Cloud - https://www.oracle.com/engineeredsystems/exadata/index.html
6
Big Data How to Pick your Platform Tech Digest Information Week September 2014 - http://dc.ubmus.com/i/378226#201-element
7
See (among other sources) Facebook Showcasing Its Open Source Database, at
http://allfacebook.com/facebook-showcasing-its-open-source-database_b21560 .

Trying to define this kind of databases, the similarities we found in most NoSQL databases are that they
are scheme-agnostic and that they provide access to the database thru Application Programming
Interfaces.
An inference of NoSQL being scheme-agnostic is that it is extremely well suited to deal with nonstructured data.
Since there is no pattern of query language for NoSQL databases, using an API is necessary. You code a
customized program using, for example, Java or Python to access the database according to your
particular use case.
As we said, some high-level languages to query NoSQL databases have been developed - as for example
Pig and Hive. Usually, these languages performance is not as good as when using the API.
Scheme-agnostic, but organized
According to the book Enterprise NoSQL for Dummies 8, NoSQL databases are usually structured in one
of these forms:
Key-Value pairs
Document
Column Family
Graph
If we take a look at some NoSQL database implementations, well see that more than one logic concept
can be present in one database.
Key-Value Stores
This is a simple data model.
The main implementation is Dynamo, Amazon Key-Value store.
Other vendors are Redis, Riak and Voldemort9.
Column Family
This model supports semi-structured data. Its prototype is BigTable, from Google.
HBase is an open source Apache implementation of this model.
Cassandra is another Apache implementation of BigTable. Cassandra incorporates Cassandra Query
Language (CQL), which has been modeled after SQL.
Document Databases
This model defines document as a colletion of key-value pairs. The database itself is a collection of
documents. MongoDB is an implementation of this model.
Graph Databases
This model is based in nodes and relationships. Has a high level of complexity. Stores and supports
querying linked data, equivalent to several graph types Undirected, Directed, Multi Graphs, Hyper
Graphs, etc. Each node component in the database structure knows its adjacent nodes.
8

See Enterprise NoSQL for Dummies, MarkLogic Special Edition, by Charlie Brooks.
Introduction to Graph Databases Chicago Graph database Meet-Up, by Max de Marzi http://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789 .
9

Some implementations are OrientDB, InfiniteGraph, AllegroGraph.


Big Table and HBase
As we said above, BigTable and HBase are column family NoSQL database implementations, from
Google and Apache respectively.
According with the definition of Big Table which I found in the article Understanding HBase and
BigTable10 and also in the article BigTable a NoSQL massively parallel table11, BigTable is a sparse,
distributed, persistent multidimensional sorted map. And Understanding HBase and BigTable
proceeds: The map is indexed by a row key, column key, and a timestamp; each value in the map is an
uninterpreted array of bytes.
Map, in this sense, is an associative array. According to Krzyzanowski, BigTable is a collection of
(key,value) pairs where the key identifies a row and the value is the set of columns.
Multidimensional means (still according to Krzyzanowski) that A table is indexed by rows. Each row
contains one or more named column families. Column families are defined when the table is created.
Within a column family, one may have one or more named columns. Columns within a column
family can be created on the fly12.
So we can see that Key/Value Pairs and Column Family structures both are implemented in
BigTable.
In the The NoSQL Movement Big Table Databases13 article, it is stated that Bigtable is a distributed
storage system for managing structured data that is designed to scale to a very large size: petabytes of
data across thousands of commodity servers. Bigtable is a sparse, distributed, persistent
multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each
value in the map is an uninterrupted array of bytes. The timestamp index allows each cell in a
Bigtable to contain multiple versions of the data indexed by time.
This time-indexed versions of the data allows BigTable to avoid updating records; instead, it just inserts
a new version. This is not just useful, but need, since BigTable rests on Hadoop Distributed File System
and as described below in this document HDFS is not oriented to updating data.
Since a Document database is also a collection of key-value pairs, BigTable implements also this kind
of database.
Regarding the Graph database, I would say that BigTable implements it partially.
Since the value part of the key-value pairs can be another value-pair collection and that this can occur
several times, at nested levels, you can easily implement a specific kind of graph a tree. If you need
to implement a kind of graph which is able to directly link two branches in different trunks i.e., able to
navigate the tree without mandatorily following the branches probably a BigTable implementation will
not be the best choice and you would need a specific Graph database.
10

Understanding HBase and BigTable, by Jim R. Wilson http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable .


11
BigTable a NoSQL massively parallel table, by Paul Krzyzanowski available in the Handouts of Professor
Hsus Course.
12
This results in that (strangely for these used to relational databases) there is no built-in way to query for a list of
all columns in all rows. However, you can query for a list of all column families.
13
The NoSQL Movement Big Table Databases, by Paul Wiliams - http://www.dataversity.net/the-nosqlmovement-big-table-databases/ .

Infrastructure Characteristics
Please let me highlight that when describing Big Data infrastructure characteristics in this topic, I have in
mind two specific implementations Apache Hadoop and Amazon Web Services. These are two of the
most important Big Data framework implementations. Since the field is still being shaped, its a lot
easier to describe actual implementations than to define a general model. The features we describe
below are common to these two implementations.
Let us mention that there is also a Hadoop on Google Cloud implementation. We are not going to
describe it; you can find a description at https://cloud.google.com/hadoop/what-is-hadoop .
What these Big Data frameworks have in common is that they are distributed, parallel and redundant,
resulting in high performance and availability. In order to achieve these features, the frameworks are
usually hosted in HPC (also called super-computers) or use cloud computing. An indispensable requisite
for these frameworks is some kind of distributed file system; most use some variation of Hadoop
Distributed File System HDFS. A recurrent programming pattern is MapReduce.

Distributed, Parallel and Redundant High Performance and Availability


Most of Big Data frameworks good results are obtained thru high performance and high availability.
We could say that even when the data are not big, using a Big Data framework could be
advantageous due to this.
But high performance and availability are not intrinsic to Big Data itself. They are a result of Distributed
Storage and Processing. Notice that distributed processing has embedded the notion of Parallel
Processing.

Super-Computing / Cloud Computing


These advantages are not intrinsic to Big Data frameworks, nor restricted to them.
Big Data frameworks just take advantage of super-computing14 and/or cloud computing15 frameworks
to get its high performance. Several other areas, beyond Big Data, benefit from it including relational
databases. Relational databases can also be hosted in the cloud; Amazon Web Services offers both
products.
High availability is a plus usually included in super-computing/cloud computing, since the physical layer
servers, disc racks, etc. is typically redundant in these environments. The function of a failed disk or
processor is immediately taken by another in the cluster.
Let us describe HDFS, a distributed file system widely used in Big Data frameworks, and MapReduce, a
programming pattern which can be used together with HDFS and is also very usual when working with
Big Data.

14

Supercomputer: A supercomputer is a computer that performs at or near the currently highest operational rate
for computers. A supercomputer is typically used for scientific and engineering applications that must handle very
large databases or do a great amount of computation (or both). Source:
http://whatis.techtarget.com/definition/supercomputer
15
According to Amazon (aws.amazon.com), "Cloud Computing, by definition, refers to the on-demand delivery of
IT resources and applications via the Internet with pay-as-you-go pricing.

10

Hadoop Distributed File System - HDFS16


HDFS is an Apache Hadoop subproject which implements a distributed file system that runs on
commodity low-cost hardware.
An HDFS instance can run in hundreds of even thousands of server machines, each one storing a part of
the data. Its architecture aims to deal with fault detection and recovery; even if some component fails
the HDFS instance continues to work.
HDFS is designed to support large files typically, in terabytes range.
HDFS applications needs a write-once-ready-many-times access model. Once a file has been created,
it shall not be changed. MapReduce applications fits this model.
An assumption of HDFS architecture is that moving computation is cheaper than moving data. I.E., a
computation runs more efficiently when it runs near the data it uses and this is especially true when
the data size is large. HDFS provides interfaces to move applications closer to the data and since this
avoid network congestion.
We are not going to describe HDFS architecture in detail here. The picture below provides a general
ideal. You can find more details at the pictures source HDFS Architecture Guide
(http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html ).

16

HDFS Architecture Guide Introduction - http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html .

11

MapReduce
This expression MapReduce is used to refer to both a programming pattern and several software
frameworks which implement it.
According to Donald Miner and Adam Shook17, MapReduce is a computing paradigm for processing
data that resides on hundreds of computers, which has been popularized recently by Google, Hadoop,
and many others.
Miner and Shook also emphasize that The tradeoff of being confined to the MapReduce framework is
the ability to process your data with distributed computing, without having to deal with concurrency,
robustness, scale, and other common challenges.
The pattern can be very simplistic described as consisting of a Map phase, which converts input data
in key-value pairs, and a Reduce phase, which summarizes the key-value pairs as aggregated data.
In A Beginners Guide to Hadoop, Matthew Rathbone uses the diagram below to illustrate it18:
The MapReduce Pipeline

The Mappers are distributed through hundreds (or thousands) of clusters, increasing the performance.
Actually, the input data themselves are distributed thru HDFS.
The Reducers are closing this distribution until finally releasing the summarized analytical data.
Comparing HDFS diagram with MapReduce diagram, you can see clearly that HDFS distributed
architecture matches MapReduce parallel processing.

1717

MapReduce Design Patterns, Donald Miner and Adam Shook, OReilly. Available at www.it-ebooks.info .
A Beginners Guide to Hadoop, Matthew Rathbone - http://blog.matthewrathbone.com/2013/04/17/what-ishadoop.html .
18

12

Hadoop Simple Definition


According to SAS19 page, Hadoop is an open-source software framework for storing and processing big
data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two
tasks: massive data storage and faster processing20.
Apache page21 states that The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation
and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service on top of a
cluster of computers, each of which may be prone to failures.

Hadoop BEOCAT instance


An actual Hadoop instance is installed at BeoCat Kansas State University supercomputer22. BeoCat is a
supercomputer built over the Beowulf23 computer cluster hence the name24.
Beocat has 164 nodes totaling 2528 processors and 14TB+ of ram25.
Residing in this clustered supercomputer, the Hadoop instance can be fully distributed and parallel.
BeoCat admins grant that it is also redundant and highly available.
Hadoop components are (in a simplistic way) showed in the picture below26:

19

http://www.sas.com/en_us/company-information.html
http://www.sas.com/en_us/insights/big-data/hadoop.html
21
http://hadoop.apache.org/
22
See BeoCat, at https://www.cis.ksu.edu/beocat , for additional information.
23
Bewoulf cluster - http://en.wikipedia.org/wiki/Beowulf_cluster .
24
A Wild Cat is Kansas State University symbol.
25
About BeoCat, at http://support.cis.ksu.edu/BeocatDocs (this page has been moved to
http://support.beocat.cis.ksu.edu ).
20

26

Big Data Hadoop from an Infrastructure Perspective - http://blogs.cisco.com/datacenter/big-datahadoop-from-an-infrastructure-perspective .


13

HDFS and MapReduce we just discussed in this document.


HBase, as we already said in this document, is a Column Family NoSQL database, based on
Googles BigTable; Hbase is Apache implementation of BigTable.
Pig is a platform for analyzing large data sets. Consists of a high level language for expressing
data analysis programs.
Pig is procedural, opposite to SQL, which is declarative.
Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis. Has been initially developed by Facebook; nowadays several
other companies do use Hive. Amazon has a Hive fork included in Amazon Elastic MapReduce.
Hive has a SQL-like syntax.
Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.

AMAZON WEB SERVICES CLOUD COMPUTING


For those who does not have access to a High Performance Computing cluster, Amazon makes available
its Web Services Infrastructure as a Service. The range of products is very wide; Amazon classifies it as
Compute, Storage & Content Delivery, Databases, Networking, Analytics, Mobile Services, Enterprise
Applications, etc.
They claim delivering IT Service thru the Internet with pay-as-you-go pricing27.
In this document you are going to focus in some of these services directly connected to databases
and/or Big Data.
This subset of services provides the equivalent of the Hadoop instance we described above in this
document, just in the cloud28.

Amazon EC2 - provides resizable compute capacity in the cloud. It is designed to make webscale computing easier for developers and system administrators.
Amazon EBS (Elastic Block Store) - provides block level storage volumes for use with Amazon
EC2 instances. Amazon EBS volumes are off-instance storage that persists independently from
the life of an instance.
Databases - Amazon Web Services provides fully managed relational and NoSQL database
services, as well as fully managed in-memory caching as a service and a fully managed petabytescale data-warehouse service.
Or, you can operate your own database in the cloud on Amazon EC2 and Amazon EBS.
Storage & Content Delivery - Amazon S3 (Simple Storage Service) - provides a fully redundant
data storage infrastructure for storing and retrieving any amount of data, at any time, from
anywhere on the Web.

27

What is Cloud Computing - http://aws.amazon.com/what-is-cloudcomputing/?sc_ichannel=ha&sc_icountry=en&sc_icampaign=ha_en_WhatIsCC&sc_icontent=ha_212&sc_idetail=h


a_en_212_1&sc_iplace=ha_en_ed
28
Source - aws.amazon.com .

14

Analytics - Amazon Web Services provides cloud based analytics services to help you process
and analyze any volume of data, whether your need is for managed Hadoop clusters, real-time
streaming data, petabyte scale data warehousing, or orchestration.

Amazon EMR (Elastic MapReduce) - is a web service that enables businesses, researchers, data
analysts, and developers to easily and cost-effectively process vast amounts of data. Amazon
EMR uses Hadoop to distribute your data and processing across a resizable cluster of Amazon
EC2 instances.

The text below is literally copied from Amazon site (aws.amazon.com):


How to Use Amazon EMR
To use Amazon EMR, you simply:
1. Develop your data processing application. You can use Java, Hive (a SQL-like language), Pig (a data
processing language), Cascading, Ruby, Perl, Python, R, PHP, C++, or Node.js. Amazon EMR provides code
samples and tutorials to get you up and running quickly.
2. Upload your application and data to Amazon S3. If you have a large amount of data to upload, you
may want to consider usingAWS Import/Export (to upload data using physical storage devices) or AWS
Direct Connect (to establish a dedicated network connection from your data center to AWS). If you
prefer, you can also write your data directly to a running cluster.
3. Configure and launch your cluster. Using the AWS Management Console, EMR's Command Line
Interface, SDKs, or APIs, specify the number of EC2 instances to provision in your cluster, the types of
instances to use (standard, high memory, high CPU, high I/O, etc.), the applications to install (Hive, Pig,
HBase, etc.), and the location of your application and data. You can use Bootstrap Actionsto install
additional software or change default settings.
4. (Optional) Monitor the cluster. You can monitor the clusters health and progress using the
Management Console, Command Line Interface, SDKs, or APIs. EMR integrates with Amazon CloudWatch
for monitoring/alarming and supports popular monitoring tools like Ganglia. You can add/remove
capacity to the cluster at any time to handle more or less data. For troubleshooting, you can use the
consoles simple debugging GUI.
5. Retrieve the output. Retrieve the output from Amazon S3 or HDFS on the cluster. Visualize the data
with tools like Tableau and MicroStrategy. Amazon EMR will automatically terminate the cluster when
processing is complete. Alternatively you can leave the cluster running and give it more work to do.

15

The diagram below shows a (simplified) view of how this works.


Amazon S3 (Simple Storage Services) provides the Client a Hadoop cluster.
Notice that, despite it is not represented in the picture, S3 is supported by EC2 (resizable compute
capacity) and EBS (Elastic Block Store).
S3, in turn supports an Elastic MapReduce framework, which allows you to run your analytics
application.

The Canonical Wordcount Example


Hello World of MapReduce development is the Word Count example reading a text file and
counting the number of times each word appears in the text.
Of course this is not an actual of data mining the text; its just a Hello World example.

Environment: Linux; Programming Language: Python


You can (or so Ive read) install Hadoop in a Windows server. This example is completely Linux-based.
If you are not familiar with Linux, youll have to take a brief tutorial or, at least, search for the meaning
of the commands we are using in this example.

16

The mapper and the reducer are coded in Python, as showed in Glenn Lockwoods article Writing
Hadoop Applications in Python with Hadoop Streaming29.
You could use Java to code the mapper/reducer as well. In Miner and Shooks MapReduce Design
Patterns30 you can find the same example using Java.

Word Count Diagram


The picture below is from Glenn Lockwoods article Writing Hadoop Applications in Python with
Hadoop Streaming31:

hadoop-streaming.jar
Our example is a MapReduce example. Hadoop Streaming is the JAR utility that allows users to run
MapReduce jobs. hadoop-streaming.jar utility will control our examples execution. This file comes
with Hadoop distribution.

29

http://www.glennklockwood.com/di/hadoop-streaming.php
MapReduce Design Patterns, Donald Miner and Adam Shook, OReilly. Available at www.it-ebooks.info .
31
http://www.glennklockwood.com/di/hadoop-streaming.php
30

17

So we need to use a Linux command to make Hadoop use hadoop-streaming.jar to run our MapReduce
job. In this command we usually do specify the mapper, the reducer, and the input and output files.
The format of the command to do this is
hadoop \
jar /Hadoop-path/in-your-install/streaming/hadoop-streaming-x.y.z.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/your-text-file.txt"
\
-output "wordcount/output"

Please notice that


you need to provide the actual path for the JAR file; also, x.y.z is the version of your JAR file;
you can name your Python mapper and reducer as you will (you could use Java classes instead);
your Python (.py) files are regular Linux files, they should be in your working directory, as you
can see by the $PWD (print working directory) Linux command used;
your-text-file.txt, the specified input, is an HDFS file, not a regular Linux file, remember that
we are not just using Linux, we are using Hadoop MapReduce over Linux;
output is an HDFS directory which will be used by the Reduce step.
You can find a complete description of Hadoop Streaming parameters in Hadoop Wiki, at
http://wiki.apache.org/hadoop/HadoopStreaming .

Mapper
Our mapper shall just read the text file and output key-value pairs.
The key part will be every word in text; the value part will be always 1.
We can do this with the Python code below:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
keys = line.split()
for key in keys:
value = 1
print( "%s\t%d" % (key, value) )

Reducer
Our reducer shall take the key-value pairs and aggregate it, the output will be one key-value pair for
each different word in the text, the value being the number of times that word appears in the text.
The algorithm shall be built with the assumption that the input will be sorted.
Another assumption is that the value will not be always 1; it may have been already aggregated by
Hadoop-streaming-embedded group phase.

18

A possible pseudocode is
If this key is the same as the previous key,
add this key's value to our running total.
Otherwise,
print out the previous key's name and the running total,
reset our running total to 0,
add this key's value to the running total, and
"this key" is now considered the "previous key"

And the Python code is


#!/usr/bin/env python
import sys
last_key = None
running_total = 0
for input_line in sys.stdin:
input_line = input_line.strip()
this_key, value = input_line.split("\t", 1)
value = int(value)
if last_key == this_key:
running_total += value
else:
if last_key:
print( "%s\t%d" % (last_key, running_total) )
running_total = value
last_key = this_key
if last_key == this_key:
print( "%s\t%d" % (last_key, running_total) )

Our Mapper and Reducer can run just in Linux


We can run a Linux command to test our Mapper and our Reducer.
We should do this, just to be sure theyre working as expected.
For example, running
cat linux_input.txt | ./mapper.py | sort | ./reducer.py > output.txt

or, assuming that our input.txt is big and we want to use just the first 10 lines and that you want to
see the reducers output on the screen (and hence youll not redirect the standard output)
head -n10 linux_input.txt | ./mapper.py | sort | ./reducer.py

Please notice that our input file, since we are using just Linux, is a regular input file!
And, since we are using just Linux, there is no parallel processing, there is no multiple instances of
mappers and reducers, there is no BigData framework being used at all!

19

Running our Hadoop Job


In order to use Hadoop, and to be able to multi-instantiate our Map and Reduce processes as showed
in the diagram The MapReduce Pipeline , the first step is copying our input data to HDFS. The
commands below are creating an HDFS directory wordcount and then copying the Linux text file as an
HDFS file.
The commands start with hadoop" because theyre directed to Hadoop Framework and dfs,
meaning Distributed File System.
hadoop dfs -mkdir wordcount
hadoop dfs -copyFromLocal ./linux_input.txt wordcount/hdfs_input.txt

You can find a list of Hadoop User Commands at


http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#User+Commands .
And then we can run our Hadoop job:
hadoop \
jar /Hadoop-path/in-your-install/streaming/hadoop-streaming-x.y.z.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/hdfs_input.txt"
\
-output "wordcount/output"

Using Parallel Processing


But, are we using parallel processing? Are we instantiating several mappers and several reducers?
Probably not!
Hadoop, by default, instantiates a mapper for each 64MB in the input file.
So, unless the input file is bigger than 64MB, the Hadoop framework will be doing the same than the
regular Linux pipeline did regular batch processing!
We can say to Hadoop that we want more mapper and/or reducer instances thru the command line,
adding the
-D mapred.map.tasks=desired-number-of-tasks

parameter. We can also use a mapred.reduce.tasks parameter.


Please notice that we are suggesting to Hadoop using 4 map tasks. Several other settings can
influence this.
We strongly suggest that you take a look at Hadoop Wiki, partitioning your job into maps and reduces,
at http://wiki.apache.org/hadoop/HowManyMapsAndReduces .
But, for now, as novices, we can just use the mapred.map.tasks and mapred.reduce.tasks:
hadoop \
jar /Hadoop-path/in-your-install/streaming/hadoop-streaming-x.y.z.jar \
-D mapred.map.tasks=4
-D mapred.reduce.tasks=2
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/hdfs_input.txt"
\
-output "wordcount/output"

We finished our Hello, Word demo; welcome to Big Data!

20

BIBLIOGRAPHY

Oracle: SQL Best For Big Data Analysis - http://www.informationweek.com/big-data/big-dataanalytics/oracle-sql-best-for-big-data-analysis/a/d-id/1316272


Big Data: How To Pick Your Platform Tech Digest Information Week September 2014
http://www.informationweek.com/big-data/software-platforms/big-data-how-to-pick-your-platform/d/did/1315609 - full version at http://dc.ubm-us.com/i/378226#201-element
16 Top Big Data Analytics Platforms - http://www.informationweek.com/big-data/big-data-analytics/16top-big-data-analytics-platforms/d/d-id/1113609
RDBMS vs. NoSQL: How do you pick? - http://www.zdnet.com/article/rdbms-vs-nosql-how-do-you-pick/
Big data means the reign of the relational database is over http://www.itproportal.com/2014/03/07/big-data-means-the-reign-of-the-relational-database-is-over/
Relational Databases Aren't Dead. Heck, They're Not Even Sleeping http://readwrite.com/2013/03/26/relational-databases-far-from-dead
Introduction to the Big Data Issue - by Marvin Waschke, Senior Principal Software Architect, Editor in
Chief of the CA Technology Exchange, CA Technologies http://www.ca.com/us/~/media/Files/Articles/ca-technology-exchange/introduction-to-big-datawaschke.pdf
When NoSQL Databases Are Yes Good For You And Your Company http://readwrite.com/2013/03/25/when-nosql-databases-are-good-for-you
Big Data: Big Opportunity or a Big Mess? A Conversation with David Simon of CSC http://www.washingtonexec.com/2014/12/big-data-big-opportunity-big-mess-conversation-david-simoncsc/?adbid=UPDATE-c11205950683852761505792&adbpl=li&adbpr=1120&utm_campaign=1013_GBD_OT_SocialPostoftheDay_201
41216_37405977
Facebook Showcasing Its Open Source Database, at http://allfacebook.com/facebook-showcasing-itsopen-source-database_b21560 .
Dynamo: Amazons Highly Available Key-value Store article, at
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf .
Enterprise NoSQL for Dummies MarkLogic Special Edition by Charlie Brooks.
Understanding HBase and BigTable, by Jim R. Wilson http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
BigTable a NoSQL massively parallel table, by Paul Krzyzanowski available in the Handouts of
Professor Hsus Course.
Introduction to Graph Databases Chicago Graph database Meet-Up, by Max de Marzi http://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789 .
Case Study: Sessionization (Social Network) https://www.youtube.com/watch?v=hzZ3bM80pJg
HDFS Architecture Guide - http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html .
MapReduce Design Patterns, Donald Miner and Adam Shook, OReilly. Available at www.it-ebooks.info
A Beginners Guide to Hadoop, Matthew Rathbone http://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html
Big Data Hadoop from an Infrastructure Perspective - http://blogs.cisco.com/datacenter/big-datahadoop-from-an-infrastructure-perspective
The NoSQL Movement Big Table Databases, by Paul Wiliams - http://www.dataversity.net/the-nosqlmovement-big-table-databases/
Writing Hadoop Applications in Python with Hadoop Streaming - Glenn Lockwood http://www.glennklockwood.com/di/hadoop-streaming.php
New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps - http://cacm.acm.org/blogs/blogcacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext

21

Oracle Exadata Database Machine Extreme Performance for the Cloud https://www.oracle.com/engineered-systems/exadata/index.html

22

Glossary
We are listing here some terms we found in our readings, not necessarily cited in our text.

NewSQL - a class of modern relational database management systems that seek to provide the same
scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads
while still maintaining the ACID guarantees of a traditional database system.
4Vs Volume, Variety, Velocity, Value
5Vs Volume, Variety, Velocity, Veracity, Value
Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured
datastores such as relational databases. See http://sqoop.apache.org/ .
Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is
robust and fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online analytic application. See
http://flume.apache.org/
Pig platform for analyzing large data sets. Consists of a high level language for expressing data analysis
programs.
Mahout Apache project to produce free implementations of scalable machine learning algorithms.
Focused primarily in collaborative filtering, clustering and classification. Many implementations use
Hadoop platform.
Hbase open source, non-relational, distributed database modeled after Googles Big Table and written
in Java. Apache Hadoop Foundation.
Hive data warehouse infrastructure built on top of Hadoop for providing data summarization, query
and analysis. Initially developed by Facebook. Now Netflix and others also. Amazon has a fork included
in Amazon Elastic MapReduce.
YARN = an improved implementation of MapReduce.
Zookeeper - Apache ZooKeeper is an effort to develop and maintain an open-source server which
enables highly reliable distributed coordination.
QLIKVIEW a visualization tool. See WWW.QLIK.COM Business Intelligence and Data Visualization
Software.

23

Appendix a BigData Solution Architeture


Felipe Renz MBA Project at Unisinos College, Brazil
The diagrams listed here belong to Mr. Felipe Renz MBA project at Universidade do Vale do Rio dos
Sinos (Sinos River Valley University), campus So Leopoldo City, Rio Grande do Sul State, Brazil (2014).
They show the architecture of a Big Data solution conceived to preview failures in heavy agricultural
machinery. It has been not completely implemented yet, but it shows how complex a real world Big
Data project can be.
Figura 3 Arquitetura para soluo de Big Data - FONTE: (Sawant e Shah, 2014, p.10).
Picture 3 Big Data Solution Architecture Source: Sawant & Shah, 2014, page 10).

24

Picture 6 Proposed Big Data Architecture


Source Designed by Felipe Renz

25

Picture 7 Big Data Solution Data Flow


Source Designed by Felipe Renz

26

You might also like