Professional Documents
Culture Documents
Foreword
I was very interested in Big Data is there some IT professional who is not, nowadays? and decided to
learn about. So Ive enrolled in, and Im attending, the CIS 798-Programming Techniques for Big Data
Analytics, at Kansas State University, with Professor William Hsu.
I started the course aware of some catchwords, like
Big data is NoSQL.
Big data made relational databases obsolete.
Professor Hsu started the course teaching us that Big data is about Volume, Velocity, Variety, and
proceeded with a Machine Lab implementing the prototypical Word Count Big Data programming
example, using Hadoop with a Java plug-in.
Well, that makes sense, but I think Professor Hsu underestimated my lack of knowledge. I was not just
ignorant regarding the programming techniques; I did not know when, in which use cases, to apply Big
Data programming techniques.
So I thought I should take some time to learn about the context, about in what use cases we could
and/or should use Big Data techniques and also, if traditional relational databases are not dead, when
we should stay with them.
As a result of my knowledge about Big Data (or, more properly, lack of) summing up almost just to the
catchword Big Data is NoSQL, when I speak, in these initial paragraphs, about Big Data
infrastructure, please understand that I mean anything different from relational databases.
I called myself a beginner. Indeed I am regarding Big Data. Im familiar (proficient, I would dare to
say) with relational databases. I think these notes can be useful to anyone aiming to get a general
overview of Big Data, but please take in account that Im writing as somebody used to relational
databases and new to Big Data. I did make a reasonable amount of reading about Big Data, but of
course I am still a beginner. I may have made mistakes; I very probably did. Your feedback will be very
welcomed at my personal e-mail: pedrofbpereira@yahoo.com.br . Thank you very much!
Contents
Foreword ................................................................................................................................................ 1
Initial Statements .................................................................................................................................... 4
For some Use Cases, Relational databases are better .............................................................................. 5
Big Data does things that Relational Databases cant ............................................................................... 5
Un-Structured Data ............................................................................................................................. 5
Complex Event Processing ................................................................................................................... 6
Sessionization ...................................................................................................................................... 6
Volume, Velocity, Variety ........................................................................................................................ 6
Volume ................................................................................................................................................ 6
Velocity ............................................................................................................................................... 7
Volume, Velocity Facebook ............................................................................................................... 7
Variety................................................................................................................................................. 7
NoSQL ................................................................................................................................................. 7
Scheme-agnostic, but organized ...................................................................................................... 8
Key-Value Stores .............................................................................................................................. 8
Column Family ................................................................................................................................. 8
Document Databases ....................................................................................................................... 8
Graph Databases ............................................................................................................................. 8
Big Table and HBase......................................................................................................................... 9
Infrastructure Characteristics ................................................................................................................ 10
Distributed, Parallel and Redundant High Performance and Availability .......................................... 10
Super-Computing / Cloud Computing ................................................................................................ 10
Hadoop Distributed File System - HDFS .............................................................................................. 11
MapReduce ....................................................................................................................................... 12
The MapReduce Pipeline ............................................................................................................... 12
Hadoop Simple Definition ............................................................................................................... 13
Hadoop BEOCAT instance ............................................................................................................... 13
AMAZON WEB SERVICES CLOUD COMPUTING ................................................................................ 14
The Canonical Wordcount Example ....................................................................................................... 16
Environment: Linux; Programming Language: Python ...................................................................... 16
Word Count Diagram ......................................................................................................................... 17
hadoop-streaming.jar ........................................................................................................................ 17
Mapper ............................................................................................................................................. 18
Reducer ............................................................................................................................................. 18
2
Initial Statements
Ive read a dozen articles (see Bibliography, in these Notes), as well as Professor Hsu handouts, and
reached, in short, the following conclusions:
Despite the fact that (understandably) some NoSQL database vendors say so, not all applications
using relational databases worth being converted to Big Data techniques; relational databases
and data warehouses using non-normalized star schemas and cubes - are still, frequently, the
best option.
Despite the fact that some relational databases vendors (understandably) state that their
products can do everything that Big Data frameworks do, they cant. They can indeed deliver
some Big Data tools. But amazing new fields as for example Complex Event Processing - are
being pioneered thanks to Big Data techniques.
Volume, Velocity and Variety is a simple way to describe Big Data features to outsiders.
Reality is not always that simple. Actually most relational databases can scale to big data
volume (usually, at a higher cost; usually, not instantly) and can perform at high speed
(depending on the use case, faster than Big Data frameworks). Variety would be the key
feature. Id like to highlight that, as I see, this variety is more relevant in a sense of changes
in database schema (or even absence of schema) instead of variety of data sources or in data
content.
Economic differences and infrastructure differences:
o A strong point of Big Data frameworks is distributed processing and storage, which
leads to great easiness in scalability and failure tolerance. Relational databases can be
distributed and grow to sizes that would grant them the right to be called also big
data. But since Big Data frameworks are based in standard cheap computers, and
relational databases uses (mostly) expensive powerful servers, distributing
conventional relational databases is a lot more expensive. Big Data frameworks tend
to be cheaper.
o Big Data frameworks can scale instantly. Relational databases, usually, are not able to
answer instantly to explosive demand increases.
o Relational databases can be hosted in the cloud. Most Big Data frameworks are born
in the cloud. Open source frameworks as for example Apache Hadoop are less
expensive than cloud versions of relational databases. And big providers of
Infrastructure as a Service as for example Amazon allow using a safe, best-practicesmanaged, and highly (and instantly!) scalable platform, for an affordable cost.
Giants as Google, Yahoo, Amazon, the government, and also academic researchers, almost for
sure need to use Big Data frameworks. Smaller companies thanks to IaaS providers can also
use Big Data.
There is no free lunch! The increased capability of Big Data frameworks to deal with variable
schemas implies in not being ACID1 (Atomic, Consistent, Isolated, Durable). Despite some NoSQL
distributors claim to be ACID, I honestly cant understand how you can be distributed, unrelated, un-locked and, at the same time, ACID. They aim to be, instead, BASE2 (Basically
Available, Soft-state and Eventually consistent).
1
This performance/ACID exchange is clearly stated in Dynamo: Amazons Highly Available Key-value Store
article, at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf .
2
See Enterprise NoSQL for Dummies, MarkLogic Special Edition, by Charlie Brooks.
Big Data How to Pick your Platform Tech Digest Information Week September 2014 - http://dc.ubmus.com/i/378226#201-element
Sessionization
Sessionization is the capability of analyze clickstream-related events grouping all events from a
session. With this capability you can query, for example, what happened before a particular type of
event for example, before an equipment fails. BigData frameworks as Hadoop do have sessionization
features for years; oracle 12c recently released this feature. Im not aware if the new Oracle
sessionization feature still follows the patter describe in Case Study: Sessionization (Social Network) 4
Volume
Ok, Big Data. How big? Its usually accepted that big data range starts at terabytes level.
And, indeed, it could be said that, in general, relational databases and data warehouses start to be
overwhelmed with 10 terabytes on and, since this, when you reach some point near 10 terabytes, its
better think about alternatives but, alternatives does not mean necessarily Big Data.
https://www.youtube.com/watch?v=hzZ3bM80pJg
Oracle Exadata, for example, can provide, in a single rack, 12 terabytes of system memory and 672
terabytes of disk, with more than 400 high-performance CPU cores5. With two of these racks, we are
into petabytes range!
Large volumes dont necessarily demand big data tools.
Velocity
Relational databases can be scaled to deal with a large number of transactions with good performance.
As we mentioned before, Stack Exchange, who maintains the programming Q&A site Stack Overflow,
which ranks 54th in world for traffic, uses SQL Server6.
Of course if you are dealing with machine-created data, from sensors for example, that would create a
massive amount of traffic. But very few companies who are using caching and indexing properly will
have traffic on a web application that could justify moving to a Big Data platform with the only reason of
needing more velocity.
Variety
Variety is the key word. Relational databases, due its well-defined schemas, can grant data integrity
and consistency. Exactly due to its well-defined schemas, sometimes its not easy to make real life,
changing data, fit in the schema. ETL operations are needed; sometimes, the schema needs to be
changed which usually implies time-consuming operations and probably some downtime.
Nowadays business need to handle unstructured data, as for example text; flexibly structured JSON
data; geospatial data
Variety is indeed a stronger reason to a platform change. If your data are unstructured, or flexibly
structured, there is no reason to stay with a structured approach. You should think about changing from
SQL to NoSQL.
NoSQL
Its hard to define NoSQL; the difficulty starts with the name it says what it is not, instead of what it is!
Furthermore Hive (a query language for Hadoop) syntax is very similar to SQL; Cassandra database
provides CQL query language, which also has similarities with SQL.
I think thats why some are defining NoSQL as Not ONLY SQL, instead of, literally, no SQL.
Oracle Exadata Database Machine Extreme Performance for the Cloud - https://www.oracle.com/engineeredsystems/exadata/index.html
6
Big Data How to Pick your Platform Tech Digest Information Week September 2014 - http://dc.ubmus.com/i/378226#201-element
7
See (among other sources) Facebook Showcasing Its Open Source Database, at
http://allfacebook.com/facebook-showcasing-its-open-source-database_b21560 .
Trying to define this kind of databases, the similarities we found in most NoSQL databases are that they
are scheme-agnostic and that they provide access to the database thru Application Programming
Interfaces.
An inference of NoSQL being scheme-agnostic is that it is extremely well suited to deal with nonstructured data.
Since there is no pattern of query language for NoSQL databases, using an API is necessary. You code a
customized program using, for example, Java or Python to access the database according to your
particular use case.
As we said, some high-level languages to query NoSQL databases have been developed - as for example
Pig and Hive. Usually, these languages performance is not as good as when using the API.
Scheme-agnostic, but organized
According to the book Enterprise NoSQL for Dummies 8, NoSQL databases are usually structured in one
of these forms:
Key-Value pairs
Document
Column Family
Graph
If we take a look at some NoSQL database implementations, well see that more than one logic concept
can be present in one database.
Key-Value Stores
This is a simple data model.
The main implementation is Dynamo, Amazon Key-Value store.
Other vendors are Redis, Riak and Voldemort9.
Column Family
This model supports semi-structured data. Its prototype is BigTable, from Google.
HBase is an open source Apache implementation of this model.
Cassandra is another Apache implementation of BigTable. Cassandra incorporates Cassandra Query
Language (CQL), which has been modeled after SQL.
Document Databases
This model defines document as a colletion of key-value pairs. The database itself is a collection of
documents. MongoDB is an implementation of this model.
Graph Databases
This model is based in nodes and relationships. Has a high level of complexity. Stores and supports
querying linked data, equivalent to several graph types Undirected, Directed, Multi Graphs, Hyper
Graphs, etc. Each node component in the database structure knows its adjacent nodes.
8
See Enterprise NoSQL for Dummies, MarkLogic Special Edition, by Charlie Brooks.
Introduction to Graph Databases Chicago Graph database Meet-Up, by Max de Marzi http://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789 .
9
Infrastructure Characteristics
Please let me highlight that when describing Big Data infrastructure characteristics in this topic, I have in
mind two specific implementations Apache Hadoop and Amazon Web Services. These are two of the
most important Big Data framework implementations. Since the field is still being shaped, its a lot
easier to describe actual implementations than to define a general model. The features we describe
below are common to these two implementations.
Let us mention that there is also a Hadoop on Google Cloud implementation. We are not going to
describe it; you can find a description at https://cloud.google.com/hadoop/what-is-hadoop .
What these Big Data frameworks have in common is that they are distributed, parallel and redundant,
resulting in high performance and availability. In order to achieve these features, the frameworks are
usually hosted in HPC (also called super-computers) or use cloud computing. An indispensable requisite
for these frameworks is some kind of distributed file system; most use some variation of Hadoop
Distributed File System HDFS. A recurrent programming pattern is MapReduce.
14
Supercomputer: A supercomputer is a computer that performs at or near the currently highest operational rate
for computers. A supercomputer is typically used for scientific and engineering applications that must handle very
large databases or do a great amount of computation (or both). Source:
http://whatis.techtarget.com/definition/supercomputer
15
According to Amazon (aws.amazon.com), "Cloud Computing, by definition, refers to the on-demand delivery of
IT resources and applications via the Internet with pay-as-you-go pricing.
10
16
11
MapReduce
This expression MapReduce is used to refer to both a programming pattern and several software
frameworks which implement it.
According to Donald Miner and Adam Shook17, MapReduce is a computing paradigm for processing
data that resides on hundreds of computers, which has been popularized recently by Google, Hadoop,
and many others.
Miner and Shook also emphasize that The tradeoff of being confined to the MapReduce framework is
the ability to process your data with distributed computing, without having to deal with concurrency,
robustness, scale, and other common challenges.
The pattern can be very simplistic described as consisting of a Map phase, which converts input data
in key-value pairs, and a Reduce phase, which summarizes the key-value pairs as aggregated data.
In A Beginners Guide to Hadoop, Matthew Rathbone uses the diagram below to illustrate it18:
The MapReduce Pipeline
The Mappers are distributed through hundreds (or thousands) of clusters, increasing the performance.
Actually, the input data themselves are distributed thru HDFS.
The Reducers are closing this distribution until finally releasing the summarized analytical data.
Comparing HDFS diagram with MapReduce diagram, you can see clearly that HDFS distributed
architecture matches MapReduce parallel processing.
1717
MapReduce Design Patterns, Donald Miner and Adam Shook, OReilly. Available at www.it-ebooks.info .
A Beginners Guide to Hadoop, Matthew Rathbone - http://blog.matthewrathbone.com/2013/04/17/what-ishadoop.html .
18
12
19
http://www.sas.com/en_us/company-information.html
http://www.sas.com/en_us/insights/big-data/hadoop.html
21
http://hadoop.apache.org/
22
See BeoCat, at https://www.cis.ksu.edu/beocat , for additional information.
23
Bewoulf cluster - http://en.wikipedia.org/wiki/Beowulf_cluster .
24
A Wild Cat is Kansas State University symbol.
25
About BeoCat, at http://support.cis.ksu.edu/BeocatDocs (this page has been moved to
http://support.beocat.cis.ksu.edu ).
20
26
Amazon EC2 - provides resizable compute capacity in the cloud. It is designed to make webscale computing easier for developers and system administrators.
Amazon EBS (Elastic Block Store) - provides block level storage volumes for use with Amazon
EC2 instances. Amazon EBS volumes are off-instance storage that persists independently from
the life of an instance.
Databases - Amazon Web Services provides fully managed relational and NoSQL database
services, as well as fully managed in-memory caching as a service and a fully managed petabytescale data-warehouse service.
Or, you can operate your own database in the cloud on Amazon EC2 and Amazon EBS.
Storage & Content Delivery - Amazon S3 (Simple Storage Service) - provides a fully redundant
data storage infrastructure for storing and retrieving any amount of data, at any time, from
anywhere on the Web.
27
14
Analytics - Amazon Web Services provides cloud based analytics services to help you process
and analyze any volume of data, whether your need is for managed Hadoop clusters, real-time
streaming data, petabyte scale data warehousing, or orchestration.
Amazon EMR (Elastic MapReduce) - is a web service that enables businesses, researchers, data
analysts, and developers to easily and cost-effectively process vast amounts of data. Amazon
EMR uses Hadoop to distribute your data and processing across a resizable cluster of Amazon
EC2 instances.
15
16
The mapper and the reducer are coded in Python, as showed in Glenn Lockwoods article Writing
Hadoop Applications in Python with Hadoop Streaming29.
You could use Java to code the mapper/reducer as well. In Miner and Shooks MapReduce Design
Patterns30 you can find the same example using Java.
hadoop-streaming.jar
Our example is a MapReduce example. Hadoop Streaming is the JAR utility that allows users to run
MapReduce jobs. hadoop-streaming.jar utility will control our examples execution. This file comes
with Hadoop distribution.
29
http://www.glennklockwood.com/di/hadoop-streaming.php
MapReduce Design Patterns, Donald Miner and Adam Shook, OReilly. Available at www.it-ebooks.info .
31
http://www.glennklockwood.com/di/hadoop-streaming.php
30
17
So we need to use a Linux command to make Hadoop use hadoop-streaming.jar to run our MapReduce
job. In this command we usually do specify the mapper, the reducer, and the input and output files.
The format of the command to do this is
hadoop \
jar /Hadoop-path/in-your-install/streaming/hadoop-streaming-x.y.z.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/your-text-file.txt"
\
-output "wordcount/output"
Mapper
Our mapper shall just read the text file and output key-value pairs.
The key part will be every word in text; the value part will be always 1.
We can do this with the Python code below:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
keys = line.split()
for key in keys:
value = 1
print( "%s\t%d" % (key, value) )
Reducer
Our reducer shall take the key-value pairs and aggregate it, the output will be one key-value pair for
each different word in the text, the value being the number of times that word appears in the text.
The algorithm shall be built with the assumption that the input will be sorted.
Another assumption is that the value will not be always 1; it may have been already aggregated by
Hadoop-streaming-embedded group phase.
18
A possible pseudocode is
If this key is the same as the previous key,
add this key's value to our running total.
Otherwise,
print out the previous key's name and the running total,
reset our running total to 0,
add this key's value to the running total, and
"this key" is now considered the "previous key"
or, assuming that our input.txt is big and we want to use just the first 10 lines and that you want to
see the reducers output on the screen (and hence youll not redirect the standard output)
head -n10 linux_input.txt | ./mapper.py | sort | ./reducer.py
Please notice that our input file, since we are using just Linux, is a regular input file!
And, since we are using just Linux, there is no parallel processing, there is no multiple instances of
mappers and reducers, there is no BigData framework being used at all!
19
20
BIBLIOGRAPHY
21
Oracle Exadata Database Machine Extreme Performance for the Cloud https://www.oracle.com/engineered-systems/exadata/index.html
22
Glossary
We are listing here some terms we found in our readings, not necessarily cited in our text.
NewSQL - a class of modern relational database management systems that seek to provide the same
scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads
while still maintaining the ACID guarantees of a traditional database system.
4Vs Volume, Variety, Velocity, Value
5Vs Volume, Variety, Velocity, Veracity, Value
Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured
datastores such as relational databases. See http://sqoop.apache.org/ .
Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is
robust and fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online analytic application. See
http://flume.apache.org/
Pig platform for analyzing large data sets. Consists of a high level language for expressing data analysis
programs.
Mahout Apache project to produce free implementations of scalable machine learning algorithms.
Focused primarily in collaborative filtering, clustering and classification. Many implementations use
Hadoop platform.
Hbase open source, non-relational, distributed database modeled after Googles Big Table and written
in Java. Apache Hadoop Foundation.
Hive data warehouse infrastructure built on top of Hadoop for providing data summarization, query
and analysis. Initially developed by Facebook. Now Netflix and others also. Amazon has a fork included
in Amazon Elastic MapReduce.
YARN = an improved implementation of MapReduce.
Zookeeper - Apache ZooKeeper is an effort to develop and maintain an open-source server which
enables highly reliable distributed coordination.
QLIKVIEW a visualization tool. See WWW.QLIK.COM Business Intelligence and Data Visualization
Software.
23
24
25
26