You are on page 1of 61

Big Data Technologies

Jnaneshwar Bohara

Introduction to Big Data|JBohara


Chapter 1: Introduction to Big Data

1.1 Big Data Overview

Introduction to Big Data|JBohara


Whats Big Data?
No single definition

Big data is the term for a collection of data sets so large


and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.
Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.
Big Data refers to datasets whose size are beyond the
ability of typical database software tools to capture,
store, manage and analyze.
Introduction to Big Data|JBohara
Big Data EveryWhere!
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network

Introduction to Big Data|JBohara


Whos Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion

5
How much data?
500 Million Tweets sent each day!
More than 4 Million Hours of content uploaded to
Youtube every day!
3.6 Billion Instagram Likes each day.
4.3 BILLION Facebook messages posted daily!
5.75 BILLION Facebook likes every day.
40 Million Tweets shared each day!
6 BILLION daily Google Searches!

https://blog.microfocus.com/how-much-data-is-
created-on-the-internet-each-day/
Introduction to Big Data|JBohara
What is Big Data?
It is a new set of approaches for analysing data sets that
were not previously accessible because they posed
challenges across one or more of the 3 Vs of Big Data

Volume - too Big Terabytes and more of Credit Card


Transactions, Web Usage data, System logs

Variety - too Complex truly unstructured data such as


Social Media, Customer Reviews, Call Center Records

Velocity - too Fast - Sensor data, live web traffic, Mobile


Phone usage, GPS Data
Introduction to Big Data|JBohara
Big Data: 3Vs

Introduction to Big Data|JBohara


Volume (Scale)

Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially

Exponential increase in
collected/generated data

Introduction to Big Data|JBohara


4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world wide

100s of
millions
data every day

of GPS
? TBs of

enabled
devices sold
annually

25+ TBs of 2+
log data
every day billion
people on
the Web
76 million smart meters by end
in 2009 2011
200M by 2014
Variety (Complexity)
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF),

Streaming Data
You can only scan the data once

A single application can be generating/collecting


many types of data

Big Public Data (online, weather, finance, etc)

To extract knowledge all these types of


data need to be linked together

Introduction to Big Data|JBohara


Velocity (Speed)

Data is begin generated fast and need to be


processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you

Healthcare monitoring: sensors monitoring your activities and body


any abnormal measurements require immediate reaction

Introduction to Big Data|JBohara


Some Make it 4Vs

Introduction to Big Data|JBohara


Harnessing Big Data

OLTP: Online Transaction Processing (DBMSs)


OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

14
Challenges in Handling Big Data

The Bottleneck is in technology


New architecture, algorithms, techniques are needed
Also in technical skills
Experts in using the new technology and dealing with big data

15
Big Data Challenges

https://blogs.wsj.com/experts/2014/03/26/six-
challenges-of-big-data/
Chapter 1: Introduction to Big Data

1.2 Background of Data Analytics

Introduction to Big Data|JBohara


Data Analytics
Big data analytics is the process of examining
large amounts of data of a variety of types.

The primary goal of big data analytics is to help


companies make better business decisions.

Analyze huge volumes of transaction data as well


as other data sources that may be left untapped
by conventional business intelligence (BI)
programs.

Introduction to Big Data|JBohara


Data Analytics
Big data Consist of

Uncovered hidden patterns.

Unknown correlations and other useful


information.

Such information can provide business benefits.

More effective marketing and increased revenue.


Data Analytics
Big data analytics can be done with the software
tools commonly used as part of advanced
analytics disciplines such as predictive analysis
and data mining.

But the unstructured data sources used for big


data analytics may not fit in traditional data
warehouses.

Traditional data warehouses may not be able to


handle the processing demands posed by big data.
Data Analytics
The technologies associated with big data analytics
include NoSQL databases, Hadoop and MapReduce.

Knowledge about these technologies form the core


of an open source software framework that supports
the processing of large data sets across clustered
systems.
big data analytics initiatives include
internal data analytics skills
high cost of hiring experienced analytics
professionals,
challenges in integrating Hadoop systems and data
warehouses
Data Analytics
Big Analytics delivers competitive advantage
compared to the traditional analytical model.

Big Analytics describes the efficient use of a simple


model applied to volumes of data that would be too
large for the traditional analytical environment.

Research suggests that a simple algorithm with a


large volume of data is more accurate than a
sophisticated algorithm with little data.
Data Analytics
Big Analytics supporting the following objectives
for working with Big Data Analytics:

1. Avoid sampling / aggregation;

2. Reduce data movement and replication;

3. Bring the analytics as close as possible to the data.

4. Optimize computation speed.


The Process of Data Analytics
Data Analytics Process: Discovery
The knowledge discovery phase involves

gathering data to be analyzed.


pre-processing it into a format that can be used.
consolidating it for analysis,
analyzing it to discover what it may reveal.
and interpreting it to understand the processes by
which the data was analyzed and how conclusions
were reached.
Data Analytics Process: Discovery
Acquisition
Data acquisition involves collecting or acquiring data
for analysis.
Acquisition requires access to information and a
mechanism for gathering it.
Pre-processing
Pre-processing is necessary if analytics is to yield
trustworthy , useful results.
places it in a standard format for analysis.
Data Analytics Process: Discovery

Integration
Integration involves consolidating data for
analysis.
Retrieving relevant data from various sources for analysis
Eliminating redundant data or clustering data to obtain a smaller
representative sample

Analysis
Searching for relationships between data items in a database, or
exploring data in search of classifications or associations.
Analysis can yield descriptions or predictions.
Analysis based on interpretation, organizations can determine
whether and how to act on them.
Data Analytics Process: Discovery

Interpretation
Analytic processes are reviewed by data scientists
to understand results and how they were
determined.
Interpretation involves retracing methods,
understanding choices made throughout the
process and critically examining the quality of the
analysis.
It provides the foundation for decisions about
whether analytic outcomes are trustworthy.
Data Analytics Process: Application
Application
Associations discovered amongst data in the
knowledge phase of the analytic process are
incorporated into an algorithm and applied.
In the application phase organizations gather the
benefits of knowledge discovery.
Through application of derived algorithms,
organizations make determinations upon which
they can act.
A Brief History of Big Data

C 18,000 BCE
Humans use tally sticks to record data for
the first time. These are used to track
trading activity and record inventory.
C 2400 BCE
The abacus is developed, and the first
libraries are built in Babylonia
300 BCE 48 AD
The Library of Alexandria is the worlds largest
data storage center until it is destroyed by
the Romans.
Introduction to Big Data|JBohara
A Brief History of Big Data

100 AD 200 AD
The Antikythera Mechanism, the first mechanical
computer is developed in Greece
1663
John Graunt conducts the first recorded statistical-
analysis experiments in an attempt to curb the
spread of the bubonic plague in Europe
1865
The term business intelligence is used by Richard
Millar Devens in his Encyclopedia of Commercial
and Business Anecdotes
Introduction to Big Data|JBohara
A Brief History of Big Data

1881
Herman Hollerith creates the Hollerith Tabulating
Machine which uses punch cards to vastly reduce
the workload of the US Census.
1926
Nikola Tesla predicts that in the future, a man will
be able to access and analyze vast amounts of data
using a device small enough to fit in his pocket.
1928
The term business intelligence is used by Richard
Millar Devens in his Encyclopedia of Commercial
and Business Anecdotes

Introduction to Big Data|JBohara


A Brief History of Big Data

1944
Fremont Rider speculates that Yale Library will
contain 200 million books stored on 6,000 miles of
shelves, by 2040.
1958
Hans Peter Luhn defines Business Intelligence as the
ability to apprehend the interrelationships of
presented facts in such a way as to guide action
towards a desired goal.
1965
The US Government plans the worlds first data center
to store 742 million tax returns and 175 million sets of
fingerprints on magnetic tape.

Introduction to Big Data|JBohara


A Brief History of Big Data

1970
Relational Database model developed by IBM mathematician
Edgar F Codd. The Hierarchal file system allows records to be
accessed using a simple index system. This means anyone can
use databases, not just computer scientists.
1976
Material Requirements Planning (MRP) systems are commonly
used in business. Computer and data storage is used for
everyday routine tasks.
1989
Early use of term Big Data in magazine article by fiction author
Erik Larson commenting on advertisers use of data to target
customers.

Introduction to Big Data|JBohara


A Brief History of Big Data

1991
The birth of the internet. Anyone can now go
online and upload their own data, or analyze data
uploaded by other people.
1996
The price of digital storage falls to the point
where it is more cost-effective than paper.
1997
Google launch their search engine which will
quickly become the most popular in the world.

Michael Lesk estimates the digital universe is


increasing tenfold in size every year.
Introduction to Big Data|JBohara
A Brief History of Big Data

1999
First use of the term Big Data in an academic paper Visually
Exploring Gigabyte Datasets in Real-time (ACM)

First use of term Internet of Things, in a business presentation


by Kevin Ashton to Procter and Gamble.
2001
Three Vs of Big Data Volume, Velocity, Variety defined by
Doug Laney
2005
Hadoop an open source Big Data framework now developed
by Apache is developed.

The birth of Web 2.0 the user-generated web.

Introduction to Big Data|JBohara


A Brief History of Big Data

2008
Globally 9.57 zettabytes (9.57 trillion gigabytes) of information is
processed by the worlds CPUs.

An estimated 14.7 exabytes of new information is produced this year.


2009
The average US company with over 1,000 employees is storing more
than 200 terabytes of data according to the report Big Data: The
Next Frontier for Innovation, Competition and Productivity by
McKinsey Global Institute.
2010
Eric Schmidt, executive chairman of Google, tells a conference that
as much data is now being created every two days, as was created
from the beginning of human civilization to the year 2003.

Introduction to Big Data|JBohara


A Brief History of Big Data

2011
The McKinsey report states that by 2018 the US will
face a shortfall of between 140,000 and 190,000
professional data scientists, and warns that issues
including privacy, security and intellectual property will
have to be resolved before the full value of Big Data
will be realized.
2014
Mobile internet use overtakes desktop for the first
time.
88% of executives responding to an international
survey by GE say that big data analysis is a top priority

Introduction to Big Data|JBohara


Chapter 1: Introduction to Big Data

1.3 Role of Distributed System in Big Data

Introduction to Big Data|JBohara


What is a Distributed System?
A distributed system consists of a collection of
autonomous computers, connected through a
network and distribution middleware, which
enables computers to coordinate their
activities and to share the resources of the
system, so that users perceive the system as a
single, integrated computing facility.

Introduction to Big Data|JBohara


Big data is distributed data
Big data is distributed data : Data is so
massive it cannot be stored or processed by a
single node.
The way to scale fast and affordably is to use
commodity hardware to distribute the storage
and processing of our massive data streams
across several nodes, adding and removing
nodes as needed.

Introduction to Big Data|JBohara


Distributed data generation is fueling
big data growth
The reason we have data problems so big that
we need large-scale distributed computing
architecture to solve is that the creation of the
data is also large-scale and distributed.
Most of us walk around carrying devices that
are constantly pulsing all sorts of data into the
cloud and beyond our locations, our photos,
our tweets, our status updates, our
connections, even our heartbeats.

Introduction to Big Data|JBohara


Hadoop and MapReduce
Hadoop: An open source platform for
consolidating, combining and understanding
large-scale data in order to make better
business decisions.
2 key parts to Hadoop:
HDFS (Hadoop distributed file system) which lets
you store data across multiple nodes.
MapReduce which lets you process data in parallel
across multiple nodes.
Introduction to Big Data|JBohara
Chapter 1: Introduction to Big Data

1.4 Role of Data Scientist

Introduction to Big Data|JBohara


Data Science
Data science, also known as data-driven science, is an
interdisciplinary field about scientific methods,
processes and systems to extract knowledge or insights
from data in various forms, either structured or
unstructured, similar to data mining.
It employs techniques and theories drawn from many
fields within the broad areas of mathematics, statistics,
information science, and computer science, in
particular from the subdomains of machine learning,
classification, cluster analysis, data mining, databases,
and visualization.

Introduction to Big Data|JBohara


Data Science

Introduction to Big Data|JBohara


Data Scientist
High ranking professional with training and
curiosities to make discovery in the world of big
data.
The people who understand how to fish out
answers to important business questions from
today's tsunami of unstructured information.
Newly coined term , in 2008 by D.J Patil and Jeff
Hammerbacher
A hybrid of data hacker, analyst,
communicator,and trusted adviser. The
combination is extremely powerfuland rare
Introduction to Big Data|JBohara
Data Scientist
Sudden appearance of Data Scientist on the
business scene reflects the fact that companies
are now wrestling with information that comes in
varieties and volumes never encountered before.
If the organization stores multiple petabytes of
data, if the information most critical to the
business resides in forms other than rows and
columns of numbers, or if answering the biggest
question would involve a mashup of several
analytical efforts, it has got a big data
opportunity.

Introduction to Big Data|JBohara


Data Scientist

Introduction to Big Data|JBohara


Data scientist: a brand new profession
Data Scientist: The Sexiest Job of the 21st Century
[Harward Business Review 2013]
Data scientist? A guide to 2015's hottest
profession [Mashable 2015]
Its official data scientist is the best job in
America [Forbes, 2016]
"This hot new field promises to revolutionize
industries from business to government, health
care to academia."
The New York Times

Introduction to Big Data|JBohara


Successful Data Scientist
Characteristics
Intellectual curiosity, Intuition
Find needle in a haystack(something that is difficult to locate in a
much larger space)
Ask the right questions value to the business
Communication and engagements
Presentation skills
Let the data speak but tell a story
Story teller drive business value not just data insights
Creativity
Guide further investigation
Business Savvy
Discovering patterns that identify risks and opportunities
Measure

Introduction to Big Data|JBohara


Skills of data scientists
Introduction to Big Data|JBohara
Role/Skill of Data Scientist
Data Scientist should have skill set to
use technologies that make taming big data possible,
including Hadoop (the most widely used framework for
distributed file system processing) and related open-
source tools, cloud computing, and data visualization.
make discoveries while swimming in pool of data
bring structure to large quantities of formless data and
make analysis possible
identify rich data sources, join them with other,
potentially incomplete data sources, and clean the
resulting set
Introduction to Big Data|JBohara
Role/Skill of Data Scientist
Data Scientist should have skill set to
communicate what theyve learned and suggest its
implications for new business directions
be creative in displaying information visually and
making the patterns they find clear and compelling
fashion their own tools and even conduct academic-
style research
write code
desire to go beneath the surface of a problem, find the
questions at its heart, and distill them into a very clear
set of hypotheses that can be tested

Introduction to Big Data|JBohara


Data Scientist Job Description
Amazons Shopper Marketing & Insights team focuses
on serving the advertisers and our overall ad business
to provide strategic media planning, customer insights,
targeting recommendations, and measurement and
optimization of advertising.
We are hiring outstanding Data Scientists who will use
innovative statistical and machine learning approaches
to drive advertising optimization and contribute to the
creation of scalable insights. The ideal candidate
should have one hand on the white-board writing
equations and one hand on the keyboard writing code.

Introduction to Big Data|JBohara


Data Scientist
A recent study by the McKinsey Global Institute concludes,
"a shortage of the analytical and managerial talent
necessary to make the most of Big Data is a significant and
pressing challenge (for the U.S.).
The report estimates that there will be four to five million
jobs in the U.S. requiring data analysis skills by 2018, and
that large numbers of positions will only be filled through
training or retraining. The authors also project a need for
1.5 million more managers and analysts with deep
analytical and technical skills "who can ask the right
questions and consume the results of analysis of big data
effectively.
https://datascience.berkeley.edu/about/what-is-data-science/

Introduction to Big Data|JBohara


https://www.glassdoor.com/List/Best-Jobs-in-America-LST_KQ0,20.htm

Introduction to Big Data|JBohara


Chapter 1: Introduction to Big Data

1.5 Current Trend in Big Data Analytics

Introduction to Big Data|JBohara


Top 10 Big Data Trends 2017
1. Big data becomes fast and approachable
2. Big data no longer just Hadoop
3. Organizations leverage data lakes from the get-go to drive value
4. Architectures mature to reject one-size-fits all frameworks
5. Variety, not volume or velocity, drives big-data investments
6. Spark and machine learning light up big data
7. Convergence of IoT, cloud, and big data create new opportunities
for self-service analytics
8. Self-service data prep becomes mainstream as end users begin to
shape big data
9. Big data grows up: Hadoop adds to enterprise standards
10. Rise of metadata catalogs helps people find analysis-worthy big
data
https://www.tableau.com/resource/top-10-big-data-trends-
2017#1H9I1K2oDuvrgFhr.99
Introduction to Big Data|JBohara
Introduction to Big Data|JBohara

You might also like