You are on page 1of 51

Introduction to Big Data

& Basic Data Analysis


Basic Concepts in Big Data
What is big data?
"Big Data are high-volume, high-velocity,
and/or high-variety information assets that
require new forms of processing to enable
enhanced decision making, insight discovery
and process optimization (Gartner 2012)
Complicated (intelligent) analysis of data
may make a small data appear to be big
Bottom line: Any data that exceeds our
current capability of processing can be
regarded as big
Why is big data a big
Government deal?
Obama administration announced big data initiative
Many different big data programs launched
Private Sector
Walmart handles more than 1 million customer transactions every hour,
which is imported into databases estimated to contain more than 2.5
petabytes of data
Facebook handles 40 billion photos from its user base.
Falcon Credit Card Fraud Detection System protects 2.1 billion active
accounts world-wide
Science
Large Synoptic Survey Telescope will generate 140 Terabyte of data every
5 days.
Biomedical computation like decoding human Genome & personalized
medicine
Social science revolution
-
Lifecycle of Data: 4 As
In
ed D te
er Aggregatio a g
att ta rat
c
S ta n ed
Da
Acquisition Analysis
g e
Log ed
da l
ta ow
Application Kn
Computational View of Big Data

Data
Visualization
Data Access Data Analysis

Data Understanding Data Integration

Formatting, Cleaning

Storage Data
Big Data & Related Topics/Courses
CS19
Human-Computer Interaction
9
Data
Visualization Machine Learning
DatabasesInformation Retrieval
Data Access Data Analysis
Data Mining
Computer Vision
Speech Recognition
Data Understanding Data Integration
Natural Language ProcessingData Warehousing

Formatting, Cleaning
Signal Processing
Many
Storage Applications!
Data
Information Theory
Some Data Analysis Techniques

Visualizat
ion
Classificati Predictive
on Modeling
Time Clusteri
Series ng
Big Data EveryWhere!

Lots of data is being collected


and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
How much data?
Google processes 20 PB a day (2008)
Facebook has 2.5 PB of user data + 15
TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day
(5/2009)

640K ought to be
enough for
anybody.
The Earthscope
The Earthscope is the world's
largest science project.
Designed to track North
America's geological evolution,
this observatory records data
over 3.8 million square miles,
amassing 67 terabytes of data.
much more.
(http://www.msnbc.msn.com/id/4
4363598/ns/technology_and_sci
ence-
future_of_technology/#.TmetOd
Q--uI)
Type of Data
Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF),
What to do with these data?
Aggregation and Statistics
Data warehouse and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
OLAP and Data Mining
Warehouse Architecture
Client Client

Query &
Analysis

Metadata Warehous
e

Integration

Sourc Sourc Sourc


e e e
15
Star Schemas

A star schema is a common


organization for data at a
warehouse. It consists of:
1. Fact table : a very large accumulation
of facts such as sales.
Often insert-only.
2. Dimension tables : smaller, generally
static information about the entities
involved in the facts.
16
Terms
sale
Fact table orderId
date customer
Dimension tables
product
prodId custId custId
prodId name
Measures name
price storeId address
qty city
amt

store
storeId
city

17
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

18
Cube

Fact table view:


Multi-dimensional cube:
sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8

dimensions = 2

19
3-D Cube

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1 p1 12 50
p1 c2 2 4 p2 11 8

dimensions = 3

20
ROLAP vs. MOLAP
ROLAP:
Relational On-Line Analytical
Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing

21
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

22
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

23
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4

rollup

drill-down

24
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)

25
What is Data Mining?

Discovery of useful, possibly


unexpected, patterns in data
Extraction of implicit, previously
unknown and potentially useful
information from data
Exploration & analysis, by automatic
or
semi-automatic means, of large
quantities of data in order to discover
meaningful patterns
Data Mining Tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Decision Trees
Example:
Conducted survey to see what customers were
interested in new model car
Want to select customers for advertising campaign

training
set

29
Clustering

income

education

age

30
K-Means Clustering

31
Association Rule Mining
t ion er ts
ac m c
a ns
d sto odu ht
t r i cu id pr oug
b

sales
market-basket
records:
data

Trend: Products p5, p8 often bough together


Trend: Customer 12 likes product p9

32
Association Rule Discovery
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, } --> {Potato Chips}
Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
Bagels in the antecedent => can be used to see which
products would be affected if the store discontinues
selling bagels.
Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Supermarket shelf management.
Inventory Managemnt
Other Types of Mining
Text mining: application of data mining to
textual documents
cluster Web pages to find related pages
cluster pages a user has visited to organize
their visit history
classify Web pages automatically into a Web
directory
Graph Mining:
Deal with graph data

34
The Meaning of Big Data - 3
Vs
Big Volume
With simple (SQL) analytics
With complex (non-SQL) analytics

Big Velocity
Drink from the fire hose

Big Variety
Large number of diverse data sources to
integrate

35
The Participants

Row storage and row executor


Microsoft Madison, DB2, Netezza, Oracle(!)

Column store grafted onto a row executor (wannabees)


Terradata/Asterdata, EMC/Greenplum

Column store and column executor


HP/Vertica, Sybase/IQ, Paraccel

Oracle Exadata is not:


a column store
a scalable shared-nothing architecture

36
Hadoop..

Simple analytics
X100 times a parallel DBMS
Complex analytics (Mahout or roll-your-own)
X100 times Scalapack
Parallel programming
Parallel grep (great)
Everything else (awful)
Hadoop lacks
Stateful computations
Point-to-point communication

37
Big Velocity
Sensor tagging everything of value
sends velocity through the roof
E.g. car insurance

Smart phones as a mobile platform


sends velocity through the roof

State of multi-player internet games


must be recorded sends velocity
through the roof
38
New OLTP
You need to ingest a
fire hose in real-time

You need to perform


high volume OLTP

You often need real-


time analytics

39
VoltDB: an example of
New SQL
A main memory SQL engine

Open source

Shared nothing, Linux, TCP/IP on jelly beans

Light-weight transactions
Run-to-completion with no locking

Single-threaded
Multi-core by splitting main memory

About 100x RDBMS on TPC-C

40
Big Variety
Typical enterprise has 5000 operational systems
Only a few get into the data warehouse
What about the rest?

And what about all the rest of your data?


Spreadsheets
Access data bases
Web pages

And public data from the web?

41
The World of Data
Integration
the rest of your data

enterprise text
data warehouse

42
Summary
The rest of your data (public and private)
Is a treasure trove of incredibly valuable
information

Largely untapped

43
IoT Meets Big Data

44
Big Data Value Chain
Discove
Ingestio ry & Integrat
Collection Analysis Delivery
n Cleansin ion
g

Collection Structured, unstructured and semi-structured data from


multiple sources
Ingestion loading vast amounts of data onto a single data store
Discovery & Cleansing understanding format and content; clean
up and formatting
Integration linking, entity extraction, entity resolution, indexing
and data fusion
Analysis Intelligence, statistics, predictive and text analytics,
Need learning
machine for Standardized Approaches At
Delivery querying, visualization, real time delivery on enterprise-
class availability
Each Step
Source OReilly Strata 2012

12
45
45
Considerations for Big Data Standardization

Variety of Use Data Characteristics


Cases Distributed /
Centralized
Mobility
The 4 Vs : Volume,
Security & Privacy Velocity, Variety,
Lifecycle Veracity
Management & Data Collection
Data Quality Data Visualization
System Data Quality
Management & Data Analytics &
Other Issues Action
46
Data Sources
Source Any*

Anytime
Sensors
Anything
Applications
Any Device
Software agents
Any Context
Individuals
Any Place
Organizations
Anywhere
Hardware resources
Any one

47
Big Data Standardization Challenges
(1)
Big Data use cases, definitions, vocabulary and reference architectures
(e.g. system, data, platforms, online/offline)
Specifications and standardization of metadata including data
provenance
Application models (e.g. batch, streaming)
Query languages including non-relational queries to support diverse
data types (XML, RDF, JSON, multimedia) and Big Data operations (e.g.
matrix operations)
Domain-specific languages
Semantics of eventual consistency
Advanced network protocols for efficient data transfer
General and domain specific ontologies and taxonomies for describing
data semantics including interoperation between ontologies

Source : ISO

48
Big Data Standardization
Challenges (2)
Big Data security and privacy access controls
Remote, distributed, and federated analytics (taking the
analytics to the data) including data and processing
resource discovery and data mining
Data sharing and exchange
Data storage, e.g. memory storage system, distributed file
system, data warehouse, etc.
Human consumption of the results of big data analysis (e.g.
visualization)
Interface between relational (SQL) and non-relational
(NoSQL)
Big Data Quality and Veracity description and management
Source : ISO

49
Big Data Seminar Report with ppt and pdf
The Structure of Big Data
Structured
Most traditional data sources
Semi-structured
Many sources of big data
Unstructured
Video data, audio data
Benefits of Big Data
Big Data is already an important part of the $64 billion
database and data analytics market
It offers commercial opportunities of a comparable
Sekhar Kondepudi
sekhar.kondepudi@nus.edu.sg
www.kondepudi-group.info
M : +65 98566472

51

You might also like