You are on page 1of 4

1. When does data become big data ?

a) Volume
b) Velocity - frequency at which data is generated
c) Variety- actual contents of data set

2. Architectural Components
a) Parallelism
b) Hadoop - scalable data storage and batch processing framework.
ingests, processes and aggregates external data.
c) HDFS - file system
d) YARN - frameworkd for scheduling and execution of data processing .
Yet Another Resource Negotiator.
e) Types of Jobs
3. Big Data Job types
A) Map Reduce-
Map phase - user defined map processes - parellel processes at different
nodes.
Reduce phase - output of map phase is aggregated.
intermediate proceses - Partition, Shuffle, Sorted

B) Hive
Data warehousing infrastructure
HQL
QUery unstructed data
Provides Summarization, Analysis, Ad hoc querying

C) Pig Latin
Queries
Data manipulation
Combines SQL / Map Reduce.

***********************************************************************************
***********************************************************************************
*****************

MODULE 2

1. Database Architectures - peanuts case study

Flat Systems
Relational Systems
No SQL

2. Relational Systems
' Attribute - columns in the table
Tuple - row in the table
Relation - the table itself
View - requested result set

3. NO SQL
designed specifically for distributed environment
stores data in documents
document format is json
you can change the structure of the tables seamlessly unlike relational db
Types :
key value stores - highly effective for high veloci
document stores
wide column stores
graph stores
4. data warehousing
OLTP
OLAP

1. nATURE constant transactions


periodic large updates, complex queries
queries/updates
2. Types Operational data
Consolidated data
3. Data retention short term (2-6 mths) long term -
( 2-5 years)
4. Storage GB
TB/PB
5. Users Many
Few
6. Protection Robust, constant data
periodic protection
protection and fault
tolerance

8. DATA WAREHOUSING

sOURCE DATA through ETL is loaded into data warehouse( meta data, raw data,
summary data)
the data stored here can be used for OLAP, Reporting , Data mining

9. Data Mart
smaller data warehouse

DATA WAREHOUSE DATA Mart


enterprise wide data department-wide data
multiple subject areas single subject areas
complex less complex

10. Benefits of Data warehousing


Consolidation
Isolation
history
consistency
performance
value

***********************************************************************************
***********************************************************************************
***************** MODULE 3

1. Hadoop Architecture Overview -


HADOOP 1.X HADOOP
2.X (2013)
Data Processing Map Reduce v1 Map Reduce v2
Resource Management Map Reduce v1 YARN
Distributed Data Storage HDFS HDFS
2. Hadoop 2.x architecture data processing
1. Applications jobs are submitted to the yarn resource manager. name
node contains resource manager. the resource manager is responsible for accepting
job submissions , coordinating the allocation of compute resources in the hadoop
cluster and deciding which data node on teh applicaiton master should be launched
on
2. Node managers exist on each node. are responsible for launching and
monitoring compute containers. They managee the life cycle of containers and report
container resource usuage back to the Resource manager.
3. A container is a dedicated unit of resources ( CPU, Memory) for
applicaitons to execute in .
4. the Resource manager decides which node to launch the Application
Master based on container availability
5. The node manager is also responsible for launching the Application
Master which is responsible for the entire execution of an application job. There
is only one application master per applicaiton job.
6. When the job is complete the output job is sent back to RM. The RM
frees up the resource on the node. if an application job fails it is the
responsibility of the application master to restart it.

3. Map Reduce
parallel processign framework for distributed data processing.

4. Why Map Reduce ?


Provides :
Sacele out architecture
processing of structured and unstructed data
Runs on large clusters of cheap commodity hardware
Falult tolerant
Optimised scheduling
Flexibility for develepors
Interoperates with tools like Hive and Pig
Computes data locally instead of moving it over the network

5. Map Reduce Phases


Map
reduce operates on key/value pairs
accepts key/value pairs as input
produces key/value pair as output

Shuffle Phase
performs the sort
transfers map outputs to the reducers as inputs

Reducer Phase
aggregates teh values
produces one output element for each input list.

6. Map Reduce Joins


combines tables or datasets

Three types
Map Side Join
Performed on the mapper
Faster but has constraints
sorted by same key
equal number of partitions
all records of the same key in the same partition
Reduce Side Join
performed on teh reducers
fewer constraints
input need not have to be structured
Less efficient
both data sets must go thru Requires shuffle

Distributed cache.
used on map side join
copy small files to memory on nodes before join.
More efficient and eliminate the need for a reducer

7. Map Reduce Combiner


òptional mini reducer
runs in memory after map before reducer (if used)
main purpose is to optimize bandwidth

8. Map Reduce Partitionr


application code that defines how keys are assigned to reducers.
determines which reducer receives which key -value pairs
when a map reduce job starts it determines how many partitions it will
divide the data into
default partitioner uses simple hash of the key - Hash Partitioner.

***********************************************************************************
***********************************************************************************
*****************

You might also like