Professional Documents
Culture Documents
a) Volume
b) Velocity - frequency at which data is generated
c) Variety- actual contents of data set
2. Architectural Components
a) Parallelism
b) Hadoop - scalable data storage and batch processing framework.
ingests, processes and aggregates external data.
c) HDFS - file system
d) YARN - frameworkd for scheduling and execution of data processing .
Yet Another Resource Negotiator.
e) Types of Jobs
3. Big Data Job types
A) Map Reduce-
Map phase - user defined map processes - parellel processes at different
nodes.
Reduce phase - output of map phase is aggregated.
intermediate proceses - Partition, Shuffle, Sorted
B) Hive
Data warehousing infrastructure
HQL
QUery unstructed data
Provides Summarization, Analysis, Ad hoc querying
C) Pig Latin
Queries
Data manipulation
Combines SQL / Map Reduce.
***********************************************************************************
***********************************************************************************
*****************
MODULE 2
Flat Systems
Relational Systems
No SQL
2. Relational Systems
' Attribute - columns in the table
Tuple - row in the table
Relation - the table itself
View - requested result set
3. NO SQL
designed specifically for distributed environment
stores data in documents
document format is json
you can change the structure of the tables seamlessly unlike relational db
Types :
key value stores - highly effective for high veloci
document stores
wide column stores
graph stores
4. data warehousing
OLTP
OLAP
8. DATA WAREHOUSING
sOURCE DATA through ETL is loaded into data warehouse( meta data, raw data,
summary data)
the data stored here can be used for OLAP, Reporting , Data mining
9. Data Mart
smaller data warehouse
***********************************************************************************
***********************************************************************************
***************** MODULE 3
3. Map Reduce
parallel processign framework for distributed data processing.
Shuffle Phase
performs the sort
transfers map outputs to the reducers as inputs
Reducer Phase
aggregates teh values
produces one output element for each input list.
Three types
Map Side Join
Performed on the mapper
Faster but has constraints
sorted by same key
equal number of partitions
all records of the same key in the same partition
Reduce Side Join
performed on teh reducers
fewer constraints
input need not have to be structured
Less efficient
both data sets must go thru Requires shuffle
Distributed cache.
used on map side join
copy small files to memory on nodes before join.
More efficient and eliminate the need for a reducer
***********************************************************************************
***********************************************************************************
*****************