You are on page 1of 6

SANKHYANA CONSULTANCY SERVICES

Data Driven Decision Science

Chapter : 1
Introduction to Big data and hadoop

Bigdata is the data beyond the storing capacity or beyond the processing power.
It is :
1. High – Volume, High – Velocity and High – Varity of data.
2.Demands cost – effective and innovative ways of data processing
3.for enhanced insight and decision making.

The 4 Vs of Big Data :


1.Volume – Huge amount of data that is generated from various sources. Overall more than 3 exabytes
of data is generated everyday and it keeps growing.
2.Velocity – Speed of data generation i.e how fast the data is generated and processed to meet the
demands and the challenges ahead in the enterprise growth
3.Variety – Formats of data
4.Veracity – quality of data. We can say that it refers to validaity and accuracy of data. If data is not
accurate it is of no use.

Types Data :
1.Structured Data is data that is having a pre – defined schema or format. Eg . - RDBMS, XML etc.
2.Semi – Structured Data is data that may have predefined scheme, it is often ignored. Eg. - XML,
JSON etc.
3.Un – structured data is data that does not have predefined format. Eg. - Text Files, Images etc.

Problems : -
1.Storing and managing massive amount of data.
2.Analyse this data to predict customers behaviours, trends and outcomes in cost effective manner.

Solutions we have : RDBMS and DFS

Drawbacks :
1.RDBMS => Supports GB of data, Structured data and Processing speed is low.
2.DFS => Programming is complex when we are talking about massive data sets, Getting the data from
storage device to the processor is a problem. (Assume that you want to transfer 100 GB of data to the
processor with transfer rate of 100 MB/s. It takes appro. 17 min to transfer). Difficult to deal the partial
failure of the system and sclaing up and sclaing out is not easy task adding node or removing node from
existing distributed system is complex.

New Approach – >


Hadoop is new Approach to all this.

Limitation : -
1.We can access the data in sequential order.
2.We can write data in it but can't update or delete the data nor can insert a new record.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2nd Sector, HSR
Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

Difference b/w HDFS and RDBMS: -


lMore storage capacity where Limits to only GB
lSupports all types of Data where as Supports only structured data.
lFaster processing where as Slow processing for large data.
lAllows to read data in the sequential manner where as reads data randomly.
lDo not allow updatable operation where as allow updatable operation (Insert, Update and Delete).
lBatch processing where as Real – Time processing.

From where Hadoop Concept came :


Google had created its own distributed computing framework and published papers. It includes : GFS
and MapReduce.

Hadoop was developed by Doug Cutting and Mike Cafarella based on Google File System. Hadoop's
logo was named after Doug's son toy elephant. It was released as Apache's Open Source Project. It is a
framework with cluster of commodity hardware with streaming access pattern to achieve the goal of
storing and analysing large sets of data. It is for reliable, scalable and distributed computing. It is
designed to scale up from single servers to thousands of machines, each offering local computation and
storage. It allows for the distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage. Rather than rely on hardware to deliver high – availability,
the Hadoop software library itself is designed to detect and handle failures at the application layer.

Streaming access pattern – Write once read any no. of times but do'nt try to change the content of the
file.

History : -
2002 – Doug Cutting and Mike Cafarella started working on Nutch
2003 – Google published paper describing GFS and Mapreduce
2004 – Doug Cutting adds DFS and MapReduce support to Nutch
Jan 2006 – Doug joins Yahoo
Feb 2006 – Hadoop project started seperately to support HDFS and MapReduce and Yahoo has adopted
Hadoop Project.
Apr 2006 – Sort benchmark (1GB/node) run on 188 nodes in 47.9 hours.
Jan 2008 – Hadoop was made its own top – level project at Apache.
Apr 2008 – Won the 1 TB sort benchmark in 3.5 minutes on 900 nodes.
Nov 2008 – Google reported that its MapReduce implementation sorted one terabyte in 68 seconds.
Apr 2009 – Yahoo ! Used Hadoop to sort one terabyte in 62 seconds.
2014 – a team of databricke used a 207 – node Spark cluster to sort 100 terabytes of data in 1406
seconds.
Features of Hadoop :
1.Distributes the data across nodes
2.Brings the computation local to data prevents the network overhead.
3.Data is replicated for high availability and reliability.
4.Scalable
5.Fault Tolerant

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2nd Sector, HSR
Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

Advantages of using Hadoop :


i.Distributes the data : When you store data then the data files are split into multiple blocks. Each blocks
is replicated on multiplt nodes (3 default).
ii.Hadoop brings the computation to the data. Unlike traditional systems where computations are
performed by applications on separate machines while data resides on a central server, Hadoop brings
computations closer to the node where data is stored making data processing way more faster.
iii.Parallel Execution and fault tolerant. Jobs are divided into multiple tasks and these tasks run
independently. Speculative execution is possible bcoz of parallel execution. Node failure is common, so
partial failure can be handeled easily.
If nodes fails
·system continues to run
·Master re – assigns the tasks to different node
·No loss of data bcoz of data replication
·When failure node recovers, it joins the cluster automatically.
iv.Scalable. We can add nodes easily to hadoop cluster which increases capacity. We can upgrade the
node capabilities.

Installation nodes :
1.Standalone Mode : - It is configured to run in a non – distributed mode, as a single Java process. In
this mode there are no Hadoop daemons running in the background. HDFS is not used, only local file
system is used for storing files. It is best when you want to test your program for bugs with small input.
It is also known as Local Job Runner mode.
2.Pseudo – Distributed Mode : - Runs on a single machine, but it has all the daemons running in a
separate process. In this mode, we can emulate a multi – node cluster on a single machine. Different
daemons run in different JVM instances, but on a single machine.HDFS is used instead of local FS.
3.Fully – Distributed Mode : - The code runs on an actual Hadoop cluster where we can run the code
against large hadoop cluster.

Hadoop Components :=> Hadoop Core Components and Hadoop Ecosystem Components

1.Hadoop Core Components :


Hadoop = HDFS + MapReduce.
a. HDFS (Hadoop Distributed File System) is hadoop's own file system written in Java. It is
implemented based on Google File System. It sits on top of native unix file system such as ext3, ext4,
xfs etc.(It is a high performance 64 – bit journaling [keeps track of changes] file system). It stores the
data on the cluster.

b. MapReduce is hadoop's data processing component. It is for distributing tasks across multiple nodes.
Each node process the data stored on that node locally. It contains two phases Map and reduce. The
other important phase called shuffle and sort runs after the map phase and before the reduce phase.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2nd Sector, HSR
Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

Hadoop cluster is a group of machines working together to store and process data. It follows Master –
Slave Architecture. Usually it contains one master and many slave nodes.
Master Contains
a. Name Node stores metadata about the data being stored in DataNodes. Metadata includes Number of
blocks and on which datanode the block is stored etc.
b. JobTracker is a master which creates and runs the job. It runs on NameNode allocates the job to
Tasktracker which run on datanode.
Slave Contains
a. DataNode stores the actual data.
b. TaskTrackers run the tasks and report the status of task to Job Tracker.

Hadoop Daemons :
Daemons is a process that runs in the background.Hadoop has five daemons :
1.Name Node
2.Secondary Namenode
3.Data Node
4.Job Tracker
5.Task Tracker

2.Hadoop Ecosystem Components :

Sqoop(SQL to Hadoop) - > Implemented in Java used to transfer data between HDFS & RDBMS.

Flume is distributed, reliable and available service for efficiently collecting, aggregating and moving
large amounts of log data from servers into HDFS. It has a simple and flexible architecture based on
streaming data flows. Flume channels data b/w sources and sinks and its data harvesting can either be
scheduled or event – driven. Possible sources for flume include avro(remote procedure call & data
serialization framework), files and system logs. Possible sinks (remove event from the channel & puts it
into external repository or forward Flume Source of next Flume agent in the flow) include HDFS and
Hbase.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2nd Sector, HSR
Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

Source Sink

Web
Server Channel

Agent HDFS

Hive is SQL based tool. It is to query and manage large datasets residing in distributed storage. It
provides a mechanism to query the data using a SQL – like langauage called HiveQL. Developed by
facebook and 100% open source. Converts HiveQL queries into MapReduce jobs. Submits these Jobs to
Hadoop Cluster.
Impala High performance SQL engine for vast amounts of data. Similar to HiveQL. 10 – 50 times faster
than Hive, PIG or MapReduce. Developed by Cloudera and 100% open source, Data stores in HDFS.
Does not use MapReduce.
Pig is a platform for data analysis and processing on Hadoop. It offers an alternative to write
MapReduce programs directly. Developed by Yahoo and 100% open source. It contains a high – level
scripting language called Pig Latin. It is designed to help with ETL tasks.

Hive Pig

Impala

MapReduce

HDFS

Oozie is a workflow engine. Runs on a server typically outside the cluster. Oozie workflows are PIG
jobs, Hive jobs, Sqoop Jobs, MapReduce Jobs. It can also use non MapReduce Jobs like sending mails.
It is for scheduling the job.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2nd Sector, HSR
Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2nd Sector, HSR
Layout, Bangalore – 560102. Ph: 080 48147185, 48147186

You might also like