You are on page 1of 8

Contents 1.The Growth Of Data ...................................................................... 0 1.1Sources of Data Inflow & Data Growth .................................... 0 1.

2Problems with Enormous Data ................................................. 1 2.Introduction to Apache Hadoop .................................................... 1 2.1Introduction ............................................................................ 1 2.2Problems with Shared Storage ................................................ 1 2.3Components Of Apache Hadoop ............................................... 2 2.4RDBMS Vs Apache Hadoop ....................................................... 2 3.Hadoop Distributed File System .................................................... 3 3.1Introduction ............................................................................ 3 3.2Assumptions & Goals of HDFS .................................................. 3 3.3Terms Associated To HDFS ...................................................... 4 3.4Data Replication ...................................................................... 5 4.MapReduce ................................................................................... 5 4.1Introduction ............................................................................ 5 4.2MapReduce The Flow ............................................................ 6

1. The Growth Of Data The data on the World Wide Web has grown at an enormous rate. With the amount of users using the internet increasing, the amount of data getting uploaded is increasing manifold. Some of the estimates related to the growth of data are as follows: In 2006, the total estimated size of the digital data stood at 0.18 zettabytes By 2011, a forecast estimates it to stand at 1.8 zettabytes. One zettabyte = 1000 exabytes = 1 million petabytes = 1 billion terabytes

1.1Sources of Data Inflow & Data Growth This huge amount of data is flowing in from various sources. Some of them are as follows: Social networking sites hosting photos Video streaming sites Stock exchanges generating huge amount of data every second Internet Archives Introduction To Apache Hadoop A Data Intensive Processing Framework

Introduction To Apache Hadoop

Real time systems which monitor the activity of critical systems, ex: Hadron Collider generate millions of transactions every second which need to be stored for data analysis.

1.2Problems with Enormous Data The growth of data also brings with it some problems. Even though the amount of data storage has increased over time, the data access speed has not increased at the same rate. As an example: Amount of Data Storage available/drive Access Speed 1370 MB 1 TB 4.4 MB/s 100 MB/s

Year 1990 2010

Total time taken 5 mins 2.5 hrs

Overall, considering the above statistics, the data access seems to be taking more time. If all the data resides on one node, then it deteriorates the overall data access time. Reading becomes slower; writing becomes even slower. However, if the same data is accessed from multiple disks in parallel, then the overall data access time can be reduced. This concept has been applied for accessing data in a much faster manner. To make such a concept a reality, the data has to be shared among multiple nodes. Overall, there is a need to increase the storage hardware to reduce the data access time.

2. Introduction to Apache Hadoop 2.1Introduction

A framework which enables applications to work with multiple nodes which have an enormous amount of data is what Apache Hadoop basically consists of. It comprises of 2 components: a. Apache Hadoop Distributed File System (HDFS) b. Googles MapReduce Framework Apache Hadoop was created by Doug Cutting, who named it after his sons stuffed elephant. It was initially developed to support the Nutch search engine project. It has currently become a top level Apache project. It is being constructed and used by a community of users. Facebook, New York times are some of the examples of Apache Hadoop implementations. 2.2Problems with Shared Storage

Introduction To Apache Hadoop

Following table highlights the problems which can occur with shared storage and how Apache Hadoop framework tries to handle those: Problem Hardware Failure (All the nodes will not be functional at any given moment in time) Combining the data retrieved from multiple nodes is challenging (Combining the output of each worker node becomes a complex use case to handle) Solution Approach Replicate the data. However, Hadoop file system works in a slightly different manner. Googles MapReduce solve this problem. framework helps

MapReduce = Map + Reduce Map is pretty much key+value pair Its a computational method of mapping the data retrieved from the multiple disks and then, combining them to generate one output.

2.3Components Of Apache Hadoop Hadoop framework can be thought of as consisting of 2 parts: Component Provider

Storage

Apache Hadoop Distributed File System, AKA HDFS MapReduce

Analysis of Data

2.4RDBMS Vs Apache Hadoop Following are the categories in which Hadoop differs from RDBMS: Category Size of Data being handled Data Access Strategies Optimum Updates Traditional RDBMS Gigabytes Interactive and batch Read and write many times MapReduce Petabytes Batch

Write once, read many time

Introduction To Apache Hadoop

Storage Integrity Scaling

High Nonlinear

Low Linear

3. Hadoop Distributed File System 3.1Introduction It is a distributed file system which is designed to run on commodity hardware. Since the framework implementers understand that failure of any node across the commodity hardware is a norm rather than an exception; HDFS has been designed to be highly fault tolerant. Also, it is designed to run on low cost shared hardware. HDFS provides high throughput access to data present in applications which deal with huge amount of data. 3.2Assumptions & Goals of HDFS Assumptions & Goals Description

Hardware Failure

It is a norm rather exception in HDFS.

than

At a given point in time, one or th other node IS ALWAYS non functional Detection of faults, and automatic recovery is the key

quic

Streaming Data Access

Applications running on HDFS nee streaming access to the data

Not for general purpose application HDFS is processing designed for

batc

Large Data Sets

HDFS is tuned to support large files Each file may be 1 TB in size

Simple Coherency Model

HDFS applications are mostly Write Once-Read-Many applications

Introduction To Apache Hadoop

Moving Computation Moving Data

is

Cheaper

than

It minimizes network congestion an increases the overall throughput o the system

Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to b easily portable from one platform t another.

3.3Terms Associated To HDFS Terms Description

Type of Architecture

HDFS has Architecture

Master/Slav

Master Node Name Node

The master node is known as th Name Node. It manages namespace the file

system

Slave Nodes - Data Nodes

Regulates access to files by clients. Usually one per cluster

Manage the storage attached to th node on which they run

Responsibility of Name Node

It executes file system namespac operations like opening, closing renaming files and directories.

Determines mapping of blocks t Data nodes.

Decides on the replication of dat blocks*

Responsibility of Data Node

Responsible for serving read an write requests from the file system client.

They also perform block creation deletion, replication as instructed b the Name node.

Introduction To Apache Hadoop

3.4Data Replication www.hadoop.apache.org HDFS is designed to reliably store very large files across machines in a large cluster It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance and this replication is configurable. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode Replica placement is crucial for faster retrieval of data by the clients For this, HDFS uses a technique known as Rack Awareness*. HDFS tries to satisfy a read request from a replica that is closest to the client. All HDFS communication protocols are layered on top of the TCP/IP protocol. 4. MapReduce 4.1Introduction MapReduce is the framework which helps in the data analysis part of Apache Hadoop implementation. Following are the key points related to MapReduce: MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers The framework is inspired by map and reduce functions commonly used in functional programming Overall, it is consisting of a Map step and a Reduce step to solve a given problem. Map Step: o The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. o A worker node may do this again in turn, leading to a multilevel tree structure. o The worker node processes that smaller problem, and passes the answer back to its master node. Reduce Step: o The master node then takes the answers to all the subproblems and combines them in a way to get the output All Maps steps execute in a parallel fashion

Introduction To Apache Hadoop

The Reduce step takes in the input from the Map step. All the Maps with the same key fall under one reducer. However, there are multiple reducers; again working in parallel. This parallelism offers the possibility of recovery from partial failure. If one node (Mapper/Reducer) fails, then its work can be rescheduled to another node. Logical View of Map Step: o Data is structured in the form of (key,value) pairs o Map(k1,v1) -> list(k2,v2) o Input: Type of data in one domain. o Output: List of data in a different domain. Logical View of Reduce Step: o The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain o Reduce(k2, list (v2)) -> list(v3) o Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values 4.2MapReduce The Flow Following diagram explains the flow of data as described above:

Introduction To Apache Hadoop

You might also like