You are on page 1of 18

Lets Hadoop

WHATS THE BIG DEAL WITH BIG DATA?

1.

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Gartner Predicts 800% data growth over next 5 years

Big Data opens the door to a new approach to engaging customers and making decisions

BIG DATA: WHAT ARE THE CHALLENGES?

2.

How we can capture and deliver data to right people in real-time?

How we can understand and use big data when it is in Variety of forms?
How we can store/analyze the data given its size and computational capacity? While the storage capacities of hard drives have increased massively over the years, access speedsthe rate at which data can be read from drives have not kept up. Example: Need to process 100TB datasets On 1 node: scanning @ 50MB/s = 23 days On 1000 node cluster: scanning @ 50MB/s = 33 min Hardware Problems Hardware Problems / Process and combine data from Multiple disks
Traditional Systems: They cant scale, not reliable and expensive.

WHAT TECHNOLOGIES SUPPORT BIG DATA?

3.

Scale-out everything: Storage Compute

WHAT MAKES HADOOP DIFFERENT?

4.

AccessibleHadoop runs on large clusters of commodity machines or on cloud (EC2 ). RobustHadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ScalableHadoop scales linearly to handle larger data by adding more nodes to the cluster. SimpleHadoop allows users to quickly write efficient parallel code. Data LocalityMove Computation to the Data. Replication - Use replication across servers to deal with unreliable storage/servers

IS HADOOP ONE-STOP SOLUTION?

5.

10

Good for....

Real time Small datasets Algorithms requires large temp space Problems that are CPU bound and have lots of cross talk
11

Not good for...

Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Framework written in Java Designed to solve problem that involve analyzing large data (Petabytes) Programing model based on Googles Map Reduce Infrastructure based on Googles Big Data and Distributed File System Hadoop consists of two core components. The Hadoop Distributed File System (HDFS) - A distributed file system MapReduce - distributed processing on compute clusters.

12

NameNode This manages the file system namespace (metadata) and regulates access to files by clients. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories DataNode This manages storage attached to the node in which they run. DataNode serves read, write, perform block operation, delete, and replication upon request from NameNode Many Data Nodes, typically one DataNode for a physical node
13

Large-Scale Data Processing o Want to use 1000s of CPUs o But dont want hassle of managing things MapReduce Architecture provides o Automatic parallelization & distribution o Fault tolerance o I/O scheduling o Monitoring & status updates MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two phases: o Map o Reduce

14

15

In map phase , the mapper reads data in the form of key/value pairs The Reducer process all output from mapper and arrives at final output as final key/value pairs writes to HDFS. There are two types of nodes that control the job execution process: o Jobtracker o Tasktrackers The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress reports to the jobtracker Jobtracker runs in NameNode. Tasktracker runs in DataNode.

16

17

You might also like