You are on page 1of 32

Hadoop Education

DBA Team 4/14/2011

SHI BI Landscape

Jan. 16, 2009

SHC Hadoop Landscape

Jan. 16, 2009

SHC Hadoop
Hadoop is a software framework for processing large amounts of data scattered across multiple commodity nodes (servers). The base Hadoop environment will contain a Distributed File System (HDFS) and a Parallel Programming (MapReduce) piece. Additional projects may be added to the Hadoop software framework.

Hadoop is not a replacement for a RDBMS (Relational DataBase Management System).

Jan. 16, 2009

SHC Hadoop Projects Overview

Jan. 16, 2009

HADOOP CORE

Jan. 16, 2009

HDFS
HDFS (Hadoop Distributed File System) is a distributed fault-tolerant file system designed to be deployed on low cost commodity hardware. HDFS provides high throughput access to large amounts of application data. HDFS is not a file system which requires expensive fast disk drives with RAID (Redundant Array of Independent Disks) to provide high throughput and fault tolerance.

Jan. 16, 2009

MapReduce
MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.

MapReduce is not a replacement for a RDBMS (Relational DataBase Management System) or SQL (Structured Query Language).
sort split0 copy merge sort split1

Map

Reduce

part0

Map
merge sort

Reduce

part1

split2

Map

Jan. 16, 2009

HADOOP PROJECTS & SUBPROJECTS

Jan. 16, 2009

AVRO
AVRO is a data serialization system. It provides a means to distribute non-text files, such as .zip, graphics, binary files (and text files) in a consistent manner across a distributed (Hadoop) environment.

Jan. 16, 2009

10

FLUME
Flume (Log Flume) is a horizontally scalable data aggregation tool, which can support different levels of compression, batching and reliability for each unique data flow.

Jan. 16, 2009

11

HBase
HBase is a NoSQL multi-dimensional, distributed, highly available data store made up of rows and column families, which can support billions of rows and millions of columns. HBase is not a SQL database and thus does not have the concepts of joins, data types, SQL or even a query engine.

Jan. 16, 2009

12

Hive
Hive is a data warehouse environment built on top of Hadoop. Hive gives the capability for SQL programmers and map reduce programmers to use a common SQL-like query language called QL which is extensible to custom mapper and reducer plug ins. It is best used for batch jobs with large immutable sets of data. Hive is not designed for online transaction processing (OLTP) and does not offer real time queries and row level updates.

Jan. 16, 2009

13

HUE
Hue (Hadoop User Experience) is a unified web-based UI for interacting with Hadoop. Hue provides an interface to submit jobs, watch running jobs, browse the file system, and interact with Hive . Additional UI applications can be built to be used with Hue, thus providing a single access point into Hadoop.

Jan. 16, 2009

14

LUCENE/SOLR
Lucene/Solr are two projects that merged into one in March 2010.
Lucene is a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Solr is a high performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces.

www.yonik.com

Jan. 16, 2009

15

PIG
Apache Pig (Pig Latin) is a scripting language for exploring large datasets. It provides the ability with a few commands to search terabytes of data. Pig programs run in a distributed environment on a cluster (programs are compiled into MapReduce jobs and execute using Hadoop).

Jan. 16, 2009

16

OOZIE (Yahoo)
Oozie is a workflow and coordination server tool for managing jobs on a distributed (Hadoop) environment. Oozie job execution can be driven on a Time and/or Data availability basis.

http://yahoo.github.com/oozie/design.html

Jan. 16, 2009

17

SQOOP
Sqoop (Sql-to-hadoop) is a database import tool which provides the capability to easily copy tables or entire databases between SQL databases (RDBMS) and Hadoop files in HDFS (Hadoop Distributed File System).

RDBMS

SQOOP

HADOOP

Generated Record Datatype Definitions

Jan. 16, 2009

18

ZOOKEEPER
ZooKeeper enables highly reliable distributed coordination by providing a centralized service for maintaining configuration information, naming, distributed synchronization, and group services for distributed (Hadoop) applications.

Jan. 16, 2009

19

NON-HADOOP PROJECTS

Jan. 16, 2009

20

DATAMEER
The Datameer Analytics Solution (DAS) gives easy access to Big Data analytics via a spreadsheet interface. DAS gives the capability to create or embed graphs and statistics into dashboards for easy viewing.

Jan. 16, 2009

21

GANGLIA
Ganglia is a scalable distributed monitoring system used to monitor cluster and grids. It provides the ability to drill down through standard or custom textual and graphical views at a single node or at a cluster level.

Client

Browser

RRD

Node gmetad

Node #1

Node #2

Node #3

Node

gmond

gmond

gmond

gmond

Cluster

Jan. 16, 2009

22

NAGIOS
NAGIOS is a open source monitoring, alerting, response, reporting, maintenance, and capacity planning tool for servers and networks. Nagios can be setup to monitor critical infrastructure, such as network protocols, applications, services, servers and network components. It is very flexible by allowing custom Nagios plugins to be created and shared via the open community, to enhance Nagioss features.-

Jan. 16, 2009

23

INFOBRIGHT
Infobright is a columnar MySQL compatible analytic database.

Jan. 16, 2009

24

JASPERSOFT
Jaspersoft is an open source BI (Business Intellegence) and ETL (Extract, Transform and Load) set of tools, which incorporates R (project for Statistical Computing) and supports Hadoop/Hive.

Jan. 16, 2009

25

R
R is a language and environment for statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) computing and graphics.

Jan. 16, 2009

26

HADOOP HARDWARE

Jan. 16, 2009

27

Hardware
NW1
Worker

NW2
Worker

NW3
Worker

NW4
Worker

NW5
Master

NW6
Master

22 Nodes

22 Nodes

22 Nodes

18 Nodes

6 Nodes

9 Nodes

Production
1,008 Cores, 2,688GB Memory, 672TB Raw (~168TB Usable)

OW1
Master

OW2
Worker

OW3
Master

OW4
Worker

OW5
Worker

5 Nodes

15 Nodes

5 Nodes

22 Nodes

8 Nodes

Integration/UAT
180 Cores, 480GB Memory, 120TB Raw (~30TB Usable)

Backup
120 Cores, 240GB Memory, 240TB Raw (~80TB Usable)

Jan. 16, 2009

28

Hadoop Nodes
Production Cluster
DL380 - Master Nodes
UID
1 1 1 1
POWER POWER SUPPLY SUPPLY

Backup Cluster
R710 - Master Nodes
HP ProLiant DL380 G7

Integration/UAT Cluster
R710 - Master Nodes

2 2 2 2
POWER POWER SUPPLY SUPPLY

3 3
OVER OVER TEMP TEMP POWER POWER CAP CAP

9 9

7 7

5 5

3 3

DIMMS

1 1

1 1

3 3

5 5

7 7

9 9

2
8 8 6 6 2 2
PROC PROC

4 4

2 2
ONLINE AMP SPARE STATUS

2 2

4 4

6 6

8 8 1 1

FANS

MIRROR

PROC PROC

6 6

5 5

FANS 4 4

3 3

2 2

1 1

EST

1 2

EST

3 4

1 2

3 4

Gb 1

Gb 2

Gb 3

Gb 4

iLO

UID

Gb 1

Gb 2

Gb 3

Gb 4

8 x 2.8Ghz Intel, 60GB RAM 4 x 146GB 10k SAS (RAID) 6 x GB NICs, Mgmt Onboard Redundant Power Supplies

8 x 2.6Ghz Intel, 48GB RAM 4 x 300GB 15k SAS (RAID) 8 x GB NICs, Mgmt Onboard Redundant Power Supplies

8 x 2.6Ghz Intel, 48GB RAM 4 x 300GB 15k SAS (RAID) 8 x GB NICs, Mgmt Onboard Redundant Power Supplies

R515 - Access Nodes

R515 - Access Nodes

R515 - Access Nodes

ST

ST

ST

2
Gb 1

2
Gb 1

2
Gb 1

3
Gb 2

3
Gb 2

3
Gb 2

12 x 2.6Ghz AMD, 64GB RAM 12 x 2TB SATA (RAID) 4 x 10GB NICs, Mgmt Onboard Redundant Power Supplies

12 x 2.6Ghz AMD, 64GB RAM 12 x 2TB SATA (RAID) 4 x 10GB NICs, Mgmt Onboard Redundant Power Supplies

12 x 2.6Ghz AMD, 64GB RAM 12 x 2TB SATA (RAID) 4 x 10GB NICs, Mgmt Onboard Redundant Power Supplies

R415 - Worker/Data Nodes


EST

R310 - Worker/Data Nodes


EST

R415 - Worker/Data Nodes


EST

Primary Network (1GbE)


1 2

Primary Network (1GbE)


Gb 1

Primary Network (1GbE)

Gb 1

MEST

Gb 1

Gb 2

Gb 2

Gb 2

Management Network

Secondary Network (1GbE)

Secondary Network (1GbE)

Management Network

Secondary Network (1GbE)

12 x 2.6Ghz AMD, 32GB RAM 4 x 2TB SATA (JBOD) 4 x GB NICs, Mgmt Onboard Single Power Supply

4 x 2.4Ghz Intel, 8GB RAM 4 x 2TB SATA (JBOD) 2 x GB NICs, No Mgmt Single Power Supply

12 x 2.6Ghz AMD, 32GB RAM 4 x 2TB SATA (JBOD) 4 x GB NICs, Mgmt Onboard Single Power Supply

Jan. 16, 2009

MEST

29

Hadoop - Production

Jan. 16, 2009

30

Hadoop Integration/UAT and Backup

Jan. 16, 2009

31

You might also like