Hadoop Education v1

Hadoop Education
DBA Team 4/14/2011
SHI BI Landscape
Jan. 16, 2009
SHC Hadoop Landscape
Jan. 16, 2009
SHC Hadoop
Hadoop is a software framework for processing large amounts of data scattered across multiple commodity nodes (servers). The base Hadoop environment will contain a Distributed File System (HDFS) and a Parallel Programming (MapReduce) piece. Additional projects may be added to the Hadoop software framework.
Hadoop is not a replacement for a RDBMS (Relational DataBase Management System).
Jan. 16, 2009
SHC Hadoop Projects Overview
Jan. 16, 2009
HADOOP CORE
Jan. 16, 2009
HDFS
HDFS (Hadoop Distributed File System) is a distributed fault-tolerant file system designed to be deployed on low cost commodity hardware. HDFS provides high throughput access to large amounts of application data. HDFS is not a file system which requires expensive fast disk drives with RAID (Redundant Array of Independent Disks) to provide high throughput and fault tolerance.
Jan. 16, 2009
MapReduce
MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
MapReduce is not a replacement for a RDBMS (Relational DataBase Management System) or SQL (Structured Query Language).
sort split0 copy merge sort split1
Map
Reduce
part0
Map
merge sort
Reduce
part1
split2
Map
Jan. 16, 2009
HADOOP PROJECTS & SUBPROJECTS
Jan. 16, 2009
AVRO
AVRO is a data serialization system. It provides a means to distribute non-text files, such as .zip, graphics, binary files (and text files) in a consistent manner across a distributed (Hadoop) environment.
Jan. 16, 2009
10
FLUME
Flume (Log Flume) is a horizontally scalable data aggregation tool, which can support different levels of compression, batching and reliability for each unique data flow.
Jan. 16, 2009
11
HBase
HBase is a NoSQL multi-dimensional, distributed, highly available data store made up of rows and column families, which can support billions of rows and millions of columns. HBase is not a SQL database and thus does not have the concepts of joins, data types, SQL or even a query engine.
Jan. 16, 2009
12
Hive
Hive is a data warehouse environment built on top of Hadoop. Hive gives the capability for SQL programmers and map reduce programmers to use a common SQL-like query language called QL which is extensible to custom mapper and reducer plug ins. It is best used for batch jobs with large immutable sets of data. Hive is not designed for online transaction processing (OLTP) and does not offer real time queries and row level updates.
Jan. 16, 2009
13
HUE
Hue (Hadoop User Experience) is a unified web-based UI for interacting with Hadoop. Hue provides an interface to submit jobs, watch running jobs, browse the file system, and interact with Hive . Additional UI applications can be built to be used with Hue, thus providing a single access point into Hadoop.
Jan. 16, 2009
14
LUCENE/SOLR
Lucene/Solr are two projects that merged into one in March 2010.
Lucene is a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Solr is a high performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces.
www.yonik.com
Jan. 16, 2009
15
PIG
Apache Pig (Pig Latin) is a scripting language for exploring large datasets. It provides the ability with a few commands to search terabytes of data. Pig programs run in a distributed environment on a cluster (programs are compiled into MapReduce jobs and execute using Hadoop).
Jan. 16, 2009
16
OOZIE (Yahoo)
Oozie is a workflow and coordination server tool for managing jobs on a distributed (Hadoop) environment. Oozie job execution can be driven on a Time and/or Data availability basis.
http://yahoo.github.com/oozie/design.html
Jan. 16, 2009
17
SQOOP
Sqoop (Sql-to-hadoop) is a database import tool which provides the capability to easily copy tables or entire databases between SQL databases (RDBMS) and Hadoop files in HDFS (Hadoop Distributed File System).
RDBMS
SQOOP
HADOOP
Generated Record Datatype Definitions
Jan. 16, 2009
18
ZOOKEEPER
ZooKeeper enables highly reliable distributed coordination by providing a centralized service for maintaining configuration information, naming, distributed synchronization, and group services for distributed (Hadoop) applications.
Jan. 16, 2009
19
NON-HADOOP PROJECTS
Jan. 16, 2009
20
DATAMEER
The Datameer Analytics Solution (DAS) gives easy access to Big Data analytics via a spreadsheet interface. DAS gives the capability to create or embed graphs and statistics into dashboards for easy viewing.
Jan. 16, 2009
21
GANGLIA
Ganglia is a scalable distributed monitoring system used to monitor cluster and grids. It provides the ability to drill down through standard or custom textual and graphical views at a single node or at a cluster level.
Client
Browser
RRD
Node gmetad
Node #1
Node #2
Node #3
Node
gmond
gmond
gmond
gmond
Cluster
Jan. 16, 2009
22
NAGIOS
NAGIOS is a open source monitoring, alerting, response, reporting, maintenance, and capacity planning tool for servers and networks. Nagios can be setup to monitor critical infrastructure, such as network protocols, applications, services, servers and network components. It is very flexible by allowing custom Nagios plugins to be created and shared via the open community, to enhance Nagioss features.-
Jan. 16, 2009
23
INFOBRIGHT
Infobright is a columnar MySQL compatible analytic database.
Jan. 16, 2009
24
JASPERSOFT
Jaspersoft is an open source BI (Business Intellegence) and ETL (Extract, Transform and Load) set of tools, which incorporates R (project for Statistical Computing) and supports Hadoop/Hive.
Jan. 16, 2009
25
R
R is a language and environment for statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) computing and graphics.
Jan. 16, 2009
26
HADOOP HARDWARE
Jan. 16, 2009
27
Hardware
NW1
Worker
NW2
Worker
NW3
Worker
NW4
Worker
NW5
Master
NW6
Master
22 Nodes
22 Nodes
22 Nodes
18 Nodes
6 Nodes
9 Nodes
Production
1,008 Cores, 2,688GB Memory, 672TB Raw (~168TB Usable)
OW1
Master
OW2
Worker
OW3
Master
OW4
Worker
OW5
Worker
5 Nodes
15 Nodes
5 Nodes
22 Nodes
8 Nodes
Integration/UAT
180 Cores, 480GB Memory, 120TB Raw (~30TB Usable)
Backup
120 Cores, 240GB Memory, 240TB Raw (~80TB Usable)
Jan. 16, 2009
28
Hadoop Nodes
Production Cluster
DL380 - Master Nodes
UID
1 1 1 1
POWER POWER SUPPLY SUPPLY
Backup Cluster
R710 - Master Nodes
HP ProLiant DL380 G7
Integration/UAT Cluster
R710 - Master Nodes
2 2 2 2
POWER POWER SUPPLY SUPPLY
3 3
OVER OVER TEMP TEMP POWER POWER CAP CAP
9 9
7 7
5 5
3 3
DIMMS
1 1
1 1
3 3
5 5
7 7
9 9
2
8 8 6 6 2 2
PROC PROC
4 4
2 2
ONLINE AMP SPARE STATUS
2 2
4 4
6 6
8 8 1 1
FANS
MIRROR
PROC PROC
6 6
5 5
FANS 4 4
3 3
2 2
1 1
EST
1 2
EST
3 4
1 2
3 4
Gb 1
Gb 2
Gb 3
Gb 4
iLO
UID
Gb 1
Gb 2
Gb 3
Gb 4
8 x 2.8Ghz Intel, 60GB RAM 4 x 146GB 10k SAS (RAID) 6 x GB NICs, Mgmt Onboard Redundant Power Supplies
R515 - Access Nodes
R515 - Access Nodes
R515 - Access Nodes
ST
ST
ST
2
Gb 1
2
Gb 1
2
Gb 1
3
Gb 2
3
Gb 2
3
Gb 2
12 x 2.6Ghz AMD, 64GB RAM 12 x 2TB SATA (RAID) 4 x 10GB NICs, Mgmt Onboard Redundant Power Supplies
R415 - Worker/Data Nodes

EST

EST

EST
Primary Network (1GbE)

1 2

Gb 1
Gb 1
MEST
Gb 1
Gb 2
Gb 2
Gb 2
Management Network
Secondary Network (1GbE)
Management Network
12 x 2.6Ghz AMD, 32GB RAM 4 x 2TB SATA (JBOD) 4 x GB NICs, Mgmt Onboard Single Power Supply
4 x 2.4Ghz Intel, 8GB RAM 4 x 2TB SATA (JBOD) 2 x GB NICs, No Mgmt Single Power Supply
12 x 2.6Ghz AMD, 32GB RAM 4 x 2TB SATA (JBOD) 4 x GB NICs, Mgmt Onboard Single Power Supply
Jan. 16, 2009
MEST
29
Hadoop - Production
Jan. 16, 2009
30
Hadoop Integration/UAT and Backup
Jan. 16, 2009
31

Hadoop Education v1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Education v1

Uploaded by

Copyright:

Available Formats

Hadoop Education

DBA Team 4/14/2011

Jan. 16, 2009

SHC Hadoop Landscape

Jan. 16, 2009

Hadoop is not a replacement for a RDBMS (Relational DataBase Management System).

Jan. 16, 2009

SHC Hadoop Projects Overview

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

HADOOP PROJECTS & SUBPROJECTS

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Generated Record Datatype Definitions

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

Jan. 16, 2009

R515 - Access Nodes

R515 - Access Nodes

R515 - Access Nodes

R415 - Worker/Data Nodes

R310 - Worker/Data Nodes

R415 - Worker/Data Nodes

Primary Network (1GbE)

Primary Network (1GbE)

Primary Network (1GbE)

Secondary Network (1GbE)

Secondary Network (1GbE)

Secondary Network (1GbE)

Jan. 16, 2009

Jan. 16, 2009

Hadoop Integration/UAT and Backup

Jan. 16, 2009

You might also like