Professional Documents
Culture Documents
SHI BI Landscape
SHC Hadoop
Hadoop is a software framework for processing large amounts of data scattered across multiple commodity nodes (servers). The base Hadoop environment will contain a Distributed File System (HDFS) and a Parallel Programming (MapReduce) piece. Additional projects may be added to the Hadoop software framework.
HADOOP CORE
HDFS
HDFS (Hadoop Distributed File System) is a distributed fault-tolerant file system designed to be deployed on low cost commodity hardware. HDFS provides high throughput access to large amounts of application data. HDFS is not a file system which requires expensive fast disk drives with RAID (Redundant Array of Independent Disks) to provide high throughput and fault tolerance.
MapReduce
MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
MapReduce is not a replacement for a RDBMS (Relational DataBase Management System) or SQL (Structured Query Language).
sort split0 copy merge sort split1
Map
Reduce
part0
Map
merge sort
Reduce
part1
split2
Map
AVRO
AVRO is a data serialization system. It provides a means to distribute non-text files, such as .zip, graphics, binary files (and text files) in a consistent manner across a distributed (Hadoop) environment.
10
FLUME
Flume (Log Flume) is a horizontally scalable data aggregation tool, which can support different levels of compression, batching and reliability for each unique data flow.
11
HBase
HBase is a NoSQL multi-dimensional, distributed, highly available data store made up of rows and column families, which can support billions of rows and millions of columns. HBase is not a SQL database and thus does not have the concepts of joins, data types, SQL or even a query engine.
12
Hive
Hive is a data warehouse environment built on top of Hadoop. Hive gives the capability for SQL programmers and map reduce programmers to use a common SQL-like query language called QL which is extensible to custom mapper and reducer plug ins. It is best used for batch jobs with large immutable sets of data. Hive is not designed for online transaction processing (OLTP) and does not offer real time queries and row level updates.
13
HUE
Hue (Hadoop User Experience) is a unified web-based UI for interacting with Hadoop. Hue provides an interface to submit jobs, watch running jobs, browse the file system, and interact with Hive . Additional UI applications can be built to be used with Hue, thus providing a single access point into Hadoop.
14
LUCENE/SOLR
Lucene/Solr are two projects that merged into one in March 2010.
Lucene is a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Solr is a high performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces.
www.yonik.com
15
PIG
Apache Pig (Pig Latin) is a scripting language for exploring large datasets. It provides the ability with a few commands to search terabytes of data. Pig programs run in a distributed environment on a cluster (programs are compiled into MapReduce jobs and execute using Hadoop).
16
OOZIE (Yahoo)
Oozie is a workflow and coordination server tool for managing jobs on a distributed (Hadoop) environment. Oozie job execution can be driven on a Time and/or Data availability basis.
http://yahoo.github.com/oozie/design.html
17
SQOOP
Sqoop (Sql-to-hadoop) is a database import tool which provides the capability to easily copy tables or entire databases between SQL databases (RDBMS) and Hadoop files in HDFS (Hadoop Distributed File System).
RDBMS
SQOOP
HADOOP
18
ZOOKEEPER
ZooKeeper enables highly reliable distributed coordination by providing a centralized service for maintaining configuration information, naming, distributed synchronization, and group services for distributed (Hadoop) applications.
19
NON-HADOOP PROJECTS
20
DATAMEER
The Datameer Analytics Solution (DAS) gives easy access to Big Data analytics via a spreadsheet interface. DAS gives the capability to create or embed graphs and statistics into dashboards for easy viewing.
21
GANGLIA
Ganglia is a scalable distributed monitoring system used to monitor cluster and grids. It provides the ability to drill down through standard or custom textual and graphical views at a single node or at a cluster level.
Client
Browser
RRD
Node gmetad
Node #1
Node #2
Node #3
Node
gmond
gmond
gmond
gmond
Cluster
22
NAGIOS
NAGIOS is a open source monitoring, alerting, response, reporting, maintenance, and capacity planning tool for servers and networks. Nagios can be setup to monitor critical infrastructure, such as network protocols, applications, services, servers and network components. It is very flexible by allowing custom Nagios plugins to be created and shared via the open community, to enhance Nagioss features.-
23
INFOBRIGHT
Infobright is a columnar MySQL compatible analytic database.
24
JASPERSOFT
Jaspersoft is an open source BI (Business Intellegence) and ETL (Extract, Transform and Load) set of tools, which incorporates R (project for Statistical Computing) and supports Hadoop/Hive.
25
R
R is a language and environment for statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) computing and graphics.
26
HADOOP HARDWARE
27
Hardware
NW1
Worker
NW2
Worker
NW3
Worker
NW4
Worker
NW5
Master
NW6
Master
22 Nodes
22 Nodes
22 Nodes
18 Nodes
6 Nodes
9 Nodes
Production
1,008 Cores, 2,688GB Memory, 672TB Raw (~168TB Usable)
OW1
Master
OW2
Worker
OW3
Master
OW4
Worker
OW5
Worker
5 Nodes
15 Nodes
5 Nodes
22 Nodes
8 Nodes
Integration/UAT
180 Cores, 480GB Memory, 120TB Raw (~30TB Usable)
Backup
120 Cores, 240GB Memory, 240TB Raw (~80TB Usable)
28
Hadoop Nodes
Production Cluster
DL380 - Master Nodes
UID
1 1 1 1
POWER POWER SUPPLY SUPPLY
Backup Cluster
R710 - Master Nodes
HP ProLiant DL380 G7
Integration/UAT Cluster
R710 - Master Nodes
2 2 2 2
POWER POWER SUPPLY SUPPLY
3 3
OVER OVER TEMP TEMP POWER POWER CAP CAP
9 9
7 7
5 5
3 3
DIMMS
1 1
1 1
3 3
5 5
7 7
9 9
2
8 8 6 6 2 2
PROC PROC
4 4
2 2
ONLINE AMP SPARE STATUS
2 2
4 4
6 6
8 8 1 1
FANS
MIRROR
PROC PROC
6 6
5 5
FANS 4 4
3 3
2 2
1 1
EST
1 2
EST
3 4
1 2
3 4
Gb 1
Gb 2
Gb 3
Gb 4
iLO
UID
Gb 1
Gb 2
Gb 3
Gb 4
8 x 2.8Ghz Intel, 60GB RAM 4 x 146GB 10k SAS (RAID) 6 x GB NICs, Mgmt Onboard Redundant Power Supplies
8 x 2.6Ghz Intel, 48GB RAM 4 x 300GB 15k SAS (RAID) 8 x GB NICs, Mgmt Onboard Redundant Power Supplies
8 x 2.6Ghz Intel, 48GB RAM 4 x 300GB 15k SAS (RAID) 8 x GB NICs, Mgmt Onboard Redundant Power Supplies
ST
ST
ST
2
Gb 1
2
Gb 1
2
Gb 1
3
Gb 2
3
Gb 2
3
Gb 2
12 x 2.6Ghz AMD, 64GB RAM 12 x 2TB SATA (RAID) 4 x 10GB NICs, Mgmt Onboard Redundant Power Supplies
12 x 2.6Ghz AMD, 64GB RAM 12 x 2TB SATA (RAID) 4 x 10GB NICs, Mgmt Onboard Redundant Power Supplies
12 x 2.6Ghz AMD, 64GB RAM 12 x 2TB SATA (RAID) 4 x 10GB NICs, Mgmt Onboard Redundant Power Supplies
Gb 1
MEST
Gb 1
Gb 2
Gb 2
Gb 2
Management Network
Management Network
12 x 2.6Ghz AMD, 32GB RAM 4 x 2TB SATA (JBOD) 4 x GB NICs, Mgmt Onboard Single Power Supply
4 x 2.4Ghz Intel, 8GB RAM 4 x 2TB SATA (JBOD) 2 x GB NICs, No Mgmt Single Power Supply
12 x 2.6Ghz AMD, 32GB RAM 4 x 2TB SATA (JBOD) 4 x GB NICs, Mgmt Onboard Single Power Supply
MEST
29
Hadoop - Production
30
31