Introduction To Hadoop

INTRODUCTION TO HADOOP
Explaining a complex product in 20 minutes or less
Keith R. Davis Data Architect NEMSIS Project University of Utah, School of Medicine keith.davis@hsc.utah.edu
INTRODUCTION
Hadoop is an open source Apache software project that enables the distributed processing of large data sets across clusters of commodity servers.
WHAT IS HADOOP?
(2004) Google publishes the GFS and MapReduce papers (2005) Apache Nutch search project rewritten to use MapReduce (2006) Hadoop was factored out of the Apache Nutch project (2006) Development was sponsored by Yahoo
(2008) Becomes a top-level Apache project

(Trivia) Why is it called Hadoop?
It was named after the principal architects son's toy elephant!
A QUICK BIT OF HISTORY
And more
WHO IS USING HADOOP?
Data is not stored in tables Haoop supports only forward parsing Hadoop doesnt guarantee ACID properties
Hadoop takes code to the data

Scales horizontally vs. vertically
HOW IS HADOOP DIFFERENT FROM A TRADITIONAL RDBMS?
Hadoop is:

Easily Scalable New cluster nodes can be added as needed Cost effective Hadoop brings massively parallel computing to commodity servers
Flexible Hadoop is schema-less, and can absorb any type of data

Fault tolerant Share nothing architecture prevents data loss and process failure
WHATS THE BIG DEAL?
Use Hadoop when you need to:
Process a terabytes of unstructured data

Running batch jobs is acceptable You have access to a lot of cheap hardware
DO NOT use Hadoop when you need to:

Perform calculations with little or no data (Pi to one million places) Process data in a transactional manner Have interactive ad-hoc results (this is changing)
WHEN SHOULD I USE HADOOP?
Hadoop consists of two primary services:

1.
Reliable storage though HDFS (Hadoop Distributed File System)
2.
Parallel data processing using a technique known as MapReduce
BASIC ARCHITECTURE
Block #1
Input Data (CSV)
Block #2
Block #3
HOW IT WORKS: HDFS WRITE STEP #1 (FILE SPLITS)
Block #1
Block #2
Node #1
Block #2 Block #3
Node #2
Block #1
Node #3
Block #3
HOW IT WORKS: HDFS WRITE STEP #2 (REPLICATION)
Client
Data Node
Mapper Reducer Mapper Reducer
Data Node
Data Node
Mapper
Data Node
HOW IT WORKS: MAP/REDUCE
HDFS File System (output)
Job Scheduler
HDFS File System (input)
...
...
Not to worry, there are many ways to access the power of MapReduce:

Hadoop Java API (If you like Java and low level stuff) Pig (If you are a script wiz and LINQ doesnt scare you) Hive (You know some SQL and coding isnt your thing) RHadoop (If R is your thing) SAS/ACCESS (If SAS is your thing)
LOOKS COMPLICATED!
Supports the concepts of databases, tables, and partitions through the use of metadata (think of views over delimited text files) Supports a restricted version of SQL (no updates or deletes) Supports joins between tables - INNER, OUTER (FULL, LEFT, and RIGHT) Supports UNION to combine multiple SELECT STATEMENTS Provides a rich set of data types and predefined functions Allows the user to create custom scalar and aggregate functions Executes queries via MapReduce Provides JDBC and ODBC drivers for integration with other applications Hive is NOT a replacement for a traditional RDBMS as it is not ACID compliant
HIVE: THE EASY WAY TO GET DATA OUT
If you use HIVE to create sample sets for your analysis, here are a few standard functions you may find useful:
round(), floor(), ceil(), rand(), exp(), ln(), log10(), log2(), log(), pow(), sqrt(), bin(), hex(), unhex(), conv(), abs(), pmod(), sin(), asin(), cos(), acos(), tan(), atan(), degrees(), radians(), positive(), negative(), sign(), e(), pi(), count(), sum(), avg(), min(), max(), variance(), var_samp(), stddev_pop(), stddev_samp(), covar_pop(), covar_samp(), corr(), percentile(), percentile_approx(), histogram_numeric(), collect_set()
HIVE: MATH AND STATS FUNCTIONS
Cloudera (Easy Setup) - http://www.cloudera.com/content/cloudera/en/home.html NoSQL - http://nosql-database.org/ Emulab - http://www.emulab.net/ Apache Hadoop - http://hadoop.apache.org/#Getting+Started
RHadoop - https://github.com/RevolutionAnalytics/RHadoop/wiki
SAS/ACCESS - http://www.sas.com/software/data-management/access/index.html
RESOURCES
THANK YOU!

Introduction To Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Hadoop

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO HADOOP

Explaining a complex product in 20 minutes or less

(2008) Becomes a top-level Apache project

It was named after the principal architects son's toy elephant!

A QUICK BIT OF HISTORY

WHO IS USING HADOOP?

Hadoop takes code to the data

HOW IS HADOOP DIFFERENT FROM A TRADITIONAL RDBMS?

Flexible Hadoop is schema-less, and can absorb any type of data

WHATS THE BIG DEAL?

Use Hadoop when you need to:

Process a terabytes of unstructured data

DO NOT use Hadoop when you need to:

WHEN SHOULD I USE HADOOP?

Hadoop consists of two primary services:

Reliable storage though HDFS (Hadoop Distributed File System)

Parallel data processing using a technique known as MapReduce

Input Data (CSV)

HOW IT WORKS: HDFS WRITE STEP #1 (FILE SPLITS)

HOW IT WORKS: HDFS WRITE STEP #2 (REPLICATION)

Mapper Reducer Mapper Reducer

HOW IT WORKS: MAP/REDUCE

HDFS File System (output)

HDFS File System (input)

HIVE: THE EASY WAY TO GET DATA OUT

HIVE: MATH AND STATS FUNCTIONS

You might also like