You are on page 1of 51

Brief

BIG DATA & HADOOP


Alchetron.com
Free Social Encyclopedia

BIG DATA
HADOOP
HDFS
MAPREDUCE
ALCHETRON
FEEDBACKS
Q/A

BIG DATA & HADOOP

+
To understand BIG
DATA we will have
to understand data

THIS DRAWING WAS CREATED 40,000


YEARS AGO THIS WAS THE FIRST TIME
WHEN HUMANS STARTED RECORDING DATA

STONE TABLETS
AS TIME PASSED WE STARTED CREATING
MORE DATA AS YOU CAN SEE IN THIS PIC
WHICH IS 3000-10,000 YEARS OLD

Johannes
Gutenberg

This man invented


printing machine in
1439 that means
more data is
collected than
before

100 crore
books printed
till 18th
century & my
dear friends
you are still
not born
..

HIS GUY INVENTS INTERNET IN 1991

SIR Tim Berners-Lee Invents Internet in


1991 now
with internet the amount of data
generated

30 years of mobile
Technology

30 years of mobile
Technology

Next 20 years Computing will move on to


Microscopic level
Computers wont be in our pockets but inside
our body & mind
This is where Technology & Biology will merge
which will multiply and enhance our
capabilities a thousand times

30 years of mobile
Technology

Technological change
will be so rapid &
exponential

With invention of internet + small & less


expensive storage devices !!

Data creation explodes

Data generation statisticsDith invention


of internet + small & less expensive
2.7 Zetabytes of data exist in the digital universe today
storage devices !!
Data
Facebook stores,
accesses, and analyzes 50+ Petabytes of
creation
explodes
user generated data.

Walmart handles more than 1 million customer transactions


every hour, which is imported into databases estimated to
contain more than 2.5 petabytes of data
More than 5 billion people are calling, texting, tweeting and
browsing on mobile phones worldwide.
YouTube users upload 48 hours of new video every minute
of the day.
In 2008,Google was processing 20,000 terabytes of data
(20 petabytes) a day

With
invention
of BIG
internet
SO
WHAT IS
DATAdata
??
creation explodes
Every day, we create 2.5 quintillion bytes of data
so much that 90% of the data in the world
today has been created in the last two years
alone. This data comes from everywhere :
sensors used to gather climate information, posts
to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS
signals to name a few.

This data isbig data.

With invention of internet data


creation explodes

With invention of internet data


creation explodes

With invention of internet data


creation explodes

With invention of internet data


creation explodes

Who will manage BIG


DATA

HADOOP
Open Source Apache Project
Written in Java
Runs on
Linux, Mac OS/X, Windows, and
Solaris
Commodity hardware

Contents

History of Hadoop
The current applications of Hadoop
Hadoop HDFS + MAP-REDUCE
Other hadoop projects

Fun Fact of Hadoop


"The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not used
elsewhere: those are my naming criteria.
---- Doug Cutting, Hadoop project
creator

History of Hadoop
Re
ad
sp
Map-reduce
ap
er
2004

It is an important technique!

Doug
Cutting

Joins

Yaho
o

! at 2
006

Extended
Apache Nutch

The great journey begins

History of Hadoop
Yahoo! became the primary
contributor in 2006

History of Hadoop
Yahoo! deployed large scale science
clusters in 2007.
Tons of Yahoo! Research papers
emerge:
WWW
CIKM
SIGIR

Yahoo! began running major


production jobs in Q1 2008.

Hadoop consists of 2 parts.


They are HDFS & MapReduce.

HDFS

Namenodes & Datanodes are nothing but machines


which helps the client to store data.
Metadata is stored in namenode & actual data is
stored in datanodes

A TaskTracker is a daemon and works on datanode and


is a node in the cluster that accepts tasks - Map,
Reduce and Shuffle operations - from a Jobtracker.
A JobTracker is a daemon and works on namenode
and also farms outMapReducetasks to specific nodes
in the cluster, ideally the nodes that have the data,
or at least are in the same rack.

Map-Reduce Architecture
Map-reduce is basically a data
processing engine
To understand it deeply you should
know java coding with experience
Lets try to learn the architecture of
map-reduce

An example

BORED

ALMOST THERE

BORED

ALMOST THERE

JUST ONE MORE


CODE

Another Example code

Now a days (as per latest job


market)

Software Developer Intern - IBM - Somers, NY +3 locations- Agile


development - Big data / Hadoop / data analytics a plus
Software Developer - IBM - San Jose, CA +4 locations - include Hadooppowered distributed parallel data processing system, big data analytics ...
multiple technologies, including Hadoop

Other Hadoop Projects


Ecosystem
Hadoop Core

Distributed File System


MapReduce Framework

Pig (initiated by Yahoo!)

Parallel Programming Language and Runtime

Hbase (initiated by Powerset)


Table storage for semi-structured data

Zookeeper (initiated by
Yahoo!)
Coordinating distributed systems

Hive (initiated by Facebook)


SQL-like query language and metastore

TYPICAL HADOOP CLUSTER HANDLING & PROCESSING PETA


BYTES OF DATA
1000 TB = 1 PETA BYTE APPROX..

Now a days
Who use Hadoop?

Amazon/A9
Alchetron
Fox interactive media
Google
IBM
Facebook
Quantcast
Rackspace/Mailtrust
Veoh
Yahoo!
More at http://wiki.apache.org/hadoop/PoweredBy

Lets see how we


Implemented this at

When you visit


Alchetron.com
you are interacting
with data processed
with Hadoop

Searc
h
Index

Searc
h
Index
When
youyou
visitvisit
When
Alchetron.com
Alchetron.com
youyou
areare interacting
interacting
with data
with data
processed
processed
with
Hadoop!!
with
Hadoop
!!

Organizi
ng data

Content
Filtering

References
For more information:
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/
http://alchetron.com/What-is-Big-data1530-W
http://alchetron.com/Big-Data-Hadoop260-W

You might also like