Open navigation menu

Welcome to Scribd!

Mining of Massive Datasets: Leskovec, Rajaraman, and Ullman

Uploaded by

0% found this document useful (0 votes)

16 views11 pages

M

Original Title

MapReduce1_DistributedFileSystems

Copyright

© © All Rights Reserved

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

M

Copyright:

© All Rights Reserved

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

16 views11 pages

Mining of Massive Datasets: Leskovec, Rajaraman, and Ullman

Uploaded by

M

Copyright:

© All Rights Reserved

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 11

Search inside document

Mining

of Massive Datasets

Leskovec, Rajaraman, and Ullman

Stanford University

CPU

Machine Learning, Statistics

Memory

Classical Data Mining

Disk

10 billion web pages

Average size of webpage = 20KB

10 billion * 20KB = 200 TB

Disk read bandwidth = 50 MB/sec

Time to read = 4 million seconds = 46+ days

Even longer to do something useful with the data

3

2-10 Gbps backbone between racks

1 Gbps between
any pair of nodes
in a rack

Switch

Switch

CPU
Mem
Disk

Switch

CPU

CPU

Mem

Mem

Disk

Disk

CPU

Mem
Disk

Each rack contains 16-64 commodity Linux nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
4

Node failures

A single server can stay up for 3 years (1000 days)

1000 servers in cluster => 1 failure/day
1M servers in cluster => 1000 failures/day
How to store data persistently and keep it

available if nodes can fail?

How to deal with node failures during a long-

running computaRon?

Network boSleneck

Network bandwidth = 1 Gbps

Moving 10TB takes approximately 1 day
Distributed programming is hard!

Need a simple model that hides most of the

complexity

Map-Reduce addresses the challenges of

cluster compuRng

Store data redundantly on mulRple nodes for

persistence and availability
Move computaRon close to data to minimize data
movement
Simple programming model to hide the complexity
of all this magic

Distributed File System

Provides global le namespace, redundancy, and
availability
E.g., Google GFS; Hadoop HDFS

Typical usage pa5ern

Huge les (100s of GB to TB)

Data is rarely updated in place
Reads and appends are common

Data kept in chunks spread across machines

Each chunk replicated on dierent machines

Ensures persistence and availability

C0

C1

D0

C1

C2

C4

C5

C2

C5

C3

D0

D1

Chunk server 1

Chunk server 2

Chunk server 3

C0

C4

D1

C3

Chunk server N

Chunk servers also serve as compute servers

Bring computation to data!
10

Chunk servers

File is split into conRguous chunks (16-64MB)

Each chunk replicated (usually 2x or 3x)
Try to keep replicas in dierent racks

Master node

a.k.a. Name Node in Hadoops HDFS

Stores metadata about where les are stored
Might be replicated

Client library for le access

Talks to master to nd chunk servers

Connects directly to chunk servers to access data
11

You might also like

Fast Python: High performance techniques for large datasets
From Everand
Fast Python: High performance techniques for large datasets
Tiago Antao
No ratings yet
OpenBSD Mastery: Filesystems: IT Mastery, #19
From Everand
OpenBSD Mastery: Filesystems: IT Mastery, #19
Michael W. Lucas
No ratings yet
2a Intro To Cluster Computing PDF
Document18 pages
2a Intro To Cluster Computing PDF
23522020 Danendra Athallariq Harya P
No ratings yet
25 - Map Reduce Recap
Document12 pages
25 - Map Reduce Recap
gpswain
No ratings yet
Mass Storage Structure: Practice Exercises
Document4 pages
Mass Storage Structure: Practice Exercises
Rakesh Kota
No ratings yet
Chapter 3
Document47 pages
Chapter 3
SANG VÕ NGỌC
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
Document46 pages
Overview of Parallel Computing: Shawn T. Brown
Karthik Kusuma
No ratings yet
I/O (Input Output) and Buffer Manager
Document10 pages
I/O (Input Output) and Buffer Manager
asma
No ratings yet
Ramcloud: Scalable High-Performance Storage Entirely in Dram
Document16 pages
Ramcloud: Scalable High-Performance Storage Entirely in Dram
turah agung
No ratings yet
Managing The Analytic Deluge in The Cloud: May 2018 Scott Jeschonek Principal Program Manager
Document23 pages
Managing The Analytic Deluge in The Cloud: May 2018 Scott Jeschonek Principal Program Manager
nad791
No ratings yet
Week 02
Document115 pages
Week 02
muhammad shoaib
No ratings yet
ch02 Mapreduce
Document7 pages
ch02 Mapreduce
jefferyleclerc
No ratings yet
Bigtable A System For Distributed Structured Storage: Motivation
Document9 pages
Bigtable A System For Distributed Structured Storage: Motivation
manishsg
No ratings yet
Hadoop Distributed File System: Big Data Management
Document48 pages
Hadoop Distributed File System: Big Data Management
Roberto Martinez
No ratings yet
Data Management Nuts and Bolts
Document21 pages
Data Management Nuts and Bolts
kaushal4053
No ratings yet
DC Unit-6
Document81 pages
DC Unit-6
Shaily Goyal
No ratings yet
Chapter 11: Storage and File Structure Classification of Physical Storage Media
Document7 pages
Chapter 11: Storage and File Structure Classification of Physical Storage Media
Harrison Mmari
No ratings yet
Google File System
Document22 pages
Google File System
Rakesh Akhileswaran
No ratings yet
MapReduce-Final
Document92 pages
MapReduce-Final
rsumaira80
No ratings yet
Review: (R&G Chapter 9) - Aren't Databases Great? - Relational Model - SQL
Document7 pages
Review: (R&G Chapter 9) - Aren't Databases Great? - Relational Model - SQL
khilanvekaria
No ratings yet
Physical Storage
Document14 pages
Physical Storage
Arman Makhani
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
Document34 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
amit bhalla
No ratings yet
Storing Data: Disks and Files: (R&G Chapter 9)
Document39 pages
Storing Data: Disks and Files: (R&G Chapter 9)
raw.junk
No ratings yet
MySQL Cluster - Deployment Best Practices Presentation
Document31 pages
MySQL Cluster - Deployment Best Practices Presentation
Omchavarium Mangundiharja
No ratings yet
15IT423E Data Science and Big Data Analytics Assignment - 3 Due Date: 26 April 2019 Max Marks: 35
Document1 page
15IT423E Data Science and Big Data Analytics Assignment - 3 Due Date: 26 April 2019 Max Marks: 35
Suyash Srivastava Miracle
No ratings yet
OS Module 5 (Mass-Storage Systems, File-System Interface and Implementation)
Document20 pages
OS Module 5 (Mass-Storage Systems, File-System Interface and Implementation)
Saif Khan
No ratings yet
Classification o F Physical Storage Media
Document9 pages
Classification o F Physical Storage Media
Tom Jones
No ratings yet
Why Computers Are Getting Slower
Document24 pages
Why Computers Are Getting Slower
jsmith_797871
No ratings yet
BSC in Information Technology: Massive or Big Data Processing J.Alosius
Document17 pages
BSC in Information Technology: Massive or Big Data Processing J.Alosius
Geethma Minoli
No ratings yet
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
Document44 pages
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
Johan Rizky Aditya
No ratings yet
Mass-Storage Structure
Document25 pages
Mass-Storage Structure
Ansh Singh
No ratings yet
Linux Multi-Core Scalability
Document7 pages
Linux Multi-Core Scalability
VISHNUKUMAR R
No ratings yet
Mysql Cluster Deployment Best Practices
Document39 pages
Mysql Cluster Deployment Best Practices
sushil pun
No ratings yet
Hadoop Fundamentals
Document45 pages
Hadoop Fundamentals
Harish R
No ratings yet
Web Scalability - Part - 2
Document25 pages
Web Scalability - Part - 2
fizo
100% (2)
Gfs 2
Document27 pages
Gfs 2
Towsif Salauddin
No ratings yet
MySQL and Linux Tuning - Better Together
Document26 pages
MySQL and Linux Tuning - Better Together
Oleksiy Kovyrin
100% (1)
Comarch - Week 5 Memory
Document31 pages
Comarch - Week 5 Memory
T Vinassaur
No ratings yet
Database Scalability: Jonathan Ellis
Document49 pages
Database Scalability: Jonathan Ellis
bbo0t
No ratings yet
Zfs Aaron Toponce
Document25 pages
Zfs Aaron Toponce
kiratdeluxe
No ratings yet
Wafl Overview
Document15 pages
Wafl Overview
bhanusaie
No ratings yet
Greplin On Thousands of Indexes in The Cloud at Lucene Revolution 2011
Document11 pages
Greplin On Thousands of Indexes in The Cloud at Lucene Revolution 2011
lucidimagination
No ratings yet
Synthetic Data Gen
Document29 pages
Synthetic Data Gen
Owen Mantz
No ratings yet
The Hadoop Distributed File System
Document44 pages
The Hadoop Distributed File System
SaravanaRaajaa
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
Document21 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
shabir Ahmad
No ratings yet
CassandraTraining v3.3.4
Document183 pages
CassandraTraining v3.3.4
Ratika Miglani Malhotra
100% (1)
The Earth Simulator: Presented by Jin Soon Lim For CS 566
Document29 pages
The Earth Simulator: Presented by Jin Soon Lim For CS 566
swapnanshah
No ratings yet
Distributed Shared Memory
Document9 pages
Distributed Shared Memory
Arindam Sarkar
No ratings yet
Cache Memory - January 2014
Document90 pages
Cache Memory - January 2014
Dare Devil
No ratings yet
DB925799-CISC3018 Assignment
Document5 pages
DB925799-CISC3018 Assignment
Haha Cheong
No ratings yet
21csc205p Dbms Unit 5
Document204 pages
21csc205p Dbms Unit 5
ESHITA RAI
No ratings yet
Storage and File Structure
Document50 pages
Storage and File Structure
mutayeb
No ratings yet
Storage Donvito Chep 2013
Document43 pages
Storage Donvito Chep 2013
rahmandhikamuhammadcahyani
No ratings yet
HTTP 2.0 - Tokyo
Document47 pages
HTTP 2.0 - Tokyo
Douglas Sousa
No ratings yet
Magnetic Disk
Document15 pages
Magnetic Disk
Amit Cool
No ratings yet
Chapter 6
Document37 pages
Chapter 6
shubhamvslavi
No ratings yet
Disks, Memories & Buffer Management: "The Two Offices of Memory Are Collection and Distribution." - Samuel Johnson
Document28 pages
Disks, Memories & Buffer Management: "The Two Offices of Memory Are Collection and Distribution." - Samuel Johnson
Gilbert Heß
No ratings yet
Common List of Bottlenecks in Performance Testing
Document3 pages
Common List of Bottlenecks in Performance Testing
cidBookBee
No ratings yet
11 Memory
Document41 pages
11 Memory
Khoa Pham
No ratings yet
Computer Science I Essentials
From Everand
Computer Science I Essentials
Randall Raus
Rating: 5 out of 5 stars
5/5 (7)
Comparison of KFC Compensation Relationship With Performance Appraisal
Document19 pages
Comparison of KFC Compensation Relationship With Performance Appraisal
nirob
No ratings yet
Iii Bca and CS Iot Notes 2023 PDF
Document105 pages
Iii Bca and CS Iot Notes 2023 PDF
sp2392546
No ratings yet
Assessment Brief 2: Programme Title
Document21 pages
Assessment Brief 2: Programme Title
Victor Cojocaru
No ratings yet
Idr Oracle Privileges
Document2 pages
Idr Oracle Privileges
swaroopmehta1982
No ratings yet
Exam DP-900: Microsoft Azure Data Fundamentals - Skills Measured
Document7 pages
Exam DP-900: Microsoft Azure Data Fundamentals - Skills Measured
welhie
0% (1)
Modbus Training
Document75 pages
Modbus Training
Nanang Aribowo Setiawan
100% (4)
EMAPI - Rev F
Document55 pages
EMAPI - Rev F
Anamica Meena
No ratings yet
T316 TMA Fall2023
Document4 pages
T316 TMA Fall2023
Mohamed El Shazly
No ratings yet
Sports Pro Intro Projects
Document5 pages
Sports Pro Intro Projects
Nate
No ratings yet
File Structures
Document6 pages
File Structures
Praveen Pallav
No ratings yet
C Language Topics For Interview
Document24 pages
C Language Topics For Interview
Srikanth Reddy
No ratings yet
Solution Manual For Accounting Information Systems 3rd Edition Vernon Richardson Chengyee Chang Rod Smith
Document18 pages
Solution Manual For Accounting Information Systems 3rd Edition Vernon Richardson Chengyee Chang Rod Smith
AmyBowmanngdq
100% (41)
SQL: Schema Definition, Basic Constraints, and Queries
Document77 pages
SQL: Schema Definition, Basic Constraints, and Queries
Marshil Shibu
No ratings yet
G11 Practical Research Lesson 4
Document5 pages
G11 Practical Research Lesson 4
Jay Mark Cartera Tumala
No ratings yet
Chapter 9 - Real Memory Organization and Management: 2004 Deitel & Associates, Inc. All Rights Reserved
Document21 pages
Chapter 9 - Real Memory Organization and Management: 2004 Deitel & Associates, Inc. All Rights Reserved
Onoja Mary oluwafunke
No ratings yet
Udbsqlref
Document1,050 pages
Udbsqlref
vangappan2
No ratings yet
Advantages of Multidimensional Data Model
Document6 pages
Advantages of Multidimensional Data Model
Kadir Sahan
No ratings yet
Unit 1 Introduction To Information Technology
Document10 pages
Unit 1 Introduction To Information Technology
Mubin Shaikh Nooru
No ratings yet
JSON Tutorial PDF
Document40 pages
JSON Tutorial PDF
adiccb
No ratings yet
10 (Day3)
Document22 pages
10 (Day3)
eman71
No ratings yet
Transaction Properties: Acid vs. Base
Document13 pages
Transaction Properties: Acid vs. Base
SeYy Fuhi
No ratings yet
Running Head: Sudzarama Software Development 1
Document14 pages
Running Head: Sudzarama Software Development 1
Isaac Hassan
0% (1)
G100 Manual Update 10.8.18 Diamond
Document54 pages
G100 Manual Update 10.8.18 Diamond
Viasensor Info
No ratings yet
PR Format
Document3 pages
PR Format
Kenneth Carolino
No ratings yet
Kv2 Computer Science
Document184 pages
Kv2 Computer Science
mahi21911687
No ratings yet
Pig Programming - Create Your First Apache Pig Script - Edureka
Document5 pages
Pig Programming - Create Your First Apache Pig Script - Edureka
ratneshkumarg
No ratings yet
Assignment 5: Qualitative Research
Document17 pages
Assignment 5: Qualitative Research
brunooliveira_tkd2694
No ratings yet
TEMS Investigation 14.1 Technical Reference
Document98 pages
TEMS Investigation 14.1 Technical Reference
gvmariano
No ratings yet
Hadoop Cheat Sheet - Jesse Anderson
Document7 pages
Hadoop Cheat Sheet - Jesse Anderson
Askia TheGreat
No ratings yet
Research Proposal HRM
Document34 pages
Research Proposal HRM
Guddy Lakhwani
88% (33)