Professional Documents
Culture Documents
Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Source: http://www.free-pictures-photos.com/
3.
4.
1. Web-Scale Problems
Characteristics:
Examples:
Crawling, indexing, searching, mining the Web Post-genomics life sciences research Other scientific data (physics, astronomers, etc.) Sensor networks Web 2.0 applications
Learning relations
Start with seed instances Search for patterns on the Web Using patterns to find more instances
Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879
(Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; )
Web-scale problems? Throw more machines at it! Clear trend: centralization of computing resources in large data centers
Necessary ingredients: fiber, juice, and space What do Oregon, Iceland, and abandoned mines have in common?
Important Issues:
App OS
Traditional Stack
Virtualized Stack
Utility computing
Why buy machines when you can rent cycles? Examples: Amazons EC2, GoGrid, AppNexus
Give me nice API and take care of the implementation Example: Google App Engine
4. Web Applications
From the desktop to the browser SaaS == Web-based applications Examples: Google Maps, Facebook
Rent computing resources by the hour Basic unit of accounting = instance-hour Additional costs for bandwidth
If youre not genuinely interested in the topic If youre not ready to do a lot of programming
Source: http://davidzinger.wordpress.com/2007/05/page/2/
This is bleeding edge technology Those W$*#T@F! moments This is the second first time Ive taught this course
Be patient
Be flexible
There will be unanticipated issues along the way Tell me how I can make everyones experience better
Be constructive
Source: Wikipedia
Source: Wikipedia
Source: Wikipedia
Source: Wikipedia
Things to go over
Amazon EC2/S3
Web-Scale Problems?
Partition
w1
worker
w2
worker
w3
worker
r1
r2
r3
Result
Combine
Different Workers
Different threads in the same core Different cores in the same CPU
Commodity vs. exotic hardware Number of machines vs. processor vs. cores
Flynns Taxonomy
Instructions Single (SI) Single (SD) SISD Single-threaded process Multiple (MI) MISD Pipeline architecture
Data
Multiple (MD)
SISD
Processor
Instructions
SIMD
Processor
D0 D1 D2
D0 D1 D2
D0 D1 D2
D0 D1 D2
D0 D1 D2
D0 D1 D2
D0 D1 D2
D3
D4 Dn
D3
D4 Dn
D3
D4 Dn
D3
D4 Dn
D3
D4 Dn
D3
D4 Dn
D3
D4 Dn
Instructions
MIMD
Processor
Instructions Processor
Instructions
Processor
Processor
Processor
Memory
Processor Network
Memory
Processor
Memory
Processor
Memory
Processor
Memory Processor
Processor
Memory Processor Network
Parallelization Problems
How do we assign work units to workers? What if we have more work units than workers?
General Theme?
Difficult because
(Often) dont know the order in which workers run (Often) dont know where the workers are running (Often) dont know when workers interrupt each other
Thus, we need:
Parallel computing has been around for decades Here are some design patterns
Master/Slaves
master
slaves
Producer/Consumer Flow
P P P
C C C
P P P
C C C
Work Queues
P
shared queue P
W W W W W
C C C
The reality:
Lots of one-off solutions, custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything