Cluster Computing

Objectives
Learn and Share Recent advances in cluster computing (both in research and commercial settings):
Architecture, System Software Programming Environments and Tools Applications
Agenda
Overview of Computing Motivations & Enabling Technologies Cluster Architecture & its Components Clusters Classifications Cluster Middleware Single System Image Representative Cluster Systems Resources and Conclusions
Computing Elements
Applications
Programming Paradigms Threads Interface Threads Interface Microkernel Microkernel Multi-Processor Computing System
P P P P P
Operating System
..
Hardware
P Processor P
Thread
Process
3
Two Eras of Computing

Sequential Era
Parallel Era
Architectures System Software Applications P.S.Es Architectures System Software Applications

1940 50 60 70 80 90 2000
Commercialization R&D
2030 P.S.Es
4
Commodity
Computing Power and Computer Architectures
Computing Power (HPC) Drivers

Solving grand challenge applications using computer modeling, simulation and analysis
Life Sciences Life Sciences
Aerospace Aerospace
E-commerce/anything
CAD/CAM CAD/CAM
Digital Biology Digital Biology
6 Military Applications
How to Run App. Faster ?

b
There are 3 ways to improve performance:
1. Work Harder 2. Work Smarter 3. Get Help

b
Computer Analogy
1. Use faster hardware: e.g. reduce the time per instruction (clock cycle). 2. Optimized algorithms and techniques 3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.
7
Computing Platforms Evolution

Breaking Adm inistrative Barriers
2 1 0
2 10
2 10
2 1 0
P E R F O R M A N C E
?
2 1 0 2 10 2 10 2 1 0
21 00
Administrative Barrie
Ind ividu al Gro up D epart men t C ampus Sta te N ational Globe Inte r Plane t U niverse 8
Application Case Study
Web Serving and E-Commerce
E-Commerce and PDC ?

b
b b
b b b
What are/will be the major problems/issues in eCommerce? How will or can PDC be applied to solve some of them? Other than Compute Power, what else can PDC contribute to e-commerce? How would/could the different forms of PDC (clusters, hypercluster, GRID,) be applied to e-commerce? Could you describe one hot research topic for PDC applying to e-commerce? A killer e-commerce application for PDC ? ...
10
Killer Applications of Clusters

b b b
Numerous Scientific & Engineering Apps. Parametric Simulations Business Applications E-commerce Applications (Amazon.com, eBay.com .) Database Applications (Oracle on cluster) Decision Support Systems Internet Applications Web serving / searching Infowares (yahoo.com, AOL.com) ASPs (application service providers) eMail, eChat, ePhone, eBook, eCommerce, eBank, eSociety, eAnything! Computing Portals Mission Critical Applications command control systems, banks, nuclear reactor control, star-war, and handling life threatening situations.
11
Major problems/issues in Ecommerce

Social Issues Capacity Planning b Multilevel Business Support (e.g., B2P2C) b Information Storage, Retrieval, and Update b Performance b Heterogeneity b System Scalability b System Reliability b Identification and Authentication b System Expandability b Security b Cyber Attacks Detection and Control (cyberguard) b Data Replication, Consistency, and Caching Manageability (administration and control)
12
Amazon.com: Online sales/trading killer E-commerce Portal Several Thousands of Items

b
books, publishers, suppliers

b
Millions of Customers
Customers details, transactions details, support for transactions update

b
(Millions) of Partners
Keep track of partners details, tracking referral link to partner and sales and payment
b b
Sales based on advertised price Sales through auction/bids
A mechanism for participating in the bid (buyers/sellers define rules of the game)
13
2100
2100
2100
2100
Can these drive E-Commerce ?
2100
2100
2100
2100
Clusters are already in use for web serving, web-hosting, and number of other Internet applications including E-commerce
scalability, availability, performance, reliable-high performancemassive storage and database support. Attempts to support online detection of cyber attacks (through data mining) and control
b
Hyperclusters and the GRID:
Support for transparency in (secure) Site/Data Replication for high availability and quick response time (taking site close to the user). Compute power from hyperclusters/Grid can be used for data mining for cyber attacks and fraud detection and control. Helps to build Compute Power Market, ASPs, and Computing Portals.
14
Science Portals - e.g., PAPIA system
Pentiums Myrinet NetBSD/Linuux PM Score-D MPC++
RWCP Japan: http://www.rwcp.or.jp/papia/
PAPIA PC Cluster
15
PDC hot topics for Ecommerce

b b b b b b b b b b b b b
Cluster based web-servers, search engineers, portals Scheduling and Single System Image. Heterogeneous Computing Reliability and High Availability and Data Recovery Parallel Databases and high performance-reliable-mass storage systems. CyberGuard! Data mining for detection of cyber attacks, frauds, etc. detection and online control. Data Mining for identifying sales pattern and automatically tuning portal to special sessions/festival sales eCash, eCheque, eBank, eSociety, eGovernment, eEntertainment, eTravel, eGoods, and so on. Data/Site Replications and Caching techniques Compute Power Market Infowares (yahoo.com, AOL.com) ASPs (application service providers) ...
16
Sequential Architecture Limitations

architectures reaching physical Sequential(speed of light, thermodynamics) limitation
Hardware improvements like pipelining, Superscalar, etc., are non-scalable and
requires sophisticated Compiler Technology.
Vector Processing works well for certain kind of problems.

17
Computational Power Improvement
Multiprocessor
C.P.I.
Uniprocessor
2. . . .
No. of Processors
18
Human Physical Growth Analogy: Computational Power Improvement
Vertical
Horizontal
Growth
5 10
15 20 25
30
35
40
45 . . . .
Age
19
Why Parallel Processing NOW? of PP is mature and can The Tech. commercially; significantbe exploited R & D work on development of tools & environment. development in technology is paving a way for heterogeneous computing.
20
Significant Networking
History of Parallel Processing S PP can be traced to a tablet dated around 100 BC.
x x
Tablet has 3 calculating positions. Infer that multiple positions: Reliability/ Speed
21
Motivating Factors
Aggregated speed with which complex calculations carried out by millions of neurons in human brain is amazing! although individual neurons response is slow (milli sec.) demonstrate the feasibility of PP
22
Taxonomy of Architectures
Simple classification by Flynn:

(No. of instruction and data streams)
SISD SIMD MISD MIMD
- conventional - data parallel, vector computing - systolic arrays - very general, multiple approaches.
Current focus is on MIMD model, using

general purpose multicomputers. processors or
23
Main HPC Architectures..1a

b b b b b
SISD - mainframes, workstations, PCs. SIMD Shared Memory - Vector machines, Cray... MIMD Shared Memory - Sequent, KSR, Tera, SGI, SUN. SIMD Distributed Memory - DAP, TMC CM2... MIMD Distributed Memory - Cray T3D, Intel, Transputers, TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).
24
Motivation for using Clusters

The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs. b Workstation clusters are easier to integrate into existing networks than special parallel computers.
b
25
Main HPC Architectures..1b.

b
NOTE: Modern sequential machines are not purely SISD - advanced RISC processors use many concepts from
vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle.
26
Parallel Processing Paradox

b Time
required to develop a parallel application for solving GCA is equal to:

Half Life of Parallel Supercomputers.
27
The Need for Alternative Supercomputing Resources

b b
b b
Vast numbers of under utilised workstations available to use. Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas. Reluctance to buy Supercomputer due to their cost and short life span. Distributed compute resources fit better into today's funding model.
28
Technology Trend
29
Scalable Parallel Computers
30
Design Space of Competing Computer Architecture
31
Towards Inexpensive Supercomputing

It is:
Cluster Computing..
The Commodity Supercomputing!
32
Cluster Computing Research Projects

b b b b b b b b b b b b b b b
Beowulf (CalTech and NASA) - USA CCS (Computing Centre Software) - Paderborn, Germany Condor - Wisconsin State University, USA DQS (Distributed Queuing System) - Florida State University, US. EASY - Argonne National Lab, USA HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US far - University of Liverpool, UK Gardens - Queensland University of Technology, Australia MOSIX - Hebrew University of Jerusalem, Israel MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia NetSolve - University of Tennessee, USA PBS (Portable Batch System) - NASA Ames and LLNL, USA PVM - Oak Ridge National Lab./UTK/Emory, USA
33
Cluster Computing Commercial Software

b b b b b b b b b b
Codine (Computing in Distributed Network Environment) - GENIAS GmbH, Germany LoadLeveler - IBM Corp., USA LSF (Load Sharing Facility) - Platform Computing, Canada NQE (Network Queuing Environment) - Craysoft Corp., USA OpenFrame - Centre for Development of Advanced Computing, India RWPC (Real World Computing Partnership), Japan Unixware (SCO-Santa Cruz Operations,), USA Solaris-MC (Sun Microsystems), USA ClusterTools (A number for free HPC clusters tools from Sun) A number of commercial vendors worldwide are offering clustering solutions including IBM, Compaq, Microsoft, a number of startups like TurboLinux, HPTI, Scali, BlackStone..)
34

Surveys show utilisation of CPU cycles of desktop workstations is typically <10%. b Performance of workstations and PCs is rapidly improving b As performance grows, percent utilisation will decrease even further! b Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.
b
35

The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems. b Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms. b Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!
b
36
Cycle Stealing
Usually a workstation will be owned by an individual, group, department, or organisation - they are dedicated to the exclusive use by the owners. b This brings problems when attempting to form a cluster of workstations for running distributed applications.
b
37
Cycle Stealing
b
Typically, there are three types of owners, who use their workstations mostly for:
1. Sending and receiving email and preparing documents. 2. Software development - edit, compile, debug and test cycle. 3. Running compute-intensive applications.
38
Cycle Stealing
Cluster computing aims to steal spare cycles from (1) and (2) to provide resources for (3). b However, this requires overcoming the ownership hurdle - people are very protective of their workstations. b Usually requires organisational mandate that computers are to be used in this way. b Stealing cycles outside standard work hours (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder.
b
39
Rise & Fall of Computing Technologies

Mainframes Minis PCs
Minis 1970
PCs 1980
Network Computing 1995

40
Original Food Chain Picture
41
1984 Computer Food Chain
Mainframe Mini Computer Workstation
PC
Vector Supercomputer
42
1994 Computer Food Chain
(hitting wall soon)
Mini Computer
Workstation (future is bleak) Mainframe
PC
Vector Supercomputer
MPP
43
Computer Food Chain (Now and Future)
44
What is a cluster?
A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected standalone/complete computers cooperatively working together as a single, integrated computing resource. b A typical cluster: Network: Faster, closer connection than a typical network (LAN) Low latency communication protocols Looser connection than SMP
b
45
Why Clusters now? (Beyond Technology and Cost)

b
Building block is big enough
complete computers (HW & SW) shipped in millions: killer micro, killer RAM, killer disks, killer OS, killer networks, killer apps.
b b
b
Workstations performance is doubling every 18 months. Networks are faster

Higher link bandwidth (v 10Mbit Ethernet)
b b
based networks coming (ATM) b Interfaces simple & fast (Active Msgs) Striped files preferred (RAID) Demise of Mainframes, Supercomputers, & MPPs
46
b Switch
Architectural Drivers (cont)

b
Node architecture dominates performance
processor, cache, bus, and memory design and engineering $ => performance
b
Greatest demand for performance is on large systems
must track the leading edge of technology without lag

b
MPP network technology => mainstream
system area networks

b
System on every node is a powerful enabler
very high speed I/O, virtual memory, scheduling,
47
...Architectural Drivers
b
Clusters can be grown: Incremental scalability (up, down, and across)
Individual nodes performance can be improved by adding additional resource (new memory blocks/disks) New nodes can be added or nodes can be removed Clusters of Clusters and Metacomputing
b
Complete software tools
Threads, PVM, MPI, DSM, C, C++, Java, Parallel C++, Compilers, Debuggers, OS, etc.
b
Wide class of applications
Sequential and grand challenging parallel applications
48
Clustering of Computers for Collective Computing: Trends

?
1960
1990
1995+ 2000
Example Clusters: Berkeley NOW

b
b b
100 Sun UltraSparcs 200 disks Myrinet SAN 160 MB/s Fast comm. AM, MPI, ... Ether/ATM switched external net Global OS Self Config
50
Basic Components
MyriNet 160 MB/s Myricom NIC
M
I/O bus
$ P
Sun Ultra 170
51
Massive Cheap Storage Cluster

b
Basic unit: 2 PCs doubleending four SCSI chains of 8 disks each
Currently serving Fine Art at http://www.thinker.org/imagebase/

52
Cluster of SMPs (CLUMPS)

b
Four Sun E5000s
8 processors 4 Myricom NICs each

b
Multiprocessor, Multi-NIC, MultiProtocol NPACI => Sun 450s
53
Millennium PC Clumps
Inexpensive, easy to manage Cluster Replicated in many departments Prototype for very large PC cluster
54
Adoption of the Approach
55
So Whats So Different?
b b b b b b
Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node
virtual memory scheduler files ...

56
OPPORTUNITIES & CHALLENGES
57
Opportunity of Large-scale Computing on NOW

Shared Pool of Computing Resources: Processors, Memory, Disks
Interconnect
Guarantee atleast one workstation to many individuals (when active)
Deliver large % of collective resources to few individuals at any one time

58
Windows of Opportunities
b
MPP/DSM:
Compute across multiple systems: parallel.

b
Network RAM:
Idle memory in other nodes. Page across other nodes idle memory
b
Software RAID:
file system supporting parallel I/O and reliablity, massstorage.

b
Multi-path Communication:
Communicate across multiple networks: Ethernet, ATM, Myrinet
59
Parallel Processing
Scalable Parallel Applications require
good floating-point performance low overhead communication scalable network bandwidth parallel file system
60
Network RAM
b
Performance gap between processor and disk has widened. Thrashing to disk degrades performance significantly Paging across networks can be effective with high performance networks and OS that recognizes idle machines Typically thrashing to network RAM can be 5 to 10 times faster than thrashing to disk
61
Software RAID: Redundant Array of Workstation Disks

b
I/O Bottleneck:
Microprocessor performance is improving more than 50% per year. Disk access improvement is < 10% Application often perform I/O
b b b
RAID cost per byte is high compared to single disks RAIDs are connected to host computers which are often a performance and availability bottleneck RAID in software, writing data across an array of workstation disks provides performance and some degree of redundancy provides availability.
62
Software RAID, Parallel File Systems, and Parallel I/O
63
Cluster Computer and its Components
64
Clustering Today
b
Clustering gained momentum when 3 technologies converged:
1. Very HP Microprocessors
workstation performance = yesterday supercomputers
2. High speed communication

Comm. between cluster nodes >= between processors in an SMP.
3. Standard tools for parallel/ distributed computing

& their growing popularity.
65
Cluster Computer Architecture
66
Cluster Components...1a Nodes

b
Multiple High Performance Components: PCs Workstations SMPs (CLUMPS) Distributed HPC Systems leading to Metacomputing They can be based on different architectures and running difference OS
67
Cluster Components...1b Processors

b
There are many (CISC/RISC/VLIW/Vector..) Intel: Pentiums, Xeon, Merceed. Sun: SPARC, ULTRASPARC HP PA IBM RS6000/PowerPC SGI MPIS Digital Alphas Integrate Memory, processing and networking into a single chip
IRAM (CPU & Mem): (http://iram.cs.berkeley.edu) Alpha 21366 (CPU, Memory Controller, NI)
68
Cluster Components2 OS
b
State of the art OS:

Linux (Beowulf) Microsoft NT (Illinois HPVM) SUN Solaris (Berkeley NOW) IBM AIX (IBM SP2) HP UX (Illinois - PANDA) Mach (Microkernel based OS) (CMU) Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project) (Berkeley Glunix) OS gluing layers:
69
Cluster Components3 High Performance Networks

b b b b b b b b
Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Dolphin - MPI- 12micro-sec latency) ATM Myrinet (1.2Gbps) Digital Memory Channel FDDI
70
Cluster Components4 Network Interfaces

b
Network Interface Card
Myrinet has NIC User-level access support Alpha 21364 processor integrates processing, memory controller, network interface into a single chip..
71
Cluster Components5 Communication Software

b
Traditional OS supported facilities (heavy weight due to protocol processing).. Sockets (TCP/IP), Pipes, etc. Light weight protocols (User Level) Active Messages (Berkeley) Fast Messages (Illinois) U-net (Cornell) XTP (Virginia) System systems can be built on top of the above protocols
72
Cluster Components 6a Cluster Middleware

b
Resides Between OS and Applications and offers in infrastructure for supporting: Single System Image (SSI) System Availability (SA) SSI makes collection appear as single machine (globalised view of system resources). Telnet cluster.myinstitute.edu SA - Check pointing and process migration..
73
Cluster Components 6b Middleware Components

b b
Hardware
DEC Memory Channel, DSM (Alewife, DASH) SMP Techniques
OS / Gluing Layers
Solaris MC, Unixware, Glunix)
Applications and Subsystems

System management and electronic forms Runtime systems (software DSM, PFS etc.) Resource management and scheduling (RMS):
CODINE, LSF, PBS, NQS, etc.
74
Cluster Components7a Programming environments

b
b b
Threads (PCs, SMPs, NOW..) POSIX Threads Java Threads MPI Linux, NT, on many Supercomputers PVM Software DSMs (Shmem)
75
Cluster Components 7b Development Tools ?

b
Compilers
C/C++/Java/ ; Parallel programming with C++ (MIT Press book)
b b b
RAD (rapid application development tools).. GUI based tools for PP modeling Debuggers Performance Analysis Tools Visualization Tools
76
Cluster Components8 Applications

b b
Sequential Parallel / Distributed (Cluster-aware app.)
Grand Challenging applications

Weather Forecasting Quantum Chemistry Molecular Biology Modeling Engineering Analysis (CAD/CAM) .
PDBs, web servers,data-mining

77
Key Operational Benefits of Clustering
b b b
System availability (HA). offer inherent high system availability due to the redundancy of hardware, operating systems, and applications. Hardware Fault Tolerance. redundancy for most system components (eg. disk-RAID), including both hardware and software. OS and application reliability. run multiple copies of the OS and applications, and through this redundancy Scalability. adding servers to the cluster or by adding more clusters to the network as the need arises or CPU to SMP. High Performance. (running cluster enabled programs)
78
Classification of Cluster Computer
79
Clusters Classification..1
b
Based on Focus (in Market)
High Performance (HP) Clusters

Grand Challenging Applications
High Availability (HA) Clusters

Mission Critical applications
80
HA Cluster: Server Cluster with "Heartbeat" Connection
81
b
Based on Workstation/PC Ownership
Dedicated Clusters Non-dedicated clusters

Adaptive parallel computing Also called Communal multiprocessing
82
b
Based on Node Architecture..
Clusters of PCs (CoPs) Clusters of Workstations (COWs) Clusters of SMPs (CLUMPs)
83
Building Scalable Systems: Cluster of SMPs (Clumps)
Performance of SMP Systems Vs. Four-Processor Servers in a Cluster
84
b
Based on Node OS Type..
Linux Clusters (Beowulf) Solaris Clusters (Berkeley NOW) NT Clusters (HPVM) AIX Clusters (IBM SP2) SCO/Compaq Clusters (Unixware) .Digital VMS Clusters, HP clusters, ..
85
b
Based on node components architecture & configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..):
Homogeneous Clusters
All nodes will have similar configuration
Heterogeneous Clusters
Nodes based on different processors and running different OSes.
86
Clusters Classification..6a
(3) Network
Public Enterprise Campus Department Workgroup
Dimensions of Scalability & Levels of Clustering
Metacomputing (GRID)
/ OS y mor / Me
Technology
(1)
CP
Uniprocessor
/ I/O U
SMP Cluster MPP
Platform
(2)
87
Clusters Classification..6b Levels of Clustering

b
Group Clusters (#nodes: 2-99)

(a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet) Departmental Clusters (#nodes: 99-999) Organizational Clusters (#nodes: many 100s) (using ATMs Net) Internet-wide Clusters=Global Clusters: (#nodes:
1000s to many millions)
Metacomputing Web-based Computing Agent Based Computing
b b b b
Java plays a major in web and agent based computing
88
Major issues in cluster design

Size Scalability (physical & application) Enhanced Availability (failure management) Single System Image (look-and-feel of one system) Fast Communication (networks & protocols) Load Balancing (CPU, Net, Memory, Disk) Security and Encryption (clusters of clusters) Distributed Environment (Social issues) Manageability (admin. And control) Programmability (simple API if required) Applicability (cluster-aware and non-aware app.)
89
Cluster Middleware and Single System Image
90
A typical Cluster Computing Environment
Application
PVM / MPI/ RSH
???
Hardware/OS
91
CC should support
b b
Multi-user, time-sharing environments Nodes with different CPU speeds and memory sizes (heterogeneous configuration)
b b
Many processes, with unpredictable requirements Unlike SMP: insufficient bonds between nodes Each computer operates independently Inefficient utilization of resources
92
The missing link is provide by cluster middleware/underware
Application
PVM / MPI/ RSH Middleware or Underware
Hardware/OS
93
SSI Clusters--SMP services

on a CC
Pool Together the Cluster-Wide resources

Adaptive resource usage for better performance Ease of use - almost like SMP Scalable configurations - by decentralized control Result: HPC/HAC at PC/Workstation prices
b b b
94
What is Cluster Middleware ?

b
An interface between between use applications and cluster hardware and OS platform. Middleware packages support each other at the management, programming, and implementation levels. Middleware Layers: SSI Layer Availability Layer: It enables the cluster services of Checkpointing, Automatic Failover, recovery from failure, fault-tolerant operating among all cluster nodes.
95
Middleware Design Goals

b
Complete Transparency (Manageability)
Lets the see a single cluster system..

b
Single entry point, ftp, telnet, software loading... Scalable Performance
Easy growth of cluster

b
no change of API & automatic load distribution. Enhanced Availability
Automatic Recovery from failures

Employ checkpointing & fault tolerant technologies
Handle consistency of data when replicated..

96
What is Single System Image (SSI) ?

b
A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource. SSI makes the cluster appear like a single machine to the user, to applications, and to the network. A cluster without a SSI is not a cluster
97
Benefits of Single System Image

b b b b b b b
Usage of system resources transparently Transparent process migration and load balancing across nodes. Improved reliability and higher availability Improved system response time and performance Simplified system management Reduction in the risk of operator errors User need not be aware of the underlying system architecture to use these machines effectively
98
Desired SSI Services

b
Single Entry Point
telnet cluster.my_institute.edu telnet node1.cluster. institute.edu

b b b b b b
Single File Hierarchy: xFS, AFS, Solaris MC Proxy Single Control Point: Management from single GUI Single virtual networking Single memory space - Network RAM / DSM Single Job Management: Glunix, Codine, LSF Single User Interface: Like workstation/PC windowing environment (CDE in Solaris/NT), may it can use Web technology
99
Availability Support Functions

b
Single I/O Space (SIO):
any node can access any peripheral or disk devices without the knowledge of physical location.
b
Single Process Space (SPS)
Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.
b
Checkpointing and Process Migration.
Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. PM for Load balancing...
b
Reduction in the risk of operator errors User need not be aware of the underlying system architecture to use these machines effectively 100
Scalability Vs. Single System Image
UP
101
SSI Levels/How do we implement SSI ?

b
It is a computer science notion of levels of abstractions (house is at a higher level of abstraction than walls, ceilings, and floors). Application and Subsystem Level Operating System Kernel Level
Hardware Level
102
SSI at Application and Subsystem Level

Level application Examples cluster batch system, system management distributed DB, OSF DME, Lotus Notes, MPI, PVM Sun NFS, OSF, DFS, NetWare, and so on OSF DCE, Sun ONC+, Apollo Domain Boundary an application Importance what a user wants SSI for all applications of the subsystem
subsystem
a subsystem
file system
toolkit
shared portion of implicitly supports the file system many applications and subsystems explicit toolkit best level of facilities: user, support for heterservice name,time ogeneous system
(c) In search of clusters 103
SSI at Operating System Kernel Level

Level Kernel/ OS Layer kernel interfaces virtual memory microkernel Examples Solaris MC, Unixware MOSIX, Sprite,Amoeba / GLunix UNIX (Sun) vnode, Locus (IBM) vproc Boundary Importance
each name space: kernel support for files, processes, applications, adm pipes, devices, etc. subsystems type of kernel modularizes SSI objects: files, code within processes, etc. kernel may simplify implementation of kernel objects implicit SSI for all system services
none supporting each distributed operating system kernel virtual memory space Mach, PARAS, Chorus, each service OSF/1AD, Amoeba outside the microkernel
SSI at Harware Level

Level Examples Boundary Importance
Application and Subsystem Level Operating System Kernel Level

memory SCI, DASH memory space better communication and synchronization lower overhead cluster I/O
memory and I/O
SCI, SMP techniques
memory and I/O device space
SSI Characteristics
b b
1. Every SSI has a boundary 2. Single system support can exist at different levels within a system, one able to be build on another
106
SSI Boundaries -- an applications SSI boundary
Batch System SSI Boundary
Relationship Among Middleware Modules
108
SSI via OS path!

b
1. Build as a layer on top of the existing OS
Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time. i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. Eg: Glunix
b
2. Build SSI at kernel level, True Cluster OS
Good, but Cant leverage of OS improvements by

vendor
E.g. Unixware, Solaris-MC, and MOSIX
109
SSI Representative Systems

b
OS level SSI SCO NSC UnixWare Solaris-MC MOSIX, . Middleware level SSI PVM, TreadMarks (DSM), Glunix, Condor, Codine, Nimrod, . Application level SSI PARMON, Parallel Oracle, ...
110
http://www.sco.com/products/clustering/
UP or SMP node Users, applications, and systems management Standard OS kernel calls Standard SCO UnixWare with clustering hooks
SCO NonStop Cluster for UnixWare

UP or SMP node Users, applications, and systems management Standard OS kernel calls Standard SCO UnixWare with clustering hooks
Extensions
Extensions
Modular kernel extensions
Modular kernel extensions
Devices ServerNet
Other nodes
Devices
111
How does NonStop Clusters Work?

b
Modular Extensions and Hooks to Provide:
Single Clusterwide Filesystem view Transparent Clusterwide device access Transparent swap space sharing Transparent Clusterwide IPC High Performance Internode Communications Transparent Clusterwide Processes, migration,etc. Node down cleanup and resource failover Transparent Clusterwide parallel TCP/IP networking Application Availability Clusterwide Membership and Cluster timesync Cluster System Administration Load Leveling
112
Solaris-MC: Solaris for MultiComputers

b b
Applications System call interface Network File system C++ Processes Solaris MC Other nodes
global file system globalized process management globalized networking and I/O
Object framework
Object invocations Kernel
Existing Solaris 2.5 kernel Solaris MC Architecture
http://www.sun.com/research/solaris-mc/
113
Solaris MC components
Applications System call interface Network File system C++ Processes Solaris MC Other nodes
b b
Object framework
Object invocations
Existing Solaris 2.5 kernel Kernel Solaris MC Architecture
b b
Object and communication support High availability support PXFS global distributed file system Process mangement Networking
114
Multicomputer OS for UNIX (MOSIX) http://www.mosix.cs.huji.ac.il/

b
b b
An OS module (layer) that provides the applications with the illusion of working on a single system Remote operations are performed like local operations Transparent to the application - user interface unchanged
Application
PVM / MPI / RSH MO Hardware/OS
SIX
115
Main tool
Preemptive process migration that can migrate--->any process, anywhere, anytime
b
b b
Supervised by distributed algorithms that respond on-line to global resource availability - transparently Load-balancing - migrate process from overloaded to under-loaded nodes Memory ushering - migrate processes from a node that has exhausted its memory, to prevent paging/swapping
116
MOSIX for Linux at HUJI

b
A scalable cluster configuration:
50 Pentium-II 300 MHz 38 Pentium-Pro 200 MHz (some are SMPs) 16 Pentium-II 400 MHz (some are SMPs)
b b b b
Over 12 GB cluster-wide RAM Connected by the Myrinet 2.56 G.b/s LAN Runs Red-Hat 6.0, based on Kernel 2.2.7 Upgrade: HW with Intel, SW with Linux Download MOSIX:
http://www.mosix.cs.huji.ac.il/
117
NOW @ Berkeley
b
Design & Implementation of higher-level system b Global OS (Glunix) b Parallel File Systems (xFS) b Fast Communication (HW for Active Messages) b Application Support Overcoming technology shortcomings b Fault tolerance b System Management NOW Goal: Faster for Parallel AND Sequential
http://now.cs.berkeley.edu/
118
NOW Software Components

Large Seq. Apps
Parallel Apps
Sockets, Split-C, MPI, HPF, vSM

Global Layer Unix Active Messages
Name Svr
Unix Workstation VN segment Driver AM L.C.P.
Unix (Solaris) Workstation VN segment Driver AM L.C.P.
e ch S
e ul
Myrinet Scalable Interconnect
119
3 Paths for Applications on NOW?

b b b
Revolutionary (MPP Style): write new programs from scratch using MPP languages, compilers, libraries, Porting: port programs from mainframes, supercomputers, MPPs, Evolutionary: take sequential program & use 1) Network RAM: first use memory of many computers to reduce disk accesses; if not fast enough, then: 2) Parallel I/O: use many disks in parallel for accesses not in file cache; if not fast enough, then: 3) Parallel program: change program until it sees enough processors that is fast=> Large speedup without fine grain parallel program
120
Comparison of 4 Cluster Systems
121
Cluster Programming Environments

b
Shared Memory Based
DSM Threads/OpenMP (enabled for clusters) Java threads (HKU JESSICA, IBM cJVM)
b
Message Passing Based
PVM (PVM) MPI (MPI)

b
Parametric Computations
Nimrod/Clustor
b b
Automatic Parallelising Compilers Parallel Libraries & Computational Kernels (NetSolve)
122
Levels of Parallelism Levels of Parallelism

PVM/MPI
Task i-l Task i-l Task ii Task Task i+1 Task i+1
Code-Granularity Code-Granularity Code Item Code Item Large grain Large grain (task level) (task level) Program Program Medium grain Medium grain (control level) (control level) Function (thread) Function (thread) Fine grain Fine grain (data level) (data level) Loop (Compiler) Loop (Compiler) Very fine grain Very fine grain (multiple issue) (multiple issue) With hardware With hardware
Threads
func1 ( () ) func1 {{ .... .... .... .... }}
func2 ( () ) func2 {{ .... .... .... .... }}
func3 ( () ) func3 {{ .... .... .... .... }}
Compilers CPU
aa( (00) )=.. =.. bb( (00) )=.. =..
aa( (11)=.. )=.. bb( (11)=.. )=..
aa( (22)=.. )=.. bb( (22)=.. )=..
+ +
x x
Load Load
123
MPI (Message Passing Interface)

http://www.mpi-forum.org/
b
A standard message passing interface.

MPI 1.0 - May 1994 (started in 1992) C and Fortran bindings (now Java)
b b b
Portable (once coded, it can run on virtually all HPC platforms including clusters! Performance (by exploiting native hardware features) Functionality (over 115 functions in MPI 1.0)
environment management, point-to-point & collective communications, process group, communication world, derived data types, and virtual topology routines.
b
Availability - a variety of implementations available, both vendor and public domain.

124
A Sample MPI Program...

# include <stdio.h> # include <string.h> #include mpi.h main( int argc, char *argv[ ]) { int my_rank; /* process rank */ int p; /*no. of processes*/ int source; /* rank of sender */ int dest; /* rank of receiver */ int tag = 0; /* message tag, like email subject */ char message[100]; /* buffer */ MPI_Status status; /* function return status */ /* Start up MPI */ MPI_Init( &argc, &argv ); /* Find our process rank/id */ MPI_Comm_rank( MPI_COM_WORLD, &my_rank); /*Find out how many processes/tasks part of this run */ MPI_Comm_size( MPI_COM_WORLD, &p);
(master)
Hello,... (workers)
125
A Sample MPI Program
if( my_rank == 0) /* Master Process */ { for( source = 1; source < p; source++) { MPI_Recv( message, 100, MPI_CHAR, source, tag, MPI_COM_WORLD, &status); printf(%s \n, message); } } else /* Worker Process */ { sprintf( message, Hello, I am your worker process %d!, my_rank ); dest = 0; MPI_Send( message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COM_WORLD); } /* Shutdown MPI environment */ MPI_Finalise(); }
126
Execution
% cc -o hello hello.c -lmpi % mpirun -p2 hello Hello, I am process 1! % mpirun -p4 hello Hello, I am process 1! Hello, I am process 2! Hello, I am process 3! % mpirun hello (no output, there are no workers.., no greetings)
127
PARMON: A Cluster Monitoring Tool

PARMON Client on JVM PARMON Server on each node
parmon
PARMON High-Speed Switch
parmond
http://www.buyya.com/parmon/
128
Resource Utilization at a Glance
129
Globalised Cluster Storage
Single I/O Space and Design Issues

Reference: Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space, IEEE Concurrency, March, 1999
by K. Hwang, H. Jin et.al

130
Clusters with & without Single I/O Space
Users
Users
Single I/O Space Services
Without Single I/O Space
With Single I/O Space Services
131
Benefits of Single I/O Space
Eliminate the gap between accessing local disk(s) and remote disks Support persistent programming paradigm Allow striping on remote disks, accelerate parallel I/O operations Facilitate the implementation of distributed checkpointing and recovery schemes
b b
132
Single I/O Space Design Issues
b b b
Integrated I/O Space Addressing and Mapping Mechanisms Data movement procedures
133
Integrated I/O Space

LD1 LD2 D11 D12 D21 D22 ... ... ... Dn1 Dn2 B11 B12 SD1 B1k P1 B21 B22 SD2 B2k . . . ... ... ... ... ... D1t D2t
Sequential addresses
LDn
Dnt Bm1 Bm2 SDm Bmk
Local Disks, (RADD Space) Shared RAIDs, (NASD Space) Peripherals (NAP Space)
Ph
134
Addressing and Mapping
User Applications
Name Agent
Disk/RAID/ NAP Mapper

I/O Agent I/O Agent
Block Mover
I/O Agent
I/O Agent
User-level Middleware plus some Modified OS System Calls
RADD
NASD
NAP
135
Data Movement Procedures

User Application I/O Agent Node 1 Request Data Block A Block Mover
Node 2 I/O Agent
LD2 or SDi LD1 of the NASD Node 1 Block Mover
User Application I/O Agent
Node 2
I/O Agent
LD2 or SDi LD1 of the NASD
A
136
What Next ??
Clusters of Clusters (HyperClusters) Global Grid Interplanetary Grid Universal Grid??
137
Clusters of Clusters (HyperClusters)

Cluster 1
Scheduler
Master Daemon
LAN/WAN
Submit Graphical Control Execution Daemon
Cluster 3
Scheduler
Clients
Master Daemon
Cluster 2
Scheduler
Master Daemon
Submit Graphical Control
Execution Daemon
Clients
Submit Graphical Control
Execution Daemon
Clients
138
Towards Grid Computing.
For illustration, placed resources arbitrarily on the GUSTO test-bed!!
139
What is Grid ?
b
An infrastructure that couples

Computers (PCs, workstations, clusters, traditional supercomputers, and even laptops, notebooks, mobile computers, PDA, and so on) Software ? (e.g., renting expensive special purpose applications on demand) Databases (e.g., transparent access to human genome database) Special Instruments (e.g., radio telescope--SETI@Home Searching for Life in galaxy, Austrophysics@Swinburne for pulsars) People (may be even animals who knows ?)
across the local/wide-area networks (enterprise, organisations, or Internet) and presents them as an unified integrated (single) resource.
140
Conceptual view of the Grid
Leading to Portal (Super)Computing
http://www.sun.com/hpc/
141
Grid Application-Drivers
b
Old and New applications getting enabled due to coupling of computers, databases, instruments, people, etc:
(distributed) Supercomputing Collaborative engineering high-throughput computing

large scale simulation & parameter studies
Remote software access / Renting Software Data-intensive computing On-demand computing
142
Grid Components
Applications and Portals
Scientific
Engineering
Collaboration
Prob. Solving Env.
Web enabled Apps
Grid Apps.
Development Environments and Tools
Languages
Libraries
Debuggers
Monitoring
Resource Brokers
Web tools
Grid Tools
Distributed Resources Coupling Services
Comm.
Sign on & Security
Information
Process
Data Access
QoS
Grid Middleware
Local Resource Managers
Operating Systems
Queuing Systems
Libraries & App Kernels
TCP/IP & UDP
Networked Resources across Organisations
Grid Fabric
Computers
Clusters
Storage Systems
Data Sources
Scientific Instruments
143
Many GRID Projects and Initiatives

b
b
Public Grid Initiatives
PUBLIC FORUMS
Computing Portals Grid Forum European Grid Forum IEEE TFCC! GRID2000 and more.
Distributed.net SETI@Home Compute Power Grid

b
USA
Australia
Nimrod/G EcoGrid and GRACE DISCWorld

b
Europe
UNICORE MOL METODIS Globe Poznan Metacomputing CERN Data Grid MetaMPI DAS JaWS and many more...
Globus Legion JAVELIN AppLes NASA IPG Condor Harness NetSolve NCSA Workbench WebFlow EveryWhere and many more...
Japan
Ninf Bricks and many more...
http://www.gridcomputing.com/
144
NetSolve
Client/Server/Agent -- Based Computing
Easy-to-usetool to provideefficient and uniform access to a variety of scientific packages on UNIX platforms
Client-Server design Network-enabled solvers Network Resources Seam less access to resources Non-hierarchical system Load Balancing Fault Tolerance reply Interfaces to Fortran, C, Java, Matlab, m ore request Softw are is available
Software Reposit
choice
NetSolveClient
145 NetSolveAgent
HARNESS Virtual Machine

Scalable Distributed con trol and CCA based Daemon
Discovery and registration Host A Host D
Host B
Virtual Machine Component based da emon

proce ss control
Another VM
Host C
Operation within VM uses Distributed Control
user fea tures
http://www.epm.ornl.gov/harness/
HARNESS daem on
Customization and extension by dynamica ll 146 adding plug-in
HARNESS Core Research

Par allel Plug-ins for Heterogeneous Distributed Virtual Machin e One research go al is to un derstand and implemen t a dynamic parallel plug -in environ ment.
provides a method for many use rs to e xte nd Harness in much the same way tha t third party serial plug-ins extend Nets cape, hotos hop and Linux. P ,
Research issues wit h Parallel plug -ins in clu de: heterog eneity, synch ro nization, in tero peration , p artial success
(three typica l ca ses):
load plug-in into single host of VM w/o communication load plug-in into single host broadcast to rest of VM load plug-in into every host of VM w/ synchronization
147
Nimrod - A Job Management System
http://www.dgs.monash.edu.au/~davida/nimrod.html
148
Job processing with Nimrod
149
Nimrod/G Architecture
Nimrod/G Client Nimrod/G Client Nimrod/G Client
Nimrod Engine
Schedule Advisor Trading Manager
Persistent Store
Dispatcher
Middleware Services TM TS
Grid Explorer
GE
GIS
Grid Information Services
RM & TS RM & TS
RM & TS
GUSTO Test Bed RM: Local Resource Manager, TS: Trade Server
150
Compute Power Market
Grid Information Server
Grid Explorer Application Job Control Agent
Schedule Advisor
Trade Server
Charging Alg. Accounting Other services
Trading
Trade Manager Resource Reservation
Resource Allocation
Deployment Agent User Resource Broker
R1
R2
Rn
A Resource Domain
151
Pointers to Literature on Cluster Computing
152
Reading Resources..1a Internet & WWW

Computer Architecture:
http://www.cs.wisc.edu/~arch/www/
PFS & Parallel I/O

http://www.cs.dartmouth.edu/pario/
Linux Parallel Procesing

http://yara.ecn.purdue.edu/~pplinux/Sites/
DSMs
http://www.cs.umd.edu/~keleher/dsm.html
153
Reading Resources..1b Internet & WWW

Solaris-MC
http://www.sunlabs.com/research/solaris-mc
Microprocessors: Recent Advances

http://www.microprocessor.sscc.ru
Beowulf:
http://www.beowulf.org
Metacomputing
http://www.sis.port.ac.uk/~mab/Metacomputing/
154
Reading Resources..2 Books In Search of Cluster

by G.Pfister, Prentice Hall (2ed), 98
High Performance Cluster Computing

Volume1: Architectures and Systems Volume2: Programming and Applications
Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.
Scalable Parallel Computing

by K Hwang & Zhu, McGraw Hill,98
155
Reading Resources..3 Journals

A Case of NOW, IEEE Micro, Feb95
by Anderson, Culler, Paterson
Fault Tolerant COW with SSI, IEEE Concurrency, (to appear)

by Kai Hwang, Chow, Wang, Jin, Xu
Cluster Computing: The Commodity Supercomputing, Journal of Software Practice and Experience-(get from my web)
by Mark Baker & Rajkumar Buyya
156
Cluster Computing Infoware
http://www.csse.monash.edu.au/~rajkumar/cluster/
157
Cluster Computing Forum

IEEE Task Force on Cluster Computing (TFCC)
http://www.ieeetfcc.org
158
TFCC Activities...
b b b b b b b b b
Network Technologies OS Technologies Parallel I/O Programming Environments Java Technologies Algorithms and Applications >Analysis and Profiling Storage Technologies High Throughput Computing
159
TFCC Activities...
b b b b b b b b
High Availability Single System Image Performance Evaluation Software Engineering Education Newsletter Industrial Wing TFCC Regional Activities
All the above have there own pages, see pointers from: http://www.ieeetfcc.org
160
TFCC Activities...
b
Mailing list, Workshops, Conferences, Tutorials, Web-resources etc. Resources for introducing subject in senior undergraduate and graduate levels. Tutorials/Workshops at IEEE Chapters.. .. and so on. FREE MEMBERSHIP, please join! Visit TFCC Page for more details:
b b b b b
http://www.ieeetfcc.org (updated daily!).

161
Clusters Revisited
162
Summary
We have discussed Clusters

Enabling Technologies Architecture & its Components Classifications Middleware Single System Image Representative Systems
163
Conclusions
Clusters are promising..
Solve parallel processing paradox Offer incremental growth and matches with funding pattern. New trends in hardware and software technologies are likely to make clusters more promising..so that Clusters based supercomputers can be seen everywhere!
164
Computing Platforms Evolution

Breaking Adm inistrative Barriers
2 1 0
2 10
2 10
2 1 0
P E R F O R M A N C E
?
2 1 0 2 10 2 10 2 1 0
21 00
Administrative Barrie
Ind ividu al Gro up D epart men t C ampus Sta te N ational Globe Inte r Plane t U niverse 165
Thank You ... Thank You ...
166
Backup Slides...
167
SISD : A Conventional Computer

Instructions
Data Input
Processor Processor
Data Output
Speed is limited by the rate at which computer can transfer information internally.
Ex:PC, Macintosh, Workstations
168
The MISD Architecture

Instruction Stream A Instruction Stream B Instruction Stream C
Processor
A Data Input Stream

Processor
Data Output Stream

Processor
B C
More of an intellectual exercise than a practical configuration. Few built, but commercially not available
169
SIMD Architecture
Instruction Stream
Data Input stream A Data Input stream B Data Input stream C
Processor
A
Processor
Data Output stream A Data Output stream B

Processor
B C
Data Output stream C
Ci<= Ai * Bi Ex: CRAY machine vector processing, Thinking machine cm*

170
MIMD Architecture
Instruction Instruction Instruction Stream A Stream B Stream C
Data Input stream A Data Input stream B Data Input stream C
Processor
A
Processor
Data Output stream A Data Output stream B

Processor
B C
Data Output stream C
Unlike SISD, MISD, MIMD computer works asynchronously. Shared memory (tightly coupled) MIMD Distributed memory (loosely coupled) MIMD
171
Shared Memory MIMD machine

Processor Processor A A
Processor Processor B B Processor Processor C C
M E M B O U R S Y
M E M B O U R S Y
M E M B O U R S Y
Global Memory System Global Memory System

Comm: Source PE writes data to GM & destination retrieves it Easy to build, conventional OSes of SISD can be easily be ported Limitation : reliability & expandability. A memory component or any processor failure affects the whole system. Increase of processors leads to memory contention. Ex. : Silicon graphics supercomputers....
172
Distributed Memory MIMD

IPC channel Processor Processor A A
Processor Processor B B Processor Processor C C
IPC channel
M E M B O U R S Y
M E M B O U R S Y
M E M B O U R S Y
Memory Memory System A System A

q q q
Memory Memory System B System B
Memory Memory System C System C
Communication : IPC on High Speed Network. Network can be configured to ... Tree, Mesh, Cube, etc. Unlike Shared MIMD easily/ readily expandable Highly reliable (any CPU failure does not affect the whole system)
173

Cluster Computing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Computing

Uploaded by

Copyright:

Available Formats

Objectives

Architecture, System Software Programming Environments and Tools Applications

Two Eras of Computing

Architectures System Software Applications P.S.Es Architectures System Software Applications

Computing Power and Computer Architectures

Computing Power (HPC) Drivers

Life Sciences Life Sciences

Digital Biology Digital Biology

How to Run App. Faster ?

There are 3 ways to improve performance:

1. Work Harder 2. Work Smarter 3. Get Help

Computing Platforms Evolution

Application Case Study

Web Serving and E-Commerce

E-Commerce and PDC ?

Killer Applications of Clusters

Major problems/issues in Ecommerce

Amazon.com: Online sales/trading killer E-commerce Portal Several Thousands of Items

books, publishers, suppliers

Customers details, transactions details, support for transactions update

Sales based on advertised price Sales through auction/bids

Can these drive E-Commerce ?

Hyperclusters and the GRID:

Science Portals - e.g., PAPIA system

Pentiums Myrinet NetBSD/Linuux PM Score-D MPC++

RWCP Japan: http://www.rwcp.or.jp/papia/

PDC hot topics for Ecommerce

Sequential Architecture Limitations

Hardware improvements like pipelining, Superscalar, etc., are non-scalable and

requires sophisticated Compiler Technology.

Vector Processing works well for certain kind of problems.

Computational Power Improvement

Human Physical Growth Analogy: Computational Power Improvement

Simple classification by Flynn:

SISD SIMD MISD MIMD

Current focus is on MIMD model, using

Main HPC Architectures..1a

Motivation for using Clusters

Main HPC Architectures..1b.

Parallel Processing Paradox

required to develop a parallel application for solving GCA is equal to:

The Need for Alternative Supercomputing Resources

Scalable Parallel Computers

Design Space of Competing Computer Architecture

Towards Inexpensive Supercomputing

Cluster Computing Research Projects

Cluster Computing Commercial Software

Motivation for using Clusters

Motivation for using Clusters

Rise & Fall of Computing Technologies

Network Computing 1995

Original Food Chain Picture

1984 Computer Food Chain

Mainframe Mini Computer Workstation

1994 Computer Food Chain

(hitting wall soon)

Workstation (future is bleak) Mainframe

Computer Food Chain (Now and Future)

Why Clusters now? (Beyond Technology and Cost)

Building block is big enough

Workstations performance is doubling every 18 months. Networks are faster

Architectural Drivers (cont)