Professional Documents
Culture Documents
Parallel Processing
Contents at a Glance
Review of week11 Threads and Processes Implicit and Explicit Multithreading Clusters Parallelizing Vector Computation
Problem - multiple copies of same data may reside in some caches and the main memory If the processors are allowed to update their own copies (in the cache) freely, may result in an inconsistent view of memory This results in a cache coherence problem Hence multiple copies of the data (in cache) have to be kept identical
March 20, 2012
The objective is to update recently used local variables in the cache and let them reside through numerous reads and write The protocol maintains consistency of shared variables in multiple caches at the same time Cache Coherence approaches
Collect and maintain information about copies of data in cache There is a centralized controller, that is part of a main memory controller, and a directory stored in main memory When a request is made, the centralized controller checks and issues necessary commands for data transfer between memory and cache or between caches The central controller keeps the state information updated Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute
Directory Protocol
Snoopy Protocol
Distribute cache coherence responsibility among cache controllers Cache recognizes that a line is shared Updates announced to other caches by broadcast mechanism Each cache controller snoops on the network to observe the broadcast notifications and react accordingly
MESI Protocol
Modified : line in the cache has been modified (different from mail memory) and is available only in this line Exclusive : line in the cache is the same as that in main memory and is not present in any other cache Shared : line in the cache is the same as that in main memory and may be present in another cache Invalid : line in the cache does not contain valid data
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
CPU performance
seconds
CPU Execution = Time Instruction Clock Cycle X CPI X Count Time
instructions/program
cycles/instruction
seconds/cycle
but its an intuitive and useful concept. Note: Use dynamic instruction count (#instructions executed), not static (#instructions in compiled code)
March 20, 2012
Slide9
Same machine, different programs Same program, different machines, but same ISA Same program, different ISAs
March 20, 2012 Slide10
Clock speed?
No
Unless ISA is same
Physics: speed of light, size of atoms, heat generated (speed requires energy loss), capacity of electromagnetic spectrum (for wireless), ... Limits with current technology: size of magnetic domains, chip size (due to defects), lithography, pin count. New technologies on the horizon: quantum computers, molecular computers, superconductors, optical computers, holographic storage, ... Fallacy improvements will stop Pitfall trying to predict > 5 years in future
March 20, 2012
Slide12
Processor Performance
Increase performance by increasing clock frequency Increasing instructions that complete during a cycle (pipelining)
Alternative approach
Instruction stream divided into smaller streams called threads The threads can be executed in parallel (multithreading)
Process switch, an operation that switches the processor from one process to the other
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Resource ownership, Virtual address space to hold process image (collection of program, data, stack and other attributes ) Scheduling/execution (execution state, dispatching priority)
Includes processor context (which includes the program counter and stack pointer) and data area for stack Thread executes sequentially Interruptible: processor can turn to another thread Switching processor between threads within same process Typically less costly than process switch
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Thread switch
Explicit Multithreading
All commercial processors and most experimental ones use explicit multithreading Concurrently execute instructions from different explicit threads Interleave instructions from different threads on shared pipelines Parallel execution on parallel pipelines
Implicit Multithreading
Implicit multithreading is concurrent execution of multiple threads extracted from single sequential program
Interleaved
Also known as Fine-grained multithreading Processor deals with two or more thread contexts at a time Switching thread at each clock cycle If thread is blocked it is skipped and a ready thread is executed
Blocked
Also known as Coarse-grained multithreading Thread are executed successively until an event occurs that causes delay e.g. Cache miss Effective on in-order processor Avoids pipeline stall
Multithreading with a dual-issue superscalar CPU (a) Fine-grained multithreading (b) Coarse-grained multithreading (c) Simultaneous multithreading
(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education)
March 20, 2012
Simultaneous (SMT)
Instructions simultaneously issued from multiple threads to execution units of superscalar processor
Chip multiprocessing
Processor is replicated on a single chip Each processor handles separate threads
Single-threaded scalar
Simple pipeline No multithreading
Easiest multithreading to implement Switch threads at each clock cycle Pipeline stages kept close to fully occupied Hardware needs to switch thread context between cyclesRichard Salomon, Sudipto Mitra
Copyright Box Hill Institute
Scalar Diagrams
Superscalar
No multithreading
Each cycle, as many instructions as possible issued from single thread Delays due to thread switches eliminated Number of instructions issued in cycle limited by dependencies
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
e.g. IA-64 Multiple instructions in single word Typically constructed by compiler Operations that may be executed in parallel in same word May pad with no-ops March 20, 2012 Richard Salomon, Sudipto Mitra
Simultaneous multithreading
Issue multiple instructions at a time One thread may fill all horizontal slots Instructions from two or more threads may be issued With enough threads, can issue maximum number of instructions on each cycle
Chip multiprocessor
Multiple processors Each has two-issue superscalar processor Each processor is assigned thread
Can
Examples
Some Pentium 4
Intel calls it hyper-threading SMT with support for two threads Single multithreaded processor, logically two processors High-end PowerPC Combines chip multiprocessing with SMT Chip has two separate processors Each supporting two threads concurrently using SMT
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
IBM Power5
Clusters (1)
Alternative to SMP High performance High availability Server applications
Clusters (2)
A group of interconnected whole computers Working together as unified resource Illusion of being one machine Each computer called a node
Cluster Benefits
Absolute scalability Incremental scalability High availability Superior price/performance
Failback
Restoration
Load balancing
Incremental scalability Automatically include new computers in scheduling Middleware needs to recognise that processes may switch between machines
Parallelizing (1)
at compile time which parts can be executed in parallel Split off for different computers
Parallelizing (2)
Application
Application
written from scratch to be parallel Message passing to move data between nodes Hard to program Best end result
Parametric computing
If
a problem is repeated, execute the algorithm on different sets of data e.g. simulation using different scenarios Needs effective tools to organize and run
March 20, 2012
SMP:
Clustering:
Superior incremental & absolute scalability Superior availability
Redundancy
All processors have access to all parts of memory Access time to all regions of memory is the same Access time to memory for different processors same As used by SMP All processors have access to all parts of memory
Access time of processor differs depending on region of memory Different processors access different regions of memory at different speeds
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Motivation
NUMA retains SMP flavour while giving large scale multiprocessing Objective is to maintain transparent system wide memory while permitting multiprocessor nodes, each with own bus or internal interconnection Richard Salomon, Sudipto Mitra system Copyright Box Hill Institute
March 20, 2012
Each processor has own L1 and L2 cache Each node has own main memory Nodes connected by some networking facility Each processor sees single addressable memory space Memory request order:
L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory
CC-NUMA Organization
P2-3 issues read request on snoopy bus of node 2 Directory on node 2 recognises location is on node 1 Node 2 directory requests node 1s directory Node 1 directory requests contents of 798 Node 1 memory puts data on (node 1 local) bus Node 1 directory gets data from (node 1 local) bus Data transferred to node 2s directory Node 2 directory puts data on (node 2 local) bus Data picked up, put in P2-3s cache and delivered to processor
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Cache Coherence
Node 1 directory keeps note that node 2 has copy of data If data modified in cache, this is broadcast to other nodes Local directories monitor and purge local cache if necessary Local directory monitors changes to local data in remote caches and marks memory invalid until write-back Local directory forces write-back if memory location requested by another processor
March 20, 2012
Higher effective levels of parallelism than SMP No major software changes Performance can breakdown if too much access to remote memory Can be avoided by:
L1
& L2 cache design reducing all memory access Good temporal locality of software Good spatial locality of software Virtual memory management moving pages to nodes that are usingMitra them most Richard Salomon, Sudipto
Copyright Box Hill Institute
Not transparent
Availability?
Vector Computation
Maths problems involving physical processes that present different difficulties for computation
Supercomputers
Hundreds of millions of flops $10-15 million Optimised for calculation rather than multitasking and I/O Limited market Research, government agencies, meteorology Alternative to supercomputer Configured as peripherals to mainframe & mini Just run vector portion of problems
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Array processor
Approaches (1)
General purpose computers rely on iteration to do vector calculations In above example this needs six calculations Vector processing
Assume possible to operate on one-dimensional vector of data All elements in a particular row can be calculated in parallel
Approaches (2)
Parallel processing
Independent processors functioning in parallel Use FORK N to start individual process at location N JOIN N causes N independent processes to join and merge following JOIN
O/S
Co-ordinates JOINs Execution is blocked until all N processes have reached JOIN Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute
Processor Designs
Pipelined ALU
Within operations Across operations
Computer Organizations
Summary
Alternative approach to improve performance, Instruction stream divided into smaller streams called threads and execute them in parallel (multithreading) Clustering is an alternative to symmetric multiprocessing for providing high performance and high availability.
Reference
Stallings William, 2003, Computer Organization & Architecture designing for performance, Sixth Edition, Pearson Education, Inc, ISBN 0 - 13 049307 4. M Morris Mano, Computer System Architecture, Third Edition, Prentice Hall. Measuring Performance, UCSD, CSE 141, Larry Carter, Winter 2002 Tanenbaum, Structured Computer Organization, Fifth Edition, 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0. CS 284a Lecture, Tuesday, 7 October 1997, John Thornley.
March 20, 2012
Further Reading
Manufacturers websites Relevant Special Interest Groups [SIG] Articles in magazines IEEE Computer Society Task Force on Cluster Computing web-site