Ict123 W12

Centre for Computer Technology
ICT123 Computer Architecture

Week 12
Parallel Processing
Contents at a Glance
Review of week11 Threads and Processes Implicit and Explicit Multithreading Clusters Parallelizing Vector Computation
March 20, 2012
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Shared-Memory Architecture: Cache Coherence

Problem - multiple copies of same data may reside in some caches and the main memory If the processors are allowed to update their own copies (in the cache) freely, may result in an inconsistent view of memory This results in a cache coherence problem Hence multiple copies of the data (in cache) have to be kept identical
March 20, 2012
Cache Coherence Protocols
The objective is to update recently used local variables in the cache and let them reside through numerous reads and write The protocol maintains consistency of shared variables in multiple caches at the same time Cache Coherence approaches

Software solutions Hardware solutions

March 20, 2012
March 20, 2012
Collect and maintain information about copies of data in cache There is a centralized controller, that is part of a main memory controller, and a directory stored in main memory When a request is made, the centralized controller checks and issues necessary commands for data transfer between memory and cache or between caches The central controller keeps the state information updated Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute
Directory Protocol
Snoopy Protocol
Distribute cache coherence responsibility among cache controllers Cache recognizes that a line is shared Updates announced to other caches by broadcast mechanism Each cache controller snoops on the network to observe the broadcast notifications and react accordingly
March 20, 2012
MESI Protocol
Modified : line in the cache has been modified (different from mail memory) and is available only in this line Exclusive : line in the cache is the same as that in main memory and is not present in any other cache Shared : line in the cache is the same as that in main memory and may be present in another cache Invalid : line in the cache does not contain valid data
March 20, 2012
MESI State Transition Diagram
March 20, 2012
CPU performance
seconds
CPU Execution = Time Instruction Clock Cycle X CPI X Count Time
One of P&Hs big pictures
instructions/program
cycles/instruction
seconds/cycle
Note: CPI is somewhat artificial

(its computed from the other numbers using this formula)
but its an intuitive and useful concept. Note: Use dynamic instruction count (#instructions executed), not static (#instructions in compiled code)
March 20, 2012
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide9
Explaining performance variation

CPU Execution = Time Instruction Clock Cycle X CPI X Count Time
Same machine, different programs Same program, different machines, but same ISA Same program, different ISAs
March 20, 2012 Slide10
How do you judge computer performance?

Clock speed?
No
Unless ISA is same
Peak MIPS rate?
No Sometimes (if program tested is like yours) The best method!

Slide11
Relative MIPS, normalized MFLOPS?
How fast does it execute MY program
March 20, 2012
Physics: speed of light, size of atoms, heat generated (speed requires energy loss), capacity of electromagnetic spectrum (for wireless), ... Limits with current technology: size of magnetic domains, chip size (due to defects), lithography, pin count. New technologies on the horizon: quantum computers, molecular computers, superconductors, optical computers, holographic storage, ... Fallacy improvements will stop Pitfall trying to predict > 5 years in future
March 20, 2012
What are limits?
Slide12
Centre for Computer Technology
Processor Performance
Increasing Performance (1)

Processor performance can be measured by the rate at which it executes instructions
MIPS rate = f * IPC
f processor clock frequency, in MHz IPC is average instructions per cycle
Increase performance by increasing clock frequency Increasing instructions that complete during a cycle (pipelining)
March 20, 2012
Increasing Performance (2)
May be reaching limit due to

Complexity Power consumption

Alternative approach
March 20, 2012
Wide variety of multithreading designs

Instruction stream divided into smaller streams called threads The threads can be executed in parallel (multithreading)
Threads and Processes (1)
Thread in multithreaded processors may or may not be same as software threads

Process is an instance of a program running on a computer A process has two main characteristics
March 20, 2012
Process switch, an operation that switches the processor from one process to the other
Resource ownership, Virtual address space to hold process image (collection of program, data, stack and other attributes ) Scheduling/execution (execution state, dispatching priority)
Threads and Processes (2)
Thread a dispatchable unit of work within a process
Includes processor context (which includes the program counter and stack pointer) and data area for stack Thread executes sequentially Interruptible: processor can turn to another thread Switching processor between threads within same process Typically less costly than process switch
Thread switch

March 20, 2012
Explicit Multithreading
All commercial processors and most experimental ones use explicit multithreading Concurrently execute instructions from different explicit threads Interleave instructions from different threads on shared pipelines Parallel execution on parallel pipelines
March 20, 2012
Implicit Multithreading
Implicit multithreading is concurrent execution of multiple threads extracted from single sequential program
Implicit threads are defined

Statically by the compiler or Dynamically by the hardware
March 20, 2012
Approaches to Explicit Multithreading (1)
Interleaved
Also known as Fine-grained multithreading Processor deals with two or more thread contexts at a time Switching thread at each clock cycle If thread is blocked it is skipped and a ready thread is executed
March 20, 2012
Blocked
Also known as Coarse-grained multithreading Thread are executed successively until an event occurs that causes delay e.g. Cache miss Effective on in-order processor Avoids pipeline stall
March 20, 2012
(a) (c) Three threads

(empty boxes indicate that the thread has stalled waiting for memory)
(d) Fine-grained multithreading (e) Coarse-grained multithreading

(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education)
March 20, 2012
Multithreading with a dual-issue superscalar CPU (a) Fine-grained multithreading (b) Coarse-grained multithreading (c) Simultaneous multithreading
March 20, 2012
Simultaneous (SMT)
Instructions simultaneously issued from multiple threads to execution units of superscalar processor
Chip multiprocessing
Processor is replicated on a single chip Each processor handles separate threads
March 20, 2012
Scalar Processor Approaches (1)
Single-threaded scalar
Simple pipeline No multithreading
Interleaved multithreaded scalar
March 20, 2012
Easiest multithreading to implement Switch threads at each clock cycle Pipeline stages kept close to fully occupied Hardware needs to switch thread context between cyclesRichard Salomon, Sudipto Mitra
Scalar Processor Approaches (2)
Blocked multithreaded scalar

Thread executed until latency event occurs Would stop pipeline Processor switches to another thread
March 20, 2012
Scalar Diagrams
March 20, 2012
Multiple Instruction Issue Processors (1)
Superscalar
No multithreading
Interleaved multithreading superscalar:

Each cycle, as many instructions as possible issued from single thread Delays due to thread switches eliminated Number of instructions issued in cycle limited by dependencies
March 20, 2012
March 20, 2012
Blocked multithreaded superscalar
Instructions from one thread Blocked multithreading used
e.g. IA-64 Multiple instructions in single word Typically constructed by compiler Operations that may be executed in parallel in same word May pad with no-ops March 20, 2012 Richard Salomon, Sudipto Mitra
Very long instruction word (VLIW)
Interleaved multithreading VLIW
Similar efficiencies to interleaved multithreading on superscalar architecture
Blocked multithreaded VLIW
Similar efficiencies to blocked multithreading on superscalar architecture
March 20, 2012
March 20, 2012
Parallel, Simultaneous Execution of Multiple Threads (1)
Simultaneous multithreading
Issue multiple instructions at a time One thread may fill all horizontal slots Instructions from two or more threads may be issued With enough threads, can issue maximum number of instructions on each cycle
March 20, 2012
Chip multiprocessor
Multiple processors Each has two-issue superscalar processor Each processor is assigned thread
Can
issue up to two instructions per cycle per thread
March 20, 2012
March 20, 2012
Examples
Some Pentium 4

Intel calls it hyper-threading SMT with support for two threads Single multithreaded processor, logically two processors High-end PowerPC Combines chip multiprocessing with SMT Chip has two separate processors Each supporting two threads concurrently using SMT
IBM Power5
March 20, 2012
Hyperthreading on the Pentium 4
Resource sharing between threads in the Pentium 4 NetBurst microarchitecture

March 20, 2012
Power5 Instruction Data Flow
March 20, 2012
Clusters (1)
Alternative to SMP High performance High availability Server applications
March 20, 2012
Clusters (2)
A group of interconnected whole computers Working together as unified resource Illusion of being one machine Each computer called a node
March 20, 2012
Cluster Benefits
Absolute scalability Incremental scalability High availability Superior price/performance
March 20, 2012
Cluster Configurations - Standby Server, No Shared Disk
March 20, 2012
Cluster Configurations Shared Disk
March 20, 2012
Operating Systems Design Issues (1)
Failure Management High availability Fault tolerant Failover

Switching
applications & data from failed system to alternative within cluster
Failback
Restoration
system After problem is fixed

March 20, 2012
of applications and data to original
Operating Systems Design Issues (2)
Load balancing
Incremental scalability Automatically include new computers in scheduling Middleware needs to recognise that processes may switch between machines
March 20, 2012
Parallelizing (1)
Single application executing in parallel on a number of machines in a cluster Compiler

Determines
at compile time which parts can be executed in parallel Split off for different computers
March 20, 2012
Parallelizing (2)
Application
Application
written from scratch to be parallel Message passing to move data between nodes Hard to program Best end result
Parametric computing
If
a problem is repeated, execute the algorithm on different sets of data e.g. simulation using different scenarios Needs effective tools to organize and run
March 20, 2012
Cluster Computer Architecture
March 20, 2012
Cluster v. SMP (1)

Both provide multiprocessor support to high demand applications. Both available commercially
SMP for longer
March 20, 2012
SMP:
Cluster v. SMP (2)
Easier to manage and control Closer to single processor systems

Scheduling
is main difference Less physical space Lower power consumption
Clustering:
Superior incremental & absolute scalability Superior availability
March 20, 2012
Redundancy
Non-uniform Memory Access (NUMA) (1)
Alternative to SMP & clustering Uniform memory access (UMA) systems

All processors have access to all parts of memory Access time to all regions of memory is the same Access time to memory for different processors same As used by SMP All processors have access to all parts of memory
Using load & store
Non-uniform memory access (NUMA) systems

Access time of processor differs depending on region of memory Different processors access different regions of memory at different speeds
Using load & store
March 20, 2012
Non-uniform Memory Access (NUMA) (2)
Cache coherent NUMA

Cache coherence is maintained among the caches of the various processors Significantly different from SMP and clusters
March 20, 2012
Motivation

SMP has practical limit to number of processors
Bus traffic limits to between 16 and 64 processors

Apps do not see large global memory Coherence maintained by software not hardware
Each node of a cluster has own memory

NUMA retains SMP flavour while giving large scale multiprocessing Objective is to maintain transparent system wide memory while permitting multiprocessor nodes, each with own bus or internal interconnection Richard Salomon, Sudipto Mitra system Copyright Box Hill Institute
March 20, 2012
CC-NUMA Operation (1)
Each processor has own L1 and L2 cache Each node has own main memory Nodes connected by some networking facility Each processor sees single addressable memory space Memory request order:

L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory
March 20, 2012
CC-NUMA Operation (2)

Memory requests delivered to the requesting processors local cache Automatic and transparent Each node maintains directory of location of portions of memory and cache status
March 20, 2012
CC-NUMA Organization
March 20, 2012
Memory Access Sequence Example
node 2 processor 3 (P2-3) requests location 798 which is in memory of node 1
P2-3 issues read request on snoopy bus of node 2 Directory on node 2 recognises location is on node 1 Node 2 directory requests node 1s directory Node 1 directory requests contents of 798 Node 1 memory puts data on (node 1 local) bus Node 1 directory gets data from (node 1 local) bus Data transferred to node 2s directory Node 2 directory puts data on (node 2 local) bus Data picked up, put in P2-3s cache and delivered to processor
March 20, 2012
Cache Coherence

Node 1 directory keeps note that node 2 has copy of data If data modified in cache, this is broadcast to other nodes Local directories monitor and purge local cache if necessary Local directory monitors changes to local data in remote caches and marks memory invalid until write-back Local directory forces write-back if memory location requested by another processor
March 20, 2012
Higher effective levels of parallelism than SMP No major software changes Performance can breakdown if too much access to remote memory Can be avoided by:
L1
NUMA Pros & Cons (1)
March 20, 2012
& L2 cache design reducing all memory access Good temporal locality of software Good spatial locality of software Virtual memory management moving pages to nodes that are usingMitra them most Richard Salomon, Sudipto
NUMA Pros & Cons (2)
Not transparent
Page allocation, process allocation and load balancing changes needed
Availability?
March 20, 2012
Vector Computation
Maths problems involving physical processes that present different difficulties for computation
High precision Repeated floating point calculations on large arrays of numbers
Aerodynamics, seismology, meteorology Continuous field simulation
March 20, 2012
Vector Computation Handling
Supercomputers

Hundreds of millions of flops $10-15 million Optimised for calculation rather than multitasking and I/O Limited market Research, government agencies, meteorology Alternative to supercomputer Configured as peripherals to mainframe & mini Just run vector portion of problems
Array processor

March 20, 2012
Vector Addition Example
March 20, 2012
Approaches (1)

General purpose computers rely on iteration to do vector calculations In above example this needs six calculations Vector processing
Assume possible to operate on one-dimensional vector of data All elements in a particular row can be calculated in parallel
March 20, 2012
Approaches (2)
Parallel processing
Independent processors functioning in parallel Use FORK N to start individual process at location N JOIN N causes N independent processes to join and merge following JOIN
O/S
March 20, 2012
Co-ordinates JOINs Execution is blocked until all N processes have reached JOIN Salomon, Sudipto Mitra Richard
Processor Designs
Pipelined ALU
Within operations Across operations
Parallel ALUs Parallel processors
March 20, 2012
Approaches to Vector Computation
March 20, 2012
Computer Organizations
March 20, 2012
IBM 3090 with Vector Facility
March 20, 2012
Summary
Processor performance can be measured by the rate at which it executes instructions,
Alternative approach to improve performance, Instruction stream divided into smaller streams called threads and execute them in parallel (multithreading) Clustering is an alternative to symmetric multiprocessing for providing high performance and high availability.
March 20, 2012
MIPS rate = f * IPC
Reference
Stallings William, 2003, Computer Organization & Architecture designing for performance, Sixth Edition, Pearson Education, Inc, ISBN 0 - 13 049307 4. M Morris Mano, Computer System Architecture, Third Edition, Prentice Hall. Measuring Performance, UCSD, CSE 141, Larry Carter, Winter 2002 Tanenbaum, Structured Computer Organization, Fifth Edition, 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0. CS 284a Lecture, Tuesday, 7 October 1997, John Thornley.
March 20, 2012
Further Reading
Manufacturers websites Relevant Special Interest Groups [SIG] Articles in magazines IEEE Computer Society Task Force on Cluster Computing web-site
March 20, 2012

Ict123 W12

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ict123 W12

Uploaded by

Copyright:

Available Formats

Centre for Computer Technology

ICT123 Computer Architecture

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Shared-Memory Architecture: Cache Coherence

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cache Coherence Protocols

Software solutions Hardware solutions

March 20, 2012

March 20, 2012

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

MESI State Transition Diagram

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

One of P&Hs big pictures

Note: CPI is somewhat artificial

CSE 141 - Performance I and II Copyright Box Hill Institute

Explaining performance variation

CSE 141 - Performance I and II Copyright Box Hill Institute

How do you judge computer performance?

Peak MIPS rate?

No Sometimes (if program tested is like yours) The best method!

Relative MIPS, normalized MFLOPS?

How fast does it execute MY program

March 20, 2012

What are limits?

CSE 141 - Performance I and II Copyright Box Hill Institute

Centre for Computer Technology

Increasing Performance (1)

Processor performance can be measured by the rate at which it executes instructions

MIPS rate = f * IPC

f processor clock frequency, in MHz IPC is average instructions per cycle

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Increasing Performance (2)

May be reaching limit due to

March 20, 2012

Wide variety of multithreading designs

Threads and Processes (1)

Thread in multithreaded processors may or may not be same as software threads

March 20, 2012

Threads and Processes (2)

Thread a dispatchable unit of work within a process

March 20, 2012

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Implicit threads are defined

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (1)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (2)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (3)

(a) (c) Three threads

(d) Fine-grained multithreading (e) Coarse-grained multithreading

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (4)