Professional Documents
Culture Documents
1st Module
Computer Generations
First generation
Single CPU which performed serial fixed point arithmetic.
Consisted of program counter, branch instructions, and an
accumulator.
Machine or assembly languages were used.
Second generation
Index registers, floating point arithmetic, multiplexed memory, and
I/O registers were introduced.
High Level Languages were used.
Batch processing monitors were used.
Third generation
Contd
Fourth generation
Shared or distributed memory or vector hardware
was used.
Multiprocessing OS
Special languages and compilers were developed
for parallelism.
Software tools were created for parallel
processing.
Elements of Modern
Computers
Flynns Classification
Classification of computer architectures based on
instruction stream and data stream.
Instruction stream
Data stream
Data values needed for execution of the program.
In the previous example, values of R1, R2, and R3
constitute data stream.
Contd
Contd
Contd
Contd
Contd
Contd
System Attributes to
Performance
Contd
CPI can also be expressed in terms of the number of
processor cycles needed (p), number of memory
references needed (m).
Duration of a memory cycle is k times that of a
processor cycle.
CPI = p + m x k
Thus, T = IC x (p + m x k) x t
Hence, IC, p, m, k, and t are the five performance
factors.
Instruction set architecture, compiler technology,
CPU implementation and control, and cache and
memory hierarchy are the four system attributes.
Contd
Multiprocessors and
Multicomputers
Parallel computers can be classified into
multiprocessors and multicomputers.
Basis of classification: whether they are
having shared or distributed memory.
Multiprocessors: Parallel computers having
shared memory.
Multicomputers: Parallel computers having
distributed memory.
Multiprocessors
Three types
Uniform Memory Access (UMA) model
Non-Uniform Memory Access (NUMA) model
Cache Only Memory Architecture (COMA) model
UMA Model
Main memory is divided into different shared
memory words (SM1, SM2, SMm).
All processors have equal access to all the shared
memory words. Hence the name uniform memory
access.
Processors access the shared memory words through
a system interconnenct.
Contd
UMA can further be classified into two:
asymmetric and symmetric based on the kind of
access to peripheral devices.
Symmetric: All processors have equal access to
peripherals.
Asymmetric: Only one or a subset of processors
can have control over peripherals. These are
called master(or executive) processors and the
remaining are called attached processors.
Contd
Contd
Contd
Contd
Contd
COMA Model
Each processor consists of its own local cache
memories (C).
A processor can access its cache directly.
For accessing a cache associated with another
processor (i.e. remote cache access),
interconnection network and cache directories (D)
are used.
Accessing a remote cache takes more time
compared to local caches.
Hence, COMA model can be said as a special case
of NUMA model.
Contd
Contd
Limitation of a multiprocessor
Lack of scalability due to the usage of shared
memories.
Latency in remote memory access.
Multicomputers
Contd
Contd
Advantage of multicomputers
Since distributed memory is used instead of
shared memory, scalability is not an issue.
Vector Supercomputers
A normal computer usually contains the scalar
processor alone.
Vector computers have vector processors provided
as an additional feature along with scalar processors.
Program and data are loaded into the main memory
through a host computer.
All instructions, irrespective of whether they are
scalar or vector, are decoded in the scalar control
unit.
If the decoded instruction is scalar, it is executed by
the scalar processor using scalar functional
pipelines.
Contd
If the instruction decoded is a vector, it is
sent to vector control unit.
Vector control unit controls the flow of
vector data between main memory and
vector functional pipelines.
The number of vector functional pipelines
depends on the size of vector to be
executed.
Contd
SIMD Supercomputers
The operational model of an SIMD
supercomputer is specified by a 5 tuple form:
(N, C, I, M, R)
N: No of processing elements
C: set of instructions directly executed by the
control unit.
I: set of instructions broadcast by the CU to all
PEs for parallel execution.
M: Masking schemes for partitioning PEs into
enabled and disabled subsets.
R: Routing functions for inter PE communications.
Contd
PRAM
Parallel Random Access Machines
Used for modeling parallel computers with zero memory
access overhead.
An n processor PRAM has a globally addressable shared
memory (as shown in fig).
Four memory update operations are possible:
Contd
Contd
EREEW PRAM
CREW PRAM
ERCW PRAM
CRCW PRAM
Contd
Contd
Contd
Conditions of Parallelism
Data and resource dependences
Hardware and software parallelism
The role of compilers
Flow Dependence
Consider two statements S1 and S2.
S2 is flow dependent on S1 if there exists an
execution path from S1 to S2 and at least one output
of S1 feeds in as input to S2.
Denoted as S1
S2
Antidependence: S2 is antidependent on S1 if
S2 follows S1 in program order.
Contd
Output of S2 overlaps the input of S1.
Denoted as S1
S2
Output Dependence
S1 and S2 are output dependent if they produce the
same output variable.
Denoted as S1
S2
Contd
The subscript has non linear loop index variable
(A[x^2]).
I/O Dependence
Read and write are the I/O statements.
I/O dependence occurs when same file is referenced
by both the I/O statements.
Read F4 and Write F4 are I/O dependent as the same
file F4 is involved.
Control Dependence
Occurs in programs involving branch instructions.
If (P1)
{ then S1} indicates S1 is control dependent on P1.
Contd
Two constraints
A statement which is not control dependent on a branch should
not be brought after the branch instruction.
A statement which is control dependent on a branch should not
be taken before the branch.
Resource Dependence
Occurs when multiple processes are accessing the same
resources such as integer units, FP units etc.
Storage dependence occurs when the same storage
location is accessed by multiple processes.
ALU dependence occurs when the conflicting resource is
ALU.
ADD R1,R2 and ADD R3, R2 causes resource dependence as
the same resource, adder, is used.
Contd
Bernsteins Conditions
Set of conditions based on which two processes
can execute in parallel.
Input set of a process: Set of all input variables
needed to execute a process.
Output Set of a Process: Set of all output variables
generated after execution of the process.
Let I1 and I2 be the input sets of processes P1 and
P2 respectively.
Let O1 and O2 be the output sets of P1 and P2.
P1||P2 indicates P1 and P2 can execute in parallel.
Contd
P1||P2 iff
P1 and P2 are not flow dependent, i.e., output set of
P1 and input set of P2 do not intersect. O1I2 = .
P1 and P2 are not anti dependent, i.e., input set of
P1 and output set of P2 do not intersect. I1O2 = .
P1 and P2 are not output dependent, i.e., output sets
of P1 and P2 do not intersect. O1O2 = .
Software Parallelism
Depends on algorithm, programming style, and
program design.
Contd
Two types: Control parallelism and data parallelism.
Control parallelism: Two or more operations can be performed
simultaneously.
Data parallelism: Same operation is performed over many data
items by many processors simultaneously.
Network Properties
Node degree
Network Diameter
Find out the shortest path between each pairs of nodes in the network.
The maximum of these shortest paths gives the network diameter.
Bisection Width
The minimum no of edges that need to be cut in order to bisect the entire
network into 2 equal halves.
2 equal halves indicate that each half has equal no of nodes.
Permutations
For n objects, there are n! permutations by which
they can be reordered.
We can use permutations to connect n Processing
Elements among themselves.
Permutations can be done using crossbar switches
and multistage networks.
Permutation capability of a network reflects the
data routing capability.
Higher permutation capability indicates higher
routing efficiency.
Contd
Perfect Shuffle
Consider n = 2^k nodes
Represent each node by a k bit binary number.
For example if there are 8 nodes (n = 2^3 = 8), the value of k is
3.
Hence, represent each node by 3 bit binary numbers.
For instance, represent node 0 by 000, node 1 by 001, node 2 by
010 and so on.
Perfect shuffle maps x to y where y is obtained by performing a
circular left shift of 1 bit on x.
Example: Consider node 3 (i.e. 011). Circular left shift of 011
gives 110 (i.e. node 6). Thus, routing is done between nodes 3
and 6. Repeat this for all nodes.
Inverse shuffle: Instead of circular left shift, perform a circular
right shift.
Contd
Contd
Hypercube routing
An n dimensional cube has 2^n nodes, i.e., a node on
each of its vertices.
Each node is represented as an n bit binary number.
An n dimensional cube allows n number of routing
functions.
A 3 dimensional cube has 2^3 = 8 nodes and each
one is represented by a 3 bit binary number.
There are 3 different routing functions possible in a 3
dimensional or a binary 3 cube.
Routing by least significant bit
Routing by middle bit
Routing by most significant bit
Contd
Contd
Network Performance
Network performance can be affected by different
factors.
Functionality: Refers to how a network supports data
routing, synchronization, coherence etc.
Network Latency: Worst case time delay for a unit
message to be transmitted through the network.
Bandwidth: Maximum data transfer rate (Mbytes/s or
Gbytes/s).
Hardware complexity: Implementation cost for the
hardware such as wires, switches etc.
Scalability: Ability of the network to be expandable
with a scalable performance.
Static Connection
Networks
Static connection networks has fixed links
which do not change during execution.
Suitable for communication patterns which
are predictable.
Linear Array
Contd
Ring
Obtained by connecting two terminal nodes of a
linear array with an extra link.
All nodes have a degree of 2.
Ring can be of unidirectional or bidirectional
types.
The diameter of a bidirectional ring = N/2
Diameter of a unidirectional ring = N
Chordal Ring
It is obtained from a ring by increasing the node
degree from 2 to a higher value.
Contd
Barrel Shifter
Obtained from a ring by adding links from each
node to all those nodes having distance equal to
an integer power of 2.
Node i is connected to j if |j i| = 2^r where r = 0,
1, 2, n-1.
Barrel shifter has a network size of 2^n.
Node degree = 2n 1
Network diameter = n/2
Contd
Contd
Binary Tree
A k-level binary tree has (2^k) 1 nodes.
A 5-level tree has 31 nodes.
Maximum node degree is 3 and diameter is 2(k-1).
Fat Tree
Channel width increases as we ascend from leaves to
the root.
Can cause performance bottleneck as the amount of
traffic becomes heavier towards the root.
Star Tree
Two level tree with central node having the highest
degree.
Contd
Contd
Mesh
Not symmetric in the sense that node degrees are
different for boundary nodes and interior nodes.
Illiac Mesh
Obtained from a mesh.
Connect the terminal node of each row with the
starting node of subsequent row.
Also, connect the terminal node of last row with
the starting node of first row.
Starting node of each column is connected to the
end node of that column itself.
Contd
Torus
Similar to illiac mesh.
Only difference being, the ending node of each
row is connected to the starting node of that row
itself.
Systolic array
Used for implementing fixed algorithms.
Contd
Contd
Hyper cubes
Contd
Contd
Dynamic Connection
Networks
Contd
Contd
Switch modules
An a x b switch module has a number of input
ports and b number of output ports.
A 2 x 2 switch module is also known as binary
switch.
Usually a and b are chosen as integer powers
of two.
A switch module has two modes of operations
Straight: One to one mapping.
Cross over: One to many mapping.
Contd
Contd
Contd
Omega Network
It is a kind of multistage network made of 2 x 2
switches.
Consider an N x N omega network.
It contains log2(N) number of stages.
Each stage contains N/2 switches.
Communication between switches present in
different stages is done through the technique of
perfect shuffle.
Output of the first stage serves as input to the
next stage.
Contd
No of stages = log2(8) = 3
Each stage contains 8/2 = 4 2 x 2 switches.
Stage 1 contains the switches A1, A2, A3, A4
Stage 2 B1, B2, B3, B4
Stage 3 C1, C2, C3, C4
A1 has two inputs marked as 0 (000) and 1 (001), and
two outputs marked again as 0 (000) and 1 (001).
A2 has two inputs 2 (010) and 3 (011), and two
outputs 2 (010) and 3 (011).
Similarly mark the inputs and outputs of other
switches also.
Contd
Each stage consists of 8 inputs (each of the 4
switches having 2 inputs).
Perform a perfect shuffle on each input.
Perfect shuffle of 000 gives 000 itself.
Perfect shuffle of 001 gives 010, hence connect
port 001 to 010 in all the three stages.
Repeat this for the remaining inputs also.
Contd
Contd
Baseline Networks
Another variant of multistage interconnection
networks.
Consider an N x N baseline network.
No of stages = log2(N)
Each stage contains N/2 number of 2 x 2 switches.
The first stage contains one N x N block.
The second stage contains two N/2 x N/2 sub
blocks (C0 and C1).
The process is repeated with all the sub blocks
until N/2 sub blocks of size 2 x 2 are reached.
Contd
Contd
Contd
Crossbar Network
It is a single stage switch network.
Provides dynamic connections between sourcedestination pairs.
Switches are placed at each crosspoints.
These crosspoint switches are responsible for
controlling the traffic between source-destination
pairs.
Contd
Contd
Memory capacity (m) amount of main memory
(in bytes or words) needed to execute the
program.
Communication overhead (h) time spent for
inter processor communication, synchronization,
remote memory access etc. Expressed as h(s, n).
Computer cost (c) total cost of hardware and
software required.
Programming overhead (p) Overhead associated
with the development of an application program.
Speed up, S(s, n) = T(s, 1)/ (T(s, n) + h(s, n))
Efficiency = S(s, n)/ n