You are on page 1of 87

CSA

1st Module

Computer Generations

First generation
Single CPU which performed serial fixed point arithmetic.
Consisted of program counter, branch instructions, and an
accumulator.
Machine or assembly languages were used.

Second generation
Index registers, floating point arithmetic, multiplexed memory, and
I/O registers were introduced.
High Level Languages were used.
Batch processing monitors were used.

Third generation

Microprogrammed control was used.


Pipelining and cache memory were introduced.
Multiprogramming was implemented.
Time sharing OS was used.

Contd

Fourth generation
Shared or distributed memory or vector hardware
was used.
Multiprocessing OS
Special languages and compilers were developed
for parallelism.
Software tools were created for parallel
processing.

Elements of Modern
Computers

Flynns Classification
Classification of computer architectures based on
instruction stream and data stream.
Instruction stream

Constitutes different instructions to be executed.


In a program ADD R1, R2; SUB R2, R3; MUL R1, R3
ADD, SUB, and MUL constitute instruction stream.

Data stream
Data values needed for execution of the program.
In the previous example, values of R1, R2, and R3
constitute data stream.

Computer Architectures can be classified into 4:


SISD, SIMD, MISD, and MIMD.

Contd

Single Instruction stream over Single Data


stream (SISD)
At a time, only a single instruction can be
executed using a single data stream.
No parallelism

Contd

Single Instruction stream over Multiple Data


stream (SIMD)
Single instruction stream can be executed using
multiple data streams at the same time.
ADD R1, R2 and ADD R3, R4 can be executed
simultaneously.
Exploits data level parallelism.
Used in vector processors.

Contd

Multiple Instruction stream over Single Data


stream (MISD)
Multiple instructions can be executed together
using the same data stream.
ADD R1, R2; SUB R1, R2; MUL R1, R2 can be
executed in parallel.
Exploits instruction level parallelism.

Contd

Multiple Instruction stream over Multiple


Data stream (MIMD)
Multiple instructions can be executed in parallel
using multiple data streams.
ADD R1, R2; SUB R2, R3; MUL R1, R3 can be
executed in parallel.
Exploits task level parallelism (both instruction
level and data level parallelism)
Employed in most of the modern day
architectures.

Contd

Contd

System Attributes to
Performance

Factors which affect the performance of a processor.


CPU is driven by a clock. Execution of a program by CPU
involves a number of clock cycles.
Each clock cycle has a specific duration (t).
Clock Frequency (f) Inverse of clock cycle duration (i.e.,
1/t)
Clock cycles Per Instruction (CPI) Ratio of the total no of
clock cycles needed to execute a program to the total
number of instructions present in the program.
CPI = Total no of clock cycles/ Instruction Count (IC)
CPU execution time for a program (T)
T = Instruction count x CPI x t OR
T = Instruction Count x CPI x (1/f)

Contd
CPI can also be expressed in terms of the number of
processor cycles needed (p), number of memory
references needed (m).
Duration of a memory cycle is k times that of a
processor cycle.
CPI = p + m x k
Thus, T = IC x (p + m x k) x t
Hence, IC, p, m, k, and t are the five performance
factors.
Instruction set architecture, compiler technology,
CPU implementation and control, and cache and
memory hierarchy are the four system attributes.

Contd

Performance factors are influenced by the


system attributes in the following ways:

Instruction set architecture affects IC and p


Compiler technology affects IC, p, and m
CPU implementation and control affects p and t
Cache and memory hierarchy affects k and t

MIPS rate Million Instruction Per Second


rate
MIPS = IC/ (T x 10^6) = f/ (CPI x 10^6)

Throughput rate (W)


W = f/ (IC x CPI)

Implicit and Explicit


Parallelism

Multiprocessors and
Multicomputers
Parallel computers can be classified into
multiprocessors and multicomputers.
Basis of classification: whether they are
having shared or distributed memory.
Multiprocessors: Parallel computers having
shared memory.
Multicomputers: Parallel computers having
distributed memory.

Multiprocessors

Three types
Uniform Memory Access (UMA) model
Non-Uniform Memory Access (NUMA) model
Cache Only Memory Architecture (COMA) model

UMA Model
Main memory is divided into different shared
memory words (SM1, SM2, SMm).
All processors have equal access to all the shared
memory words. Hence the name uniform memory
access.
Processors access the shared memory words through
a system interconnenct.

Contd
UMA can further be classified into two:
asymmetric and symmetric based on the kind of
access to peripheral devices.
Symmetric: All processors have equal access to
peripherals.
Asymmetric: Only one or a subset of processors
can have control over peripherals. These are
called master(or executive) processors and the
remaining are called attached processors.

Contd

Contd

NUMA Model (shared local memory model)


Shared memory is distributed to processors as
local memories (LM1 for P1, LM2 for P2 and so
on).
A processor can access its own local memory
quickly i.e., P1 can access LM1 quickly.
But, it takes more time for the processor to access
local memory of other processors i.e., P1 can
access LM2 (through interconnection network) but
it will take more time.
Thus, the time of access to the local memories is
non uniform.

Contd

NUMA model (Hierarchical Cluster Model)


Processors are divided into clusters.
Each cluster is a UMA model i.e., all processors belonging to
a cluster can uniformly access the Cluster Shared Memory
(CSM) modules through Cluster Interconnection Networks
(CIN).
The clusters are connected to Global Shared Memory (GSM)
modules.
All clusters have equal access time to GSMs.
Access time to cluster shared memory (CSM) is shorter than
the access time to global shared memory (GSM) which in
turn is shorter than intercluster memory access times.
Access to GSM as well as intercluster memory are done
through Global Interconnection Networks.

Contd

Contd

COMA Model
Each processor consists of its own local cache
memories (C).
A processor can access its cache directly.
For accessing a cache associated with another
processor (i.e. remote cache access),
interconnection network and cache directories (D)
are used.
Accessing a remote cache takes more time
compared to local caches.
Hence, COMA model can be said as a special case
of NUMA model.

Contd

Contd

Limitation of a multiprocessor
Lack of scalability due to the usage of shared
memories.
Latency in remote memory access.

Multicomputers

Multicomputers have distributed memories.


Consists of multiple computers called nodes.
Each node is an autonomous system consisting
of a processor and a local memory.
Unlike multiprocessors, local memory
associated with a processor can be accessed
by that processor alone.
No remote access is possible.
Communication between nodes is carried out
by message passing interconnection networks.

Contd

Contd

Advantage of multicomputers
Since distributed memory is used instead of
shared memory, scalability is not an issue.

Vector Supercomputers
A normal computer usually contains the scalar
processor alone.
Vector computers have vector processors provided
as an additional feature along with scalar processors.
Program and data are loaded into the main memory
through a host computer.
All instructions, irrespective of whether they are
scalar or vector, are decoded in the scalar control
unit.
If the decoded instruction is scalar, it is executed by
the scalar processor using scalar functional
pipelines.

Contd
If the instruction decoded is a vector, it is
sent to vector control unit.
Vector control unit controls the flow of
vector data between main memory and
vector functional pipelines.
The number of vector functional pipelines
depends on the size of vector to be
executed.

Contd

SIMD Supercomputers
The operational model of an SIMD
supercomputer is specified by a 5 tuple form:
(N, C, I, M, R)
N: No of processing elements
C: set of instructions directly executed by the
control unit.
I: set of instructions broadcast by the CU to all
PEs for parallel execution.
M: Masking schemes for partitioning PEs into
enabled and disabled subsets.
R: Routing functions for inter PE communications.

Contd

Theoretical Models of Parallel


Computers
Theoretical models do not exist in reality.
Used by algorithm developers for
developing parallel algorithms.
Two theoretical models:

Parallel Random Access Machines (PRAM)


VLSI Complexity models

PRAM
Parallel Random Access Machines
Used for modeling parallel computers with zero memory
access overhead.
An n processor PRAM has a globally addressable shared
memory (as shown in fig).
Four memory update operations are possible:

Exclusive Read (ER) at most one processor can read from a


memory location in each cycle.
Exclusive Write (EW) at most one processor can write to a
memory location at a time.
Concurrent Read (CR) multiple processors can read from a
memory location at the same time.
Concurrent Write (CW) allows simultaneous writes to the same
memory location.

Contd

Contd

Variants of PRAM (Based on memory update


operations):

EREEW PRAM
CREW PRAM
ERCW PRAM
CRCW PRAM

VLSI Complexity Model

Used to fabricate major components like


processor array, memory array etc.
Analysis can be done by setting limits on
memory, IO, and communication.
Memory bound on chip area
Memory required is set as a lower bound on chip area,
A.

I/O bound on volume


Volume of a cube is represented as AT where T is the
period of time for which information flows through the
chip.
No of input bits cannot exceed AT.

Contd

Contd

Bisection communication bound, A T

The bisection is represented by a vertical slice.


Let the distance of this bisection be A.
The height of cross section be T.
Hence, the bisection area will be A T.
The bisection area represents the maximum
amount of information exchange between the two
halves of the chip.
Hence, we can say that the bisection area limits
the communication bandwidth.

Contd

The AT^2 Model


Let A be the chip area.
T be the latency for completing a given
computation using the VLSI chip.
s denotes the problem size involved in
computation.
There exists a lower bound f(s) such that:
A x T^2 >= O(f(s))

Conditions of Parallelism
Data and resource dependences
Hardware and software parallelism
The role of compilers

Data and Resource


Dependences
Data Dependence: Indicates the ordering
relationship between statements.
5 types of data dependence.

Flow Dependence
Consider two statements S1 and S2.
S2 is flow dependent on S1 if there exists an
execution path from S1 to S2 and at least one output
of S1 feeds in as input to S2.
Denoted as S1
S2

Antidependence: S2 is antidependent on S1 if
S2 follows S1 in program order.

Contd
Output of S2 overlaps the input of S1.
Denoted as S1
S2

Output Dependence
S1 and S2 are output dependent if they produce the
same output variable.
Denoted as S1
S2

Unknown: the dependence relation cannot be


determined in the following cases:
Subscript of a variable is itself subscribed (A[[]]).
Subscript does not contain loop index variable (A[]).
Variable appears multiple times with different
subscript values (A[2], A[3] in the same program).

Contd
The subscript has non linear loop index variable
(A[x^2]).

I/O Dependence
Read and write are the I/O statements.
I/O dependence occurs when same file is referenced
by both the I/O statements.
Read F4 and Write F4 are I/O dependent as the same
file F4 is involved.

Control Dependence
Occurs in programs involving branch instructions.
If (P1)
{ then S1} indicates S1 is control dependent on P1.

Contd
Two constraints
A statement which is not control dependent on a branch should
not be brought after the branch instruction.
A statement which is control dependent on a branch should not
be taken before the branch.

Resource Dependence
Occurs when multiple processes are accessing the same
resources such as integer units, FP units etc.
Storage dependence occurs when the same storage
location is accessed by multiple processes.
ALU dependence occurs when the conflicting resource is
ALU.
ADD R1,R2 and ADD R3, R2 causes resource dependence as
the same resource, adder, is used.

Contd

Bernsteins Conditions
Set of conditions based on which two processes
can execute in parallel.
Input set of a process: Set of all input variables
needed to execute a process.
Output Set of a Process: Set of all output variables
generated after execution of the process.
Let I1 and I2 be the input sets of processes P1 and
P2 respectively.
Let O1 and O2 be the output sets of P1 and P2.
P1||P2 indicates P1 and P2 can execute in parallel.

Contd
P1||P2 iff
P1 and P2 are not flow dependent, i.e., output set of
P1 and input set of P2 do not intersect. O1I2 = .
P1 and P2 are not anti dependent, i.e., input set of
P1 and output set of P2 do not intersect. I1O2 = .
P1 and P2 are not output dependent, i.e., output sets
of P1 and P2 do not intersect. O1O2 = .

Bernstein's conditions for a set of processes


A set of processes can execute in parallel if and
only if each pair of processes within that set can
also execute in parallel.
P1||P2||Pk iff Pi||Pj for all ij

Hardware and Software


Paralleism
Special hardware and software support are
needed for implementation of parallelism.
Hardware Parallelism

Parallelism defined by the machine architecture.


Characterized by the number of instructions
issued per machine cycle.
A k-issue processor issues k instructions per
machine cycle.

Software Parallelism
Depends on algorithm, programming style, and
program design.

Contd
Two types: Control parallelism and data parallelism.
Control parallelism: Two or more operations can be performed
simultaneously.
Data parallelism: Same operation is performed over many data
items by many processors simultaneously.

Hardware/ Software mismatch


Consider a program with an average software parallelism of 2.6
Now, consider the execution of the same program by a 2-issue
processor, i.e., a processor which can issue only two
instructions per cycle.
Such a processor cannot execute a program with an average
parallelism of 2.6
Such an execution of a software on a mismatched hardware can
cause hardware/software mismatch.

The Role of Compilers

Hardware/ software mismatch can be prevented by


the use of an optimizing compiler.
Problem occurs when a program is executed using a
mismatched hardware.
Such an execution generates object code that cannot be read
by the hardware.
If appropriate object code can be generated, mismatch can
be prevented.
A compiler is responsible for generating object code and
hence, the usage of an appropriate compiler can prevent
mismatch.

Compilers are of great use in implicit parallelism.


Parallelizing compilers generate parallel object codes from
sequential inputs.

Network Properties

A network consists of nodes connected by appropriate


interconnections.
Network size
No of nodes in the network.

Node degree

The number of edges incident on a node.


Number of edges going out of node is the out degree.
Number of edges coming into node is the in degree.
Node degree = in degree + out degree

Network Diameter
Find out the shortest path between each pairs of nodes in the network.
The maximum of these shortest paths gives the network diameter.

Bisection Width
The minimum no of edges that need to be cut in order to bisect the entire
network into 2 equal halves.
2 equal halves indicate that each half has equal no of nodes.

Network Data Routing


Functions

Permutations
For n objects, there are n! permutations by which
they can be reordered.
We can use permutations to connect n Processing
Elements among themselves.
Permutations can be done using crossbar switches
and multistage networks.
Permutation capability of a network reflects the
data routing capability.
Higher permutation capability indicates higher
routing efficiency.

Contd

Perfect Shuffle
Consider n = 2^k nodes
Represent each node by a k bit binary number.
For example if there are 8 nodes (n = 2^3 = 8), the value of k is
3.
Hence, represent each node by 3 bit binary numbers.
For instance, represent node 0 by 000, node 1 by 001, node 2 by
010 and so on.
Perfect shuffle maps x to y where y is obtained by performing a
circular left shift of 1 bit on x.
Example: Consider node 3 (i.e. 011). Circular left shift of 011
gives 110 (i.e. node 6). Thus, routing is done between nodes 3
and 6. Repeat this for all nodes.
Inverse shuffle: Instead of circular left shift, perform a circular
right shift.

Contd

Contd

Hypercube routing
An n dimensional cube has 2^n nodes, i.e., a node on
each of its vertices.
Each node is represented as an n bit binary number.
An n dimensional cube allows n number of routing
functions.
A 3 dimensional cube has 2^3 = 8 nodes and each
one is represented by a 3 bit binary number.
There are 3 different routing functions possible in a 3
dimensional or a binary 3 cube.
Routing by least significant bit
Routing by middle bit
Routing by most significant bit

Contd

Routing by least significant bit


Routing is possible between two nodes if they
differ in their least significant bits.
For example, routing is possible between 000 and
001 (i.e. between nodes 0 and 1), but not possible
between 000 and 010.

Routing by middle bit


Routing is possible if middle bit differs.

Routing by most significant bit


Routing is possible if the most significant bit
differs.

Contd

Network Performance
Network performance can be affected by different
factors.
Functionality: Refers to how a network supports data
routing, synchronization, coherence etc.
Network Latency: Worst case time delay for a unit
message to be transmitted through the network.
Bandwidth: Maximum data transfer rate (Mbytes/s or
Gbytes/s).
Hardware complexity: Implementation cost for the
hardware such as wires, switches etc.
Scalability: Ability of the network to be expandable
with a scalable performance.

Static Connection
Networks
Static connection networks has fixed links
which do not change during execution.
Suitable for communication patterns which
are predictable.
Linear Array

N nodes are connected by N-1 links in a line.


Internal nodes have a degree of 2 and terminal
nodes have a degree of 1.
Diameter is N-1 and bisection with is 1.
Causes communication inefficiency if the value of
N is large.

Contd

Ring
Obtained by connecting two terminal nodes of a
linear array with an extra link.
All nodes have a degree of 2.
Ring can be of unidirectional or bidirectional
types.
The diameter of a bidirectional ring = N/2
Diameter of a unidirectional ring = N

Chordal Ring
It is obtained from a ring by increasing the node
degree from 2 to a higher value.

Contd

Completely connected network


Each node is connected to all the other nodes.

Barrel Shifter
Obtained from a ring by adding links from each
node to all those nodes having distance equal to
an integer power of 2.
Node i is connected to j if |j i| = 2^r where r = 0,
1, 2, n-1.
Barrel shifter has a network size of 2^n.
Node degree = 2n 1
Network diameter = n/2

Contd

Contd

Binary Tree
A k-level binary tree has (2^k) 1 nodes.
A 5-level tree has 31 nodes.
Maximum node degree is 3 and diameter is 2(k-1).

Fat Tree
Channel width increases as we ascend from leaves to
the root.
Can cause performance bottleneck as the amount of
traffic becomes heavier towards the root.

Star Tree
Two level tree with central node having the highest
degree.

Contd

Contd

Mesh
Not symmetric in the sense that node degrees are
different for boundary nodes and interior nodes.

Illiac Mesh
Obtained from a mesh.
Connect the terminal node of each row with the
starting node of subsequent row.
Also, connect the terminal node of last row with
the starting node of first row.
Starting node of each column is connected to the
end node of that column itself.

Contd

Torus
Similar to illiac mesh.
Only difference being, the ending node of each
row is connected to the starting node of that row
itself.

Systolic array
Used for implementing fixed algorithms.

Contd

Contd

Hyper cubes

Contains nodes at each vertices of the cube.


n dimensional cube contains 2^n nodes.
A 3 dimensional cube contains 8 nodes.
We can connect two 3 dimensional cubes to obtain a 4
dimensional cube ( as shown in fig).

Cube Connected Cycles


Obtained by replacing each node of a hypercube with a ring.

K-ary n-cube networks


Combination of hypercube and torus.
n represents the dimension of hypercube involved.
k represents the number of nodes along each row (or
dimension) of a torus and is termed as radix.

Contd

Contd

Dynamic Connection
Networks

Structure of such networks change during program


execution.
Digital Bus
Collection of wires and connectors for data transactions
between master and slave devices.
Master or active devices
Generate requests to access the slave devices.
Processors and I/O subsystem are the examples of master
devices.

Slave or passive devices


Memory and peripheral devices form the group of slave devices.
They respond to the requests generated by master devices.

This interaction between the master and slave devices is


done in a time sharing basis using the digital bus.

Contd

Contd

Switch modules
An a x b switch module has a number of input
ports and b number of output ports.
A 2 x 2 switch module is also known as binary
switch.
Usually a and b are chosen as integer powers
of two.
A switch module has two modes of operations
Straight: One to one mapping.
Cross over: One to many mapping.

An n x n crossbar can achieve n! permutations.

Contd

Multistage Interconnection Networks


Has multiple stages with each stage consisting of
a number of switches.
Communication between switches present in
different stages takes place using an Inter Stage
Connection (ISC).

Contd

Contd

Omega Network
It is a kind of multistage network made of 2 x 2
switches.
Consider an N x N omega network.
It contains log2(N) number of stages.
Each stage contains N/2 switches.
Communication between switches present in
different stages is done through the technique of
perfect shuffle.
Output of the first stage serves as input to the
next stage.

Contd

Construct an 8 x 8 omega network.

No of stages = log2(8) = 3
Each stage contains 8/2 = 4 2 x 2 switches.
Stage 1 contains the switches A1, A2, A3, A4
Stage 2 B1, B2, B3, B4
Stage 3 C1, C2, C3, C4
A1 has two inputs marked as 0 (000) and 1 (001), and
two outputs marked again as 0 (000) and 1 (001).
A2 has two inputs 2 (010) and 3 (011), and two
outputs 2 (010) and 3 (011).
Similarly mark the inputs and outputs of other
switches also.

Contd
Each stage consists of 8 inputs (each of the 4
switches having 2 inputs).
Perform a perfect shuffle on each input.
Perfect shuffle of 000 gives 000 itself.
Perfect shuffle of 001 gives 010, hence connect
port 001 to 010 in all the three stages.
Repeat this for the remaining inputs also.

Contd

Contd

Baseline Networks
Another variant of multistage interconnection
networks.
Consider an N x N baseline network.
No of stages = log2(N)
Each stage contains N/2 number of 2 x 2 switches.
The first stage contains one N x N block.
The second stage contains two N/2 x N/2 sub
blocks (C0 and C1).
The process is repeated with all the sub blocks
until N/2 sub blocks of size 2 x 2 are reached.

Contd

Contd

Contd

Crossbar Network
It is a single stage switch network.
Provides dynamic connections between sourcedestination pairs.
Switches are placed at each crosspoints.
These crosspoint switches are responsible for
controlling the traffic between source-destination
pairs.

Contd

Scalability Analysis and


Approaches

Scalability is defined as the property by which the


performance of a computer system increases linearly with
respect to the no of processors used for a given
application.
Scalability metrics: Factors affecting the scalability of a
computer system.
Machine size (n) no of processors employed in a parallel
computer system.
Clock rate (f) no of clock cycles per unit time.
CPU time (T) time in seconds elapsed in executing a program on a
parallel machine with n processors collectively. Denoted as T(s, n).
Problem size (s) no of data points (inputs) needed to execute a
program. Directly proportional to T.
I/ O demand (d) input/ output demand of a program.

Contd
Memory capacity (m) amount of main memory
(in bytes or words) needed to execute the
program.
Communication overhead (h) time spent for
inter processor communication, synchronization,
remote memory access etc. Expressed as h(s, n).
Computer cost (c) total cost of hardware and
software required.
Programming overhead (p) Overhead associated
with the development of an application program.
Speed up, S(s, n) = T(s, 1)/ (T(s, n) + h(s, n))
Efficiency = S(s, n)/ n

You might also like