You are on page 1of 10

Systolic Algorithms

Part 2
Motivation & Introduction
● We need a high-performance, special-purpose computer system to meet
specific application.
● I/O and computation imbalance is a notable problem.
● The concept of Systolic architecture can map high-level computation into
hardware structures.
● Systolic system works like an automobile assembly line.
● Systolic system is easy to implement because of its regularity and it is easy to
reconfigure.
● Systolic architecture can result in cost-effective, high-performance
special-purpose systems for a wide range of problems.

2
Pipelined Computations
● Pipelined program divided into a series of tasks that have to be completed
one after the other.
● Each task executed by a separate pipeline stage.
● Data streamed from stage to stage to form computation.

3
Pipelined Computations
● Computation consists of data streaming through the pipeline stages.
● Execution Time = Time to fill the pipeline (P-1)
+ Time to run in steady state (N-P+1)
+ Time to empty the pipeline (P-1)

4
Processors for Systolic Arrays
● In a pipelined algorithm, the data flows through processors in lockstep.
● The design attempts to balance the work so that there is no bottleneck at
any processor.
● In mid-80’s, processors were developed to support in hardware this kind of
parallel pipelined computation.
● Two commercial products from Intel:
○ Warp (1D array)
○ iWarp (components for 2D array)
● Warp and iWarp were meant to operate synchronously.
● Wavefront Array Processor (S.Y. Kung) was meant to operate
asynchronously, i.e. arrival of data would signal that it was time to execute.
5
Systolic Arrays from Intel
● Warp and iWarp were examples of systolic arrays.
● Systolic means regular and rhythmic.
● Data was supposed to move through pipelined computational units in a
regular and rhythmic fashion.
● Systolic arrays meant to be special-purpose processors or co-processors.
● They were very fine-grained.
● Processors implement a limited and very simple computation, usually called
cells.
● Communication is very fast, granularity meant to be around one
operation/communication!

6
Example 1: Finding primes
● Problem:
○ Input: the sequence of natural numbers from 2 to n
○ Output: the list of primes smaller than or equal to n
○ Example: For input (2, 3, 4, 5, 6, 7, 8, 9, 10), the output is (2, 3, 5, 7)
● A sequential approach: Sieve of Eratosthenes
● A pipelined approach:
○ We need a number of processors Pi to accommodate the primes less than n. Prime Number
Theorem states that there are approximately n/log(n) prime numbers less than or equal to n.
○ Each processor has a register p, initially 0. It receives a value from its previous neighbor and
provides a value to the next neighbor. The first processor receives the value from the input list,
the last processor provides values as the output.

7
Example 1: Finding primes
Pseudocode: Pi Observations:

a a’
if a <= 0 then p ● The cells are configured as the data
a’ = a passes through
else if p == 0 then ● An input number flows through the array
p = a and suffers the following transformation:
a’ = -a ○ If it reaches an unconfigured cell, it is a
else if a % p == 0 then prime number. The cell is configured with
the value and its negate value is forwarded
a’ = 0
to the end (the processors simply pass
else through all non-positive values)
a’ = a ○ If it is divisible by the value of the cell, it is
not prime and 0 is forwarded to the end.
● The systolic array output contains 0 or the
negated primes.
8
Example 2: Polynomial evaluation
P(x)=an*xn+an-1*xn-1+an-2*xn-2+an-3*xn-3+…+a1*x+a0

P(x)=(…((((0*x+an)*x+an-1)*x+an-2) * x+an-3)*x …+a1)*x+a0

● The second form is equivalent but the computations are uniform now.
● The innermost parentheses define the “stage algorithm”:
○ Take the intermediate value from the previous cell
○ Multiply it by x and add ai
○ Pass it to the next cell to repeat the algorithm with ai-1
● The first cell receives x and the cells propagate it unchanged
● The first cell receives 0 as the intermediate value because there is none

9
Example 2: Polynomial evaluation
Pseudocode: Pi Observations:
p p’
p’ = p * x + a n-i x an-i x’ ● The algorithm is not more efficient than a
x = x’ sequential one for one evaluation.
● If several x values are pipelined at the first
cell, after a delay, the last cell will output
one evaluation of the polynomial at every
tick.

10

You might also like