You are on page 1of 5

Register Reuse

Part 1
For dgemm0 the body of the third for loop consists of one floating point opera-
tion of multiplication and one of addition. Since the body is executed n3 times,
we get a total of 2n3 floating point operations. Also, because no register reuse is
employed, every value of a, b and c will be loaded every time, which results in a
total of 100+100+ 14 = 200.25 cycles for the multiplication, and 100+ 14 = 100.25
cycles for the addition. The addition does not need to load the second operand
since it is the result of the multiplication which is already stored in a register.
Hence, we get (200.25 + 100.25)n3 = 300.5n3 = 300.5 · 10003 = 300.5 · 106 cycles
in total, or

300.5 · 106 cycles


= 150.25 milliseconds
2 · 109 cycles
sec
The time wasted in accessing operands that are not in registers is

(100 + 100 + 100) · 106 cycles


= 150 milliseconds
2 · 109 cycles
sec

For dgemm1 the multiplication operands are loaded n3 times each in total,
whereas the first addition operand is loaded n2 times in total. This results in a
total of 100(2n3 + n2 ) cycles for loading operands into registers. Also, we again
have 2n3 floating point operations in total, which consumes 41 (2n3 ) = 0.5n3
cycles. Therefore, the total execution time will be

100(2n3 + n2 ) + 0.5n3 cycles


= 100.3 milliseconds
2 · 109 cycles
sec
The time wasted in accessing operands that are not in registers is

100(2n3 + n2 ) cycles
= 100.05 milliseconds
2 · 109 cycles
sec

Part 2
n = 64 dgemm0 0.000000 sec inf Gflops
dgemm1 0.000000 sec inf Gflops max—C0-C1—=0.0000000000000000
dgemm2 0.000000 sec inf Gflops max—C0-C1—=0.0000000000000000
n=128
dgemm0 0.030000 sec 0.139810 Gflops
dgemm1 0.020000 sec 0.209715 Gflops max—C0-C1—=0.0000000000000000
dgemm2 0.010000 sec 0.419430 Gflops max—C0-C1—=0.0000000000000000
n=256 dgemm0 0.260000 sec 0.129056 Gflops
dgemm1 0.130000 sec 0.258111 Gflops max—C0-C1—=0.0000000000000000
dgemm2 0.120000 sec 0.279620 Gflops max—C0-C1—=0.0000000000000000

1
Part 3

Cache Reuse
Part 1
• 10x10
1. ijk

First, for i=j=0 we will have a miss for a[0][0] and one miss for each
element in the first column of b. After that, the whole b will be in
cache along with the first row of a. Similarly, we will have a miss for
the rest of the elements in the first column of a. After the miss in
a[n-1][0], all rows of a will have been stored in the cache. Also, we
will not have any more misses for b since the cache is large enough
to hold all a and b (and c).
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.

2. jik

First, for i=j=0 we will have a miss for a[0][0] and one miss for each
element in the first column of b. After that, the whole b will be in
cache along with the first row of a. Similarly, we will have a miss for
each element of the first column of a just before j=0 becomes j=1.
At this point, both a and b are in cache in their entirety and no more
misses will occur.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.

3. ikj

First, for i=k=0 we will have a miss for a[0][0] and b[0][0], and the
first rows of a and b are loaded in cache. For the rest values of k,
and while i=0, we will have a miss for the rest of the elements in the
first column of b, and the whole b is loaded in the cache. For the
rest values of i we will also have a miss for the rest of the elements
in the first column of a. Now both a and b are in cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.

4. jki

First, for j=k=0 we will have a miss for all the elements in the first

2
column of a and a miss for b[0][0]. Now the first row of b and the
whole a are loaded in cache. For the rest values of k, and while j=0,
we will have a miss for the rest of the elements in the first column of
b, and the whole b is loaded in the cache. Now both a and b are in
cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·10 3 = 1%.

5. kij

First, for k=i=j=0 we will have a miss for a[0][0] and b[0][0], and the
first row of a and b are loaded in cache. Then, for the rest of the
values of i, and while k=0, we will have a miss for all the elements
in the first column of a. Now the first row of b and the whole a are
in cache. Now for the rest of the values of k, we will have a miss for
the rest of the elements in the first column of b, and the whole b is
loaded in the cache. Now both a and b are in the cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.

6. kji

First, for k=j=0, we will have a miss for b[0][0] and all the elements
in the first column of a. Now the whole a and the first row of b are in
the cache. After that, for the rest of the values of k, we will also have
a miss for all the elements in the first column of b, and the whole b
is now loaded in the cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·10 3 = 1%.

• 10000X10000
1. ijk

For a specific i and j, we will have a miss every ten elements in


the i-th row of a, and a miss for every single element in the j-th col-
umn of b. This is because the contents in the cache at the end of
each k-loop cannot be used at the beginning of the next k-loop, since
by the time the new k-loop would use them, they will have already
been overwritten by newer data due to the least-recently-used-first
replacement policy.
Since each element of a and b is used in n combinations of i and j, we
will have n misses for each element in a and b where misses occur,
and 0 misses for the rest of the elements of a.

3
2
Hence, in a we will have n10 elements where misses occur, which gives
2 3
n· n10 = n10 misses, and in b we will have n·n2 = n3 misses. Therefore,
we get a miss rate of
n3
10 + n3 1.1
= = 55%
2 · n3 2
.
2. jik

Same as in ijk.
3. ikj

In this case, for each combination of i and k we first load 10 elements


of the i-th row of a and b in the cache. Then, while we traverse the k-
th row of b, a miss occurs every ten elements. Note that the elements
of a in the cache will not be overwritten since they are constantly in
use. They will be overwritten only after a new line containing ele-
ments of a is loaded in the cache. Hence, for each combination of i
and k we get a miss for a[i][k] if k mod 10 = 0, and a miss every ten
elements in the k-th row in b. Notice that in the end the misses will
happen at elements at intervals of 10 horizontally, both for a and b.
The k-th row in b appears in n combinations of i and k, and hence we
have n · n/10 = n2 /10 misses for each row in b, or n · n2 /10 = n3 /10
for the whole b. Note that we have n misses for each element where
misses occur for every row in b.
Also, k mod 10 = 0 holds for n/10 values of k, and hence for n ·
n/10 = n2 /10 combinations of i and k. Hence, we get a total of
n2 /10 misses for a. Note that every a[i][k] corresponds to a specific
combination of i and k, and, therefore, every element of a where a
miss occurs will have only a single miss.
Therefore, we have

n3 /10 + n2 /10 1 + 1/n


3
= = 5.0005%
2·n 20
4. jki

In this case we will have misses every time we try to access any
element of a or b. This happens because we traverse both a and
b column-wise whereas only elements of the same row are stored in
cache.
Since each element of a and b is loaded n times, it will also cause n
misses.
Obviously, the miss rate is 100%.

4
5. kij

For each combination of k and i we have a miss for a[i][k] and a


miss every ten elements in the k-th row of b. Hence, there is a total
of one miss for every element in a, and n misses for each element
where misses occur in b. Therefore, we get a miss rate of

n2 + n · (n · n/10) 1/n + 1/10


= = 5.005%
2 · n3 2
6. kji

For each combination of k and j we have a miss for b[k][j] if j


mod 10 = 0 and a miss for all elements in the k-th column of a.
Hence, we have a single miss every ten elements in every row of b,
and since there are n combinations of k and j for a specific k, we
understand that there will be n misses for every element in every
column of a, or more simply, n misses for each element of a.
Therefore, the miss rate will be

n · n2 + n2 /10 1 + 1/(10n)
= = 50.0005%
2 · n3 2

Part 2
We can calculate the misses for this part based on on Part 1. Specifically, first
consider the block multiplication part (outer 3 loops) as a multiplication of two
m × m matrices where m = n/10, using a cache of 60/10=6 lines, where each
line can hold one element. Since m is far greater than the 6 lines of cache,
the number of misses for each block will be equal to number of misses of each
element for the 10000 × 10000 matrix in Part 1, with the only difference that
we substitute the number 10 in every formula with the number 1, since we do
not have 10 elements per cache line any more, but 1.
Now that we have the number of misses for each block in both a and b, we can
calculate the number of misses for every element in each block by multiplying
the number of misses of this block by the number of misses of the 10 × 10 matrix
multiplication in Part 1 for the same access pattern.

Part 3
Part 4

You might also like