Professional Documents
Culture Documents
1. Suppose a galaxy has 1011 stars. Estimate the time it would take to perform 100 iterations of the basic N -body algorithm using O(N 2 ) computations and a computer that is capable of 500 MFlops. 2. Find the diameter of: (a) a torus; (b) a tree network; (c) an d-dimensional mesh. 3. Look at the minimal distance deadlock-free algorithm for hypercube networks described in the textbook, page 15. Apply it for: (a) a vedimensional hypercube network from node 7 to node 22; (b) repeat for an 8 8 mesh, using its perfect embedding in a hypercube network. 4. Determine how the largest complete binary tree can be embedded into a hypercube. What is the dilation of the mapping? 5. Which is the average distance between two nodes in: (a) a mesh network; (b) a hypercube?
of the textbook for details on these routines.) (ii) Use the procedure described in the above question to estimate the time taken by your simulating routines and compare with the time taken by the corresponding MPI routines. 5. Experiment with latency hiding on your system to determine how much computation is possible between sending messages. Investigate using both nonblocking and locally blocking send routines.
5. Monte-Carlo method: Write an MPI program to compute /4 using Monte-Carlo methods. (Run it on Tembusu cluster.) 1. Use a sequential parallel random number generator and both methods described in the the class (slide 3.34-35): (1) score how many random points within a 2 2 square lie within a circle of unit radius and (2) 1 compute the corresponding integral 0 1 x2 dx. 2. Repeat the above question using a parallel random number generator (write your own implementation of such a parallel random number generator using the method described in the class - slides 3.36-37).
Regular questions
1. Analysis of divide-and-conquer method: Analyze the divide-and-conquer method of assigning one processor to each node in a tree for adding numbers (see textbook, sec.4.1.2) in terms of communication, computation, overall parallel execution time, speedup, and eciency. 2. Holes: Suppose you own a hole punch capable of putting a hole in an arbitrarily thick stack of paper. If you insert the paper into the hole punch and activate it, you will get a piece of paper with one hole in it. If you fold the paper in half before inserting it into the hole punch, you will have a piece of paper with two holes in it. If you can only use the hole punch once, how many times must you fold a piece of paper in order to put n holes in it? Prove that your answer is correct and optimal. 3. Smallest value with an arbitrary number of processes: Develop a divide-and-conquer algorithm that nds the smallest value in a set of n values in O(log n) steps using n 2 processors. What is the time n complexity if there are fewer than 2 processors? 4. Two variants of summation: Write a parallel program to compute the summation of n integers in each of the following ways and assess their performance. Assume that n is a power of 2.
n (a) Partition the n integers into n 2 pairs. Use 2 processes to add together each pair of integers resulting in n 2 integers. Repeat the method on n the n 2 integers to obtain 4 integers and continue until the nal result is obtained. (Binary tree algorithm.)
n n (b) Divide the n integers into log n groups of log n numbers each. Use log n processes each adding the numbers in one group sequentially. Then n add the log n results using method (a).
5. Integration: Write a static assignment parallel program to compute using the formula 1 1 x2 dx = 0 4 using each of the following ways: 1. rectangular decomposition 1 (slide 4.22) 2. rectangular decomposition 2 (slide 4.23) 3. trapezoidal decomposition (slide 4.24) Analyze each method in terms of speed and accuracy.
4. Outer product of two vectors: The outer product of two vectors A = (a0 , . . . , an1 ) and B = (b0 , . . . , bn1 ) a0 b0 . . . a0 bn1 . . . . . . . is an n n matrix C , where C = . . . an1 b0 . . . an1 bn1 Develop a pipeline implementation for the outer product of two vectors and analyze it. 5. Pipeline, sieve of Eratosthenes: Consider the following methods for implementing the sieve of Eratosthenes: 1. By a pipeline approach (textbook 5.3.3; slides 5.29-33) 2. By dividing the range of the numbers into m regions and assigning one region to each process to strike out multiples of prime numbers; use a master process to broadcast each already found prime number to processes. Write parallel programs (pseudocode!) for each method and estimate their time complexity.
Regular questions
1. Partial barrier: Write a barrier, barrier(procno), which will block until procno processes reach the barrier and then release the processes. Allow for the barrier to be called with dierent numbers of processes and with dierent values for procno. 2. Implementations for tree and buttery barriers: Implement the tree barrier described in Slide 6.9 (textbook 6.1.3) using individual send-receive routines. Analyze its time complexity. Repeat for the buttery barrier described in Slide 6.10 (textbook, 6.1.4). 3. Prex calculation: Analyze prex calculation method described in Slides 6.17-19 (textbook p.171) and determine its eciency. Modify the method to work for m numbers and p processes [arbitrary m, p] and repeat the above question for this new version. 4. Strips vs. squares: In our presentation of heat distribution problem (Slides 6.33-45, textbook 6.3.2) we have supposed to have a square array. What are the mathematical conditions for choosing block or strip partition if the array in an m n rectangle [arbitrary m, p]? Suppose the communication of a rectangle is proportional with its perimeter. Show that the square has the minimum communication of all rectangles of a xed area (i.e., in the class of rectangles R of dimensions (x, y ) such that x y = a, for a xed a).
1
5. Second-largest key: Given a list of n keys a[0], . . . , a[n 1], design a parallel algorithm to nd the second-largest key in the list. [Note: Keys do not necessarily have distinct values.]
assuming that each assignment statement is atomic. 4: The following C-like parallel program is supposed to transpose a matrix: forall (i = 0; i < n; i++) forall (j = 0; j < n; j++) a[i][j] = a[j][i] Explain why the code will not work and correct it. 5: Determine and explain how the following code for a barrier work (based upon the two-phase barrier given in textbook Section 6.1.3) void barrier() { lock(arrival); count++; if (count < n) unlock(arrival) else unlock(departure); lock(departure); count--; if (count > 0) unlock(departure) else unlock(arrival); return; } Why is it necessary to use two lock variables, arrival and departure?
Regular questions
1: Modify the rank sort code given in Sec.9.1.3
for (i = 0; i < n; i++) { /* for each number */ x = 0; for (j = 0; j < n; j++) /* count number of nos less tan it */ if (a[i] > a{j]) x++; b[x] = a[i]; /* copy number into correct place */ }
to cope with duplicates in the sequence of numbers (i.e., for it to sort in nondecreasing order). 2: The following is an attempt to code the odd-even transposition sort of Sec.9.2.2. as a SPMD program: Process P_i
evenprocess = (i % 2== 0); evenphase = 1; for (step = 0; step < n; step++, evenphase = !evenphase){ if ((evenphase && evenprocess) || (!evenphase) && !(evenprocess)){ send(&a, P_{i+1}); recv(&x, P_{i+1}); if (x < a) a = x; /* keep smaller number */ } else { send(&a, P_{i-1}); recv(&x, P_{i-1}); /* keep larger number */ if (x > a) a = x; } }
Determine whether the code is correct and, if not, correct it. 3: Implement (in pseudo-code) shear-sort (Sec.9.2.3). Explain why log n + 1 phases are to be used.
1
4: Draw the exchange of numbers for the Quick-sort on a Hypercube (Sec.9.2.6) using the algorithm based on Grey code ordering (Fig.9.21). Illustrate the procedure on a particular set of numbers. 5: Draw the compare-and-exchange circuit congurations for the odd-even merge-sort algorithm described in Sec.9.2.7 to sort 16 numbers. Sort a sequence of numbers by hand using the odd-even merge-sort algorithm.
CS-3211; Tutorial 1
1. Suppose a galaxy has 1011 stars. Estimate the time it would take to perform 100 iterations of the basic N -body algorithm using O(N 2 ) computations and a computer that is capable of 500 MFlops. Solution: Each iteration takes 1011 1011 = 1022 steps. 100 iterations takes 1024 steps. The computer handle 500 106 = 5 108 operation Flop per second. Hence, the computations takes 1024 /(5 108 ) = 2 1015 seconds, which gives 63,419,500 years. Notice: As it was pointed out at one tutorial, this is correct provided we suppose that each step takes 1 Flop. Otherwise the time is even larger. 2. Find the diameter of: (a) a torus; (b) a tree network; (c) an k-dimensional mesh. Solution: (a) For a m n torus (m lines and n columns), this is d = n/2 + m/2 The reason is that we may go in both directions on a line (respectively, column), so the shorter distance between two nodes in the same line is at most n/2 . Similarly for the columns. To have an example, for a 7 10 mash two points which realize this diameter are (1,1) and (4,6). (b) In a (complete, balanced, binary) tree network, the longest (minimal) path is, for instance, between the leftmost and the right-most leaves. If the tree has k levels, this is 2(k 1).
We have to express this in terms of number of the networks nodes. If the tree has k levels, than the number of vertices is 1 + 2 + 22 + . . . + 2k1 = 2k 1. If there are n nodes in the tree, this gives n = 2k 1, hence k = log2 (n + 1). To conclude, the diameter is d = 2(log2 (n + 1) 1) Notice: If the branching degree r is not 2, but still constant, a similar result is obtained, but the logarithm is in base r. If the tree is not balanced or the branching degree may be dierent for dierent nodes, then the analysis is more complicate and less precise results are obtained. (c) We suppose that the mash is an hypercube, hence it has the same length in all directions. In a k -dimensional mesh, the grater (minimal) distance is between the corners (0, 0, . . . , 0) and (1, 1, . . . , 1). A path between them have to parse all k directions, along each directions having the length k n 1. Hence the result is d = k ( k n 1) 3. Look at the minimal distance deadlock-free algorithm for hypercube networks described in the textbook, page 15. Apply it for: (a) a ve-dimensional hypercube network from node 7 to node 22; (b) repeat for an 88 mesh, using its perfect embedding in a hypercube network. Solution: (a) The binary representation of 7 is 00111 and of 22 is 10110. The algorithm requires: (i) to compute disjunctive or which is 10001 and (ii) to parse the hypercube along the directions having 1 in the result, in our case,
directions 1 and 5 (left-to-right). The obtained length 2 routing is: 7 = 00111 23 = 10111 22 = 10110. (b) In the mesh, the parsing algorithm is to go, say, rst horizontally and then vertically from one node to the other. If the mesh is embedded in a hypercube, then this routing is dierent from the hypercube routing (generally is longer), as the mesh has forgotten many of the hypercube links. 4. Determine how the largest complete binary tree can be embedded into a hypercube. What is the dilation of the mapping? Solution: We may recursively dene a perfect embedding as follows. If we know how to embed a k level tree in a r-dimensional hypercube, then we take the r + 2dimensional hypercube and map: the root of the tree in (0, 0, . . . , 0), the left subtree in the (sub) r-dimensional hypercube (1, 0, , . . . , ) and the left subtree in the (sub) rdimensional hypercube (0, 1, , . . . , ). This is a perfect embedding (one connection in the tree network is realize by one connection in the hypercube), but only a very small number of nodes of the hypercube are used. If we relax the condition to have a perfect embedding, sometimes it is possible to get irregular embedding with less nodes in the hypercube. E.g., a 3-level tree with the nodes represented by ( ), (0), (1), (00), (01), (10), (11) my be embedded in a 3-dimensional cube by the mapping: ( ) (0, 0, 0), (0) (1, 0, 0), (1) (0, 0, 1), (00) (1, 1, 0), (01) (1, 0, 1), (10) (0, 1, 0), (11) (1, 1, 1) having dilation 2.
5. Which is the average distance between two nodes in: (a) a mesh network; (b) a hypercube? Solution: 5(a) Take an arbitrary position (i, j ) of an m n mesh. The distances to the cells from i-th line are S = (j 1)+(j 2) . . .+2+1+0+1+2+. . .+(nj 1)+(nj ) 1)j nj +1) = (j + (nj )(2 . 2 For a line which departs from i-th line by k lines we have to add kn (an extra k appear for each cell), hence the total sum of the distances from (i, j ) cell to the other ones is Si,j = [S +(i 1)n]+[S +(i 2)n]+ . . . +[S +2n]+[S + n]+[S ]+ [S + n] + [S + 2n] + . . . + [S + (m i 1)n] + [S + (m i)n] 1)i i+1) = mS + n (i + n (mi)(m 2 2 j +1) 1)i i+1) 1)j = m( (j + (nj )(n ) + n( (i + (mi)(m ) 2 2 2 2 Here we may apply two dierent, but equivalent, methods: Method 2: Find the total length of all paths and divide to the number of paths. (This is a simple general method.) Method 1: Find the average of the length from one cell to the other ones, then make the average of these results over the cells. (The number of paths to be counted for each vertex is the same, so a simple, non weighted, average is enough.) By the rst method, we get the total sum of the lengths to be St =
i,j
Si,j
5
1)j (nj )(nj +1) 1)i (mi)(mi+1) = m[m j ( (j )]+n[n i ( (i )] 2 + 2 2 + 2 21 = m2 1 2 [2 j (j 1)j ] + n 2 [2 i (i 1)i] = m 2 [ j j 2 j j ] + n 2 [ i i2 i i] n+1) +1) m+1) +1) n(n2 ] + n2 [ m(m+1)(2 m(m ] = m2 [ n(n+1)(2 6 6 2 (n+1) (m+1) = m2 (n1)n + n2 (m1)m 3 3 = mn 3 (mn 1)(m + n) 1) 2 The number of paths is N = Cmn = (mn)(mn , but each 2 path is counted two times in the above sum (once for each head), hence the average is
A = =
St 2N m+n 3
Nice formula... Maybe there is a dierent, simpler proof... 5(b): In a hypercube all vertexes are equivalent, so it will be enough to count the average path length for one vertex only. If we start with vertex 00 . . . 0 (k times, for a k dimensional hypercube), then the distance to an arbitrary vertex is given by the number of 1s in its representation. The total sum of length to the other vertexes is then S 1 2 k = 1 Ck + 2 Ck + . . . + k Ck This sum may be computed taking the derivative of the well-known identity
1 1 2 2 k k (1 + x)k = 1 + Ck x + Ck x + . . . + Ck x
The derivative is
k k 1 1 0 2 1 x + 2Ck x + . . . + kCk x k (1 + x)k1 = 0 + 1Ck
Our sum actually is the right-hand-side of the above identity when x = 1, hence S = k 2k1 The number of vertexes (dierent from 00 . . . 0) is 2k 1, hence the average path length (for 00 . . . 0 and also for the whole hypercube) is A=
S 2k 1
k 2 2k1 1
For large k a good approximation of this is k 2. 5(c): Try to nd the average distance between two vertices of a tree network.