You are on page 1of 3

Homework 1

CDA 5155: Spring 2012 Due Date: 02/08/2012 11:55 PM (EDGE Students: 02/11/2012 11:55 PM) Total: 20 points (5% of overall score)
You are not allowed to take or give help in completing this assignment. Submit the PDF version of the submission in e-Learning (Sakai) website before the deadline. Please include the sentence in bold on top of your submission (PDF): I have neither given nor received any unauthorized aid on this assignment. 1. Assume that you are the product manager for XXX processor. The chip has an area of 263 mm2, with a defect rate of 0.025 defects per cm2 and N=11.5. The die of each chip is occupied by four identical cores (70% total area) and a shared L3 cache (30% total area). For simplicity, we assumed here that each chip has only four cores and an L3 cache (no other components). a. [1 Point] What is the yield of the die? b. [1 Point] Some researchers proposed that the number of defects in a die can be modeled by Geometric distribution. Suppose we can use the yield as the probability that there is no defect on a die, what is the value of parameter p in Geometric distribution here? Note: Geometric distribution means the probability that there are exactly k defects (k being a non-negative integer, k = 0, 1, 2, ...) is equal to

where, k is the number of occurrences of defects p is a positive real number, c. [3 Points] In a defected chip, assume defects are independent and uniformly distributed within the die area. What is the probability that all defects in a DEFECTED die occur in the same core? (In other words, there is no defect on all other three cores and the L3 cache.) Please notice that there can be more than one detect on a die. d. [1 Point] If there is only one defected core in a chip with defect-free L3, we can still sell it by shutting down the defected core. Suppose you can sell the perfectly working (defect-free) chip for $259.99 each. Also assume that you need $179 to manufacture and test each chip. What is the minimum sale price for your chips with 3 working cores (the defective core is shutdown) to make break even (no profit, no loss)?

2. One day you got tired with your processor company. So you accepted an offer from a software division in a GPU company. Your job is to develop a numerical simulation program. The software will be used on a workstation with a single CPU core and 512 GPU cores. The CPU can achieve 1GFlops, while each GPU core can deliver 3.9GFlops (peak). For simplicity, we only consider floating- point operations in this problem. The CPU and all the GPU cores can perform calculation simultaneously, unless there are any specific restrictions. a. [2 Points] If all of the dynamic instructions in your main application are parallelizable, what is the maximum performance (Flops) you can get from your hardware in the optimal situation? What is the speedup compared with CPU only execution? State your assumptions, if any.

b.

[3 Points] In reality, the computation cannot be performed without data. Suppose all input data is
stored within the memory prior to execution. Before each round of computation, you have to load the required input data into the cache within CPU or GPU cores. Similarly, the computation results, i.e., output data, will be stored in the corresponding cache immediately after the computation. The output data must be written back to the memory before next round of computation. Assume that the memory have infinite capacity. CPU and GPU cache are 5MB and 1 MB, respectively. There is no overlap between data transfer and computation. In other words, at most 5MB/1MB data can be moved to CPU/GPU cache before any computation. The results must be transferred back to memory before the next round of computation. No data can be transferred during the computation. The bandwidth between CPU cache and memory is 6GB/s, while the bandwidth between GPU cache and memory is 36GB/s. Each byte input data requires 2 Flops to produce 0.4 byte output data on average. There is no dependency among different parts of data. What is the maximum performance (Flops) you can get from your hardware in the optimal situation if we take the data transfer time into consideration? What is the speedup in this case compared to CPU-only execution? What happens to Flops and speedup if the GPU cache is 20MB? GPU Core 1

CPU Core

CPU Cache

Memory

GPU Cache

. . .
GPU Core 512

3. [4 points] Assume that values of variables A, B, C and D reside in memory. Write the code sequence for D = B*(A+B-C) + A*D for four instruction-set architectures: i) Stack, ii) Accumulator, iii) Register-memory and iv) Register-register (Load-Store). (These four architectures are shown in Figure A.1 on page A-4 of the Appendix A). Please do not perform any scheduling or other optimizations of the above code sequence! 4. ARM instruction set offers an instruction to Load multiple registers. LDMIA R1,{R2,R7} will perform the following two operations: R2 = memory[R1], R7 = memory[R1+4]; In other words, we can replace LOAD R2,0(R1) LOAD R7,4(R1) ADD R5, R7, R2 STORE R5, 4(R6) by LDMIA R1,{R2,R7} ADD R5, R7, R2 STORE R5, 4(R6) There are two different formats of LDMIA available: LDMIA R1,{R2,R7} and LDMIA R1,{R2R7}. The first three can load 2 registers. The second format can load all registers in the range, i.e. LDMIA R1,{R2-R7} will perform the following operations: R2 = memory[R1], R3 = memory[R1+4]; R4 = memory[R1+8], R5 = memory[R1+12]; R6 = memory[R1+16], R7 = memory[R1+20]; a. [3 Points] LDMIA is not currently supported by MIPS instruction set. Suppose all instructions are
still 32 bits. Since this is an R-Type instruction, there are 6 bits reserved for opcode. Can you design the binary format (encoding) for this LDMIA instruction if we want to add it to MIPS? Please notice that

your encoding should support instructions like LDMIA R1,{R2,R7} and LDMIA R1,{R2-R7}. b. [2 Points] Assume that the new instruction will cause the clock cycle to increase by 2.5%. Assume that 26% of dynamic instructions are loads. The new instruction affects only the clock speed and not the CPI. If only 20% of load instructions can be eliminated by the new instructions, will the overall performance change? Indicate the change (e.g., if improved, by how much etc.?).

You might also like