You are on page 1of 6

Rapid Estimation of Instruction Cache Hit Rates Using Loop Proling

Santanu Kumar Dash and Thambipillai Srikanthan School of Computer Engineering Nanyang Technological University Singapore 639798 {askdash, astsrikan}@ntu.edu.sg Abstract
Estimation of the hit rate curve for an application is the rst step in application specic cache tuning. Several techniques have been proposed to meet this objective however most of these have dealt with the data cache with little attention to the instruction cache. In this paper, we propose a novel, lightweight and highly scalable technique for rapid estimation of the instruction cache hit rate curve for a given application. Our technique works at the basic block level and relies on a one-time loop proling of the weighted control ow graph of the application followed by estimation of the hit rate for different cache sizes. It accounts for the spatial and temporal locality separately and is sensitive to the cache size as well as block size. The proposed technique is highly accurate and when compared with results from an actual cache simulator, the mean error in estimation ranged from 1.11 % to 2.46 % for the benchmarks tested.

cused primarily on characterizing the locality pattern in an instruction access stream. A power-law model to characterize the miss rate was proposed in [2]. In this model, it was proposed that the miss rate for a cache size of C could be modeled by the equation M R(C) = C where could be found based on the available trace information. Based on Chows empirical model, a fractal model of locality was also proposed [7]. Another analytic cache model was proposed in [1] which relied on breaking the reference stream into time granules and then modeling temporal and spatial locality of the reference stream. Other methods for modeling the hit and miss behavior have relied on the least recently used stack methods [3]. However, most of these techniques were developed with the data cache in mind. For estimating the data cache hit rates, these models went to the level of detail of the individual data references because of the potential lack of spatial proximity in data references. A single instruction could reference data from many different memory locations wide apart from each other therefore it was imperative to analyze each and every reference to glean information about program locality. However, for estimating the hit rate curves for the instruction cache, it is not required to go down to the level of granularity of the individual instructions. The single line of control in basic blocks of a program enables the hit rate estimation to be abstracted to the level of the basic block because once the control ow enters a basic block, it has to execute all the instructions in it. In this paper, we present a hit rate estimation technique that leverages on the basic block execution sequence. The rest of the paper is organized as follows. In section 2, we dene some of the terms that we use in our estimation framework. We give an overview of the proling framework in section 3. We present heuristics for hit rate estimation in section 4. Our results are presented in section 5 and we conclude the paper in section 6.

Introduction

Traditionally, simulation based search methods were used to nd optimal cache sizes. In such simulations, the time required for exhaustive search is prohibitively large. For such large design spaces, iterative heuristics were proposed in [10] so that near-optimal cache conguration is reached without actually simulating the entire conguration. Some techniques were also reported to reduce the trace generated by the simulated application and/or simulate multiple congurations in one go [6]. To improve the simulation speed, techniques were formulated to nd approximate trace [4] or lossless trace reduction [9]. Another approach to arriving at the optimal cache conguration is through analytical techniques where characteristics are extracted from the application and used to nd the miss rate for a given set of cache parameters or the optimal cache parameters given a set of requirements. Such analytical models for modeling program locality have fo-

1-4244-1898-5/08/$20.00 2008 IEEE

263

Denitions

In this section, we dene some terms used in our estimation framework. For the subsequent discussions, the terms cache line and cache block are used interchangeably. Cache Line Utilization: When a line is brought into the cache, not all the instructions in it may be executed. So, the ratio of the number of instruction actually executed to the number of instructions brought in from the memory expressed as a percentage represents the Cache Line Utilization. Spatial Locality: Spatial locality is dened at the level of the cache block. It is the number of additional instructions (beyond the rst instruction) executed from the same cache line after it is brought into the cache from the memory. Temporal Locality: If program is currently executing in a loop that can be fully contained in the cache, the rst useful instruction of every block will be a hit as the loop iterates because all the instructions of the loop can be fully contained in the cache. Therefore, if we know the size and frequency of the loop and the size of the cache, we can estimate the number of possible hits of the very rst useful instruction in a cache block. This hit or miss of the rst useful instruction in the cache block is how we dene temporal locality at the cache block level. Temporal locality as we measure it is therefore a function of the cache size and the nature of the loops in the application.

in a line. It is essentially similar to a line buffer with a counter. It updates the counter whenever there is a cache line change. This way we keep tabs on the total number of instruction lines that were executed. The Line Utilization Estimator also keeps count of the total number of instructions. At the end of the simulation, the total number of instructions divided by the total number of lines accessed gives the number of useful instructions per line.

3.2

Loop Proler

Our loop proler relies on the trace of basic blocks to generate a weighted partial CFG or dynamic CFG. It is partial because it only captures the paths that the program takes while executing and weighted because it also captures how many times each path is taken. This is in contrast to traditional CFGs built by compilers which account for all the control ow paths. The weighted partial CFG is thus, representative of the different paths of execution that is taken by the program. The logical next step after getting the weighted partial CFG would be to identify the loops from this partial CFG. However, to simplify the loop identication process, we partition the weighted partial CFG at a functional level. The program information is now represented using three characteristics - the function call graph, the information about the loops in each of those functions and the set of functions that those loops invoke. 3.2.1 Identication of the Loops

Overview of the Proling Framework

The proling framework (gure 1) consists of a simulator augmented with a proling tool to identify the amount of spatial locality, the nature of loops and subsequently the temporal locality in the application. The framework uses a simulator which is dynamically instrumented to obtain the information about all the basic blocks it executes. This information is then fed to the proling tool which does the instruction accounting and hit rate calculation. In this section, we discuss the role of each of the components in the framework.

The loops are identied as shown in gure 2. The dynamic control ow graph that is built is divided on a functional level using a hasher which uses information about the function address ranges to do so. Once we have the weighted control ow graph for each function, we run the dominator-join algorithm [5] on each of the control ow graphs to nd the loops in the function. We also construct a function call graph of the application using the runtime information. This call graph information is then fed to a module which does loop-centric function proling to identify which are the functions that could potentially be called by the loop. At the end of this step, we have two sets of information about every loop in the application: BBList: This is the list of basic blocks of the function containing the loop that are also a part of the loop body. FuncList: This is the list of the functions that the loop can potentially call as it is executed. 3.2.2 Calculation of loop sizes

3.1

Line Utilization Estimator

Due to the nature of code layout, some bytes from the memory may be brought into the cache even if they are not needed by the application. In other words, cache accesses by the application may not be aligned with cache line borders. So there can be potentially unused instructions in every cache line fetched from the memory. The line utilization estimator measures the amount of useful instructions

To calculate the size of the loop, we use the BBList and FuncList information obtained from the previous step. Our

264

Figure 1. Hit Rate Estimation Framework Loop-Iteration Count: This is the number of times a loop iterates. This variable is referred to as LIT C subsequently. Basic-Block Call Ratio: This is the ratio of the number of times a basic block in the loop executes to LIT C. Subsequently, this is referred to as BBCR. Function Call Ratio: This is the ratio of the number of times a function is called to the sum of the LIT C of all the loops that could potentially call the function (many loops could potentially call the same function) . This is referred to as F CR hereafter. Normalized Basic-Block Instruction Count: This is number of distinct instructions executed per basic block for each call to the basic block. This is obtained by dividing the total number of instructions executed as a part of the basic block by the total number of times the basic block is called. Henceforth, this is referred to as N BBIC. Normalized Function Instruction Count: This is number of distinct instructions executed per function for each call to the function. This is obtained by dividing the total number of instructions executed as a part of the function by the total number of times the function is called. Henceforth, this is referred to as N F IC. Normalized Loop Instruction Count: This is the number of distinct instructions executed as a part of the loop body per iteration of the loop. Henceforth, this is referred to as N LIC. Line Utilization Ratio: This is the ratio of useful instructions in every cache line that is bought into the cache. This is referred to as LU R in the equations.

Figure 2. Loop Identication Framework

objective here is to determine the distinct number of instructions (ignoring any repeat of instructions due to loop iterations) executed during the lifetime of the loop. This information is indicative of the size of the loop and can be approximated by normalizing the total number of instructions executed as a part of the loop by the loop iteration count. After obtaining this gure, we have to divide it by the line utilization factor to get the actual size of the loop. This process is shown in gure 3 and explained in detail below. To nd the size of the loops, we utilize the information about the functions that it can potentially call (FuncList) and the basic blocks of the function which contains the loop that are a part of its body (BBList). We maintain a few variables for this purpose as described below.

265

Then, using the block size of the cache and the size of each instruction, the number of useful instructions per line can be calculated. Out of these useful instructions, depending on how many times the cache line is reused, the reuse of the rst instruction will boost the temporal hit rate. The rest of the useful instructions in the line will contribute to the spatial hit rate. So, the spatial hit rate (SHR) can be calculated from the line utilization ratio (LU R) and the instruction per cache block (IP B) using the equation 3. LU R IP B 1 LU R IP B

SHR =

(3)

4.2

Estimation of Temporal Hits

Figure 3. Estimation of the loop sizes Actual Loop Size: This is the actual size of the loop after accounting for all the unutilized instructions that are brought into the cache as part of the loop body. This variable is referred to as LS in the equations. Using the variables described above, the loop size can be determined using a set of equations. As shown in gure 3 we get N LIC value rst by adding instructions from both the basic blocks and the functions that are a part of the loop using equation 1. By using equation 2, we obtain the actual size of the loop. Here, IS is the instruction size. N LIC LS = + = (N BBIC BBCR) (N F IC F CR) (N LIC IS) LU R (1) (2)

Unlike the benet from the spatial hits, the possible benet from temporal hits (reuse of the rst instruction in the cache line) is a function of the cache size and the size of the loops. This is because whether the cache line belonging to a loop will be resident in the cache or ushed out in the subsequent iterations of the loop is a function of how big a cache we use. To estimate the approximate number of temporal hits, we have to calculate the number of lines of the loop and the number of lines in the cache. This is done as shown in equation 4 and 5. Here C is the size of the cache, LL and CL are the number of cache lines that the loop can occupy and the total number of lines in the cache respectively. CL = C (IP B IS) LL = LS (IP B IS) (4) (5)

Once we know the number of lines that a loop spans and the total number of lines available in the cache, we calculate the cache residency (CR) of the loop - the average number of lines retained in the cache per iteration of the loop. There can be the following three cases when a loop is executing. When the complete loop is cache resident: This happens when the cache size is larger than the loop size. In this case, the cache resident part of the loop is actually the entire loop body. CR = LL if LL < CL (6)

4 Heuristics for Hit Rate Estimation


In this section, we present equations to estimate the number of spatial hits and the number of temporal hits as the applications runs. These equations are used by the Spatial Hit Estimator module and the Temporal Hit Estimator module to meaningfully estimate the spatial and temporal hits in the application.

4.1

Estimation of Spatial Hits

When the partial loop is cache resident: When the loop size is greater than the cache size but smaller than twice the cache size, then a certain portion of the loop will be ushed from the loop with every iteration. The part that remains behind in the cache is given by 7. CR = 2 CL LL if CL LL < 2 CL

The rst step in calculating the spatial hit rate at the cache line level is to calculate the line utilization ratio.

(7)

266

When no temporal locality is possible: This happens when the loop size is greater than twice the cache size. In such a case, not a single line is able to remain in the cache because of being ushed out by other lines during the same iteration. In such a case, CR is zero. CR = 0 if LL 2 CL (8)

Benchmarks Bitcount Qsort Dijkstra Patricia Stringsearch Rijndael SHA FFT

Once we know the cache resident part of a loop in terms of the number of lines, we can just multiply this value by LIT C to get the number of temporal hits (because temporal locality corresponds to only the rst instruction in the cache line) and sum this up for every loop to get the total number of temporal hits for the application. This number divided by the total instruction count (T IC) would give us the temporal hit rate (T HR) as shown in 9. T HR = (CR LIT C) T IC (9)

Line util (inst/line) 4.35 4.55 5.00 4.55 4.35 5.88 7.14 4.35

Line util (%) 54 57 63 57 54 74 89 54

Spatial hr (%) 77 78 80 78 77 83 86 77

Table 1. Line Utilization Ratios and Spatial Hit Rates

The nal step is to calculate the total hit rate (HR) of the cache. This is done by adding up SHR and T HR as shown in 10. HR = SHR + T HR (10)

Benchmark Bitcount Qsort Dijkstra Patricia StringSearch Rijndael SHA FFT

Mean Error (%) 2.07 1.86 2.46 1.34 1.11 1.27 1.50 1.48

Max Error (%) 3.91 3.85 5.88 3.12 2.65 2.04 1.86 3.78

Results and Discussions

Table 2. Estimation Error

The prole driven scheme for instruction cache hit rate estimation was tested using standard benchmarks programs from MiBench. The benchmarks were run on Skyeye ARM simulator running Linux 2.6 kernel. We compared the estimated instruction hit rates with actual instruction hit rates obtained using the Dinero cache simulator. The values reported below are for 4-way set associative cache memories ranging from 128 B to 64 kB with a block size of 32B. The cache hierarchy consists of a split instruction-data L1 cache connected to a L2 cache. We estimate the hit rate for the instruction cache only. The instruction size is assumed to be 4 B. So, every cache block is capable of holding up to 8 instructions.

5.1

Program Locality and Hit Rates

Cache memories are meant to hold frequently executed code segments and it is natural that there should be a correlation between the loop sizes and the increase in hit rate for a cache of comparable size. Figure 4 shows the size of the loops (shown as bars) vs. the percentage of time the program spends executing those loops (right Y axis). As evident from the gure, whenever there is a loop where the program spends a signicant amount of time, the hit rate shoots up for a cache size that is in the vicinity of the loop size. For example, consider the benchmark program SHA. SHA has a lot of loops in the size range of 128 B to 256 B and the program spends a signicant amount of time executing these loops. As a result, when the cache size is increased from 128 B to 256 B, the actual hit rate (represented by the solid line) increases dramatically. We compared the estimated hit rates with actual hits rates obtained using the Dinero cache simulator. As can be seen in gure 4, the estimated hit rate values (shown as the dotted line) follows the actual hit rate values (shown as the solid line) closely. Table 2 shows the error in the estimated hit rates as compared to those calculated by using the Dinero cache simulator. As can be seen the estimated values exhibit a high degree of accuracy with the mean error in estimation for the benchmarks around 2%.

Table 1 shows the line utilization for different benchmarks programs. The gures reported are for cache blocks that can hold up to 8 instructions. Very few programs have close to 100 % line utilization which justies a method for measuring the spatial hit rate using line utilization instead of assuming a 100 % cache line utilization as was done in [8]. Also, an average measure of the line utilization ratio is required to estimate the actual loop sizes in presence of unutilized instructions.

267

110 105 100 95 90 85 80 75 70

80 70 60 Loop Execution Time (%) 50 40 30 20 10 0

110 105 100 95 90 85 80 75 70

80 70 60 Loop Execution Time (%) 50 40 30 20 10 0

110 105 100 95 90 85 80 75 70

80 70 60 Loop Execution Time (%) 50 40 30 20 10 0

110 105 100 95 90 85 80 75 70

80 70 60 50 40 30 20 10 0 Loop Execution Time (%) Loop Execution Time (%)

Hit Rate (%)

Hit Rate (%)

Hit Rate (%)

128

256

512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes)

32k

64k

128

256

512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes)

32k

64k

128

256

512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes)

32k

64k

Hit Rate (%)

128

256

512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes)

32k

64k

(a)Bitcount
110 105 100 95 90 85 80 75 70 80 70 60 Loop Execution Time (%) 50 40 30 20 10 0 110 105 100 95 90 85 80 75 70

(b)Qsort
80 70 60 Loop Execution Time (%) 50 40 30 20 10 0 110 105 100

(c)Dijkstra
100 90 80 100 70 Hit Rate (%) 95 90 85 80 20 75 70 10 128 256 512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes) 32k 64k 0 75 70 60 50 40 30 Loop Execution Time (%) 95 90 85 80 110 105

(d)Patricia
80 70 60 50 40 30 20 10 0

Hit Rate (%)

Hit Rate (%)

128

256

512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes)

32k

64k

128

256

512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes)

32k

64k

Hit Rate (%)

128

256

512 1k 2k 4k 8k 16k Cache Size/Loop size (in Bytes)

32k

64k

(e)Stringsearch

(f)Rijndael.enc

(g)SHA

(h)FFT

Figure 4. Actual vs. estimated hit rates

5.2

Complexity of the estimation process

References
[1] A. Agarwal, J. Hennessy, and H. M. An analytical cache model. ACM Transactions on Computer Systems, 7(2):184 215, 1989. [2] C. K. Chow. Determination of caches capacity and its matching storage hierarchy. IEEE Transactions on Computers, 25(2):677688, 1976. [3] M. Kobayashi. A cache multitasking model. SIGMETRICS Perform. Eval. Rev., 20(2):2737, 1992. [4] S. Laha, J. H. Patel, and R. K. Iyer. Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Trans. Comput., 37(11):13251336, 1988. [5] V. C. Sreedhar, G. R. Gao, and Y.-F. Lee. Identifying loops using DJ graphs. ACM Transactions on Programming Languages and Systems, 18(6):649658, 1996. [6] R. Sugumar and S. Abraham. Efcient simulation of multiple cache congurations using binomial trees. Technical report, 1991. [7] D. Thiebaut. On the fractal dimension of computer programs and its application to the prediction of the cache miss ratio. IEEE Transactions on Computers, 38(7):10121026, 1989. [8] K. Vivekanandarajah, T. Srikanthan, and C. T. Clarke. Prole directed instruction cache tuning for embedded systems. In IEEE Symposium on Emerging VLSI Technologies and Architectures (ISVLSI), page 277, Washington, DC, USA, 2006. IEEE Computer Society. [9] Z. Wu and W. Wolf. Iterative cache simulation of embedded cpus with trace stripping. In International workshop on Hardware/software codesign (CODES), pages 9599, New York, NY, USA, 1999. ACM Press. [10] C. Zhang and F. Vahid. Cache conguration exploration on prototyping platforms. In Proceedings of International Workshop on Rapid Systems Prototyping, page 164, Washington, DC, USA, 2003. IEEE Computer Society.

The time complexity of the estimation process depends on building the dynamic control ow graph. The time complexity of this step is O(P logQ) where P is the total number of basic blocks referenced and Q is the number of distinct basic blocks executed. The dynamic control ow graph (insertion and updating the DCFG) can be maintained in logarithmic time and this has to be done every time a basic block is referenced. Hence, the time complexity of this task is O(P logQ). The time complexity of the loop estimation has been shown to be almost linear [5] which can be approximated to O(Q). So, the time complexity of the entire process can be taken to be O(P logQ + Q). The space complexity of the estimation framework is O(Q), dependent on the number of distinct basic blocks processed.

Conclusion

In this paper, we proposed an analytical technique for estimating the instruction cache hit rates. We relied on loop proling to achieve this objective and used cache block sizes to model spatial locality and a combination of cache sizes and loop sizes to model temporal locality. The estimated values showed a high level of accuracy when compared with values obtained using a cache simulator with mean error values ranging from 1.11 % to 2.46 % and maximum error values ranging from 1.86 % to 5.88 % for the benchmarks tested. The hit rates estimated using this technique can be used for a variety of memory hierarchy optimizations including instruction cache-tuning.

268

You might also like