next up previous
Next: Future Work Up: No Title Previous: First Miss Adjustment

Measurements

This section evaluates the benefits of instruction cache analysis via static cache simulation. Cache measurements were obtained for user programs, benchmarks, and UNIX utilities. The measurements were produced by modifying the back-end of the optimizing compiler VPO (Very Portable Optimizer) [] and by performing static cache simulation. The simulation was performed for the Sun SPARC instruction set, a RISC architecture with a uniform instruction size of one word (four bytes).

The parameters for cache simulation included direct-mapped caches with sizes of 64B to 8kB. The cache line size was fixed at 4 words. No context switches were simulated. The size of the programs varied between 2kB and 18kB. This provided a range of measurements from capacity misses dominating for small cache sizes to some programs entirely fitting in cache for large cache sizes. The number of instructions executed for each program comprised a range of 1 to 19 million using realistic input data for each program.

Table 1 shows the measurements of each test program for a 1kB cache. The static measurements reflect the percentage of always hits, always misses, first misses, and conflicts out of the total number of instructions in the function instance tree. It can be seen that a large number of hits and misses can be predicted statically. The number of always hits is slightly above 70% in average and does not change significantly with varying cache sizes. The number of first misses increases for larger caches while conflicts and misses decrease at the same time. This can be explained as follows. First misses occur when a program line without any conflicts is placed in cache on its first reference and remains in cache thereafter. For very small caches, always misses dominate due to capacity misses. For medium-sized caches, program lines tend to conflict with one another more frequently resulting in more conflict instructions. As the programs begin to fit into cache, fewer program lines are in conflict and more references become first misses due to the increased cache capacity. In the worst case, only every sixth instruction is statically predicted as a conflict and will have to be simulated at execution time. At best, there are virtually no conflicts and almost the entire runtime simulation can be performed using efficient frequency counters.

Column 6 indicates the percentage of measurement points required for our method versus the number of measurement points inserted in a conventional cache simulation (i.e., one measurement point per basic block). Our method requires only 76% of the measurement points required for the traditional trace-driven methods, i.e. about 24% fewer measurement points statically. The run-time savings (column 7) are even higher, requiring only about 69% of the measurement points executed under traditional trace-driven cache simulation. The additional dynamic savings are due to reducing sequences of basic blocks inside loops to fewer UPs, sometimes just to a single UP.

The static cache simulation results were verified (for each program execution and cache size) by ensuring that the exact same number of hits and misses were produced as obtained by traditional trace-driven cache analysis. As the cache size increases, the hit ratio (column 8) increases as well. Column 9 and 10 represent the quotient of the execution time of a program with instrumentation over the execution time for the same program without instrumentation. Column 9 refers to a trace-driven method that has been optimized such that the cache simulator is only called once per basic blockgif. Column 10 shows that cache simulation via static cache simulation is more efficient than the trace-driven method.gif Column 11 refers to the analysis via static cache simulation. The percentage of conflicts (out of all instruction references) simulated at execution time is shown in the last column.

Figure 5 shows the overhead for different cache sizes. With the traditional trace-driven method, the execution time of instrumented programs is 14x to 24x slower than the execution time of regular programs without instrumentation. The overhead for the new method using static cache simulation is much lower, only a factor of 1.1 to 2.8. This overhead depends slightly on the ratio of program size and cache size. The variation can be explained as follows.

Let the conflict degree be the number of program lines that map into the same cache line. This is a useful term to characterize the size of shared path states (SPSs) and the execution overhead due to order-dependent simulation. For small caches, the conflict degree is relatively small. Many references will result in always misses due to a lack of cache capacity, which require only efficient frequency counting. For medium-sized caches, the conflict degree increases, peaking at a 512B cache for this test set, while always misses decrease. This requires an increased number of dynamically simulated state transitions for conflicts. For larger caches, capacity misses and the conflict degree of program lines decrease. They are replaced by first misses. With a diminishing number of conflicts for large caches, the size of SPSs decreases as the cache size increases. In other words, fewer and fewer conflicting program lines map into the same cache lines so that less instrumentation code to update conflicting SPSs is needed since hardly any conflicts remain. Thus, the cache simulation at execution time can be reduced to simple frequency counting, which imposes a much lower overhead than conventional cache simulation. To summarize this discussion, it is observed that the new method requires slightly more execution overhead for small caches than for large caches since more SPSs have to be updated dynamically.

The new method outperforms conventional trace-driven cache simulation by almost an order of a magnitude without compromising the accuracy of measurements. Even the best results published in [] required an overhead factor of 2-15 over uninstrumented code for hit ratios between 96% and 99%. This highly tuned traditional method required a recompilation pass

  
Figure 5: Average Overhead for varying Cache Sizes

for better instrumentation. Under all conditions, the new method using static cache simulation outperforms the best traditional trace-driven methods published.


next up previous
Next: Future Work Up: No Title Previous: First Miss Adjustment

Robert Palmer
Mon May 19 10:44:04 EDT 1997