next up previous
Next: Code Instrumentation Up: Static Cache Simulation Previous: Instruction Categorization

 

Implementation

The iterative algorithm in Figure 2 was used to calculate the abstract cache states. Each UP has an input and output state of program lines that can potentially be in cache at that point. Initially, the input states of the top paths (entry paths of the main function) are set to all invalid lines. The input state of a path is calculated by taking the union of the output states of its immediate predecessors. The output state of a path is calculated by taking the union of its input state and the program lines accessed by the path and subtracting the program lines with which the path conflicts. This calculation includes the interprocedural propagation of abstract cache states (not explicitly shown in the algorithm): At a path p with a call to function f as the last instruction, the output state of the UP is propagated to the input states of the entry paths of the corresponding function instance f(i). Similarly, the union of the output states of f(i)'s exit paths provides the input state for p's successor paths.

  figure142
Figure 2: Algorithm to Calculate Cache States

 
 

Figure 3: Example with Flow Graph

The algorithm is a variation of an iterative data-flow analysis algorithm commonly used in optimizing compilers. Thus, the time overhead of the algorithm is comparable to that of data-flow analysis and the space overhead is O(pl*UPs*fi), where pl is the number of program lines, UPs is the number of paths, and fi the number of function instances. The correctness of the algorithm for data-flow analysis is discussed in []. The calculation can be performed for an arbitrary control-flow graph, even if it is irreducible. In addition, the order of processing basic blocks is irrelevant for the correctness of the algorithm. The reaching states can be calculated using the same base algorithm with input_state(main) = conf_lines(UP) = tex2html_wrap_inline667.

Figure 3 depicts the calculation of input and output states. The paths are restricted to basic blocks to simplify the example. In the example, there are 4 cache lines and the line size is 16 bytes (4 instructions). Thus, program line 0 and 4 map into cache line 0, program line 1 and 5 map into cache line 1, program line 2 maps into cache line 2, and program line 3 maps into cache line 3. The immediate successor of a block with a call is the first block in that instance of the called function. Block 8a corresponds to the first instance of foo() called from block 1 and block 8b corresponds to the second instance of foo() called from block 5. Two passes are required to calculate the input and output states of the blocks, given that the blocks are processed in the order shown in Figure 3. Only the states of some blocks inside the loop change on the second pass. Pass 3 results in no more changes.

After determining the input states of all blocks, each instruction is categorized based on its abstract cache state (derived from the input state) and the reaching state.gif By inspecting the input states of each block, one can make some observations that may not have been detected by a naive inspection of only physically contiguous sequences of references. For instance, the static simulation determined that the first instruction in block 7 will always be in cache (always hit) due to spatial locality. This can be determined by observing that line 4 is in in(7) and no conflicting program line is in in(7). It was also determined that the first instruction in basic block 8b will always be in cache (always hit) due to temporal locality. The static simulation determined that the last instruction in block 3 will not be in cache on its first reference, but will always be in cache on subsequent references (first miss). This is indicated by in(3), which includes program line 2 but also a conflicting program line ``invalid'' for cache line 3. Yet, the conflicting program line cannot be reached. This is also true for the first instructions of block 5 and 6, though a miss will only occur on the first reference of either one of the instructions. This is termed a group first miss. Finally, the first instruction in block 3 is classified as a conflict since it could either be a hit or a miss (due to the conditional call to foo). This is indicated by in(3), which includes program line 1 and a conflicting program line 5 that can still be reached.

The current implementation of the static cache simulator imposes some restrictions. First, only direct-mapped cache configurations are supported. Recent results have shown that direct-mapped caches have a faster access time for hits, which outweighs the benefit of a higher hit ratio in set-associative caches for large cache sizes []. Second, context switches cannot be simulated using this method. Third, recursive programs currently are not allowed since cycles in the call graph would complicate the generation of unique function instances. Finally, indirect calls are not handled since the static cache simulator must be able to generate an explicit call graph.


next up previous
Next: Code Instrumentation Up: Static Cache Simulation Previous: Instruction Categorization

Robert Palmer
Mon May 19 10:44:04 EDT 1997